Quick Start¶
This guide helps you run your first quantized, containerized model in minutes. You can run models locally with Docker or Cog or optionally deploy them to Replicate.
No vendor lock-in is required.
Prerequisites¶
Ensure the following are available:
- Git
- Cog
- Python 3.11+ (only required for
cog predict)
Info
All deployments are Cog-based and can run fully offline on Docker, without Replicate.
Step 1: Clone the Repository¶
This repository contains Cog configurations, quantized model weights and production-ready containers.
Step 2: Choose a Deployment¶
Pick a deployment based on your task:
| Task | Deployment |
|---|---|
| Text-to-Image | Flux Fast Lora Hotswap |
| Image-to-Image | Flux Fast Lora Hotswap Img2Img |
| Multimodal Model | Gemma Torchao |
| Reasoning Model | Phi4 Reasoning Plus Unsloth |
| Lightweight Model | SmolLM3 Pruna |
See the full list in the Deployment Overview.
Step 3: Run with Docker Only (No Replicate)¶
You can run any Cog-based model directly via Docker, avoiding hosted platforms and vendor lock-in.
Start the Container¶
This launches the model server locally, exposes an HTTP API on port 5000 and uses your local GPU via CUDA.
Step 4: Make an HTTP Inference Request¶
curl -s -X POST \
-H "Content-Type: application/json" \
-d '{
"input": {
"prompt": "Skyscrapers hover above clouds, bathed in golden sunrise.",
"trigger_word": "Photographic"
}
}' \
http://localhost:5000/predictions
The response will contain generated image URLs or streamed outputs, depending on the model.
Step 5: Run with Cog CLI (Local Inference)¶
If you prefer direct CLI-based inference:
Run inference for the model:
This builds the container if needed, runs inference locally and guarantees the same environment as production.
Optional: Deploy to Replicate¶
If you want a managed, hosted endpoint on Replicate:
Deployment is optional. Local Docker and Cog usage are fully supported.
Why This Matters (No Vendor Lock-In)¶
- Models run identically on local Docker, on-prem GPUs or cloud
- No dependency on proprietary inference services
- Containers are portable across environments
- Easy migration between self-hosted and hosted deployments
This design ensures long-term maintainability and operational freedom.
Next Steps¶
- Architecture — Understand internals
- API Reference — Review inputs and outputs
- Usage Examples — Explore workflows
Troubleshooting¶
Docker can't see GPU
: Verify nvidia-smi and NVIDIA Container Toolkit are installed
Slow inference : Confirm quantized weights are being used
Schema errors : Check deployment-specific input fields
For unresolved issues, open a GitHub issue with logs and hardware details.