Deployments & API Reference¶

All deployments in this repository are production-ready, quantization-aware and containerized using Cog, enabling seamless execution across local Docker, on-prem GPUs and hosted platforms such as Replicate.

Available Deployments¶

The repository provides five optimized deployments covering text generation, reasoning, multimodal inference and diffusion-based image generation.

Deployment	Task	Input	Output
Flux Fast LoRA Hotswap	Text-to-Image	Text prompt + style trigger	PNG image
Flux Fast LoRA Hotswap Img2Img	Image-to-Image	Image + text prompt + style trigger	PNG image
SmolLM3 Pruna	Text Generation	Text prompt	Text output
Phi-4 Reasoning Plus (Unsloth)	Reasoning & Explanation	Text prompt	Structured text output
Gemma Torchao	Multimodal QA	Image + text prompt	Text output

Each deployment exposes a stable input schema, supports deterministic inference and can be executed without vendor lock-in.

Flux Fast LoRA Hotswap¶

Input Schema¶

{
  "prompt": "string (required)",
  "trigger_word": "string (optional)",
  "height": "integer (default: 768)",
  "width": "integer (default: 768)",
  "num_inference_steps": "integer (default: 30)",
  "guidance_scale": "float (default: 3.5)",
  "seed": "integer (optional)"
}

Output¶

Type: PNG image (base64-encoded or URL)
Resolution: 768×768 (configurable)
Format: RGB

Available Styles¶

Enhanced Image Preferences — data-is-better-together/open-image-preferences-v1-flux-dev-lora

Trigger words: Cinematic, Photographic, Anime, Manga, Digital art, Pixel art, Fantasy art, Neonpunk, 3D Model, Painting, Animation, Illustration

Ghibsky Illustration — aleksa-codes/flux-ghibsky-illustration

Trigger word: GHIBSKY

Performance¶

Inference latency: 8–15 seconds (with torch.compile)
VRAM usage: ~18 GB (optimized from 24 GB baseline)
Quality: Full FLUX.1-dev fidelity

Example Request¶

curl -X POST http://localhost:5000/predictions \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "Golden sunset over mountain peaks, cinematic lighting",
      "trigger_word": "Cinematic",
      "num_inference_steps": 28
    }
  }'

Flux Fast LoRA Hotswap Img2Img¶

Input Schema¶

{
  "image": "file or URL (required)",
  "prompt": "string (required)",
  "trigger_word": "string (optional)",
  "strength": "float (0.0–1.0, default: 0.7)",
  "num_inference_steps": "integer (default: 30)",
  "guidance_scale": "float (default: 3.5)",
  "seed": "integer (optional)"
}

Output¶

Type: PNG image (base64-encoded or URL)
Resolution: Matches input image dimensions
Format: RGB

Parameters¶

strength : Controls how much the original image is modified. Lower values (0.3–0.5) preserve structure; higher values (0.7–0.9) allow more creative transformation.

trigger_word : Same style triggers as text-to-image deployment. Omit for content-preserving edits.

Performance¶

Inference latency: 6–12 seconds (optimized)
VRAM usage: ~18 GB
Quality: Preserves FLUX.1-dev fidelity while restyling

Example Request¶

curl -X POST http://localhost:5000/predictions \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "image": "https://example.com/photo.jpg",
      "prompt": "Transform into Studio Ghibli style",
      "trigger_word": "GHIBSKY",
      "strength": 0.8
    }
  }'

SmolLM3 Pruna¶

Input Schema¶

{
  "prompt": "string (required)",
  "max_new_tokens": "integer (default: 512, max: 16384)",
  "temperature": "float (default: 0.7, range: 0.0–2.0)",
  "top_p": "float (default: 0.9, range: 0.0–1.0)",
  "mode": "string (default: 'no_think', options: 'think', 'no_think')",
  "seed": "integer (optional)"
}

Output¶

Type: Plain text
Persistence: Automatically saved as .txt artifact
Format: UTF-8

Modes¶

think : Enables extended reasoning. Model produces step-by-step logical chains before final output. Useful for complex analysis or problem-solving.

no_think : Direct, concise response without intermediate reasoning steps. Faster and more suitable for simple queries.

Performance¶

Inference latency: 1–3 seconds per 256 tokens (optimized)
VRAM usage: ~8 GB (reduced from 12 GB baseline)
Throughput: ~80–120 tokens/second

Example Request¶

curl -X POST http://localhost:5000/predictions \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "Explain quantum entanglement in simple terms",
      "max_new_tokens": 256,
      "mode": "no_think",
      "temperature": 0.6
    }
  }'

Phi-4 Reasoning Plus (Unsloth)¶

Input Schema¶

{
  "prompt": "string (required)",
  "max_new_tokens": "integer (default: 2048)",
  "temperature": "float (default: 0.7, range: 0.0–2.0)",
  "top_p": "float (default: 0.95, range: 0.0–1.0)",
  "seed": "integer (optional)"
}

Output¶

Type: Plain text with reasoning annotations
Format: UTF-8
Structure: Explicit logical steps and explanations

Characteristics¶

Reasoning-first design: Naturally produces structured explanations
No prompt engineering required: Works well with natural, conversational prompts
Explanation-focused: Ideal for educational and analytical tasks

Performance¶

Inference latency: 2–5 seconds per 256 tokens
VRAM usage: ~12 GB (optimized via Unsloth)
Throughput: ~50–80 tokens/second

Example Request¶

curl -X POST http://localhost:5000/predictions \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "Why do planets orbit the sun? Explain step-by-step.",
      "max_new_tokens": 1024,
      "temperature": 0.5
    }
  }'

Gemma Torchao¶

Input Schema¶

{
  "image": "file or URL (required)",
  "prompt": "string (required)",
  "max_new_tokens": "integer (default: 256)",
  "temperature": "float (default: 0.7, range: 0.0–2.0)",
  "top_p": "float (default: 0.9, range: 0.0–1.0)",
  "seed": "integer (optional)"
}

Output¶

Type: Plain text
Format: UTF-8
Content: Image description, answer to query, or analysis

Capabilities¶

Visual understanding: Analyzes image content with high accuracy
Question answering: Responds to specific queries about image regions
Description generation: Produces natural-language descriptions

Performance¶

Inference latency: 1–3 seconds (vision encoding + text generation)
VRAM usage: ~6 GB (INT8 quantization + sparsity)
Quality: Retains full model fidelity despite optimizations

Example Request¶

curl -X POST http://localhost:5000/predictions \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "image": "https://example.com/photo.jpg",
      "prompt": "What objects are visible in this image?",
      "max_new_tokens": 128
    }
  }'

Testing & Validation¶

alt text

The repository follows a structured approach to ensure robustness across rapid upstream changes.

Tier 1: Unit Tests¶

Purpose: Validate individual components in isolation

Coverage: - Quantization routines - Pruning filters - Input schema validation - Utility functions

Characteristics: - CPU-first execution - Fast runtime (seconds) - Minimal resource requirements

Tier 2: Integration Tests¶

Purpose: Validate end-to-end model behavior with optimizations

Coverage: - Quantized checkpoint loading - Single inference step validation - Output shape verification - Schema compatibility checks

Characteristics: - GPU-enabled (optional) - Small input batches - Exponential retry logic - Focus on correctness over performance

Tier 3: Canary Release Tests¶

Purpose: Detect functional, semantic, and performance regressions by comparing a newly deployed candidate against a pinned, known-good stable baseline before promotion.

Coverage¶

Live inference via Replicate deployments (stable vs candidate)
Schema-correct inputs (text, multimodal, or image as applicable)
Output sanity checks (length, format, degeneration)
Semantic equivalence checks (e.g., embedding similarity for text)
Latency and throughput regression detection

Characteristics¶

Executed against deployed models, not local containers
GPU execution provided by the production inference platform (e.g., Replicate)
CPU-only CI runners are sufficient for test execution
Deterministic inputs and seeds where supported
Pinned stable baseline per deployment to avoid cross-model collisions
Lightweight enough to run on every deployment event

What This Tier Explicitly Avoids¶

Local Docker-in-Docker execution
Building or pushing containers
Full offline benchmark suites
Synthetic load or stress testing

These concerns are handled earlier (build-time) or separately (benchmarking).

Benchmarking Metrics¶

alt text

Each deployment is evaluated across:

Latency: End-to-end inference time (including I/O)
VRAM usage: Peak GPU memory during inference
Throughput: Tokens/second or images/second
Quality: Task-specific metrics (visual fidelity, semantic correctness)

Observability & Debugging¶

Logging¶

All deployments emit structured logs at key points:

Model initialization: Weights loaded, quantization applied, compilation status
Inference start: Input schema validation, processing details
Inference complete: Output shape, timing, file paths (if applicable)

Error Handling¶

Clear error messages for common issues:

Schema violations: Detailed message listing required/optional fields and types
GPU out-of-memory: Suggestions for reducing batch size or resolution
Missing dependencies: Installation instructions for required libraries

Output Inspection¶

Text outputs: Plain UTF-8 files for easy inspection
Image outputs: PNG files with metadata preserved
Artifacts: Automatically persisted for auditing and reproducibility