QuickSilver Pro vs Modal
Modal isn't an LLM API the way QuickSilver Pro is — it's serverless GPU compute. You ship Python code that loads a model and serves it, pay per GPU-second, and Modal handles cold starts and scaling. QSP is the opposite trade: you give up the ability to run custom models, and in exchange every call is one HTTP request to a managed OpenAI-compatible endpoint. This page exists for teams comparing both approaches — most projects pick one or run them side-by-side.
At a glance
| Feature | QuickSilver Pro | modal |
|---|---|---|
| Product shape | Managed inference API | Serverless GPU compute |
| What you bring | An API key | Python code + a model |
| Cost shape | Per token | Per GPU-second |
| Custom / finetuned models | No (curated catalog only) | Yes |
| Cold start | None | Seconds (model load) per scale-up |
| DeepSeek / Qwen / Kimi out of the box | Yes (7 LLMs) | BYO image |
| Setup | Sign up, paste key | modal CLI + container + GPU plumbing |
Pricing (per million tokens, USD)
Public list prices as of May 2026.
| Model | QSP input | QSP output | modal input | modal output | Savings |
|---|---|---|---|---|---|
| DeepSeek R1 (managed) | $0.56 | $2.00 | ~$2/hr H100 | + engineering | depends on traffic |
| Custom finetuned model | — | — | Pay per GPU-sec | BYO | Modal only |
| Qwen 3.6-35B (long context) | $0.12 | $0.80 | BYO image | + ops | QSP for managed RAG |
Migration - two lines
# QSP isn't a drop-in for Modal -- they're different categories.
# But if you're moving Modal-hosted LLM inference behind an
# OpenAI-compatible interface, here's the QSP equivalent:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.quicksilverpro.io/v1",
api_key=os.environ["QSP_KEY"],
)
r = client.chat.completions.create(
model="deepseek-v3", # or whichever LLM your Modal app served
messages=[{"role": "user", "content": "Hi"}],
)FAQ
If your workload runs on stock open-source LLMs (DeepSeek, Qwen, Kimi, Llama), QSP is cheaper and faster to ship: zero ops, OpenAI-compatible, per-token billing. If you need to run a custom-finetuned model, a non-LLM workload (image generation, embeddings, transcription), or anything that doesn't fit a chat-completions shape, Modal's serverless GPU is the right tool. Many teams use both: QSP for chat, Modal for custom models.
Same math as NIM self-host. An H100 on Modal at ~$2/hr serves ~200 R1 tokens/sec; the break-even vs QSP's $2.00/M-output price is roughly sustained >60% GPU utilization. Spiky traffic loses badly: every cold start is wasted GPU-seconds, and Modal's per-second billing means idle scale-up time is paid for.
The weights are open-source (DeepSeek, Qwen, Kimi are MIT / Apache), so yes — you can load them on Modal yourself. The question is whether owning that integration pays for itself: QSP serves V3 at $0.616/M output; matching that on Modal requires sustained utilization above the cross-over point and is the engineering team's full-time job. For most teams, the managed price wins.
QSP has none — it's a managed shared service, models are always warm. Modal's cold-start to a fresh GPU + loaded LLM is in the seconds for small models and longer for 70B+ class. For latency-sensitive workloads (interactive chat, agents), QSP is the safer default.