Home/Use cases/qwen3 for long-context
Use case

Qwen3.5-35B-A3B for long context

Qwen3.5-35B-A3B is a 35B-parameter MoE with only 3B active per token and a 262K context window. The MoE lets it run at a 3B-dense cost while keeping a 35B knowledge base — ideal for RAG and long-document workflows. At $0.13 input / $1.00 output per 1M tokens, it's the cheapest per-input-token model in our catalog.

$0.13 / $1.00 per 1M tokens

Why it's a fit for RAG

262K context: Fits a 500-page PDF or 200 code files into a single prompt. No need for aggressive chunking if the retrieved corpus fits; single-shot RAG simplifies your pipeline.

Low input cost: $0.13 per 1M input tokens means a 100K-token RAG prompt costs $0.013. DeepSeek V3 at $0.24/1M would cost $0.024 for the same prompt — 46% more.

MoE speed: Only 3B parameters are active per token, so inference speed is closer to a 3B dense model than a 35B dense one. For long-input workflows, this shows up as noticeably lower per-request latency.

RAG pipeline pattern

Simple single-shot: if retrieved context fits in 262K tokens, skip reranking and hierarchical summarization — feed everything to Qwen3.5-35B-A3B in one call. Lower pipeline complexity, lower latency.

With retrieval: embed → top-K retrieve → concat into a 50-100K token prompt → Qwen3.5-35B-A3B answer. Input-cost economics favor longer top-K (more context) because input tokens are cheap.

Summarize-then-answer: for >262K corpora, first summarize by section with Qwen3.5-35B-A3B, then answer on the summaries. Two-pass; still cheaper than most alternatives.

Quickstart code

python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.quicksilverpro.io/v1",
    api_key="sk-qsp-...",
)

# Load a long document — say a 500-page PDF, already extracted to text
document = open("annual-report.txt").read()  # ~180K tokens

resp = client.chat.completions.create(
    model="qwen3.5-35b",
    messages=[
        {"role": "system", "content": "You answer questions using only the provided document."},
        {"role": "user", "content": f"Document:\n{document}\n\nQuestion: What was free cash flow in Q3?"},
    ],
    max_tokens=500,
)
print(resp.choices[0].message.content)
print(f"Input tokens: {resp.usage.prompt_tokens}, cost: ${resp.usage.cost:.4f}")

FAQ

Can I really use 262K tokens in one prompt?

Yes. The 262,144-token context is the published hard limit. Long-context performance (needle-in-a-haystack recall) is strong up to about 200K; past that, accuracy can degrade on fine-grained lookup tasks. For critical retrieval, combine with vector search to put the most relevant chunks near the top of the prompt.

What's the "3B active MoE" thing?

Mixture-of-Experts routes each token through only a subset of the model's parameters. Qwen3.5-35B-A3B has 35B total parameters but activates only 3B per token. Compute per token is that of a 3B dense model; knowledge capacity is closer to a 35B model. The result is faster and cheaper inference than dense 35B, which is why long-context workloads are a particularly good fit.

Does thinking mode affect cost?

Qwen3.5-35B-A3B ships with reasoning mode available. On QuickSilver Pro, reasoning mode is suppressed by default to keep output concise and predictable — you're not billed for unnecessary thinking tokens. This matches the behavior most RAG and summarization workloads expect.

Does Qwen support tool calling?

Yes, via the OpenAI tools API. Tool-call reliability is good for simple function signatures; for complex multi-tool agent loops, DeepSeek V3 tends to be more reliable. Benchmark both on your specific agent before committing.

Try it on $1 free credits

Get API Key