What is Qwen3.5-35B-A3B good for?

Qwen3.5-35B-A3B is a 35B-parameter mixture-of-experts model with only 3B active parameters per token and a 262,144-token context window. It is particularly well-suited for long-document RAG, multi-document summarization, and workflows where the prompt contains large amounts of retrieved context. The MoE architecture means it runs at the speed and cost of a 3B dense model despite having 35B total parameters.

How much does the Qwen3.5-35B-A3B API cost?

On QuickSilver Pro: $0.111 per million input tokens and $0.80 per million output tokens. For a RAG pipeline with 50k input tokens of retrieved context per query and 500 output tokens per answer, that is $0.00555 input + $0.0004 output = ~$0.006 per query, or $6 per 1000 queries.

When should I use Qwen3.5-35B-A3B vs DeepSeek V3?

Use Qwen3.5-35B-A3B when the prompt is large — typically >32K tokens of retrieved context or a long document to summarize. Its 262K context window is 2x larger than DeepSeek V3 (131K), and its per-input-token cost is 31% lower. For short-prompt tasks (chat, coding, extraction), DeepSeek V3 has stronger general reasoning at a similar output price.

Is Qwen3.5-35B-A3B the same model as Qwen3?

Qwen3.5-35B-A3B is the 35B-parameter MoE variant with 3B active parameters — a distinct model from Qwen3's dense and larger MoE variants. A3B denotes the 3B active count. It is optimized for long-context workloads where compute per token is the bottleneck.

首页/用例/qwen3 用于 long-context

用例

Qwen3.5-35B-A3B 用于长上下文

Qwen3.5-35B-A3B 是一个 35B 参数的 MoE 模型，每个 token 只激活其中 3B，并且支持 262K 上下文窗口。MoE 让它拥有接近 35B 的知识容量，却以 3B dense 级别的成本运行，非常适合 RAG 和长文档工作流。在我们的目录里，它是输入价格最低的模型之一：每 100 万 tokens 只要 $0.111。

$0.111 / $0.80 per 1M tokens

为什么它很适合 RAG

262K 上下文：能把一份 500 页 PDF 或 200 个代码文件放进单次 prompt。只要检索后的语料能装下，就不必激进切块；单次 RAG 可以显著简化流水线。

输入便宜：每 100 万输入 token $0.111，意味着一个 100K token 的 RAG prompt 成本是 $0.011。相同 prompt 用 DeepSeek V3（$0.16/1M）则要 $0.016，贵 44%。

MoE 速度：每个 token 只激活 3B 参数，所以推理速度更接近 3B dense，而不是 35B dense。对于长输入工作流，这通常会体现为明显更低的单请求延迟。

RAG 流水线模式

简单单次调用：如果检索上下文能塞进 262K tokens，可以跳过 reranking 和分层摘要，直接把所有内容一次性交给 Qwen3.5-35B-A3B。流水线更简单，延迟也更低。

配合检索：embed → top-K 检索 → 拼接成 50K-100K token prompt → Qwen3.5-35B-A3B 回答。由于输入 token 很便宜，top-K 可以放得更长，也就是让模型看到更多上下文。

先摘要再回答：对于超过 262K 的语料，先按章节用 Qwen3.5-35B-A3B 做摘要，再基于摘要回答。两段式流程，通常仍然比大多数替代方案便宜。

快速上手代码

python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.quicksilverpro.io/v1",
    api_key="sk-qsp-...",
)

# Load a long document — say a 500-page PDF, already extracted to text
document = open("annual-report.txt").read()  # ~180K tokens

resp = client.chat.completions.create(
    model="qwen3.5-35b",
    messages=[
        {"role": "system", "content": "You answer questions using only the provided document."},
        {"role": "user", "content": f"Document:\n{document}\n\nQuestion: What was free cash flow in Q3?"},
    ],
    max_tokens=500,
)
print(resp.choices[0].message.content)
print(f"Input tokens: {resp.usage.prompt_tokens}, cost: ${resp.usage.cost:.4f}")

常见问题

我真的可以在一个 prompt 里用 262K tokens 吗？

可以。262,144 tokens 是公开的硬上限。对于长上下文检索任务，needle-in-a-haystack 的表现通常到 200K 左右都很强；再往上，细粒度查找的准确率可能下降。关键检索场景下，仍建议结合向量搜索，把最相关的内容放在 prompt 前面。

“3B active MoE” 是什么意思？

Mixture-of-Experts 会让每个 token 只经过模型中的一部分参数。Qwen3.5-35B-A3B 总参数 35B，但每个 token 实际只激活 3B。也就是说，每 token 的计算量接近 3B dense 模型，但知识容量更接近 35B。正因为如此，它比 dense 35B 更快、更便宜，所以特别适合长上下文工作负载。

思考模式会影响费用吗？

Qwen3.5-35B-A3B 本身支持 reasoning 模式。在 QuickSilver Pro 上，默认会压制 reasoning 模式，以保持输出简洁和可预期，你也不会为不必要的 thinking tokens 付费。这更符合大多数 RAG 和摘要工作流的预期。

Qwen 支持工具调用吗？

支持，通过 OpenAI tools API 即可。对于简单函数签名，它的 tool-call 稳定性不错；如果是复杂多工具 agent loop，DeepSeek V3 往往更稳。正式投入前，建议在你的具体 agent 任务上同时 benchmark 两者。

首次充值双倍 — 最高 $50 免费

获取 API Key