Pricing Guide • 2026

Kimi K2 API Pricing: Token Costs, Cache Discounts & Budget Control

Understand how the Kimi K2 API bills for input tokens, output tokens, cached/prefix tokens, and tool calls so you can ship AI features without surprise invoices.

Last updated: February 2, 2026 always confirm the latest rates in the official Moonshot AI pricing page before production use.

Quick Snapshot

  • Usage based billing: pay for tokens processed (input + output)
  • Cache savings: repeated/stable prompt prefixes can be billed at a lower input rate when cached
  • Turbo tiers: faster variants typically cost more per token
  • Tool fees: optional tools (e.g., web search) may add per-call charges
  • Budget control: cap output length, summarize chat history, and keep RAG context tight

Pricing is usage-based: input tokens + output tokens + cached tokens + tool fees.

Kimi K2 API Pricing Comparison: Kimi vs K2 vs K2.5 Token Costs & Best Value

API / Model family Typical context tiers Input (cache miss) Input (cache hit / cached) Output Notes
Kimi API (kimi-latest) 8K / 32K / 128K $0.20 (8K) / $1.00 (32K) / $2.00 (128K) Often shown as $0.15 cached input $2.00 (8K) / $3.00 (32K) / $5.00 (128K) Pricing changes based on which context tier you use (short prompts are much cheaper).
Kimi K2 API Long context (commonly listed 128K+) $0.60 $0.15 $2.50 Great “default” for agentic + coding with strong cost/perf.
Kimi K2.5 API Long context (commonly listed ~256K) $0.60 $0.10 $3.00 Positioned for “visual coding + agent swarm”; output is higher than K2, cached input can be cheaper.



1) What “Kimi K2 API pricing” actually means

When people say “Kimi K2 API pricing,” they usually mean:

  • Per-token costs for using K2-family chat/completions endpoints (input tokens + output tokens).

  • Optional discounted input pricing when your request benefits from context caching (cached/prefix-reused tokens).

  • Add-on tool costs if you call official tools like web search.

Unlike “flat monthly subscription” products, this is usage-based: you pay for what you process. Moonshot AI positions this as a “pay-as-you-go” platform in its developer messaging.

This guide focuses on how to estimate and control spend—because the cheapest model can still become expensive if you accidentally:

  • resend huge context every turn,

  • let output run long,

  • or use Turbo when you don’t need it.


2) The 3 things you can be billed for

A) Input tokens (prompt + system + conversation history)

Every request includes input: user messages, system instructions, tool outputs you include, retrieved documents, etc.

B) Output tokens (model’s answer)

Longer answers cost more. “Unlimited max_tokens” is the fastest way to surprise yourself at month-end.

C) Tool calls (example: web search)

If you call the official web search tool, you can pay a per-call fee in addition to token costs.


3) Current model pricing tiers (K2, K2 Thinking, Turbo)

Because the official docs pages can be dynamic, the most consistent way to describe pricing is to focus on the published “per 1M tokens” numbers that show up across:

  • official pricing snippets that include a “Last updated” timestamp,

  • community/vendor listings that mirror the same rates,

  • and ecosystem aggregators.

A) Standard K2 and K2 Thinking (non-Turbo)

A widely repeated and cross-sourced baseline for K2-family standard (non-Turbo) usage is:

  • Input (cache miss / regular input): $0.60 per 1M tokens

  • Input (cache hit / cached input): $0.15 per 1M tokens

  • Output: $2.50 per 1M tokens

You’ll often see this described as “$0.15 input when cached, $0.60 when not cached, $2.50 output.”

B) Turbo variants (higher speed, higher cost)

Turbo variants cost more (you’re paying for faster throughput / latency targets). A common set of Turbo numbers seen in multiple listings:

  • Turbo input: $1.15 per 1M tokens

  • Turbo output: $8.00 per 1M tokens

C) Context window differences (important for budgeting)

You’ll also see different K2 preview IDs with different context windows (e.g., ~131k vs ~262k tokens).

Why it matters: Longer context makes it easier to include huge documents—but it also makes it easier to accidentally bill yourself by repeatedly sending big chunks.

Takeaway:

  • Use standard K2 / K2 Thinking for most production workloads.

  • Reserve Turbo for time-sensitive flows (interactive IDE assistant, live agent UI) where latency is revenue.


4) How token counting works in real apps

Token costs feel confusing until you break down a single “turn” of a chat app.

A typical turn contains:

  1. System prompt (your rules, safety, style)

  2. Developer prompt (your app instructions / tool descriptions)

  3. Conversation history (prior user + assistant messages)

  4. Your new user message

  5. (Optional) retrieved context (RAG documents)

  6. (Optional) tool outputs (web search results, structured data)

  7. Model output

All of that can be billed (as input or output). So the biggest drivers of cost are:

  • How much you resend every turn (history + RAG)

  • How long the model answers

  • How often you call tools

This is why a “cheap per-token model” can still become costly in a badly-structured agent loop.


5) Context caching: how it saves money

What is “cache hit” pricing?

K2 pricing is often described with two input rates:

  • Cache miss (regular input tokens): ~$0.60 / 1M

  • Cache hit (cached/reused prefix tokens): ~$0.15 / 1M

The idea: if you reuse the same prefix (system + long instructions + stable context), the platform can avoid recomputing it and discount those input tokens.

When caching helps the most

Caching shines when you have:

  • a large, stable system prompt (policies, tool schema, long instructions),

  • a stable knowledge pack (product catalog, FAQ, style guide),

  • a multi-step agent that keeps repeating the same scaffolding.

When caching helps the least

Caching won’t save much if:

  • every request is totally different (new prefix each time),

  • you constantly reshuffle the prompt,

  • your input is mostly “fresh” text (new document each time),

  • your app is output-heavy (output tokens dominate anyway).

Practical design tips to maximize cache hits

  1. Keep the beginning of the prompt stable. Put “always the same” content first.

  2. Don’t inject timestamps/random IDs early. Put volatile fields at the end.

  3. Split static vs dynamic context. Static = cached; dynamic = appended.

  4. For RAG: keep retrieval content after the stable instruction block.

If your app is mostly output tokens, caching is still good—but output control matters more.


6) Web search tool pricing

If you use official web search tooling, there’s typically an additional fee:

  • $0.005 per web search call

One important nuance from the pricing snippet: some cases may not charge the call fee depending on the tool’s finish reason (as described in the pricing text).

Budget impact of web search

At $0.005 per call:

  • 100 searches/month → $0.50

  • 1,000 searches/month → $5

  • 10,000 searches/month → $50

Tool fees can become meaningful at scale—especially if your agent loops call web search repeatedly.


7) Rate limits, top-ups, and “cumulative recharge”

Many developers get confused because “pricing” is not the whole story rate limits affect how many requests you can run concurrently.

Moonshot’s docs mention rate limits tied to cumulative recharge (how much you’ve topped up total).

Voucher / bonus mechanics (early usage)

Some community guides and public posts mention a $5 voucher after reaching a certain recharge threshold (often described as “after your first recharge” or “when cumulative recharge reaches $5”).

Treat promos as nice-to-have, not core budgeting. Promos change; architecture lasts.


8) Practical cost calculator

You can estimate a request cost with:

Cost = (InputTokens / 1,000,000 × InputRate) + (OutputTokens / 1,000,000 × OutputRate) + ToolFees

Using standard non-Turbo baseline rates:

  • InputRate (miss): 0.60

  • InputRate (cached): 0.15

  • OutputRate: 2.50

Convert to “per 1K tokens” (easier mental math)

Divide per-1M by 1000:

  • Input miss: $0.60 / 1M → $0.00060 per 1K input tokens

  • Input cached: $0.15 / 1M → $0.00015 per 1K cached input tokens

  • Output: $2.50 / 1M → $0.00250 per 1K output tokens

So if a typical response uses:

  • 3,000 input tokens (mostly uncached)

  • 800 output tokens

Estimated token cost:

  • Input: 3 × $0.00060 = $0.00180

  • Output: 0.8 × $0.00250 = $0.00200

  • Total ≈ $0.00380 (under half a cent)

That’s why K2 can feel “shockingly cheap” until you scale to millions of requests or accidentally push huge contexts.


Try cost calculator

9) Real-world budget scenarios

Below are realistic scenarios to help you plan monthly spend. I’ll use the baseline non-Turbo numbers and call out where caching/tool fees change things.

Scenario 1: Customer support bot (simple)

Assume:

  • 2,000 input tokens/request

  • 400 output tokens/request

  • 50,000 requests/month

  • No web search

Per request:

  • Input: 2 × 0.00060 = $0.00120

  • Output: 0.4 × 0.00250 = $0.00100

  • Total: $0.00220

Monthly:

  • 50,000 × $0.00220 = $110/month

How to cut it: cap output length (reduce from 400 → 250 tokens), and summarize chat history. Biggest savings often come from output control.


Scenario 2: RAG (chat with docs) for a knowledge base

Assume:

  • 1,500 stable instruction tokens (cached)

  • 2,500 retrieved document tokens (uncached)

  • 600 output tokens

  • 20,000 requests/month

Per request:

  • Cached input: 1.5 × 0.00015 = $0.000225

  • Uncached input: 2.5 × 0.00060 = $0.00150

  • Output: 0.6 × 0.00250 = $0.00150

  • Total ≈ $0.003225

Monthly:

  • 20,000 × $0.003225 ≈ $64.50/month

Where teams lose money: retrieving too many chunks (e.g., 10k tokens every time) and not caching the stable prefix.


Scenario 3: Coding agent that iterates (multi-step tool use)

Assume:

  • Each task triggers 8 model calls (plan → implement → test → fix loops)

  • Each call: 4,000 input tokens, 900 output tokens

  • 2,000 tasks/month

  • No web search tool, but heavy conversation history

Per call:

  • Input: 4 × 0.00060 = $0.00240

  • Output: 0.9 × 0.00250 = $0.00225

  • Total: $0.00465

Per task (8 calls):

  • 8 × 0.00465 = $0.0372

Monthly:

  • 2,000 × $0.0372 ≈ $74.40/month

That’s still quite manageable but only if you keep history under control. If you let context explode to 50k tokens/call, costs jump.


Scenario 4: Long-context summarization (the “silent bill killer”)

Assume:

  • You send a 120,000-token document

  • Output summary: 2,000 tokens

  • Do this 300 times/month

Per request:

  • Input: 120 × 0.00060 = $0.072

  • Output: 2 × 0.00250 = $0.005

  • Total ≈ $0.077

Monthly:

  • 300 × $0.077 = $23.10/month

This is not huge—but if you scale to thousands of long-document jobs, it becomes a real line item. Also, if you choose Turbo, it can spike.


Scenario 5: Web-search agent (tool fees add up)

Assume:

  • 5 web searches per user session

  • 30,000 sessions/month

  • Web search fee: $0.005/call

Tool fees:

  • 5 × 30,000 × $0.005 = $750/month (tool fees alone)

This is the big warning: tool calls at scale can cost more than tokens. You must:

  • dedupe searches,

  • cache search results,

  • and avoid “search loops.”


10) Cost-optimization playbook (how serious teams keep K2 cheap)

Strategy A: Use a “cheap default → expensive escalate” router

Most queries do not need “thinking” or “turbo.”

Pattern

  1. Run standard K2 for the first attempt.

  2. If low confidence / user asks for deeper reasoning, escalate to K2 Thinking.

  3. If user is in a latency-critical UI moment, escalate to Turbo.

This one pattern often cuts spend 30–70% in production.


Strategy B: Hard caps on output

Output tokens are expensive relative to input (2.50 vs 0.60 baseline).

Do:

  • Set max_tokens per endpoint by route (FAQ answers: 200–400; summaries: 600–1200).

  • Use “short answer + expandable details” UI.

  • Use stop sequences where appropriate (lists, code blocks).


Strategy C: Summarize history aggressively

For chat apps:

  • Keep the last ~10–20 messages raw.

  • Summarize older conversation into a compact memory.

  • Store facts in structured state (JSON) instead of replaying full text.

This reduces input tokens and improves model focus.


Strategy D: RAG discipline (retrieve less, retrieve better)

Common RAG mistake: retrieving “everything” so the model has no signal.

Do:

  • Limit retrieved chunks (e.g., top 3–5).

  • Use shorter chunk sizes.

  • Use “query rewriting” to target retrieval.

  • Only include citations/quotes you truly need in the prompt.


Strategy E: Design prompts for cache hits

Because cached input is far cheaper (often shown as 0.15 vs 0.60 per 1M), architect your prompt so the stable prefix is reused.

Concrete steps:

  • Keep system and tool schema identical across calls.

  • Put variable content (user question, retrieved docs) after stable instructions.

  • Avoid random IDs at the beginning.

  • Keep formatting identical (even whitespace changes can matter in some caching implementations).


Strategy F: Deduplicate tool calls (web search)

At $0.005 per search, you must treat web search like an API billable resource.

Do:

  • Cache results per query for a TTL (e.g., 30–120 minutes).

  • Use one search + multiple follow-up reasoning steps.

  • Add a “search budget” per session.


11) Common mistakes that silently inflate your bill

  1. Unlimited output (“just generate everything”)

  2. Resending entire conversation history every turn

  3. RAG retrieving too many tokens (10k–50k per request)

  4. Using Turbo everywhere (paying premium for no user-visible benefit)

  5. Tool-call loops (web search, browse, re-search)

  6. No observability (you can’t optimize what you don’t measure)


API pricing comparison (USD per 1M tokens)

Model family Example model / tier used for comparison Input / 1M Output / 1M Important notes
Kimi K2 (Moonshot AI) Kimi K2 $0.60 $2.50 Straightforward value pricing for agentic + coding workloads.
GPT-4 (OpenAI) GPT-4 Turbo (legacy) $5.00 $15.00 This is the classic “GPT-4 Turbo” era pricing (older models also shown on the same page).
  GPT-4 (older, not Turbo) $15.00 $30.00 Listed as older GPT-4 variants (more expensive).
  GPT-4.1 (newer flagship line) $3.00 (cached $0.75) $12.00 If you mean “best OpenAI flagship pricing now,” GPT-4.1 is the more current reference point.
Claude (Anthropic) Standard tier (≤200K input tokens) $3.00 $15.00 If your input exceeds 200K tokens, the whole request moves to higher “long context” rates.
  Long context (>200K input tokens) $6.00 $22.50 Triggered by input tokens only.
  Opus 4.5 (premium) $5.00 $25.00 Model-specific pricing (more premium than standard).
  Haiku 4.5 (budget) $1.00 $5.00 Cheapest Claude family option (great for throughput).
Gemini (Google) Gemini 2.5 Pro (paid tier) $1.25 (≤200K) / $2.50 (>200K) $10.00 (≤200K) / $15.00 (>200K) Pricing changes when prompts exceed 200K tokens; output includes “thinking tokens.”
  Gemini 2.5 Flash (paid tier block on same page) $0.625 (≤200K) / $1.25 (>200K) $5.00 (≤200K) / $7.50 (>200K) Often the “best value” Gemini choice if you don’t need Pro quality. 
DeepSeek (DeepSeek) deepseek-chat $0.27 (cache miss) / $0.07 (cache hit) $1.10 Very cost-effective; cache hit pricing is notably low.
  deepseek-reasoner $0.55 (miss) / $0.14 (hit) $2.19 “Reasoner” costs more than chat (as expected).
Qwen (Alibaba Cloud) Qwen on AWS Bedrock: Qwen3 Next 80B A3B $0.15 $1.20 AWS lists per 1,000 tokens; these are converted to per-1M.
  Qwen on AWS Bedrock: Qwen3 VL 235B A22B $0.53 $2.66 More capable multimodal variant, higher price.
  Qwen coder (Model Studio example): qwen3-coder-flash (≤32K) $0.144 $0.574 Qwen pricing varies by token-per-request tier on some platforms.
Llama (Meta) Hosted example: Together pricing (Llama 3.1 8B / 70B / 405B) $0.18 / $0.88 / $3.50 $0.18 / $0.88 / $3.50 Llama is open-weights: self-host is possible; hosted pricing depends on provider.

What’s cheapest (usually), and what’s “closest to Kimi K2”

  • Lowest cost for general chat often goes to DeepSeek Chat (especially with cache hits).

  • Closest “mid-cost but strong” range to Kimi K2 is typically Qwen (some tiers) and Gemini Flash, depending on the exact model you pick.

  • Most expensive in this list (for text output) is commonly GPT-4 class pricing and premium Claude variants (e.g., Opus) - though OpenAI’s “newer flagship line” pricing (GPT-4.1) is lower than old GPT-4 Turbo/legacy references.



12) FAQs: Kimi K2 API Pricing

Is Kimi K2 API free?

Not really. You can usually create an account and generate an API key, but actual API usage is pay-as-you-go and you’ll typically need to top up credits to run meaningful requests in production.

Does Kimi K2 have free credits?

There are sometimes promo vouchers. For example, Moonshot AI has publicly mentioned a $5 voucher after your first recharge/top-up.
(These promos can change always check your dashboard for current offers.)


How much does Kimi K2 cost per 1000 tokens?

Pricing is usually listed per 1,000,000 (1M) tokens. Two common references you’ll see:

  • Kimi K2 (commonly listed): $0.60 / 1M input and $2.50 / 1M output
    That equals:

    • Input: $0.60 ÷ 1000 = $0.00060 per 1K tokens

    • Output: $2.50 ÷ 1000 = $0.00250 per 1K tokens

  • Kimi K2 variant shown on Moonshot’s platform home page (example): $0.60 / 1M input, $3.00 / 1M output, and $0.10 / 1M cache-hit (cached input).
    That equals:

    • Cached input: $0.10 ÷ 1000 = $0.00010 per 1K cached tokens

    • Output: $3.00 ÷ 1000 = $0.00300 per 1K tokens

Because different model IDs/versions can have slightly different rates, the safest approach is: use the price shown in your Moonshot console for the exact model you call.


How to reduce Kimi K2 API cost (best tactics)

  1. Cap output length (max_tokens) - output is usually the most expensive part.

  2. Don’t resend huge chat history - summarize older messages into a short memory.

  3. Tighten RAG - retrieve fewer chunks (top 3–5) and keep them short.

  4. Use Turbo only when speed is critical - Turbo is priced higher than base.

  5. Avoid tool-call loops (web search, etc.) - per-call fees can dominate at scale.


Does Kimi K2 support caching discounts?

Yes - Moonshot describes cached tokens being charged at a cache-hit input rate (discounted input).
For Turbo pricing, Moonshot explicitly lists a cache-hit rate in its pricing update.
(How much you benefit depends on whether your prompt has a stable repeated prefix, like a long system prompt/tool schema.)


How to calculate your Kimi K2 bill

Use this template:

**Total cost = (input_tokens ÷ 1,000,000 × input_rate)

  • (cached_input_tokens ÷ 1,000,000 × cache_hit_rate)

  • (output_tokens ÷ 1,000,000 × output_rate)

  • (web_search_calls × $0.005)**

Example (using the commonly listed K2 rates):

  • Input: 10,000 tokens → 10,000/1,000,000 × 0.60 = $0.006

  • Output: 2,000 tokens → 2,000/1,000,000 × 2.50 = $0.005

  • Total ≈ $0.011 (without tools)


Kimi K2 API pricing changelog (what to include)

If you want a clean pricing changelog section on your page, use dated bullets like:

  • Nov 6, 2025: Moonshot announced new Turbo pricing (cache hit, cache miss, output) and launched K2 thinking turbo with the new rates.

  • Jan 26, 2026: Moonshot documented web search tool pricing: $0.005 per web_search call (with a noted condition where no call fee is charged in some cases).

And add: “Always verify current rates in the dashboard,” because pricing can update outside announcements.


Is Kimi K2 pricing “per request” or “per token”?

It’s primarily per token (input + output), plus possible tool-call fees (like web search).


What’s the difference between “cache hit” and “cache miss” pricing?

Cache hits apply when your request reuses a stable prefix / cached content, lowering input cost (often shown as ~$0.15 per 1M instead of ~$0.60).


Does K2 Thinking cost more than K2?

In many listings, K2 Thinking uses the same baseline per-token rates as standard K2 (non-Turbo), though the total tokens can increase because reasoning-style outputs can be longer.


Why would I ever use Turbo if it’s more expensive?

Because it can be faster (better user experience, higher throughput). Use it where speed directly impacts conversion, retention, or dev productivity. Turbo rates are commonly listed much higher (e.g., ~$1.15 input / $8 output per 1M).


How much does the official web search tool cost?

Published pricing indicates $0.005 per web search call (in addition to tokens).


Are there rate limits?

Yes - rate limits can depend on cumulative recharge/top-up, per the platform’s rate limit guidance.


Is there any promo credit?

Some public posts and guides reference a $5 voucher tied to recharge milestones. Promos change, so treat them as temporary.


What’s the fastest way to reduce cost without hurting quality?

  1. Cap output, 2) summarize history, 3) tighten RAG, 4) use caching-friendly prompts, 5) route Turbo only when needed.


Key takeaways

  • Baseline (standard K2 family): ~$0.60 / 1M input (miss), ~$0.15 / 1M cached input, ~$2.50 / 1M output

  • Turbo: higher cost (often ~$1.15 input / $8 output per 1M)

  • Web search tool: ~$0.005 per call

  • Biggest cost wins come from output caps + context control + tool-call discipline

Kimi AI with K2.5 | Visual Coding Meets Agent Swarm

Kimi K2 API pricing is what decides whether that power feels effortless or expensive. This guide breaks down token costs, cache discounts, Turbo trade offs, and real budget examples so you can scale agents confidently without invoice surprises.