Kimi API - a complete developer guide to Moonshot AI’s Open Platform
The Kimi API (Moonshot AI Open Platform) gives developers programmatic access to Kimi and Moonshot’s large language models covering everyday chat completions, long-context reasoning, tool calling (function calling), vision inputs, and supporting APIs for token estimation and files. It’s designed to be highly familiar: many endpoints follow an OpenAI compatible structure, so you can migrate existing applications with minimal changes.
This page is written as a practical, production-first reference. It explains how authentication works, what base URL to use, how to pick models, how to call tools safely, how to handle long context and vision prompts, how to keep costs predictable, and how to build a reliable architecture around rate limits and retries.
1) What the Kimi API is (and when it’s the right choice)
Moonshot AI’s Open Platform (often called the “Kimi API”) provides an HTTP API to interact with Moonshot/Kimi models. The official docs describe an ecosystem that includes long-context models (32k, 128k, and newer 256k-class context), tool calling, and multimodal vision models.
When Kimi API is a good fit
- Long-context chat and document Q&A: you want large context windows for deep threads, long docs, or multi-step workflows.
- Tool-using agents: you want the model to call tools (functions) to fetch data, run actions, or orchestrate workflows.
- Migration from OpenAI-style chat: you already have code written around OpenAI-like chat completions and want a compatible provider.
- Cost control: you need competitive token pricing and predictable unit economics (especially at scale).
- Vision and multimodal workflows: you want models that accept image inputs and combine them with text prompts.
When Kimi API is not the best fit
- Purely local/offline requirements: if your product must run without external network access, a hosted API won’t work.
- Ultra-low latency micro-responses: LLM responses involve token generation; for strict <50ms use cases, you may want smaller models or heuristics.
- Fully deterministic output: LLMs are probabilistic; use structured output constraints and validation for critical workflows.
2) Authentication & base URLs (global vs China)
Kimi API uses standard Bearer authentication. You send your API key in the Authorization header: Authorization: Bearer YOUR_KEY. Base URLs can differ by region. Documentation and integrations commonly reference: https://api.moonshot.ai/v1 (global) and https://api.moonshot.cn/v1 (China).
Global base URL
Use this for many international deployments:
https://api.moonshot.ai/v1
China base URL
Use this if your account/region is set up for China:
https://api.moonshot.cn/v1
Security best practices
- Never place API keys in frontend JavaScript. Always call Kimi API from your backend.
- Store keys in a secrets manager (or at least environment variables) and rotate periodically.
- Use request logging and per-user quotas to reduce abuse risk and prevent surprise billing.
- If you allow user-supplied prompts, add content filtering and “dangerous prompt” detection for your domain.
Quick curl health test: list models
Many platforms expose a models listing endpoint. A practical first test is to call your provider’s models list (or make a minimal chat request). If your key is valid and base URL is correct, you should get a JSON response.
curl "https://api.moonshot.ai/v1/models" \
-H "Authorization: Bearer $MOONSHOT_API_KEY"
3) OpenAI compatibility: why migration is usually straightforward
A major advantage of the Kimi API ecosystem is compatibility with familiar “chat completions” style request bodies. In practice, migration often comes down to:
- Swap base URL to https://api.moonshot.ai/v1 (or .cn).
- Replace the API key environment variable and header.
- Update the model string to a Moonshot/Kimi model ID (see next section).
- Review any differences in tool calling parameters (especially legacy functions vs modern tool schema).
Basic “drop-in” chat request (conceptual)
POST https://api.moonshot.ai/v1/chat/completions
Authorization: Bearer $MOONSHOT_API_KEY
Content-Type: application/json
{
"model": "moonshot-v1-32k",
"messages": [
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Summarize this article in 6 bullets..."}
],
"temperature": 0.3
}
Do I need OpenAI’s SDK to call Kimi API?
No. You can use plain HTTPS requests. However, many OpenAI-compatible SDKs allow specifying a custom base URL. If your existing code already uses an OpenAI client, you can often keep your code structure and swap configuration.
4) Models & context windows (how to choose the right one)
Moonshot’s documentation describes multiple families: moonshot-v1 models optimized for different context lengths (commonly 32k and 128k), plus vision preview variants and newer Kimi K2/K2.5 generation models that emphasize tool use, coding, and agentic workflows. In the official “Main Concepts” documentation, models are described by intended use and context size.
Core model selection rules
Pick context first
If your conversation or retrieved documents are large, choose a larger context model. If most prompts are small, avoid paying for massive context you don’t use.
Pick capability second
For tool-using agents and complex coding tasks, prefer K2/K2.5-class models when available. For simple Q&A, smaller models may be more cost-effective.
Keep a tiered UX
In product UI, expose “Standard / Long Context / Pro Agent / Vision” modes rather than showing raw model IDs to users.
Common model categories
| Category | Typical model IDs | Best for | Notes |
|---|---|---|---|
| General chat | moonshot-v1-32k / moonshot-v1-128k | Summaries, assistants, long text, knowledge work | 32k for most; 128k for very large prompts or multi-doc RAG |
| Vision (preview) | moonshot-v1-8k-vision-preview, moonshot-v1-32k-vision-preview, moonshot-v1-128k-vision-preview | Image + text prompts, screenshots, charts, UI analysis | Use only when you actually need image understanding |
| Agentic / coding (K2/K2.5) | Kimi K2 / K2.5 family IDs (see your /models list) | Tool calling, code generation, longer reasoning chains | Often best for multi-step tasks and automation pipelines |
5) Chat API: messages, parameters, and streaming
Most applications start with chat completions. You send a list of messages (system, user, assistant) and receive a model-generated response. A production-grade chat layer usually includes: safe prompt construction, streaming support, response validation, retries, and logging.
Message roles and prompt hygiene
- system: set behavior (“You are a concise support agent.”), output constraints, tone, policy.
- user: the user’s input (ideally sanitized and length-checked).
- assistant: previous replies (conversation memory). Keep only what you need to control token usage.
Parameter cheatsheet (typical)
Quality + creativity knobs
temperature controls randomness (lower = more deterministic). For extraction and structured outputs, keep it low (0–0.3). For ideation and creative writing, you can raise it (0.7–1.0). If the API supports top_p, use one or the other—don’t “fight” both.
Length control
Use max output limits where available, and design your prompts to request a specific shape: “Return exactly 6 bullets” or “Return JSON with keys …”. This reduces run-on outputs and keeps costs predictable.
JavaScript example (fetch)
async function kimiChat({ prompt }) {
const res = await fetch("https://api.moonshot.ai/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.MOONSHOT_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
model: "moonshot-v1-32k",
messages: [
{ role: "system", content: "You are a concise developer assistant." },
{ role: "user", content: prompt }
],
temperature: 0.2
})
});
if (!res.ok) {
const err = await res.text();
throw new Error(`Kimi API error ${res.status}: ${err}`);
}
const data = await res.json();
return data.choices?.[0]?.message?.content ?? "";
}
Python example (requests)
import os, requests
def kimi_chat(prompt: str) -> str:
url = "https://api.moonshot.ai/v1/chat/completions"
headers = {
"Authorization": f"Bearer {os.environ['MOONSHOT_API_KEY']}",
"Content-Type": "application/json",
}
payload = {
"model": "moonshot-v1-32k",
"messages": [
{"role": "system", "content": "You are a helpful assistant. Keep answers under 150 words."},
{"role": "user", "content": prompt},
],
"temperature": 0.2,
}
r = requests.post(url, headers=headers, json=payload, timeout=60)
r.raise_for_status()
data = r.json()
return data["choices"][0]["message"]["content"]
Streaming: why you probably want it
Streaming improves perceived latency: the user sees tokens as they arrive. It also reduces “double click” spam. If your SDK supports streaming, treat it as the default UX for interactive chat. For background jobs (batch summaries), non-streaming is fine.
Structured outputs: how to do it safely
Ask for a strict JSON schema in your prompt and validate the output on your backend. If parsing fails, retry with a “repair” prompt that includes the model’s previous output and asks it to fix only the JSON. Keep temperature low, and never trust JSON without validation.
6) Tool calling (function calling): building agents that do real work
Tool calling allows the model to decide when to call external functions (tools) you define. Moonshot’s tool use documentation describes tool calling as a crucial feature for building agentic workflows: the model outputs a structured “tool call” object, and your application executes the tool and returns the result back to the model as a follow-up message.
How tool calling works in practice
- You define tools (functions) with names, descriptions, and a JSON schema for parameters.
- You send a chat request with tools included.
- The model replies either with normal text, or with one or more tool_calls.
- Your code executes the tool(s) securely and returns tool results as messages.
- The model uses those results to produce the final user-facing response.
Tool definition example (JSON schema)
{
"type": "function",
"function": {
"name": "search_kb",
"description": "Search the internal knowledge base for relevant documents.",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Search query." },
"top_k": { "type": "integer", "description": "Number of results.", "default": 5 }
},
"required": ["query"]
}
}
}
Typical tool calling loop (pseudo-code)
messages = [system, user]
tools = [search_kb, get_order_status, ...]
resp = chat(messages, tools, tool_choice="auto")
if resp contains tool_calls:
for each call:
args = validate(call.arguments)
result = execute_tool(call.name, args)
messages.append({role:"tool", tool_call_id: call.id, content: json(result)})
resp2 = chat(messages, tools, tool_choice="auto")
return resp2.final_answer
else:
return resp.text
Important note about tool_choice for certain Kimi models
Some Kimi K2.5 quickstart guidance notes that tool_choice may be restricted to "auto" or "none" by default to avoid conflicts between reasoning and forced tool selection. In practice, “auto” is the best default; only use “none” for pure text generation.
Tool calling design patterns that actually scale
1) Keep tools small and single-purpose (search, fetch, update). 2) Use consistent output formats from tools (JSON). 3) Add timeouts and retries around tool execution. 4) Log every tool call for debugging and safety auditing. 5) Add a “planner” prompt to encourage the model to call tools only when needed, not by habit.
Official tools vs your tools
Moonshot documentation also references “official tools.” In product development, you typically mix: official/provider-supported tools (when available) and your application-specific tools (your database, your APIs). Regardless, the execution should happen in your trusted runtime with access control.
7) Vision inputs: using Kimi vision preview models
Moonshot’s guides describe vision preview models such as moonshot-v1-8k-vision-preview, moonshot-v1-32k-vision-preview, and moonshot-v1-128k-vision-preview. These models accept image input plus text, enabling workflows like screenshot interpretation, chart explanations, UI analysis, and multimodal reasoning.
When to use vision
- Users upload screenshots and want explanations (“Why is this build failing?” “What does this chart imply?”).
- Document processing includes scanned pages or images where text extraction alone loses context.
- You want the assistant to interpret UI/UX mocks or design comps and generate code or copy.
Vision cost & token planning
Vision inputs typically consume tokens too (or have separate accounting). Moonshot provides a token estimation API that can include both plain text and visual input. Use it to forecast cost before sending large images in production.
Prompting tips for vision
Be explicit about the task
“Describe what’s in this image” is vague. Better: “Read the error message, identify the root cause, and propose 3 fixes in priority order.” Or: “Extract table rows into JSON with keys {name, value, unit}.”
Ask for structured outputs
Vision analysis often benefits from structure: bullet lists, step-by-step troubleshooting, or JSON. Keep temperature low and validate outputs, especially for extraction tasks.
Common pitfalls with screenshots
Screenshots can include sensitive data (emails, tokens, personal info). If your app supports user uploads, implement redaction guidance, access control, and retention policies. Also consider adding an “auto-blur” step for obvious secrets like API keys in logs.
8) Token estimation: the underrated API that saves money
Moonshot exposes an “Estimate Tokens” API used to calculate token count for a request, including both plain text and visual input. In production, token estimation is valuable because it gives you a chance to: (1) warn users before an expensive request, (2) automatically compress/trim context, and (3) enforce budgets.
What to estimate before you send
- Long user messages or pasted documents
- Large RAG results that might overrun context
- Multimodal messages with images
- Agent loops that might chain many tool calls and assistant turns
Practical strategy: three-stage context management
Stage 1: trim
Remove older conversation turns that are no longer relevant. Keep only the last few user/assistant turns and any “pinned” system instructions.
Stage 2: summarize
Summarize older context into a short “memory” block: decisions, user preferences, known facts. Then replace many turns with one summary message.
Stage 3: retrieve
Instead of pasting full documents, retrieve only the relevant chunks (RAG) and cite them or attach them as context. Use token estimation to keep retrieval bounded.
Why estimation matters even for 32k+
A bigger context window doesn’t mean “free.” Large prompts cost more, slow down responses, and increase the chance of the model getting distracted. Estimation helps you keep prompts tight and improves answer quality.
9) Files & attachments: how developers typically use them
Moonshot’s API documentation includes a Files endpoint category. Even when you don’t need “files” as a feature, it’s common to build file-handling in your application for these reasons:
- Large documents that you don’t want to paste directly into messages.
- Repeatability: upload a file once, reference it many times.
- Auditing: keep track of which file influenced which answer.
- Security: store files under your own access control and share only safe excerpts.
Recommended approach: keep “source-of-truth” in your storage
Many teams store original uploads in their own object storage (S3/R2/GCS) and only send extracted text snippets to the model. This reduces vendor lock-in and gives you control over retention and deletion.
RAG vs full document paste: which is better?
Retrieval-augmented generation (RAG) is usually better for reliability and cost: you store embeddings and retrieve only relevant chunks. Full paste can work for smaller documents, but it’s expensive and can confuse the model when the doc is long. A good default: paste excerpts + cite where they came from.
10) Pricing & budgeting: understanding the unit economics
Moonshot’s official pricing documentation presents costs as price per 1M tokens (1,000,000 tokens). Different models can have different prices. The pricing page also commonly distinguishes between input and output tokens, and some providers mention discounts for cache hits (repeated inputs).
Cost control checklist
Product guardrails
- Per-user daily/monthly credits
- Hard caps for free tier
- Preview-first flows (short answers first, “expand” on demand)
- Model gating: “Pro model” only for paid users
Engineering guardrails
- Token estimation before large requests
- Context trimming + summarization
- Deduplicate repeated prompts (idempotency hash)
- Cache retrieval results (RAG)
A simple “back-of-the-envelope” calculator
Monthly cost ≈ (requests × avg input tokens / 1,000,000 × input price) + (requests × avg output tokens / 1,000,000 × output price) + retry overhead.
If your app is growing, track two KPIs: tokens per successful user outcome and retries per completion. Those usually matter more than the headline price per 1M tokens.
Should I always use the cheapest model?
Not always. If a stronger model completes a task in fewer turns, fewer retries, and less prompt engineering, it can be cheaper overall. Choose based on total cost per successful outcome—not per-request price alone.
11) Rate limits & reliability: designing for 429s and concurrency
Like most AI APIs, Kimi API enforces rate limits and concurrency constraints. Early or trial tiers can be strict (low requests per minute and limited concurrency). Your application should assume that “Too Many Requests” (HTTP 429) will happen sometimes and handle it gracefully.
What a robust client does
- Retries with backoff: on 429/5xx, wait and retry rather than hammer the API.
- Jitter: add randomness to backoff to avoid synchronized spikes.
- Queue: if users submit many requests, enqueue them rather than firing immediately.
- Timeouts: set timeouts for API calls and tool executions; don’t hang forever.
- Idempotency: avoid duplicate charges when users click “Send” multiple times.
Backoff example (pseudo-code)
delay = 1.5s
for attempt in 1..6:
try:
return call_kimi()
except RateLimitError:
sleep(delay + random(0..500ms))
delay = min(delay * 1.8, 15s)
throw "rate-limited"
UX tips that reduce retries and rage-clicking
Show progress
Even for non-streaming responses, show “Thinking…” plus a spinner and disable the send button briefly. This alone reduces duplicate requests dramatically.
Offer “Stop”
Users feel in control when they can stop generation. If your API supports aborting streams, wire it up. Otherwise, stop displaying output and ignore late tokens.
Explain limits
Clear plan messaging (“Free: 3 RPM, 1 concurrency”) prevents confusion and reduces support. Silent throttling feels broken; visible throttling feels fair.
Error handling checklist
Log: request ID, user ID, model, prompt token estimate, response latency, error code, retry count. Return: user-friendly errors with next steps (“Try again in 10 seconds”). Alert: on elevated 429/5xx rates or latency spikes.
12) Production architecture: how to ship Kimi API reliably
A scalable LLM system is a combination of prompt design, backend orchestration, and observability. Below is a simple blueprint that works for most products: the UI calls your backend, your backend calls Kimi, a queue manages concurrency for heavy tasks, and your storage layer holds any persistent artifacts.
| Component | Responsibility | Why it matters |
|---|---|---|
| Frontend | Collect prompts, show streaming output, show status and history | Good UX reduces duplicate requests and improves retention |
| Backend API | Auth, quotas, request validation, prompt templates | Protects your keys and enforces budgets |
| Queue + workers | Batch tasks: long summaries, tool pipelines, document processing | Prevents timeouts, controls concurrency and cost spikes |
| Vector store (optional) | RAG retrieval, embeddings, chunk storage | Improves accuracy and reduces tokens vs full paste |
| Observability | Logs, traces, metrics, evals, cost tracking | Debug faster, prevent regressions, improve prompts |
Prompt design system: the easiest win
Most teams eventually create a “prompt library” of reusable templates: customer support, extraction, summarization, report writing, SQL generation, agent planning. Version these prompts like code, measure quality, and roll out updates safely.
Evaluation loop (how to improve over time)
- Collect real user queries (anonymized) and label good vs bad outcomes.
- Create a small test suite for each feature (“support replies”, “invoice extraction”).
- Run offline evaluations: compare models, prompts, temperatures.
- Ship changes behind a flag, monitor cost + satisfaction.
- Iterate and keep a changelog for transparency.
Agentic workflows: controlling tool-call explosions
Tool-calling agents can spiral into long loops. Control this with: max tool calls per run, max tokens per run, timeouts per tool, and a “budget” message in system prompt: “You have at most 3 tool calls; prefer the most informative call first.” Combine this with logging so you can see why an agent is over-calling tools.
13) Kimi K2 API
Kimi K2 API is Moonshot AI’s flagship “K2” large-language-model API on the Kimi Open Platform, built for production apps that need long context (up to 256K) and reliable tool calling (function/tool use) for agentic workflows. It’s commonly used for RAG chatbots, document QA over large inputs, structured extraction, and automation where the model decides when to call tools and your backend executes them securely.
Kimi K2.5 API
Kimi K2.5 API is the next-gen Kimi model API that builds on K2 and adds stronger multimodal capability (notably visual + coding use cases), plus more advanced agentic behavior such as a self-directed agent swarm approach for complex tasks. Moonshot’s K2.5 docs and technical posts emphasize improved performance on workflows like “design/mock → code,” higher-capacity long-horizon tool use, and multimodal inputs (including image—and in some contexts video as an experimental feature).
15) FAQ: Kimi API
What is the base URL for Kimi API?
Commonly referenced base URLs include https://api.moonshot.ai/v1 (global) and https://api.moonshot.cn/v1 (China region). Always use the one recommended by your console/docs for your account.
Is the Kimi API OpenAI-compatible?
Many endpoints follow an OpenAI-style chat completions structure. Migration usually involves swapping the base URL, using your Moonshot API key, and updating model IDs. Review tool calling differences (use modern tools rather than legacy functions).
How do I choose between 32k and 128k context models?
Choose based on how much context you routinely send. If most prompts are short, use a smaller context model for cost and speed. If you frequently paste long documents or maintain long conversation history, use 128k (or larger) and implement trimming/summary anyway.
Does Kimi API support tool calling?
Yes. Tool calling is a core feature described in Moonshot’s “Tool Use” docs. Define tools with JSON schemas, send them in a chat request, and execute tool calls securely in your application.
How do I handle rate limits?
Implement backoff retries for 429 responses, add a queue for bursty traffic, and cap concurrency per user or per workspace. Also improve UX (disable send briefly, show progress) to reduce duplicate requests.
What’s the best way to reduce cost?
Control tokens: trim conversation history, summarize older context, use RAG instead of full paste, estimate tokens before big requests, and keep temperature low for structured tasks to reduce retries. Cost savings are usually about prompt size, not about tiny parameter tweaks.
References (official docs)
Use these as the source of truth for schemas, model availability, pricing updates, and advanced guides.
| Topic | Official link | Use it for |
|---|---|---|
| Docs overview | https://platform.moonshot.ai/docs/overview |
Platform capabilities, navigation, guides |
| Quickstart | https://platform.moonshot.ai/docs/guide/start-using-kimi-api |
First request, basic setup |
| Main concepts | https://platform.moonshot.ai/docs/introduction |
Model families, context windows, concepts |
| Chat API | https://platform.moonshot.ai/docs/api/chat |
Chat request/response details |
| Tool use | https://platform.moonshot.ai/docs/api/tool-use |
Tool calling schema and examples |
| Vision guide | https://platform.moonshot.ai/docs/guide/use-kimi-vision-model |
Vision preview models and usage |
| Token estimation | https://platform.moonshot.ai/docs/api/estimate |
Estimate token counts for text + images |
| Pricing | https://platform.moonshot.ai/docs/pricing/chat |
Price per 1M tokens by model tier |
| FAQ | https://platform.moonshot.ai/docs/guide/faq |
Limits, edge cases, common questions |