Chat • Tool Calls • Vision • Token Estimation • Pricing

Kimi API - a complete developer guide to Moonshot AI’s Open Platform

The Kimi API (Moonshot AI Open Platform) gives developers programmatic access to Kimi and Moonshot’s large language models covering everyday chat completions, long-context reasoning, tool calling (function calling), vision inputs, and supporting APIs for token estimation and files. It’s designed to be highly familiar: many endpoints follow an OpenAI compatible structure, so you can migrate existing applications with minimal changes.

This page is written as a practical, production-first reference. It explains how authentication works, what base URL to use, how to pick models, how to call tools safely, how to handle long context and vision prompts, how to keep costs predictable, and how to build a reliable architecture around rate limits and retries.

Official docs hub
platform.moonshot.ai/docs
Quickstart
Start using Kimi API
Tool calling
Tool Use API
Pricing (chat)
Pricing reference
Educational disclaimer
This guide is for educational purposes and summarizes common integration patterns. Always verify the latest request/response fields, model availability, pricing, and policy requirements in the official Moonshot AI documentation and console before shipping to production.

1) What the Kimi API is (and when it’s the right choice)

Moonshot AI’s Open Platform (often called the “Kimi API”) provides an HTTP API to interact with Moonshot/Kimi models. The official docs describe an ecosystem that includes long-context models (32k, 128k, and newer 256k-class context), tool calling, and multimodal vision models.

When Kimi API is a good fit

  • Long-context chat and document Q&A: you want large context windows for deep threads, long docs, or multi-step workflows.
  • Tool-using agents: you want the model to call tools (functions) to fetch data, run actions, or orchestrate workflows.
  • Migration from OpenAI-style chat: you already have code written around OpenAI-like chat completions and want a compatible provider.
  • Cost control: you need competitive token pricing and predictable unit economics (especially at scale).
  • Vision and multimodal workflows: you want models that accept image inputs and combine them with text prompts.

When Kimi API is not the best fit

  • Purely local/offline requirements: if your product must run without external network access, a hosted API won’t work.
  • Ultra-low latency micro-responses: LLM responses involve token generation; for strict <50ms use cases, you may want smaller models or heuristics.
  • Fully deterministic output: LLMs are probabilistic; use structured output constraints and validation for critical workflows.
Best mental model
Treat Kimi API like a “reasoning coprocessor” that can read large context, generate structured outputs, and decide when to call tools. The biggest wins come from designing systems where the model does the thinking and orchestration, while your code does the secure execution.

2) Authentication & base URLs (global vs China)

Kimi API uses standard Bearer authentication. You send your API key in the Authorization header: Authorization: Bearer YOUR_KEY. Base URLs can differ by region. Documentation and integrations commonly reference: https://api.moonshot.ai/v1 (global) and https://api.moonshot.cn/v1 (China).

Global base URL

Use this for many international deployments:
https://api.moonshot.ai/v1

International OpenAI-style paths v1 endpoints

China base URL

Use this if your account/region is set up for China:
https://api.moonshot.cn/v1

China region OpenAI-style paths v1 endpoints

Security best practices

  • Never place API keys in frontend JavaScript. Always call Kimi API from your backend.
  • Store keys in a secrets manager (or at least environment variables) and rotate periodically.
  • Use request logging and per-user quotas to reduce abuse risk and prevent surprise billing.
  • If you allow user-supplied prompts, add content filtering and “dangerous prompt” detection for your domain.
Quick curl health test: list models

Many platforms expose a models listing endpoint. A practical first test is to call your provider’s models list (or make a minimal chat request). If your key is valid and base URL is correct, you should get a JSON response.

curl "https://api.moonshot.ai/v1/models" \
  -H "Authorization: Bearer $MOONSHOT_API_KEY"

3) OpenAI compatibility: why migration is usually straightforward

A major advantage of the Kimi API ecosystem is compatibility with familiar “chat completions” style request bodies. In practice, migration often comes down to:

  • Swap base URL to https://api.moonshot.ai/v1 (or .cn).
  • Replace the API key environment variable and header.
  • Update the model string to a Moonshot/Kimi model ID (see next section).
  • Review any differences in tool calling parameters (especially legacy functions vs modern tool schema).
Compatibility warning: “functions” vs “tools”
In the OpenAI ecosystem, the legacy functions field has been deprecated in favor of tools. Moonshot’s migration docs emphasize tool calling support, but recommend using the modern tools and tool_choice patterns for best results.

Basic “drop-in” chat request (conceptual)

POST https://api.moonshot.ai/v1/chat/completions
Authorization: Bearer $MOONSHOT_API_KEY
Content-Type: application/json

{
  "model": "moonshot-v1-32k",
  "messages": [
    {"role":"system","content":"You are a helpful assistant."},
    {"role":"user","content":"Summarize this article in 6 bullets..."}
  ],
  "temperature": 0.3
}
Do I need OpenAI’s SDK to call Kimi API?

No. You can use plain HTTPS requests. However, many OpenAI-compatible SDKs allow specifying a custom base URL. If your existing code already uses an OpenAI client, you can often keep your code structure and swap configuration.

4) Models & context windows (how to choose the right one)

Moonshot’s documentation describes multiple families: moonshot-v1 models optimized for different context lengths (commonly 32k and 128k), plus vision preview variants and newer Kimi K2/K2.5 generation models that emphasize tool use, coding, and agentic workflows. In the official “Main Concepts” documentation, models are described by intended use and context size.

Core model selection rules

Pick context first

If your conversation or retrieved documents are large, choose a larger context model. If most prompts are small, avoid paying for massive context you don’t use.

Pick capability second

For tool-using agents and complex coding tasks, prefer K2/K2.5-class models when available. For simple Q&A, smaller models may be more cost-effective.

Keep a tiered UX

In product UI, expose “Standard / Long Context / Pro Agent / Vision” modes rather than showing raw model IDs to users.

Common model categories

Category Typical model IDs Best for Notes
General chat moonshot-v1-32k / moonshot-v1-128k Summaries, assistants, long text, knowledge work 32k for most; 128k for very large prompts or multi-doc RAG
Vision (preview) moonshot-v1-8k-vision-preview, moonshot-v1-32k-vision-preview, moonshot-v1-128k-vision-preview Image + text prompts, screenshots, charts, UI analysis Use only when you actually need image understanding
Agentic / coding (K2/K2.5) Kimi K2 / K2.5 family IDs (see your /models list) Tool calling, code generation, longer reasoning chains Often best for multi-step tasks and automation pipelines
Tip: list models dynamically
Model IDs can evolve. In production, store a “supported models” configuration and refresh it periodically by calling your provider’s models endpoint. Then map the raw model IDs to user-friendly product tiers.

5) Chat API: messages, parameters, and streaming

Most applications start with chat completions. You send a list of messages (system, user, assistant) and receive a model-generated response. A production-grade chat layer usually includes: safe prompt construction, streaming support, response validation, retries, and logging.

Message roles and prompt hygiene

  • system: set behavior (“You are a concise support agent.”), output constraints, tone, policy.
  • user: the user’s input (ideally sanitized and length-checked).
  • assistant: previous replies (conversation memory). Keep only what you need to control token usage.

Parameter cheatsheet (typical)

Quality + creativity knobs

temperature controls randomness (lower = more deterministic). For extraction and structured outputs, keep it low (0–0.3). For ideation and creative writing, you can raise it (0.7–1.0). If the API supports top_p, use one or the other—don’t “fight” both.

Length control

Use max output limits where available, and design your prompts to request a specific shape: “Return exactly 6 bullets” or “Return JSON with keys …”. This reduces run-on outputs and keeps costs predictable.

JavaScript example (fetch)

async function kimiChat({ prompt }) {
  const res = await fetch("https://api.moonshot.ai/v1/chat/completions", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.MOONSHOT_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      model: "moonshot-v1-32k",
      messages: [
        { role: "system", content: "You are a concise developer assistant." },
        { role: "user", content: prompt }
      ],
      temperature: 0.2
    })
  });

  if (!res.ok) {
    const err = await res.text();
    throw new Error(`Kimi API error ${res.status}: ${err}`);
  }

  const data = await res.json();
  return data.choices?.[0]?.message?.content ?? "";
}

Python example (requests)

import os, requests

def kimi_chat(prompt: str) -> str:
    url = "https://api.moonshot.ai/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.environ['MOONSHOT_API_KEY']}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": "moonshot-v1-32k",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant. Keep answers under 150 words."},
            {"role": "user", "content": prompt},
        ],
        "temperature": 0.2,
    }
    r = requests.post(url, headers=headers, json=payload, timeout=60)
    r.raise_for_status()
    data = r.json()
    return data["choices"][0]["message"]["content"]
Streaming: why you probably want it

Streaming improves perceived latency: the user sees tokens as they arrive. It also reduces “double click” spam. If your SDK supports streaming, treat it as the default UX for interactive chat. For background jobs (batch summaries), non-streaming is fine.

Structured outputs: how to do it safely

Ask for a strict JSON schema in your prompt and validate the output on your backend. If parsing fails, retry with a “repair” prompt that includes the model’s previous output and asks it to fix only the JSON. Keep temperature low, and never trust JSON without validation.

6) Tool calling (function calling): building agents that do real work

Tool calling allows the model to decide when to call external functions (tools) you define. Moonshot’s tool use documentation describes tool calling as a crucial feature for building agentic workflows: the model outputs a structured “tool call” object, and your application executes the tool and returns the result back to the model as a follow-up message.

How tool calling works in practice

  1. You define tools (functions) with names, descriptions, and a JSON schema for parameters.
  2. You send a chat request with tools included.
  3. The model replies either with normal text, or with one or more tool_calls.
  4. Your code executes the tool(s) securely and returns tool results as messages.
  5. The model uses those results to produce the final user-facing response.
Golden rule
The model can suggest tool calls—but your code must be the authority. Validate parameters, enforce permissions, and never let a tool call perform an action the user isn’t allowed to do.

Tool definition example (JSON schema)

{
  "type": "function",
  "function": {
    "name": "search_kb",
    "description": "Search the internal knowledge base for relevant documents.",
    "parameters": {
      "type": "object",
      "properties": {
        "query": { "type": "string", "description": "Search query." },
        "top_k": { "type": "integer", "description": "Number of results.", "default": 5 }
      },
      "required": ["query"]
    }
  }
}

Typical tool calling loop (pseudo-code)

messages = [system, user]
tools = [search_kb, get_order_status, ...]
resp = chat(messages, tools, tool_choice="auto")

if resp contains tool_calls:
  for each call:
    args = validate(call.arguments)
    result = execute_tool(call.name, args)
    messages.append({role:"tool", tool_call_id: call.id, content: json(result)})
  resp2 = chat(messages, tools, tool_choice="auto")
  return resp2.final_answer
else:
  return resp.text

Important note about tool_choice for certain Kimi models

Some Kimi K2.5 quickstart guidance notes that tool_choice may be restricted to "auto" or "none" by default to avoid conflicts between reasoning and forced tool selection. In practice, “auto” is the best default; only use “none” for pure text generation.

Tool calling design patterns that actually scale

1) Keep tools small and single-purpose (search, fetch, update). 2) Use consistent output formats from tools (JSON). 3) Add timeouts and retries around tool execution. 4) Log every tool call for debugging and safety auditing. 5) Add a “planner” prompt to encourage the model to call tools only when needed, not by habit.

Official tools vs your tools

Moonshot documentation also references “official tools.” In product development, you typically mix: official/provider-supported tools (when available) and your application-specific tools (your database, your APIs). Regardless, the execution should happen in your trusted runtime with access control.

7) Vision inputs: using Kimi vision preview models

Moonshot’s guides describe vision preview models such as moonshot-v1-8k-vision-preview, moonshot-v1-32k-vision-preview, and moonshot-v1-128k-vision-preview. These models accept image input plus text, enabling workflows like screenshot interpretation, chart explanations, UI analysis, and multimodal reasoning.

When to use vision

  • Users upload screenshots and want explanations (“Why is this build failing?” “What does this chart imply?”).
  • Document processing includes scanned pages or images where text extraction alone loses context.
  • You want the assistant to interpret UI/UX mocks or design comps and generate code or copy.

Vision cost & token planning

Vision inputs typically consume tokens too (or have separate accounting). Moonshot provides a token estimation API that can include both plain text and visual input. Use it to forecast cost before sending large images in production.

Prompting tips for vision

Be explicit about the task

“Describe what’s in this image” is vague. Better: “Read the error message, identify the root cause, and propose 3 fixes in priority order.” Or: “Extract table rows into JSON with keys {name, value, unit}.”

Ask for structured outputs

Vision analysis often benefits from structure: bullet lists, step-by-step troubleshooting, or JSON. Keep temperature low and validate outputs, especially for extraction tasks.

Common pitfalls with screenshots

Screenshots can include sensitive data (emails, tokens, personal info). If your app supports user uploads, implement redaction guidance, access control, and retention policies. Also consider adding an “auto-blur” step for obvious secrets like API keys in logs.

8) Token estimation: the underrated API that saves money

Moonshot exposes an “Estimate Tokens” API used to calculate token count for a request, including both plain text and visual input. In production, token estimation is valuable because it gives you a chance to: (1) warn users before an expensive request, (2) automatically compress/trim context, and (3) enforce budgets.

What to estimate before you send

  • Long user messages or pasted documents
  • Large RAG results that might overrun context
  • Multimodal messages with images
  • Agent loops that might chain many tool calls and assistant turns

Practical strategy: three-stage context management

Stage 1: trim

Remove older conversation turns that are no longer relevant. Keep only the last few user/assistant turns and any “pinned” system instructions.

Stage 2: summarize

Summarize older context into a short “memory” block: decisions, user preferences, known facts. Then replace many turns with one summary message.

Stage 3: retrieve

Instead of pasting full documents, retrieve only the relevant chunks (RAG) and cite them or attach them as context. Use token estimation to keep retrieval bounded.

Why estimation matters even for 32k+

A bigger context window doesn’t mean “free.” Large prompts cost more, slow down responses, and increase the chance of the model getting distracted. Estimation helps you keep prompts tight and improves answer quality.

9) Files & attachments: how developers typically use them

Moonshot’s API documentation includes a Files endpoint category. Even when you don’t need “files” as a feature, it’s common to build file-handling in your application for these reasons:

  • Large documents that you don’t want to paste directly into messages.
  • Repeatability: upload a file once, reference it many times.
  • Auditing: keep track of which file influenced which answer.
  • Security: store files under your own access control and share only safe excerpts.

Recommended approach: keep “source-of-truth” in your storage

Many teams store original uploads in their own object storage (S3/R2/GCS) and only send extracted text snippets to the model. This reduces vendor lock-in and gives you control over retention and deletion.

RAG vs full document paste: which is better?

Retrieval-augmented generation (RAG) is usually better for reliability and cost: you store embeddings and retrieve only relevant chunks. Full paste can work for smaller documents, but it’s expensive and can confuse the model when the doc is long. A good default: paste excerpts + cite where they came from.

10) Pricing & budgeting: understanding the unit economics

Moonshot’s official pricing documentation presents costs as price per 1M tokens (1,000,000 tokens). Different models can have different prices. The pricing page also commonly distinguishes between input and output tokens, and some providers mention discounts for cache hits (repeated inputs).

What matters for your product
Your real cost depends on: (1) model tier, (2) average prompt tokens, (3) average output tokens, (4) retries, (5) how much long context you send unnecessarily. Most teams reduce costs by controlling context, not by chasing tiny price differences.

Cost control checklist

Product guardrails

  • Per-user daily/monthly credits
  • Hard caps for free tier
  • Preview-first flows (short answers first, “expand” on demand)
  • Model gating: “Pro model” only for paid users

Engineering guardrails

  • Token estimation before large requests
  • Context trimming + summarization
  • Deduplicate repeated prompts (idempotency hash)
  • Cache retrieval results (RAG)

A simple “back-of-the-envelope” calculator

Monthly cost ≈ (requests × avg input tokens / 1,000,000 × input price) + (requests × avg output tokens / 1,000,000 × output price) + retry overhead.

If your app is growing, track two KPIs: tokens per successful user outcome and retries per completion. Those usually matter more than the headline price per 1M tokens.

Should I always use the cheapest model?

Not always. If a stronger model completes a task in fewer turns, fewer retries, and less prompt engineering, it can be cheaper overall. Choose based on total cost per successful outcome—not per-request price alone.

11) Rate limits & reliability: designing for 429s and concurrency

Like most AI APIs, Kimi API enforces rate limits and concurrency constraints. Early or trial tiers can be strict (low requests per minute and limited concurrency). Your application should assume that “Too Many Requests” (HTTP 429) will happen sometimes and handle it gracefully.

What a robust client does

  • Retries with backoff: on 429/5xx, wait and retry rather than hammer the API.
  • Jitter: add randomness to backoff to avoid synchronized spikes.
  • Queue: if users submit many requests, enqueue them rather than firing immediately.
  • Timeouts: set timeouts for API calls and tool executions; don’t hang forever.
  • Idempotency: avoid duplicate charges when users click “Send” multiple times.

Backoff example (pseudo-code)

delay = 1.5s
for attempt in 1..6:
  try:
    return call_kimi()
  except RateLimitError:
    sleep(delay + random(0..500ms))
    delay = min(delay * 1.8, 15s)
throw "rate-limited"

UX tips that reduce retries and rage-clicking

Show progress

Even for non-streaming responses, show “Thinking…” plus a spinner and disable the send button briefly. This alone reduces duplicate requests dramatically.

Offer “Stop”

Users feel in control when they can stop generation. If your API supports aborting streams, wire it up. Otherwise, stop displaying output and ignore late tokens.

Explain limits

Clear plan messaging (“Free: 3 RPM, 1 concurrency”) prevents confusion and reduces support. Silent throttling feels broken; visible throttling feels fair.

Error handling checklist

Log: request ID, user ID, model, prompt token estimate, response latency, error code, retry count. Return: user-friendly errors with next steps (“Try again in 10 seconds”). Alert: on elevated 429/5xx rates or latency spikes.

12) Production architecture: how to ship Kimi API reliably

A scalable LLM system is a combination of prompt design, backend orchestration, and observability. Below is a simple blueprint that works for most products: the UI calls your backend, your backend calls Kimi, a queue manages concurrency for heavy tasks, and your storage layer holds any persistent artifacts.

Component Responsibility Why it matters
Frontend Collect prompts, show streaming output, show status and history Good UX reduces duplicate requests and improves retention
Backend API Auth, quotas, request validation, prompt templates Protects your keys and enforces budgets
Queue + workers Batch tasks: long summaries, tool pipelines, document processing Prevents timeouts, controls concurrency and cost spikes
Vector store (optional) RAG retrieval, embeddings, chunk storage Improves accuracy and reduces tokens vs full paste
Observability Logs, traces, metrics, evals, cost tracking Debug faster, prevent regressions, improve prompts

Prompt design system: the easiest win

Most teams eventually create a “prompt library” of reusable templates: customer support, extraction, summarization, report writing, SQL generation, agent planning. Version these prompts like code, measure quality, and roll out updates safely.

Evaluation loop (how to improve over time)

  1. Collect real user queries (anonymized) and label good vs bad outcomes.
  2. Create a small test suite for each feature (“support replies”, “invoice extraction”).
  3. Run offline evaluations: compare models, prompts, temperatures.
  4. Ship changes behind a flag, monitor cost + satisfaction.
  5. Iterate and keep a changelog for transparency.
Agentic workflows: controlling tool-call explosions

Tool-calling agents can spiral into long loops. Control this with: max tool calls per run, max tokens per run, timeouts per tool, and a “budget” message in system prompt: “You have at most 3 tool calls; prefer the most informative call first.” Combine this with logging so you can see why an agent is over-calling tools.

13) Kimi K2 API

Kimi K2 API is Moonshot AI’s flagship “K2” large-language-model API on the Kimi Open Platform, built for production apps that need long context (up to 256K) and reliable tool calling (function/tool use) for agentic workflows. It’s commonly used for RAG chatbots, document QA over large inputs, structured extraction, and automation where the model decides when to call tools and your backend executes them securely.

Kimi K2.5 API

Kimi K2.5 API is the next-gen Kimi model API that builds on K2 and adds stronger multimodal capability (notably visual + coding use cases), plus more advanced agentic behavior such as a self-directed agent swarm approach for complex tasks. Moonshot’s K2.5 docs and technical posts emphasize improved performance on workflows like “design/mock → code,” higher-capacity long-horizon tool use, and multimodal inputs (including image—and in some contexts video as an experimental feature).

15) FAQ: Kimi API

What is the base URL for Kimi API?

Commonly referenced base URLs include https://api.moonshot.ai/v1 (global) and https://api.moonshot.cn/v1 (China region). Always use the one recommended by your console/docs for your account.

Is the Kimi API OpenAI-compatible?

Many endpoints follow an OpenAI-style chat completions structure. Migration usually involves swapping the base URL, using your Moonshot API key, and updating model IDs. Review tool calling differences (use modern tools rather than legacy functions).

How do I choose between 32k and 128k context models?

Choose based on how much context you routinely send. If most prompts are short, use a smaller context model for cost and speed. If you frequently paste long documents or maintain long conversation history, use 128k (or larger) and implement trimming/summary anyway.

Does Kimi API support tool calling?

Yes. Tool calling is a core feature described in Moonshot’s “Tool Use” docs. Define tools with JSON schemas, send them in a chat request, and execute tool calls securely in your application.

How do I handle rate limits?

Implement backoff retries for 429 responses, add a queue for bursty traffic, and cap concurrency per user or per workspace. Also improve UX (disable send briefly, show progress) to reduce duplicate requests.

What’s the best way to reduce cost?

Control tokens: trim conversation history, summarize older context, use RAG instead of full paste, estimate tokens before big requests, and keep temperature low for structured tasks to reduce retries. Cost savings are usually about prompt size, not about tiny parameter tweaks.

References (official docs)

Use these as the source of truth for schemas, model availability, pricing updates, and advanced guides.

Topic Official link Use it for
Docs overview https://platform.moonshot.ai/docs/overview Platform capabilities, navigation, guides
Quickstart https://platform.moonshot.ai/docs/guide/start-using-kimi-api First request, basic setup
Main concepts https://platform.moonshot.ai/docs/introduction Model families, context windows, concepts
Chat API https://platform.moonshot.ai/docs/api/chat Chat request/response details
Tool use https://platform.moonshot.ai/docs/api/tool-use Tool calling schema and examples
Vision guide https://platform.moonshot.ai/docs/guide/use-kimi-vision-model Vision preview models and usage
Token estimation https://platform.moonshot.ai/docs/api/estimate Estimate token counts for text + images
Pricing https://platform.moonshot.ai/docs/pricing/chat Price per 1M tokens by model tier
FAQ https://platform.moonshot.ai/docs/guide/faq Limits, edge cases, common questions