DeepSeek API (2026): Complete Developer Guide
The DeepSeek API is a developer platform for calling DeepSeek models using an OpenAI-compatible request/response format. That means you can often take existing code written for OpenAI-style clients, swap the base URL, set your DeepSeek API key, and start shipping. DeepSeek’s platform focuses on fast, cost-efficient inference and includes both chat-style models and reasoning models that are designed for harder tasks.
What is DeepSeek API?
DeepSeek API is an HTTP-based service for generating text (and in some cases advanced reasoning) from DeepSeek models. You send a request containing a conversation (messages), a model name, and generation settings (like temperature and max tokens), and the API returns a model response you can display, store, transform, or feed into downstream systems.
What makes DeepSeek especially practical for teams is that the API follows a familiar “chat completions” pattern that many developer tools already support. If your stack uses an OpenAI-compatible SDK, it’s often “plug-and-play”: update the base URL, provide your API key, and keep your existing code structure for messages, tools, and streaming.
What you can build with DeepSeek API
- Customer support assistants with grounded answers and escalation paths
- RAG search (document Q&A, knowledge-base chat, internal copilot)
- Extraction APIs (pull structured fields from text: invoices, resumes, tickets, reviews)
- Coding assistants and code review bots
- Research workflows that summarize, compare, and explain sources
- Agentic automation using tools (function calls) and multi-step planning
Why teams choose DeepSeek
Most teams pick DeepSeek for one or more of these reasons:
- OpenAI-compatible schema: faster migration and fewer rewrites across SDKs and gateways.
- Cost structure: token-based billing (with caching concepts) can be extremely cost-effective if you architect for reuse.
- Model variety: a chat model for general tasks and a reasoner model for deeper, multi-step reasoning.
- Practical production docs: clear error codes and best practices for retries and resilience.
Core concepts: tokens, context, caching, and output limits
To build predictably with any LLM API, you need four mental models: tokens, context window, output limit, and caching.
Tokens (what you pay for)
DeepSeek pricing is expressed per 1 million tokens (1M tokens). Tokens are small chunks of text (words, parts of words, punctuation, and sometimes whitespace). The API bills the total number of tokens processed: input tokens (your messages + system prompt + tool schema) and output tokens (the model’s response).
Context window (how much you can send)
Each model has a maximum context length (for example, 64K for certain DeepSeek models). Your request must fit inside this window: system instructions + conversation history + any retrieved documents. If you exceed the window, you’ll need to truncate, summarize, or chunk your content.
Output token limit (how long the answer can be)
Many APIs let you cap the model’s output length (max_tokens). Production apps should always set a maximum output, because without a cap, long responses can increase costs and latency. For some models, the output maximum might be lower than the full context window—so design your UI and expectations accordingly.
Caching (how you can reduce cost)
In many token-billing platforms, caching reduces the cost of repeated prompt prefixes. The high-level strategy is: keep stable instructions and repeated content (policy text, formatting rules, tool schemas, or static knowledge) in a consistent prefix, then reuse it across calls. If the provider supports cache hit/miss pricing, cache hits are cheaper than cache misses. Even if your application-layer caching does most of the work, it’s still useful to understand how provider caching influences cost.
Models overview: deepseek-chat vs deepseek-reasoner
DeepSeek’s official pricing docs list two primary model families used by most developers:
| Model | What it’s best at | Typical usage |
|---|---|---|
| deepseek-chat | General conversational tasks, drafting, summarization, extraction, tool-using assistants | Customer support bots, content generation, structured extraction, RAG chat |
| deepseek-reasoner | Harder reasoning tasks, multi-step planning, complex analysis | Agents, code reasoning, deep analysis, complex decision support |
A simple approach that works well in production is to use deepseek-chat for most traffic and only route to deepseek-reasoner when the task is genuinely difficult or the user is on a tier that can afford higher reasoning cost. This is similar to how teams often route between “fast” and “deep” models on other platforms.
Thinking mode (when available)
DeepSeek’s chat completion docs include a thinking parameter (enabled/disabled) to switch between thinking and non-thinking modes in some cases. In practice, you should treat “thinking” as a knob that trades cost and latency for stronger reasoning. When you ship it, expose it as a product feature only if users understand the trade-off.
Authentication & base URL
DeepSeek API uses Bearer token authentication. You attach your API key in the HTTP header: Authorization: Bearer YOUR_API_KEY. The official docs show the base URL as https://api.deepseek.com, and note that you can also use https://api.deepseek.com/v1 as an OpenAI-compatible base (the “v1” path is for schema compatibility, not a model version).
Checklist for secure key management
- Store keys in a secrets manager, not hard-coded strings or public repos.
- Create separate keys for dev/staging/prod; rotate keys regularly.
- Log request IDs and user IDs, not the API key or raw prompts with sensitive data.
- Use allowlists for internal services that can access the key.
OpenAI compatibility & migration
DeepSeek explicitly documents an OpenAI-compatible API format. That means many OpenAI-style SDKs can work by changing:
- Base URL → https://api.deepseek.com (or .../v1)
- API key → DeepSeek API key
- Model name → DeepSeek model identifier (e.g., deepseek-chat)
In real migrations, the biggest “gotchas” are not the endpoint paths but the small behavior differences: default max tokens, how tools are represented, whether a parameter is supported, and how error codes are returned. The safest path is to create a small adapter layer in your code: a “Model Provider” interface with DeepSeek as one implementation. That way, you can keep your app logic stable even if provider details change.
Endpoints you’ll actually use
For most applications, there are only a few endpoints you need to understand:
| Endpoint | What it does | Why it matters |
|---|---|---|
| POST /chat/completions | Generate a model response from messages (chat format) | Your main workhorse for chat, RAG, extraction, and agents |
| GET /models | List available models and basic metadata | Useful to verify model availability and build model selectors |
| Docs: error codes | Explains 402/422/429/500/503 behavior | Critical for retries, UX messaging, and incident response |
Everything else (files, embeddings, images, etc.) depends on your specific stack and the provider’s evolving capabilities. Many teams keep the core chat completion endpoint as the primary interface and implement specialized functionality (like embedding generation) via an internal service or a gateway that can swap providers.
Quickstart (curl, Python, Node)
Below are practical “first request” examples. They intentionally keep the payload minimal so you can validate: your key works, your base URL is correct, and your model name is available. After that, you can add streaming, tools, structured outputs, and RAG.
1) curl — minimal chat completion
curl https://api.deepseek.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPSEEK_API_KEY" \
-d '{
"model": "deepseek-chat",
"messages": [
{"role":"system","content":"You are a concise assistant."},
{"role":"user","content":"Explain what an API is in 2 sentences."}
],
"temperature": 0.6,
"max_tokens": 200
}'
2) Python — OpenAI-style client pattern
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com/v1"
)
resp = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Give me 5 bullet tips for reducing API latency."}
],
temperature=0.4,
max_tokens=250
)
print(resp.choices[0].message.content)
3) Node.js — minimal request (fetch)
const resp = await fetch("https://api.deepseek.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${process.env.DEEPSEEK_API_KEY}`,
},
body: JSON.stringify({
model: "deepseek-chat",
messages: [
{ role: "system", content: "You are a careful assistant." },
{ role: "user", content: "Write a short JSON schema for a support ticket." }
],
temperature: 0.2,
max_tokens: 250
})
});
const data = await resp.json();
console.log(data.choices?.[0]?.message?.content);
Prompting patterns that work
Prompting is not “magic words.” It’s simply shaping the input so the model understands the task, constraints, and output format. The most reliable prompts are structured and explicit:
1) The stable system prompt
The system prompt is the place to define your assistant’s identity, safety boundaries, and output rules. Keep it stable and versioned (e.g., policy_v12) so you can audit behavior changes over time. A good system prompt is short but clear: it sets tone, rules, and what to do when unsure.
2) Task framing in the user message
Many failures happen because the user’s request is ambiguous. Your UI can help by transforming user intent into a clearer prompt: include the goal, desired format, and any constraints. For extraction tasks, always specify the fields and define “null” behavior.
3) Examples (few-shot) when needed
If you want consistent formatting, give a short example. Few-shot examples increase tokens, so use them only when the benefit is large. A balanced approach is to provide a single example for tricky formats and rely on structured output constraints for everything else.
4) Separation of concerns
Keep policy text and formatting rules in the system prompt, and keep task-specific content in the user message. This makes debugging much easier: if the output is wrong, you know whether it’s a policy/format problem or a task problem.
Streaming responses
Streaming returns partial output as it is generated. Users perceive streaming as faster because they see tokens immediately. It also allows you to stop generation early if the model goes off-topic or if the user cancels. If your SDK supports streaming, treat it as a UX improvement rather than a business logic dependency: always handle the non-streaming “final response” too.
Streaming UX best practices
- Show “thinking…” state and begin rendering tokens as soon as they arrive.
- Allow cancellation; stop streaming and discard partial output if the user changes the request.
- Don’t save partial content to your DB until you receive completion (unless you want a full transcript).
- Use a max_tokens cap even for streaming, to avoid runaway responses.
Structured JSON outputs (how to make LLMs reliable)
The biggest leap in reliability comes when you stop asking the model for “free-form prose” and instead ask for structured outputs. For example: classification, extraction, routing decisions, and config generation. This reduces hallucinations because your validator can reject invalid structures and you can re-ask the model with a correction prompt.
Pattern: “Return JSON only” + validate
In many production systems, the app first asks the model for JSON, validates it, and only then renders a user-friendly explanation. This approach works because the model’s job is no longer to write the final UI text. It’s to produce structured data.
SYSTEM:
You are a strict JSON generator. Output must be valid JSON and nothing else.
USER:
Extract a support ticket from this message. Return:
{
"category": "billing|bug|feature_request|account|other",
"urgency": "low|medium|high",
"summary": string,
"needs_human": boolean
}
If unsure, use "other" and set needs_human=true.
Message:
"My API key stopped working after I rotated it. Also getting 402 errors."
Then your backend can validate that the JSON matches expected enums and types. If validation fails, send a correction prompt: “The JSON was invalid because urgency must be one of… Return corrected JSON only.”
Tool calling & agent workflows
Many teams want “agents”: the model decides when to call tools (functions) to do work like searching documents, retrieving account details, creating tickets, or performing actions in external systems. Whether your SDK calls it “tools,” “functions,” or “function calling,” the principle is the same:
- Define tools as strict schemas.
- Let the model request tool calls.
- Execute tools in your backend (never in the model).
- Feed tool results back to the model and ask it to finalize.
Agent safety rule: “tools are allowed actions, not suggestions”
Tool calling can accidentally become a security vulnerability if you let the model decide too much. Your backend must enforce permissions. For example: if the model calls “refund_customer,” your backend must check: Is the user an admin? Is the ticket eligible? Has the user confirmed? Is this request within policy? The model can propose; your system must verify.
RAG: retrieval with citations
Retrieval-Augmented Generation (RAG) means you fetch relevant documents from your own knowledge base and include them in the prompt. This makes answers more accurate and grounded. A good RAG system has two parts:
- Retrieval: search and rank relevant passages.
- Generation: answer using only retrieved passages, and include citations to the passages.
RAG workflow that scales
- Chunk documents (300–800 tokens per chunk), store embeddings, and keep metadata (source, date, URL).
- At query time, retrieve top-k chunks, then compress them (optional summarization) to fit the context window.
- Instruct the model: “Use only provided sources. If not present, say you don’t know.”
- Include citations like [1], [2] mapping to sources in your UI.
Pricing (cache hit/miss) & a simple cost calculator
DeepSeek pricing is published per 1M tokens and separated into input and output costs. The pricing table also distinguishes “cache hit” input price vs “cache miss” input price. In practice, you should budget using cache miss until you confirm how caching behaves for your workload and account.
Published pricing (USD per 1M tokens)
| Model | Context | Max output | Input (cache hit) | Input (cache miss) | Output |
|---|---|---|---|---|---|
| deepseek-chat | 64K | 8K | $0.07 | $0.27 | $1.10 |
| deepseek-reasoner | 64K | 8K | $0.14 | $0.55 | $2.19 |
Cost calculator formula
Estimate cost using:
- Input cost = (input_tokens / 1,000,000) × input_price
- Output cost = (output_tokens / 1,000,000) × output_price
- Total = input cost + output cost
Example: deepseek-chat
Suppose your request uses 6,000 input tokens and returns 800 output tokens. Using cache miss input pricing:
- Input: (6,000 / 1,000,000) × $0.27 ≈ $0.00162
- Output: (800 / 1,000,000) × $1.10 ≈ $0.00088
- Total ≈ $0.00250 per call
The best way to reduce cost is rarely “use a lower temperature.” It’s almost always:
- Reduce input tokens (summarize history, trim tool schemas, retrieve fewer RAG chunks)
- Reduce output tokens (set max_tokens, ask for concise answers, use structured outputs)
- Cache repeated prompts and retrieval results where appropriate
- Route tasks: chat model by default, reasoner only when needed
Rate limits, timeouts & retries
DeepSeek’s rate limit documentation indicates the API does not impose a fixed user rate limit in the same way that some providers do, but it notes that under high traffic your requests may take longer and the HTTP connection may remain open while content streams back. In practice, you still need to protect your service with client-side and server-side rate controls, because:
- Your own infrastructure has limits (worker concurrency, database, network, queue backlogs).
- Upstream services can overload (503) or rate-limit (429) under spikes.
- Retries without discipline can amplify incidents and double your costs.
Retry strategy that won’t melt your system
- Retry only on transient errors (429, 500, 503, network timeouts).
- Use exponential backoff with jitter (e.g., 0.5s, 1s, 2s, 4s, max 10s).
- Cap retries (2–4 attempts). If still failing, return a graceful error and allow the user to retry manually.
- Use idempotency keys to avoid duplicate charges when a request is retried.
Error codes & debugging
DeepSeek documents common HTTP error codes and their likely causes. In production, you should map these codes to user-friendly messages and to actionable logs for your engineers.
| Status | Meaning | What to do |
|---|---|---|
| 402 | Insufficient balance | Show billing CTA, alert admins, pause background jobs, and avoid repeated retries. |
| 422 | Invalid parameters | Fix payload (model name, fields, types). Don’t retry automatically. |
| 429 | Rate limit reached / too many requests | Backoff + retry with jitter; reduce concurrency; add queueing. |
| 500 | Server error | Retry with backoff; log correlation IDs; open incident if persistent. |
| 503 | Server overloaded | Retry after delay; degrade gracefully; consider fallback provider if critical. |
Debugging checklist
- Confirm base URL and key are correct (most “401-like” issues are config mistakes).
- List models via GET /models and ensure your model exists.
- Check payload types: numbers vs strings, booleans, and nested tool schemas.
- Validate token limits: context window exceeded can produce errors or truncated output.
- Measure latency: slow responses under load might look like timeouts if your timeouts are too strict.
Security, privacy & compliance
LLM APIs are powerful, but they also handle sensitive data: customer chats, internal documents, and potentially personal information. A production DeepSeek integration should include:
Security basics
- Backend-only keys: never expose keys in the client.
- Prompt injection defense: treat user content as untrusted; never let it override system policy.
- Tool permission checks: enforce auth and policy in your backend, not in the model.
- Data minimization: send only what’s necessary; avoid dumping entire documents into context.
- Encryption: TLS in transit; encrypt sensitive logs and stored conversations.
Privacy design that users trust
- Show a clear privacy policy explaining what you store and for how long.
- Allow users to delete conversation history and generated outputs.
- For enterprise use, provide workspace-level controls: retention windows, audit logs, and admin overrides.
Production architecture (what actually scales)
The fastest demo is “frontend calls API.” The correct production approach is: frontend → your backend → DeepSeek → your backend → frontend. Your backend becomes the control plane that enforces authentication, quotas, safety, and observability.
Recommended production components
- API gateway (auth, rate limit, request validation)
- Chat service (calls DeepSeek, handles streaming, validates output)
- RAG service (retrieval + chunking + citation formatting)
- Tool executor (function calls, permission checks, audit logs)
- Queue for long-running tasks (summaries, batch extraction, report generation)
- Database for sessions, messages, tool results, and cost accounting
Routing strategy: “fast by default”
Route most requests to deepseek-chat. Route to deepseek-reasoner when:
- The user explicitly asks for deep reasoning (“analyze”, “prove”, “find edge cases”, “compare trade-offs”).
- Your classifier flags complexity (multi-part reasoning, code debugging, ambiguous constraints).
- The user is on a plan that supports higher-cost reasoning.
Conversation memory that doesn’t explode tokens
Long conversations can become expensive. The solution is a tiered memory approach:
- Short-term window: last N turns included verbatim.
- Rolling summary: older content summarized into 200–500 tokens.
- Long-term memory: store facts as structured notes (user preferences, decisions) and retrieve only when relevant.
Logging & observability
If you can’t see it, you can’t scale it. Observability for LLM apps includes normal metrics (latency, error rate) plus LLM-specific metrics:
- Input/output tokens per request
- Cost estimate per request and per user/workspace
- Model choice and route reason (chat vs reasoner)
- Tool calls (which tools, duration, success/failure)
- RAG stats (retrieved chunks, sources, compression rate)
- Safety outcomes (blocked requests, redactions, human review)
FAQ
Official resources & links
Use the official docs as the source of truth for current models, parameters, pricing, and error behavior.
- DeepSeek API docs (getting started): api-docs.deepseek.com
- API reference overview: DeepSeek API reference
- Create chat completion: /chat/completions
- List models: GET /models
- Models & pricing: Pricing
- Pricing details (USD): Pricing details
- Error codes: Error codes
- Rate limit note: Rate limit
Changelog
- Initial publication: OpenAI-compatible setup, endpoints, models, pricing table, production patterns, and FAQs.