DeepSeek API - Complete Developer Guide

Design: white background + black text Schema: OpenAI-compatible Base URL: api.deepseek.com Models: deepseek-chat • deepseek-reasoner Billing: per 1M tokens

DeepSeek API (2026): Complete Developer Guide

The DeepSeek API is a developer platform for calling DeepSeek models using an OpenAI-compatible request/response format. That means you can often take existing code written for OpenAI-style clients, swap the base URL, set your DeepSeek API key, and start shipping. DeepSeek’s platform focuses on fast, cost-efficient inference and includes both chat-style models and reasoning models that are designed for harder tasks.

This is an educational guide built from official DeepSeek API documentation and common production practices. Always check the official docs for the latest model list, parameters, and pricing before going live.

What is DeepSeek API?

DeepSeek API is an HTTP-based service for generating text (and in some cases advanced reasoning) from DeepSeek models. You send a request containing a conversation (messages), a model name, and generation settings (like temperature and max tokens), and the API returns a model response you can display, store, transform, or feed into downstream systems.

What makes DeepSeek especially practical for teams is that the API follows a familiar “chat completions” pattern that many developer tools already support. If your stack uses an OpenAI-compatible SDK, it’s often “plug-and-play”: update the base URL, provide your API key, and keep your existing code structure for messages, tools, and streaming.

What you can build with DeepSeek API

Customer support assistants with grounded answers and escalation paths
RAG search (document Q&A, knowledge-base chat, internal copilot)
Extraction APIs (pull structured fields from text: invoices, resumes, tickets, reviews)
Coding assistants and code review bots
Research workflows that summarize, compare, and explain sources
Agentic automation using tools (function calls) and multi-step planning

Why teams choose DeepSeek

Most teams pick DeepSeek for one or more of these reasons:

OpenAI-compatible schema: faster migration and fewer rewrites across SDKs and gateways.
Cost structure: token-based billing (with caching concepts) can be extremely cost-effective if you architect for reuse.
Model variety: a chat model for general tasks and a reasoner model for deeper, multi-step reasoning.
Practical production docs: clear error codes and best practices for retries and resilience.

Reality check: no model is “set-and-forget.” You still need good prompt design, output validation, monitoring, and safety controls. Treat DeepSeek as an important component in a reliable system, not the entire system.

Core concepts: tokens, context, caching, and output limits

To build predictably with any LLM API, you need four mental models: tokens, context window, output limit, and caching.

Tokens (what you pay for)

DeepSeek pricing is expressed per 1 million tokens (1M tokens). Tokens are small chunks of text (words, parts of words, punctuation, and sometimes whitespace). The API bills the total number of tokens processed: input tokens (your messages + system prompt + tool schema) and output tokens (the model’s response).

Context window (how much you can send)

Each model has a maximum context length (for example, 64K for certain DeepSeek models). Your request must fit inside this window: system instructions + conversation history + any retrieved documents. If you exceed the window, you’ll need to truncate, summarize, or chunk your content.

Output token limit (how long the answer can be)

Many APIs let you cap the model’s output length (max_tokens). Production apps should always set a maximum output, because without a cap, long responses can increase costs and latency. For some models, the output maximum might be lower than the full context window—so design your UI and expectations accordingly.

Caching (how you can reduce cost)

In many token-billing platforms, caching reduces the cost of repeated prompt prefixes. The high-level strategy is: keep stable instructions and repeated content (policy text, formatting rules, tool schemas, or static knowledge) in a consistent prefix, then reuse it across calls. If the provider supports cache hit/miss pricing, cache hits are cheaper than cache misses. Even if your application-layer caching does most of the work, it’s still useful to understand how provider caching influences cost.

Practical tip: if you have a long system prompt and a big tool schema, version them and keep them stable. Changing them frequently can reduce cache reuse and make costs more variable.

Models overview: deepseek-chat vs deepseek-reasoner

DeepSeek’s official pricing docs list two primary model families used by most developers:

Model	What it’s best at	Typical usage
deepseek-chat	General conversational tasks, drafting, summarization, extraction, tool-using assistants	Customer support bots, content generation, structured extraction, RAG chat
deepseek-reasoner	Harder reasoning tasks, multi-step planning, complex analysis	Agents, code reasoning, deep analysis, complex decision support

A simple approach that works well in production is to use deepseek-chat for most traffic and only route to deepseek-reasoner when the task is genuinely difficult or the user is on a tier that can afford higher reasoning cost. This is similar to how teams often route between “fast” and “deep” models on other platforms.

Thinking mode (when available)

DeepSeek’s chat completion docs include a thinking parameter (enabled/disabled) to switch between thinking and non-thinking modes in some cases. In practice, you should treat “thinking” as a knob that trades cost and latency for stronger reasoning. When you ship it, expose it as a product feature only if users understand the trade-off.

Authentication & base URL

DeepSeek API uses Bearer token authentication. You attach your API key in the HTTP header: Authorization: Bearer YOUR_API_KEY. The official docs show the base URL as https://api.deepseek.com, and note that you can also use https://api.deepseek.com/v1 as an OpenAI-compatible base (the “v1” path is for schema compatibility, not a model version).

Never expose your API key in the browser. Use a backend proxy (server, serverless function, or API gateway) to call DeepSeek. Your backend can enforce authentication, rate limits, quotas, and safety checks.

Checklist for secure key management

Store keys in a secrets manager, not hard-coded strings or public repos.
Create separate keys for dev/staging/prod; rotate keys regularly.
Log request IDs and user IDs, not the API key or raw prompts with sensitive data.
Use allowlists for internal services that can access the key.

OpenAI compatibility & migration

DeepSeek explicitly documents an OpenAI-compatible API format. That means many OpenAI-style SDKs can work by changing:

Base URL → https://api.deepseek.com (or .../v1)
API key → DeepSeek API key
Model name → DeepSeek model identifier (e.g., deepseek-chat)

In real migrations, the biggest “gotchas” are not the endpoint paths but the small behavior differences: default max tokens, how tools are represented, whether a parameter is supported, and how error codes are returned. The safest path is to create a small adapter layer in your code: a “Model Provider” interface with DeepSeek as one implementation. That way, you can keep your app logic stable even if provider details change.

Endpoints you’ll actually use

For most applications, there are only a few endpoints you need to understand:

Endpoint	What it does	Why it matters
POST /chat/completions	Generate a model response from messages (chat format)	Your main workhorse for chat, RAG, extraction, and agents
GET /models	List available models and basic metadata	Useful to verify model availability and build model selectors
Docs: error codes	Explains 402/422/429/500/503 behavior	Critical for retries, UX messaging, and incident response

Everything else (files, embeddings, images, etc.) depends on your specific stack and the provider’s evolving capabilities. Many teams keep the core chat completion endpoint as the primary interface and implement specialized functionality (like embedding generation) via an internal service or a gateway that can swap providers.

Quickstart (curl, Python, Node)

Below are practical “first request” examples. They intentionally keep the payload minimal so you can validate: your key works, your base URL is correct, and your model name is available. After that, you can add streaming, tools, structured outputs, and RAG.

1) curl — minimal chat completion

curl https://api.deepseek.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -d '{
    "model": "deepseek-chat",
    "messages": [
      {"role":"system","content":"You are a concise assistant."},
      {"role":"user","content":"Explain what an API is in 2 sentences."}
    ],
    "temperature": 0.6,
    "max_tokens": 200
  }'

2) Python — OpenAI-style client pattern

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1"
)

resp = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content":"Give me 5 bullet tips for reducing API latency."}
    ],
    temperature=0.4,
    max_tokens=250
)

print(resp.choices[0].message.content)

3) Node.js — minimal request (fetch)

const resp = await fetch("https://api.deepseek.com/v1/chat/completions", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.DEEPSEEK_API_KEY}`,
  },
  body: JSON.stringify({
    model: "deepseek-chat",
    messages: [
      { role: "system", content: "You are a careful assistant." },
      { role: "user", content: "Write a short JSON schema for a support ticket." }
    ],
    temperature: 0.2,
    max_tokens: 250
  })
});

const data = await resp.json();
console.log(data.choices?.[0]?.message?.content);

Tip: start with “deepseek-chat” for your first call. Once it works, add a model selector and test “deepseek-reasoner” on a few hard tasks. Measure quality, latency, and cost before routing significant traffic.

Prompting patterns that work

Prompting is not “magic words.” It’s simply shaping the input so the model understands the task, constraints, and output format. The most reliable prompts are structured and explicit:

1) The stable system prompt

The system prompt is the place to define your assistant’s identity, safety boundaries, and output rules. Keep it stable and versioned (e.g., policy_v12) so you can audit behavior changes over time. A good system prompt is short but clear: it sets tone, rules, and what to do when unsure.

2) Task framing in the user message

Many failures happen because the user’s request is ambiguous. Your UI can help by transforming user intent into a clearer prompt: include the goal, desired format, and any constraints. For extraction tasks, always specify the fields and define “null” behavior.

3) Examples (few-shot) when needed

If you want consistent formatting, give a short example. Few-shot examples increase tokens, so use them only when the benefit is large. A balanced approach is to provide a single example for tricky formats and rely on structured output constraints for everything else.

4) Separation of concerns

Keep policy text and formatting rules in the system prompt, and keep task-specific content in the user message. This makes debugging much easier: if the output is wrong, you know whether it’s a policy/format problem or a task problem.

Streaming responses

Streaming returns partial output as it is generated. Users perceive streaming as faster because they see tokens immediately. It also allows you to stop generation early if the model goes off-topic or if the user cancels. If your SDK supports streaming, treat it as a UX improvement rather than a business logic dependency: always handle the non-streaming “final response” too.

Streaming UX best practices

Show “thinking…” state and begin rendering tokens as soon as they arrive.
Allow cancellation; stop streaming and discard partial output if the user changes the request.
Don’t save partial content to your DB until you receive completion (unless you want a full transcript).
Use a max_tokens cap even for streaming, to avoid runaway responses.

Structured JSON outputs (how to make LLMs reliable)

The biggest leap in reliability comes when you stop asking the model for “free-form prose” and instead ask for structured outputs. For example: classification, extraction, routing decisions, and config generation. This reduces hallucinations because your validator can reject invalid structures and you can re-ask the model with a correction prompt.

Pattern: “Return JSON only” + validate

In many production systems, the app first asks the model for JSON, validates it, and only then renders a user-friendly explanation. This approach works because the model’s job is no longer to write the final UI text. It’s to produce structured data.

SYSTEM:
You are a strict JSON generator. Output must be valid JSON and nothing else.

USER:
Extract a support ticket from this message. Return:
{
  "category": "billing|bug|feature_request|account|other",
  "urgency": "low|medium|high",
  "summary": string,
  "needs_human": boolean
}
If unsure, use "other" and set needs_human=true.

Message:
"My API key stopped working after I rotated it. Also getting 402 errors."

Then your backend can validate that the JSON matches expected enums and types. If validation fails, send a correction prompt: “The JSON was invalid because urgency must be one of… Return corrected JSON only.”

Tool calling & agent workflows

Many teams want “agents”: the model decides when to call tools (functions) to do work like searching documents, retrieving account details, creating tickets, or performing actions in external systems. Whether your SDK calls it “tools,” “functions,” or “function calling,” the principle is the same:

Define tools as strict schemas.
Let the model request tool calls.
Execute tools in your backend (never in the model).
Feed tool results back to the model and ask it to finalize.

Agent safety rule: “tools are allowed actions, not suggestions”

Tool calling can accidentally become a security vulnerability if you let the model decide too much. Your backend must enforce permissions. For example: if the model calls “refund_customer,” your backend must check: Is the user an admin? Is the ticket eligible? Has the user confirmed? Is this request within policy? The model can propose; your system must verify.

RAG: retrieval with citations

Retrieval-Augmented Generation (RAG) means you fetch relevant documents from your own knowledge base and include them in the prompt. This makes answers more accurate and grounded. A good RAG system has two parts:

Retrieval: search and rank relevant passages.
Generation: answer using only retrieved passages, and include citations to the passages.

RAG workflow that scales

Chunk documents (300–800 tokens per chunk), store embeddings, and keep metadata (source, date, URL).
At query time, retrieve top-k chunks, then compress them (optional summarization) to fit the context window.
Instruct the model: “Use only provided sources. If not present, say you don’t know.”
Include citations like [1], [2] mapping to sources in your UI.

Cost tip: RAG often reduces output length because the model doesn’t need to “reason from scratch.” It can quote and summarize sources. It can also reduce repeated back-and-forth by answering more accurately the first time.

Pricing (cache hit/miss) & a simple cost calculator

DeepSeek pricing is published per 1M tokens and separated into input and output costs. The pricing table also distinguishes “cache hit” input price vs “cache miss” input price. In practice, you should budget using cache miss until you confirm how caching behaves for your workload and account.

Published pricing (USD per 1M tokens)

Model	Context	Max output	Input (cache hit)	Input (cache miss)	Output
deepseek-chat	64K	8K	$0.07	$0.27	$1.10
deepseek-reasoner	64K	8K	$0.14	$0.55	$2.19

Cost calculator formula

Estimate cost using:

Input cost = (input_tokens / 1,000,000) × input_price
Output cost = (output_tokens / 1,000,000) × output_price
Total = input cost + output cost

Example: deepseek-chat

Suppose your request uses 6,000 input tokens and returns 800 output tokens. Using cache miss input pricing:

Input: (6,000 / 1,000,000) × $0.27 ≈ $0.00162
Output: (800 / 1,000,000) × $1.10 ≈ $0.00088
Total ≈ $0.00250 per call

The best way to reduce cost is rarely “use a lower temperature.” It’s almost always:

Reduce input tokens (summarize history, trim tool schemas, retrieve fewer RAG chunks)
Reduce output tokens (set max_tokens, ask for concise answers, use structured outputs)
Cache repeated prompts and retrieval results where appropriate
Route tasks: chat model by default, reasoner only when needed

Rate limits, timeouts & retries

DeepSeek’s rate limit documentation indicates the API does not impose a fixed user rate limit in the same way that some providers do, but it notes that under high traffic your requests may take longer and the HTTP connection may remain open while content streams back. In practice, you still need to protect your service with client-side and server-side rate controls, because:

Your own infrastructure has limits (worker concurrency, database, network, queue backlogs).
Upstream services can overload (503) or rate-limit (429) under spikes.
Retries without discipline can amplify incidents and double your costs.

Retry strategy that won’t melt your system

Retry only on transient errors (429, 500, 503, network timeouts).
Use exponential backoff with jitter (e.g., 0.5s, 1s, 2s, 4s, max 10s).
Cap retries (2–4 attempts). If still failing, return a graceful error and allow the user to retry manually.
Use idempotency keys to avoid duplicate charges when a request is retried.

Streaming note: with streaming responses, a “slow response” can be a sign of load rather than failure. Use a generous read timeout for streaming but a strict total timeout for non-streaming calls.

Error codes & debugging

DeepSeek documents common HTTP error codes and their likely causes. In production, you should map these codes to user-friendly messages and to actionable logs for your engineers.

Status	Meaning	What to do
402	Insufficient balance	Show billing CTA, alert admins, pause background jobs, and avoid repeated retries.
422	Invalid parameters	Fix payload (model name, fields, types). Don’t retry automatically.
429	Rate limit reached / too many requests	Backoff + retry with jitter; reduce concurrency; add queueing.
500	Server error	Retry with backoff; log correlation IDs; open incident if persistent.
503	Server overloaded	Retry after delay; degrade gracefully; consider fallback provider if critical.

Debugging checklist

Confirm base URL and key are correct (most “401-like” issues are config mistakes).
List models via GET /models and ensure your model exists.
Check payload types: numbers vs strings, booleans, and nested tool schemas.
Validate token limits: context window exceeded can produce errors or truncated output.
Measure latency: slow responses under load might look like timeouts if your timeouts are too strict.

Security, privacy & compliance

LLM APIs are powerful, but they also handle sensitive data: customer chats, internal documents, and potentially personal information. A production DeepSeek integration should include:

Security basics

Backend-only keys: never expose keys in the client.
Prompt injection defense: treat user content as untrusted; never let it override system policy.
Tool permission checks: enforce auth and policy in your backend, not in the model.
Data minimization: send only what’s necessary; avoid dumping entire documents into context.
Encryption: TLS in transit; encrypt sensitive logs and stored conversations.

Privacy design that users trust

Show a clear privacy policy explaining what you store and for how long.
Allow users to delete conversation history and generated outputs.
For enterprise use, provide workspace-level controls: retention windows, audit logs, and admin overrides.

High-stakes domains: If you’re using DeepSeek for medical, legal, finance, or safety-critical decisions, build human review workflows and explicit uncertainty handling. LLMs can be wrong confidently, and you must design for that.

Production architecture (what actually scales)

The fastest demo is “frontend calls API.” The correct production approach is: frontend → your backend → DeepSeek → your backend → frontend. Your backend becomes the control plane that enforces authentication, quotas, safety, and observability.

Recommended production components

API gateway (auth, rate limit, request validation)
Chat service (calls DeepSeek, handles streaming, validates output)
RAG service (retrieval + chunking + citation formatting)
Tool executor (function calls, permission checks, audit logs)
Queue for long-running tasks (summaries, batch extraction, report generation)
Database for sessions, messages, tool results, and cost accounting

Routing strategy: “fast by default”

Route most requests to deepseek-chat. Route to deepseek-reasoner when:

The user explicitly asks for deep reasoning (“analyze”, “prove”, “find edge cases”, “compare trade-offs”).
Your classifier flags complexity (multi-part reasoning, code debugging, ambiguous constraints).
The user is on a plan that supports higher-cost reasoning.

Conversation memory that doesn’t explode tokens

Long conversations can become expensive. The solution is a tiered memory approach:

Short-term window: last N turns included verbatim.
Rolling summary: older content summarized into 200–500 tokens.
Long-term memory: store facts as structured notes (user preferences, decisions) and retrieve only when relevant.

Logging & observability

If you can’t see it, you can’t scale it. Observability for LLM apps includes normal metrics (latency, error rate) plus LLM-specific metrics:

Input/output tokens per request
Cost estimate per request and per user/workspace
Model choice and route reason (chat vs reasoner)
Tool calls (which tools, duration, success/failure)
RAG stats (retrieved chunks, sources, compression rate)
Safety outcomes (blocked requests, redactions, human review)

Engineer-friendly logs: log a request_id, user_id, model, latency, token counts, and error codes. Avoid logging raw user content unless you have explicit consent and a secure retention policy.

FAQ

Yes. DeepSeek documents an OpenAI-compatible API format. In many projects you can keep the same chat-completions request shape, switch the base URL to api.deepseek.com (or /v1), set your DeepSeek API key, and update the model name.

deepseek-chat is the general-purpose option for everyday chat, drafting, and extraction. deepseek-reasoner is tuned for deeper, multi-step reasoning and tends to cost more and run slower. A common strategy is “chat by default, reasoner only when needed.”

Reduce tokens. Summarize long conversations, trim tool schemas, limit RAG chunks, and set max_tokens. Use structured outputs to avoid long prose. Cache repeated prefixes and route simpler tasks to deepseek-chat.

402 means insufficient balance; 422 means invalid parameters; 429 is rate limit / too many requests; 500 is server error; 503 is overloaded. Retry only transient errors (429/500/503) with backoff and jitter, and avoid retrying 422.

If you need answers grounded in your own documents, use RAG. Prompting alone can hallucinate. RAG retrieves relevant sources and makes answers more accurate. For best UX, show citations and allow users to open the source passages.

Yes. Use tool/function schemas, let the model request tool calls, execute tools in your backend with permission checks, then feed results back. Always enforce safety and authorization server-side.

Official resources & links

Use the official docs as the source of truth for current models, parameters, pricing, and error behavior.

DeepSeek API docs (getting started): api-docs.deepseek.com
API reference overview: DeepSeek API reference
Create chat completion: /chat/completions
List models: GET /models
Models & pricing: Pricing
Pricing details (USD): Pricing details
Error codes: Error codes
Rate limit note: Rate limit

Changelog

Initial publication: OpenAI-compatible setup, endpoints, models, pricing table, production patterns, and FAQs.