xAI Grok API - Complete Developer Guide to Endpoints, Models, Pricing & Production Integration

The Grok API (often called the xAI API) is a developer platform for calling Grok models over HTTP. It’s designed to be compatible with OpenAI-style REST endpoints so teams can migrate quickly by changing a base URL, swapping keys, and selecting Grok models. Under the hood, you get modern LLM capabilities such as long-context reasoning, tool calling for agent workflows, multimodal inputs (text + images for certain models), and a growing set of media APIs for image/video/audio creation depending on what your account enables.

Scope note: “Grok API” can describe multiple endpoints and feature sets. The core pieces most developers use first are: /v1/responses and /v1/chat/completions hosted at https://api.x.ai. Beyond text, xAI’s docs include sections for images, video, voice, files/collections (RAG), and tool integrations.

Theme: white background + black text Audience: builders, startups, engineering teams Focus: shipping reliably (not just “hello world”)

What is the Grok API?

Grok is a family of large language models developed by xAI. The Grok API is the hosted interface that lets you call those models from your application—web, mobile, backend services, automation tools, or internal assistants. At a high level, it works like other modern LLM APIs: you send input (messages, instructions, optional images/files), the model produces output (text and, for some endpoints, media), and the response includes usage information so you can measure cost and rate-limit pressure.

What makes the Grok API especially interesting in 2026 is its ecosystem approach:

OpenAI-compatible REST design: Many teams can reuse existing HTTP clients, schemas, and middleware with minor changes. This reduces migration friction and keeps your codebase portable.
Multiple “primitives”: You can use a modern “responses” style API for richer features or a chat-completions style API for straightforward conversational patterns.
Agents & tools: The API supports building assistants that can call functions, fetch information, reason over retrieved content (RAG), and follow structured output schemas when you need deterministic JSON.
Multimodal and creative APIs: Depending on your account access, you may use image understanding, image generation, and video/audio workflows in adjacent APIs.

Grok Imagine API

Grok Imagine API is xAI’s media-generation API for creating AI videos (and related creative workflows) with a focus on strong quality, low latency, and cost efficiency. xAI positions it as a “world-class video generation model,” and provides dedicated Imagine API docs, playground, and SDK to build text-to-video and editing pipelines into apps.

What you can do with Grok Imagine API

Text → Video: Generate short videos from a text prompt (often returned asynchronously via a request_id).
Image → Video: Animate a still image into a video, guided by a prompt.
Video editing (video → video): Provide an input video URL plus instructions to produce an edited version (xAI notes a max supported length of 8.7 seconds for input videos in the video edit flow).
Asynchronous job pattern: Start a generation/edit request, then poll results using the returned request_id to fetch the final video URL.

How it typically works in production

Start a job (prompt + optional image/video URL) using the Imagine video model (commonly referenced as grok-imagine-video).
Get a request ID immediately (so your UI can show “Rendering…” and you can track/queue work).
Fetch the completed output later with the same request ID and store the resulting video in your own CDN/storage for reliability and retention control.

Important safety note for apps

Because Grok’s media generation features have been linked to non-consensual sexual deepfake misuse in recent reporting, it’s smart to add strict guardrails if you build on Imagine (age gating, consent checks, abuse reporting, watermarking/provenance, and strong moderation).

Grok 4 API

Grok 4 API (via the xAI API) is xAI’s flagship reasoning first model endpoint for building advanced assistants, coding copilots, and multimodal apps. Developers call Grok 4 through standard REST endpoints on api.x.ai most notably the Responses API (recommended for new builds) and the Chat Completions endpoint (legacy) to generate text, plan multi step workflows, and power tool using agents.

What makes Grok 4 different: it's explicitly a reasoning model (no non reasoning mode), designed for harder tasks like complex problem solving, long context synthesis, and code reasoning. It also supports modern features such as streaming and structured outputs, and xAI markets Grok 4 for reasoning, coding, and visual processing.

Important implementation notes: Grok 4 has a few parameter constraints typical of reasoning models xAI notes that presencePenalty, frequencyPenalty, and stop aren’t supported for reasoning models, and Grok 4 doesn’t accept a reasoning effort parameter (sending it returns an error). Plan your request schema accordingly when migrating from other providers.

Context & Grok 4 API headline capabilities: xAI’s Grok 4 announcement highlights a 256,000 token context window and “frontier level” multimodal understanding with advanced reasoning for text + vision use cases.

Who this guide is for

This page is for developers and teams building real products with Grok:

Application engineers who need a dependable pattern for calling Grok from web/mobile/backends.
Platform engineers who care about security, latency, caching, observability, and cost control.
Founders choosing models, estimating spend, and deciding whether to use reasoning or non-reasoning variants.
Data and ML teams implementing evaluation, monitoring, and prompt/version management.

It is written to be practical and production-minded: not just endpoints, but the “glue” that makes your system reliable—queues, retries, rate limit backoff, and safe defaults.

The big picture: base URL, endpoints, and compatibility

The xAI API uses a base host at https://api.x.ai. Many routes are under /v1/. You authenticate using a Bearer token header with your xAI API key. The documentation emphasizes compatibility with OpenAI’s REST API shape, which matters for:

Existing code that calls /v1/chat/completions.
Libraries and frameworks expecting OpenAI-like response schemas.
Tooling such as SDK adapters, agent frameworks, and gateways that support “OpenAI mode.”

In practice, your integration strategy should aim to be provider-agnostic: build an internal interface like generateText() or runAssistant(), then map to xAI’s endpoints in one adapter module. This keeps you flexible if your product later needs multiple providers or if you want to swap models without rewriting business logic.

Accounts, billing, and API keys

To use the Grok API, you typically need:

An xAI account registered through the official account system.
Billing/credits (most APIs require you to load credits or configure billing before requests succeed).
An API key generated in the xAI Console.

Operational best practices:

Create separate keys for development, staging, and production.
Store keys in a secret manager (or at minimum environment variables) and never hardcode them.
Rotate keys regularly. Treat logs and crash reports as sensitive; never log keys.
Use least-privilege patterns: if your org supports separate projects/teams, isolate environments.

Backend rule: Don’t call Grok directly from a browser in production. Always route through your backend to protect keys, enforce per-user limits, and redact sensitive fields.

Your first request (curl)

The simplest way to test connectivity is to call the Responses endpoint. This is a “single request / single response” pattern that can represent chat messages, system instructions, and richer multimodal inputs depending on model support. Below is a safe starter request:

curl https://api.x.ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -m 3600 \
  -d '{
    "model": "grok-4-1-fast-reasoning",
    "input": [
      { "role": "system", "content": "You are Grok, a helpful assistant." },
      { "role": "user", "content": "Explain rate limits like I am new to APIs." }
    ]
  }'

Notes:

The -m 3600 timeout is intentionally long for reasoning models that may take more time under load.
You should expect the response to include model output plus a usage object (token counts and possibly cost fields).
If you receive 401, check that your key is present and that billing/credits are configured.
If you receive 429, you’re hitting rate limits—reduce request frequency or move to a higher tier.

SDK options (xAI SDK + OpenAI SDK compatibility)

There are three common ways to integrate Grok in 2026:

Raw HTTPS: Use fetch/axios/requests/curl directly. This is the most universal method and easiest to debug.
xAI native SDK: Useful if you prefer first-party helpers, typed objects, and convenience wrappers for chat sessions. Native SDKs also sometimes expose additional features sooner than generic clients.
OpenAI-compatible SDK usage: If your current system uses OpenAI’s SDKs, you may be able to swap base_url to the xAI endpoint and keep much of your code intact, especially for chat completions patterns.

Example: minimal JavaScript fetch

const res = await fetch("https://api.x.ai/v1/responses", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.XAI_API_KEY}`
  },
  body: JSON.stringify({
    model: "grok-4-1-fast-non-reasoning",
    input: [{ role: "user", content: "Write a short project plan for an API integration." }]
  })
});

if (!res.ok) throw new Error(`HTTP ${res.status}`);
const data = await res.json();
console.log(data);

Example: server-side environment variable pattern

On Linux/macOS shells:

export XAI_API_KEY="your_api_key_here"

Then read it in your backend. This keeps secrets out of source control and works well with CI/CD and container deployments.

Authentication & headers

The Grok API uses a simple Bearer token header:

Authorization: Bearer <your-xai-api-key>

Common headers you’ll use:

Content-Type: application/json for JSON request bodies.
Optional correlation/request IDs you generate, such as X-Request-Id, to make logs and debugging easier.
Cache-related headers or conversation IDs (if you’re taking advantage of prompt caching behaviors on supported tiers). Even when the platform does caching automatically, you should design your own application cache for repeated prompts and repeated tool results.

Security warning: Never embed API keys in client apps (mobile binaries, browser JavaScript) because they can be extracted. Use your backend as the gatekeeper.

Responses API (/v1/responses)

The Responses API is a modern “do everything” endpoint pattern. In many platforms, “responses” can represent simple text output, multi-turn conversation, structured output enforcement, and tool execution flows. Even if you start with chat completions, adopting responses early can help you unlock richer agent features.

Typical request fields

model — which Grok model to run (reasoning or non-reasoning, long-context, coding variants, etc.).
input — an array of message-like items (roles such as system/user, possibly tool messages depending on your architecture).
temperature / top_p — sampling controls for creativity vs determinism (availability may vary by endpoint/model).
max_output_tokens — cost control knob; prevents runaway output length.
tools — tool definitions for function calling (JSON schema describing tool name and parameters).
stream — enable streaming output where supported.

When to use Responses vs Chat Completions

You can think of it this way:

Chat Completions is great when you want a familiar “messages in, message out” interface.
Responses is better when you want a single endpoint that supports advanced agent patterns, structured output, and tool orchestration.

For new projects, build around Responses if you can; for existing OpenAI-compatible stacks, start with chat completions and migrate once your basics are stable.

Chat Completions (/v1/chat/completions)

Chat completions is the “classic” interface: provide an array of messages, get a completion. It’s simple, widely supported by tools and frameworks, and often the fastest way to integrate Grok into an existing codebase.

A safe minimal chat completion example

curl https://api.x.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "grok-4-1-fast-non-reasoning",
    "messages": [
      { "role": "system", "content": "You are a concise assistant." },
      { "role": "user", "content": "Give me 5 bullet points on how to handle 429 errors." }
    ]
  }'

Many production teams keep chat completions around even after adopting responses because it maps neatly to conversation UIs. The key is to keep an internal abstraction so you can switch models and endpoints without rewriting your product.

Streaming

Streaming lets you deliver tokens to the user as they are generated. It’s essential for good UX in chat apps, code assistants, and long-form responses: users feel the system is responsive, and you can allow cancellation mid-generation to save cost.

Streaming is often implemented via Server-Sent Events (SSE) or chunked HTTP. Your backend should:

Open a streaming connection to xAI, then relay chunks to the client.
Implement timeouts and cancellation (user hits “Stop”).
Handle partial output persistence (save what’s produced so far if appropriate).
Measure “time to first token” separately from “time to last token.”

Cost control tip: Combine streaming with max_output_tokens and a “Stop generating” UI. This reduces runaway costs.

Structured outputs & JSON you can trust

If you’re building automation—forms, extraction, classification, policy checks, routing—you often need the model to emit valid JSON. The most reliable approach is to use a structured outputs feature (when available) that enforces a schema rather than hoping the model “behaves.”

A structured output workflow usually includes:

A JSON Schema describing the output fields.
A “strict” mode that rejects outputs that don’t match the schema.
Retry logic that asks the model to correct formatting issues (preferably automatically).

In production, you should still validate JSON on your side. Treat model output as untrusted input: parse, validate, sanitize, then proceed.

Model families and selection

xAI offers multiple Grok models aimed at different needs: fast and affordable general models, reasoning variants, and coding-focused variants. The details change over time, but the selection strategy stays consistent:

Fast non-reasoning: best for chat, summaries, simple extraction, and high-throughput apps.
Fast reasoning: best for multi-step tasks, planning, hard math/logic, and tool-using agents that benefit from deeper thinking.
Coding models: best for code generation, refactoring, debugging, and agentic coding workflows (where tool calls are common).
Long-context variants: best for large documents, multi-file codebases, and “chat with knowledge base” features.

A practical product uses multiple models:

Default to a fast model for most user requests.
Escalate to a reasoning model when your router detects complexity (long prompts, multi-step instructions, ambiguous tasks).
Use coding models for code tasks; use multimodal-capable models only when images are present.

Router strategy (simple but effective)

You can implement a router in your backend that chooses a model based on:

Task type: chat, coding, research, extraction, planning.
Input size: larger context means you need long-context models and cost controls.
Latency budget: interactive UI vs background batch job.
Tier: free vs paid users; enterprise may get higher quality by default.

Pricing concepts: token categories (how Grok API billing works)

In the Grok API, cost is typically based on tokens and model-specific rates. There are multiple token categories that can influence billing:

Input tokens: your prompt and conversation history (and sometimes system formatting tokens).
Completion/output tokens: the tokens the model generates as the response.
Reasoning tokens: for reasoning models, internal reasoning may be tracked and billed similarly to output tokens (availability depends on model/endpoint).
Image tokens: when images are analyzed as part of the input.
Cached prompt tokens: prompt tokens served from cache, usually at a discounted rate when cache hits occur.

The key idea: you don’t pay for “requests,” you pay for the amount and category of work done. That means the same API call can cost wildly different amounts depending on prompt length, output length, whether images are included, and whether your request triggers reasoning or tools.

Practical pricing advice

Keep system prompts short and stable. Reuse them and rely on caching where available.
Summarize history instead of sending long conversation transcripts every time.
Set output limits with max tokens and “be concise” instructions for typical UI flows.
Use cheap models for classification and routing; save expensive reasoning for the tasks that truly need it.
Cache tool results (web search results, database lookups) to avoid repeated tool calls for the same query.

Usage object & cost tracking

Grok API responses often include a usage section that breaks down tokens and sometimes cost metrics. Your product should capture usage for:

Billing and quotas: enforce user-level credit limits or monthly budgets.
Cost monitoring: alert when average cost per request spikes unexpectedly.
Model evaluation: compare cost vs quality across models in your A/B tests.
Rate limit forecasting: track tokens per minute under peak load.

A robust approach is to store a normalized usage record per request:

request_id, user_id, model, endpoint
prompt_tokens, completion_tokens, reasoning_tokens, image_tokens, cached_tokens
total_tokens, total_cost_estimate
latency_ms, status_code, error_type

Then build dashboards and alerting. This is not optional for real products: cost surprises are one of the top reasons LLM apps fail in production.

Rate limits & tiers

Like any large API platform, xAI enforces rate limits to ensure fair usage and stability. Rate limits can differ by model and by your team’s tier. The platform typically limits both:

Requests per minute (RPM) — how many calls you can make.
Tokens per minute (TPM) — how much total model work you can consume.

In production you should assume you will hit limits at some point (launches, traffic spikes, batch jobs). Design for this from day one:

Queue requests and process with controlled concurrency.
Retry 429s with exponential backoff and jitter.
Fail gracefully in the UI: “We’re busy—try again in a moment.”
Use fallback models (cheaper/faster) when appropriate.

What to do when you hit 429

Reduce concurrency (fewer parallel requests).
Reduce prompt sizes (summarize history, shrink context).
Reduce output sizes (max tokens, concise responses).
Spread load across time (batch at off-peak, schedule jobs).
Move to a higher tier or request increased limits if your use case needs it.

Image understanding (text + images in one request)

Many modern Grok models can accept images along with text input. This enables:

Describing or summarizing images (captions, alt text, product descriptions).
Extracting structured information (receipts, forms, diagrams—depending on quality and limitations).
Visual Q&A (“what’s the error message in this screenshot?”).
Design review and UI feedback (“is this layout accessible?”).

In a production app, treat image inputs as sensitive:

Do not store images unless necessary, and provide deletion controls.
Use signed URLs with short expiration if you store user uploads.
Redact personally identifying information where possible before sending images to the model.

Image generation

In addition to image understanding, xAI’s docs include image generation capabilities. Image generation APIs are typically used for:

Marketing creative, thumbnails, social media assets.
Design mockups and concept art.
Product visualization for e-commerce prototypes.

If you ship image generation, add controls for:

Output size and quality (cost vs speed).
Content filtering and safety constraints.
Provenance metadata (so users can identify AI-generated media in workflows).

Video generation (and Grok Imagine)

xAI has also announced creative video workflows (often described as “video generation” and “video editing”). These are typically separate endpoints from core text responses and may require specific plans or account enablement.

Video generation systems are usually asynchronous:

Submit a request (prompt + optional image reference + style controls).
Receive a job ID.
Poll job status or receive a webhook on completion.
Download final artifacts and store them in your own media storage/CDN.

For production, treat video generation like a render pipeline: it needs queues, retry logic, artifact integrity checks, and lifecycle management (retention, deletion, and safe sharing).

Audio / voice agent API overview

xAI documentation also references audio and voice capabilities, including “voice agent” style APIs. Voice agents usually combine:

Speech-to-text (user talks, model understands).
Text-to-speech (model responds with a natural voice).
Realtime streaming (low-latency back-and-forth).
Tool use (agent can look up account data, execute tasks, call your services).

Voice features are higher-risk than text because they can be used for impersonation or deceptive content. If you build voice agents, implement strong safeguards: verified identities for sensitive actions, clear user consent, and secure authentication.

Tool calling in Grok (functions, actions, and agents)

Tool calling is what turns “chat” into an “agent.” Instead of responding with text only, the model can decide to call a tool, such as a function in your backend (e.g., “createInvoice”), a search function (web/X), a database query, or a code execution sandbox.

A safe tool-calling design uses a strict separation of responsibilities:

The model chooses what to call and produces structured arguments.
Your backend validates arguments against a schema and authorization rules.
Your backend executes the tool (or rejects it) and returns results back to the model.
The model summarizes results for the user and asks for confirmation before any irreversible action.

Golden rules for tool safety

Never trust tool arguments blindly. Validate types, ranges, and permissions.
Require confirmations for destructive actions (delete, transfer funds, send emails).
Use allowlists instead of blocklists for actions the model may execute.
Log tool calls with correlation IDs for audit and debugging.
Throttle tool calls separately from text calls to prevent runaway loops.

Web search, X search, and citations

Many modern agent stacks include “search” tools. In xAI’s ecosystem, this can mean:

Web search for general internet sources.
X search for information and trends on the X platform.
Citations tooling to attach sources to claims.

When you integrate search, your product should:

Show sources clearly in the UI (users need to verify and trust outputs).
Cache search results briefly to reduce cost and improve performance.
Prevent prompt injection from retrieved pages by sanitizing content and applying a “trusted tool output” policy.
Handle freshness: expose timestamps, and let users request “latest” with clear expectations.

For enterprise or regulated domains, search tool usage often needs explicit controls: an admin can disable it, restrict domains, or route it through an internal approved search index.

Code execution

Some agent systems allow the model to run code in a sandbox to compute results, analyze data, or generate artifacts. Code execution can dramatically improve reliability for:

Math-heavy tasks (the model writes a small script to verify calculations).
Data transformation (CSV parsing, simple analytics, formatting).
Developer tools (generating or validating code, building config files).

But it’s also risky. If you expose code execution, enforce:

Strict sandboxing with no network access (unless explicitly required and secured).
CPU/memory/time limits.
File system restrictions and scanning of outputs.
Audit logs of what code was executed and by whom.

Files, collections, and RAG (retrieval-augmented generation)

“Chat with files” and “collections” features let your app ground Grok’s answers in your own data: PDFs, docs, knowledge base articles, wikis, policies, and internal notes. This is one of the highest-value use cases for enterprise products.

A typical RAG pipeline includes:

Ingest documents (upload or sync).
Chunk documents into passages (e.g., 300–1,200 tokens each depending on type).
Embed chunks into vector representations.
Retrieve the most relevant chunks at query time.
Generate a response grounded in retrieved passages.
Cite sources so the user can verify.

RAG best practices

Keep citations to chunk IDs or document sections so they’re traceable.
Prefer smaller, higher-quality corpora over “dump everything in.” Noisy corpora reduce answer quality.
Use access control at retrieval time: users should only retrieve documents they’re allowed to see.
Refresh index when documents change; stale embeddings lead to wrong answers.

Production architecture (recommended)

A reliable Grok API integration is as much about architecture as endpoints. Here’s a strong baseline:

Frontend: chat UI, assistant UI, or tool UI. Never stores xAI keys.
API Gateway: authenticates users, rate-limits per user/workspace, validates inputs.
Orchestrator service: builds prompts, selects model, handles tool routing, manages conversation state.
Queue + workers: for long tasks (reasoning, batch, media generation). Keeps your web servers fast.
Storage: conversation transcripts (optional), tool outputs, RAG indexes, and artifacts.
Observability: structured logs, metrics, tracing, and alerting.

Why queues matter

Even if your first prototype is synchronous, production traffic will force you to handle spikes, timeouts, and rate limiting. Queues let you:

Control concurrency so you don’t exceed rate limits.
Retry safely when upstream is unstable.
Provide “in progress” UI states and background job notifications.
Separate fast interactive requests from heavy background processing.

Reliability: retries, timeouts, and idempotency

LLM APIs can fail for routine reasons (network blips, upstream 5xx, timeouts, rate limits). Your app should treat failures as normal and handle them gracefully.

Retry strategy (safe defaults)

Retry on 429 and 5xx with exponential backoff + jitter.
Do not retry on 400 unless you can fix the request.
Cap retries (e.g., 3–5 attempts) and surface an actionable error message to the user.
Use idempotency keys so double-clicks don’t double-bill or double-create actions.

Timeouts

Reasoning models can take longer. Set timeouts appropriate to your UX:

Interactive chat: 10–45 seconds typical; stream output and allow cancellation.
Complex reasoning: longer timeouts or async jobs (minutes) with progress UI.
Batch processing: async jobs with queues, status endpoints, and worker time limits.

Security & privacy checklist

If your product handles user data, treat LLM prompts and outputs as potentially sensitive. A safe baseline includes:

Backend-only calls to xAI endpoints; keys never leave the server.
Encryption at rest for stored prompts, files, and transcripts.
Access control for conversation retrieval and file browsing (per user/workspace).
Redaction of secrets and personal data before sending to the model when possible.
Data retention controls and a deletion pathway (“delete my data”).
Audit logging for tool calls and admin actions.

Secrets management

Use a secrets manager (cloud secret store, Vault, or encrypted environment variables in your deployment platform). Do not:

Commit keys into Git.
Paste keys into client-side code.
Store keys in plaintext databases.

Compliance & safety

Compliance isn’t only about regulations; it’s also about preventing misuse and protecting users. If you ship a Grok-powered assistant, build safety into your product design:

Content controls: reject or transform disallowed requests, especially in areas involving harassment, hate, or non-consensual imagery.
Identity & consent checks: for features that could be used for impersonation or deception (especially voice and media).
Age-appropriate UX: if minors might use your product, enforce stricter defaults and clearer safeguards.
Transparency: label AI-generated content and provide an explanation of limitations.

If you operate in jurisdictions with strong privacy law (e.g., GDPR-like rules), ensure you have a lawful basis for processing, provide clear notices, and avoid collecting more data than you need.

Evaluation, monitoring, and prompt/version management

Production AI systems drift: models change, data changes, user behavior changes. To keep your product stable, invest in:

Golden tests: a suite of prompts and expected outputs (or expected properties) run daily.
Regression monitoring: track quality metrics (helpfulness, correctness, refusal accuracy) and compare across model versions.
Prompt versioning: store prompts as versioned templates; tie responses to prompt version for debugging.
Human review loops: allow users or moderators to flag bad outputs and feed those into evaluation datasets.

For structured outputs, add schema-level tests: ensure outputs parse reliably, required fields exist, and types match. For tool-using agents, test tool selection correctness (did the model call the right tool?) and tool argument validity.

Example requests (curl, JS, Python)

Below are practical examples you can adapt. They are intentionally conservative: they include timeouts, safe system prompts, and cost controls.

1) Responses API with max output tokens

curl https://api.x.ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "grok-4-1-fast-non-reasoning",
    "max_output_tokens": 300,
    "input": [
      { "role": "system", "content": "You are a precise assistant. Answer in short paragraphs and bullet points." },
      { "role": "user", "content": "Create a checklist for launching an LLM feature safely." }
    ]
  }'

2) Chat completions for a UI chat app

curl https://api.x.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "grok-4-1-fast-non-reasoning",
    "messages": [
      { "role": "system", "content": "You are friendly and concise." },
      { "role": "user", "content": "Summarize what an API rate limit is and why it exists." }
    ]
  }'

3) JavaScript backend wrapper with retries (pseudo-implementation)

async function grokRequest(body, { attempts = 4 } = {}) {
  const url = "https://api.x.ai/v1/responses";
  for (let i = 0; i < attempts; i++) {
    const res = await fetch(url, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Authorization": `Bearer ${process.env.XAI_API_KEY}`,
        "X-Request-Id": crypto.randomUUID()
      },
      body: JSON.stringify(body)
    });

    if (res.ok) return await res.json();

    // Retry on 429 and 5xx
    const retryable = res.status === 429 || (res.status >= 500 && res.status < 600);
    if (!retryable) {
      const text = await res.text().catch(() => "");
      throw new Error(`Non-retryable HTTP ${res.status}: ${text}`);
    }

    // Exponential backoff with jitter
    const base = Math.min(2000 * 2 ** i, 15000);
    const jitter = Math.floor(Math.random() * 250);
    await new Promise(r => setTimeout(r, base + jitter));
  }
  throw new Error("Grok request failed after retries");
}

4) Python “requests” style call (simple)

import os, requests

url = "https://api.x.ai/v1/chat/completions"
headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
}
payload = {
  "model": "grok-4-1-fast-non-reasoning",
  "messages": [
    {"role":"system","content":"You are a concise assistant."},
    {"role":"user","content":"Give 6 tips to reduce LLM API cost in production."}
  ]
}

r = requests.post(url, headers=headers, json=payload, timeout=60)
r.raise_for_status()
print(r.json())

Migration notes (from OpenAI or Anthropic tooling)

Because xAI’s API is designed to be compatible with OpenAI-like schemas, migrations often follow a predictable path:

Change base URL to xAI’s endpoint.
Replace API key with your xAI key.
Switch model name to a Grok model available in your console.
Confirm response parsing still matches your code (usage fields, message shape, tool call fields).
Re-tune prompts (small differences in instruction following can matter).

If your system uses an older Anthropic-compatible integration, treat it as legacy. In modern stacks, prefer Responses-style APIs or first-class tool calling features rather than older compatibility layers.

Common product patterns (what teams actually build)

Here are the most common “Grok API” product patterns and how to implement them well:

1) Customer support assistant (RAG + guardrails)

Index help articles and policy docs into collections.
Retrieve top passages and cite them in answers.
Use structured outputs to extract ticket fields (priority, category).
Escalate to a human when confidence is low or policy is ambiguous.

2) Internal copilot (tools + permissions)

Connect tools: HR policy search, Jira, GitHub, internal dashboards.
Use a strict permission layer; the model never decides authorization.
Require confirmation for writes (closing tickets, merging PRs).
Log all tool calls for audit and compliance.

3) Code assistant (coding model + code execution)

Use a coding-optimized Grok model for code tasks.
Allow optional sandbox code execution for verification.
Adopt diff-based edits rather than rewriting whole files.
Stream output for a fast developer experience.

4) Content pipeline (batch + cost controls)

Queue generation tasks and process in batches.
Enforce strict max output tokens per item.
Cache shared context like brand guidelines and style guides.
Run automated quality checks (length, tone, banned phrases) before publishing.

Developer + business FAQs

Use the xAI API host at https://api.x.ai and call endpoints under /v1/ such as /v1/responses and /v1/chat/completions.

Send your xAI API key as a Bearer token in the Authorization header: Authorization: Bearer <key>. Keep the key server-side.

For new projects and advanced agent workflows, prefer Responses. For quick integrations and classic chat UIs, Chat Completions is fine. Many teams support both behind a single internal interface.

Use cheaper models for routing and classification, summarize history, keep system prompts short, cap output tokens, cache repeated prompts and tool results, and stream responses with a stop button so users can cancel long generations.

You’re exceeding your tier’s allowed requests per minute or tokens per minute for that model. Reduce concurrency, back off with retries, shrink prompts/outputs, schedule batch work, or upgrade/request higher limits.

Yes—RAG is a core enterprise use case. Success depends on good document chunking, clean corpora, strong access control at retrieval time, and a UI that shows citations/links to sources.

Validate tool arguments with schemas, enforce permissions outside the model, require confirmations for irreversible actions, log tool calls for audit, and apply strict allowlists. Treat model output as untrusted input.

Glossary

Base URL: the root host for all API calls, such as https://api.x.ai.
RPM/TPM: requests per minute / tokens per minute. Rate limits often apply to both.
Reasoning model: a model variant optimized for multi-step thinking; may be slower but better on complex tasks.
Tool calling: letting the model request that your system executes a function/tool with structured arguments.
RAG: retrieval-augmented generation—grounding responses in retrieved documents.
Streaming: delivering output tokens incrementally so the UI updates in real time.
Idempotency: ensuring retries and duplicates don’t create duplicate side effects or double-bill.
Prompt caching: discounted reuse of repeated prompt prefixes, when supported.

Changelog

Initial publication of this Grok API guide (endpoints, models/pricing concepts, rate limits, tools, multimodal, and production patterns).