LangSmith API - Complete developer guide with responsive tables & video-themed visuals

LangSmith is a platform for observing, debugging, evaluating, and operationalizing LLM applications and agents. This page explains the LangSmith API end-to-end: how authentication works, how tracing/ingest works, how to query runs and attach feedback, how to manage datasets and evaluations, how OpenTelemetry fits in, and how to ship reliable systems in production.

You can use LangSmith with or without LangChain. Think of it like a “video editor timeline” for your agent: every step (LLM call, tool call, retriever query, parser step) becomes a segment on a trace timeline you can inspect, compare, and score.

API host: api.smith.langchain.com Auth header: X-Api-Key / x-api-key Tracing: SDK + REST ingest + OTel Core objects: projects, runs, feedback, datasets, experiments Responsive tables included

LangSmith

Image credit: smith.langchain.com


1) What is the LangSmith API?

The LangSmith API is the programmatic surface you use to integrate the LangSmith platform into your development and production workflows. “Programmatic surface” includes both REST endpoints and SDKs that call those endpoints on your behalf. With the API you can: ingest traces (runs), query and export run data, attach feedback signals, manage datasets and examples, and automate evaluations/experiments.

If you’re new to LLM apps, it helps to separate two ideas that often get mixed together: (1) the model API (OpenAI, Anthropic, Google, etc.) which generates text or structured outputs, and (2) the observability/evaluation API which tells you how well your overall system behaved. LangSmith is in category (2). It’s like having a professional-grade logging, debugging, and QA system specifically designed for agentic workflows.

Observability

See every step: prompts, tool calls, retrieval queries, parsing, retries, errors, latency.

Evaluation

Attach scores/labels, run offline evaluations on datasets, compare versions, detect regressions.

Production confidence

Monitor real traffic with privacy controls, sampling, and dashboards tied to releases.

Video analogy: Your agent’s behavior is like raw footage. Traces are the clips, runs are the segments, feedback is the review notes, and evaluations are the “screening” you do before you publish the final cut.

Core nouns you should learn once

These terms show up everywhere in the API and UI. If you understand them, everything else becomes easier.

Project

A named container for runs. Many teams map projects to environments (dev/staging/prod) or to products.

Run

A trace record representing a unit of work: a root request, a chain step, a model call, a tool invocation, or a retriever query.

Feedback

Signals attached to runs: numeric scores, categorical labels, comments, rubric versions, reviewer identity.

Dataset & Example

A dataset is a curated set of examples (inputs and optional reference outputs) used for evaluation and regression testing.

2) Quickstart (fastest working setup)

The quickest path is: create an API key, enable tracing, run a small piece of code, and verify you can see a trace. Once you have that, you can expand into datasets, evals, and automation.

2.1 Create an API key

In the LangSmith app, generate an API key in your settings. In many setups you’ll see different key types: service keys (recommended for servers, CI, and automation) and personal tokens (recommended for individual development). Use service keys for production because they’re easier to rotate and don’t depend on a single person’s identity.

2.2 Set environment variables

Shell
# Example environment variables (names can vary by SDK / integration)
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="ls_your_key_here"
export LANGSMITH_PROJECT="my-app-dev"

2.3 Trace something simple (Python)

The goal isn’t fancy logic — it’s proving your integration works end-to-end. Once it does, you’ll wrap real steps like model calls, retrieval, and tool usage.

Python
from langsmith import traceable

@traceable(name="hello_trace")
def hello_trace(user_text: str) -> str:
    # Replace this with your real LLM call.
    return f"Echo: {user_text}"

if __name__ == "__main__":
    print(hello_trace("Tracing is working!"))
If traces don’t appear: (1) confirm the key, (2) confirm tracing is enabled, (3) confirm you’re writing to the intended project, and (4) if your org has multiple workspaces, confirm tenant selection headers/settings.

2.4 Quick sanity check via HTTP

Even if you use SDKs, it’s helpful to know the REST host and the auth pattern. Many requests authenticate with a key header: X-Api-Key / x-api-key. A direct curl test can reveal auth/network issues quickly.

curl
# Pick a real endpoint from the official LangSmith API docs for a true test.
# Pattern shown here: set API key header and call the API host.

curl -sS -H "X-Api-Key: ls_your_key_here" \
  "https://api.smith.langchain.com/"

3) Authentication & API keys

Every serious integration starts with auth. The common pattern is an API key in a request header. In multi-tenant organizations, you may also need to specify which workspace/tenant a request belongs to. In production, treat keys like passwords: store them in a secret manager, rotate them, and apply least privilege.

3.1 Headers you’ll see in practice

Headers
X-Api-Key: ls_your_key_here
# Optional when a key can access multiple workspaces/tenants:
X-Tenant-Id: your_workspace_or_tenant_id

3.2 Service keys vs personal tokens

For production services and CI pipelines, prefer service keys because they’re easy to rotate and do not depend on one developer’s account. For local testing, personal tokens are convenient. But avoid using personal tokens in production because they can complicate audits, create brittle ownership, and make incident response harder if a person leaves a team.

3.3 Key handling checklist

  • Never commit keys to version control. Use environment variables and secret managers.
  • Rotate keys routinely and after incidents. Keep a documented rotation playbook.
  • Separate environments: use different keys for dev/staging/prod.
  • Minimize scope: keys used only for ingestion shouldn’t also power admin scripts unless needed.

4) REST API overview

The LangSmith REST API is the universal interface for managing LangSmith resources from any language or environment. Even if you never call REST directly, it’s useful to understand what the API offers because the SDKs are effectively “friendly wrappers” over these endpoints.

4.1 What you can do via REST

Ingest (write)

Send runs/traces (including batch/multipart), export OpenTelemetry traces, create feedback, create datasets and examples.

Query & manage (read/write)

Query runs by time range and metadata, export data, manage experiments/evaluations, and build admin automation.

4.2 When REST is best

  • Your stack is not Python/JS and you want a stable HTTP-based integration.
  • You operate batch data jobs: nightly exports, evaluation runners, or log ingestion.
  • You want a small dependency footprint and prefer raw HTTP clients.
  • You’re building internal tools or dashboards that query runs and feedback.

4.3 When SDKs are best

  • You want automatic nested run trees with minimal code.
  • You want convenient tracing decorators/context managers.
  • You want typed client methods and common patterns built-in.
Implementation rule: start with SDK tracing for speed, then add REST-based automation as your product matures (exports, evaluators, CI gates, internal dashboards).

5) Tracing: SDK instrumentation vs REST ingest

Tracing is the heart of LangSmith. A trace is a structured record of “what happened” during an LLM workflow: inputs, outputs, intermediate steps, tool calls, retrieved documents, errors, retries, latency, and metadata. In LangSmith, these are stored as runs — often organized into a parent/child tree.

5.1 Two tracing strategies

SDK tracing (most common)

Instrument your app with the LangSmith SDK so each function/step becomes a run. The SDK handles nesting and metadata, and it’s usually the fastest to implement.

REST ingest / OpenTelemetry

Use this when you have custom tracing infrastructure, multiple languages, or a pipeline that emits OTel spans. You can send traces in batches for performance and resilience.

5.2 What to record in a run (practical list)

Exact schemas evolve, but these fields consistently matter:

  • name: stable step name like answer_question or retrieve_context
  • run_type: llm, chain, tool, retriever, etc.
  • inputs / outputs: what went in and what came out (redact sensitive data)
  • metadata: model, temperature, prompt version, environment, release sha
  • tags: low-cardinality labels (prod, agent-v3, checkout)
  • timing: start/end timestamps, latency, retries, timeouts
  • error: structured error + safe stack summary

5.3 A production-grade tracing pipeline

Tracing is easiest when you treat it like high-volume telemetry rather than synchronous logging. A robust architecture:

  1. Collect run events in-process while handling a user request.
  2. Enqueue events to a queue/worker (or a background task) so user latency isn’t impacted.
  3. Batch events to reduce HTTP overhead (batch/multipart ingest where supported).
  4. Retry on transient failures with exponential backoff + jitter.
  5. Sample if needed (trace all errors, sample successes, temporarily trace 100% on new releases).
Video analogy: You don’t export every raw pixel to your editor in real time — you capture footage, then import it efficiently. Treat tracing the same way: capture locally, upload in batches, then review and score.

5.4 Video-style “shot list” for agents (how to name runs)

Developers often struggle with naming. A simple naming convention makes traces readable:

  • Root run: request or chat_turn (include request_id, session_id, user_hash as metadata)
  • LLM calls: llm.generate or llm.classify_intent
  • Retrieval: retriever.query and retriever.rerank
  • Tools: tool.search, tool.calendar, tool.db_write
  • Post-processing: parser.json_schema, formatter.response

With this style, your run tree reads like a storyboard. That makes debugging faster and evaluation labeling more consistent.

6) Runs: query, inspect, and export

Once tracing is on, the most valuable API workflow becomes: query runsinspectlabel/scorefixre-evaluate. Runs are the raw data that supports debugging, analytics, and quality improvement.

6.1 Typical run query use cases

Debugging

Find errors, inspect tool calls, see which step produced the wrong output, and correlate with your logs.

Quality review

Sample production traces, label failure types, attach feedback scores, and build datasets from real traffic.

Performance & cost

Measure latency, token usage, tool count, retries, and “expensive path” frequency by version and feature.

6.2 Metadata schema (recommended starter set)

Your metadata choices determine what you can filter and measure later. A stable starter schema:

Key Example value Why it matters Notes / privacy
app.version 1.7.3 or git:9f3c2a1 Compare releases, rollbacks, and regressions No PII; always include
env dev, staging, prod Separate experiments from production telemetry Low risk; always include
feature checkout_assistant Find issues by product area / routing logic Low risk; keep names stable
request_id req_01HT... Correlate with logs and incident reports Avoid embedding raw user data
tenant hash:ab12... Multi-tenant analytics and support workflows Prefer salted hashes; no raw customer names

6.3 Export patterns (how to avoid a privacy mess)

It’s tempting to export everything, but the safest approach is exporting derived metrics and selected fields. For example, export latency, token counts, model identifiers, and feedback scores widely. Export full inputs/outputs only to restricted systems, and only when you truly need them. If you must export content, implement a consistent redaction layer first.

Simple rule: store what you need to improve quality, not what you can. Traces are powerful; keep them safe.

7) Feedback: scores, labels, and evaluation signals

Tracing tells you what happened. Feedback tells you whether it was good. Feedback is the bridge between observability and evaluation: you attach human ratings, QA labels, or automated grader outputs to runs so you can measure quality over time and compare versions.

7.1 Common feedback types

Scalar scores

Numeric ratings like helpfulness (0–1), correctness (0–1), or quality (1–5). Choose a scale and keep it consistent across releases.

Labels / categories

Structured labels like “hallucination”, “tool misuse”, “missing citations”, “slow response”, “bad formatting”. Labels are great for root-cause breakdowns.

7.2 A feedback rubric that teams actually use

If you want feedback to be actionable, define a rubric. Here’s a practical rubric you can adopt immediately:

  • Correctness (0–1): is the response factually correct or consistent with known data?
  • Completeness (0–1): does it cover the user’s requirements and edge cases?
  • Instruction following (0–1): does it obey format, tone, and constraints?
  • Tool use correctness (0–1): did it call the right tools with the right arguments?
  • Safety / policy adherence (pass/fail): does it avoid disallowed content and unsafe recommendations?

7.3 Root-run vs child-run feedback

Use both whenever possible:

  • Root run feedback measures the end-user outcome (“did the system solve the request?”).
  • Child run feedback diagnoses specific stages (retrieval quality, tool accuracy, prompt adherence).
Should feedback be human-only?
No. Human review is gold but expensive. Many teams use a hybrid: a small stream of human-reviewed runs plus automated graders (rules, validators, or LLM-as-judge) to get broad coverage. Keep grader versions in metadata so you can interpret score shifts.

8) Datasets & examples: build regression tests for your agent

Datasets are where LangSmith becomes a release machine. A dataset is a curated set of examples that represent what users actually need. When you change prompts, models, tools, or retrieval settings, you run evaluations on the dataset to catch regressions before shipping.

8.1 What belongs in an example?

At minimum, an example needs inputs. Optionally it can include reference outputs (ground truth), plus metadata such as: domain, difficulty, language, required tools, or expected output schema. Even without ground truth, you can evaluate with rubrics and validators.

8.2 The best way to build datasets: harvest from production traces

Instead of inventing examples, harvest them:

  1. Query recent production root runs for your target feature (e.g., “checkout”).
  2. Filter out sensitive content (redact, hash, or drop fields).
  3. Convert run inputs into dataset inputs.
  4. Add reference outputs for high-value tasks where you have authoritative answers.
  5. Tag examples with metadata: difficulty, language, tool requirements, or known failure type.

This ensures your evaluation set mirrors reality. It also makes your model upgrades safer: you can run the same “real-world” examples against a new model before you deploy it.

Video analogy: Datasets are your “highlight reel.” They contain the most important scenes your audience cares about. If a new edit makes the reel worse, you don’t publish it.

8.3 Dataset sizing and iteration

You don’t need thousands of examples to start. Many teams get real value with 50–200 examples. The secret is coverage: include easy cases, typical cases, and “problem cases” where your agent has historically failed. Over time, add examples based on: support tickets, production errors, or newly shipped features.

How do I avoid overfitting to my dataset?
Keep two sets: a development dataset you iterate on frequently and a “holdout” dataset you run less often. Also sample real production traces regularly to ensure your dataset still represents current user behavior.

9) Evaluations & experiments: compare versions like you compare video cuts

Evaluations answer the practical questions that matter for shipping: “Is version B better than version A?” and “Did we regress?” In LangSmith, evaluations can be run offline on datasets (release gating) and augmented with online feedback signals (monitoring).

9.1 Two evaluation modes

Offline evals (pre-release)

Run a dataset through two versions, score outputs, compare metrics and failure categories, and block regressions.

Online eval signals (post-release)

Attach feedback to real traffic, monitor drift, detect quality drops, and target improvements by segment.

9.2 What to evaluate (a prioritized list)

  • Task success: did the agent produce a usable result?
  • Correctness: is it factually correct or consistent with retrieved sources?
  • Tool accuracy: correct tool selection and correct arguments.
  • Retrieval relevance: relevant docs retrieved; minimal irrelevant context.
  • Output validity: if you require JSON or schema, does it validate?
  • Latency/cost: speed and expense, especially in multi-step agents.

9.3 “Experiment hygiene” (how to interpret results)

A single average score can hide important regressions. Use layered analysis:

  1. Check overall score shifts.
  2. Break down by category: language, domain, difficulty, tenant segment, tool usage.
  3. Inspect failure cases and label root causes.
  4. Fix one root cause at a time (prompt, retrieval, tool guardrails, routing).
  5. Re-run evaluations and confirm you improved without breaking other categories.

Release gating like a “final cut”

Before deploying a new agent version, run your offline evaluation suite. If the metrics are worse or the failure types increase, treat it as a bad edit: revise and re-run until the final cut is better.

  • Use dataset coverage to represent real user scenes.
  • Score consistently using a rubric and grader versions.
  • Inspect traces for failures (not just averages).

10) OpenTelemetry (OTel): unify LLM traces with your system telemetry

OpenTelemetry is a widely adopted standard for traces and spans. If your infrastructure already emits OTel telemetry, integrating LangSmith with OTel can give you a unified view: LLM calls and agent steps alongside API latency, database queries, and downstream service performance.

10.1 Why OTel matters for agent systems

  • Cross-service correlation: connect a user request to every microservice hop and every agent step.
  • Standard attributes: use consistent naming across language stacks.
  • Multiple destinations: route telemetry to LangSmith plus your APM (depending on architecture).

10.2 Safe attribute design

Treat attributes like logs: don’t include secrets or raw PII. Prefer hashed identifiers and redacted content. If you need content for debugging, keep it limited and restrict access. A good compromise is: store full content only in dev/staging, and in prod store derived metrics plus a small sample of redacted content.

Design tip: Create a standard “telemetry contract” for your org (names, keys, redaction rules). That makes upgrades and incident response easier.

11) Rate limits & performance: ingest fast, query smart

Observability APIs typically enforce rate limits for stability. In LangSmith, the practical pattern is: ingest endpoints are built for throughput (especially batch/multipart) while run querying can be more restrictive. Your goal is to design for throughput without harming user latency.

11.1 Endpoint throughput patterns (responsive table)

This table is mobile-friendly: on small screens it becomes stacked cards.

Endpoint (example) Auth type Throughput pattern What it’s for
POST /runs/batch x-api-key High throughput (batch ingest) Send many run records efficiently
POST /otel/v1/traces x-api-key High throughput (OTel ingest) Export OTel traces/spans to LangSmith
POST /runs/multipart x-api-key Very high throughput (multipart ingest) Efficient ingest for large payloads
POST /runs/query x-api-key Lower throughput (query) Query/filter runs via API
POST /runs/query x-user-id + IP Higher than API query in some cases Query runs in “user” context

11.2 Performance best practices

  • Batch ingest rather than sending one HTTP request per run.
  • Async flush so user requests aren’t slowed by telemetry network calls.
  • Retry with jitter on transient failures (especially 429 and timeouts).
  • Narrow query windows: prefer “last hour/day” over “last month” for dashboards.
  • Cache dashboard queries if your UI repeatedly requests the same slices.
What should I do on HTTP 429?
Implement exponential backoff with jitter, reduce concurrency, batch more aggressively, and narrow query frequency/windows. If you hit 429 regularly in production, treat it as a capacity signal: reduce chatter and optimize your telemetry pipeline.

12) Deployment workflows: ship agents with confidence

The LangSmith API becomes even more valuable when you connect it to your deployment process. The pattern is: trace everything in dev/staging, build datasets from real usage, run offline evaluations before releases, and monitor online feedback after releases. That gives you a continuous loop: observe → evaluate → improve → ship.

12.1 A practical CI/CD checklist

  1. On every PR, run a small eval subset (fast feedback).
  2. Before a release, run the full dataset suite and compare to the current production baseline.
  3. Block deploy if key metrics regress or failure labels spike.
  4. After deploy, temporarily increase tracing and monitoring to catch new issues.
  5. Rotate keys and confirm the ingest pipeline health for the new release.

12.2 Versioning strategy

Don’t rely on memory or “it feels better.” Put versions into metadata: app.version, prompt.version, retriever.version, and toolset.version. When something breaks, you’ll know exactly which change caused it.

Team habit that wins: every time you fix a production failure, add a dataset example that represents it. Over time, your dataset becomes a map of your product’s real-world edge cases.

13) Security & privacy: log enough to debug, not enough to leak

Traces can contain sensitive content: user messages, internal documents retrieved for RAG, tool arguments with identifiers, and model outputs. The safest approach is to build privacy into your instrumentation from day one.

13.1 Redaction strategy

Implement redaction in your app before sending traces:

  • Mask secrets (API keys, tokens, passwords).
  • Redact PII (emails, phone numbers, addresses) when feasible.
  • Hash user identifiers and tenant identifiers with a salt.
  • Separate content logging from metrics logging.

13.2 Access control mindset

Restrict access to raw trace content. Many teams allow broad access to metrics dashboards but restrict raw content views to a small set of engineers or reviewers. Treat trace content like production logs: useful, powerful, and potentially risky.

How do I handle sensitive documents in RAG?
Consider logging document identifiers and relevance scores rather than full document text. If you need full text to debug, log only short snippets with redaction, and restrict access to those traces.

14) Troubleshooting: common problems and fast fixes

14.1 401 / 403 (auth or permissions)

  • Confirm the header name and key value (X-Api-Key / x-api-key).
  • If your key can access multiple workspaces, set the tenant/workspace header or configuration.
  • Verify the key hasn’t been revoked and has the needed permissions (read vs write).
  • Confirm your server clock is sane if you use any signed time-based auth in surrounding infrastructure.

14.2 Traces missing or incomplete trees

  • Confirm tracing flags are enabled and the project name is correct.
  • Ensure background flushing completes (process may exit early in scripts).
  • Check that parent/child linking is consistent (especially for custom REST ingest).
  • Reduce sampling temporarily to confirm you are not skipping runs.

14.3 429 rate limits

  • Batch ingest and reduce request frequency.
  • Retry with exponential backoff + jitter.
  • Narrow query windows and cache repeated slices.
60-second integration checklist
1) Trigger exactly one traceable request.
2) Print your tracing/project config (without printing secrets).
3) Verify the run appears in the UI within the last 5 minutes window.
4) If not, run a curl test to the API host with the same API key header.
5) If multi-workspace, add the tenant/workspace header and retry.

15) FAQ: LangSmith API

Is LangSmith only for LangChain apps?
No. While LangSmith is built by the same ecosystem, you can instrument any framework or custom stack using SDK tracing, REST ingest, or OpenTelemetry. The key is sending structured run data and metadata.
Do I need datasets and evaluations immediately?
Not immediately, but they’re the fastest way to prevent regressions. Start with tracing + basic feedback. Then build a small dataset (50–200 examples) and run offline evals before releases.
What’s the best way to keep the system fast?
Trace asynchronously, batch ingest, avoid heavy synchronous telemetry in the request path, and sample in production if necessary. Use narrow query windows and cache repeated analytics requests.
How should I name projects?
A simple convention is product-env like assistant-dev, assistant-staging, assistant-prod. You can also split by feature if your org is large, but avoid too many projects early.
How do I make traces safe for privacy?
Redact secrets and PII, hash identifiers, store derived metrics broadly, restrict raw content access, and apply retention policies. Make privacy a first-class requirement in your instrumentation.