1) What is the LangSmith API?
The LangSmith API is the programmatic surface you use to integrate the LangSmith platform into your development and production workflows. “Programmatic surface” includes both REST endpoints and SDKs that call those endpoints on your behalf. With the API you can: ingest traces (runs), query and export run data, attach feedback signals, manage datasets and examples, and automate evaluations/experiments.
If you’re new to LLM apps, it helps to separate two ideas that often get mixed together: (1) the model API (OpenAI, Anthropic, Google, etc.) which generates text or structured outputs, and (2) the observability/evaluation API which tells you how well your overall system behaved. LangSmith is in category (2). It’s like having a professional-grade logging, debugging, and QA system specifically designed for agentic workflows.
Observability
See every step: prompts, tool calls, retrieval queries, parsing, retries, errors, latency.
Evaluation
Attach scores/labels, run offline evaluations on datasets, compare versions, detect regressions.
Production confidence
Monitor real traffic with privacy controls, sampling, and dashboards tied to releases.
Core nouns you should learn once
These terms show up everywhere in the API and UI. If you understand them, everything else becomes easier.
Project
A named container for runs. Many teams map projects to environments (dev/staging/prod) or to products.
Run
A trace record representing a unit of work: a root request, a chain step, a model call, a tool invocation, or a retriever query.
Feedback
Signals attached to runs: numeric scores, categorical labels, comments, rubric versions, reviewer identity.
Dataset & Example
A dataset is a curated set of examples (inputs and optional reference outputs) used for evaluation and regression testing.
2) Quickstart (fastest working setup)
The quickest path is: create an API key, enable tracing, run a small piece of code, and verify you can see a trace. Once you have that, you can expand into datasets, evals, and automation.
2.1 Create an API key
In the LangSmith app, generate an API key in your settings. In many setups you’ll see different key types: service keys (recommended for servers, CI, and automation) and personal tokens (recommended for individual development). Use service keys for production because they’re easier to rotate and don’t depend on a single person’s identity.
2.2 Set environment variables
# Example environment variables (names can vary by SDK / integration) export LANGSMITH_TRACING=true export LANGSMITH_API_KEY="ls_your_key_here" export LANGSMITH_PROJECT="my-app-dev"
2.3 Trace something simple (Python)
The goal isn’t fancy logic — it’s proving your integration works end-to-end. Once it does, you’ll wrap real steps like model calls, retrieval, and tool usage.
from langsmith import traceable
@traceable(name="hello_trace")
def hello_trace(user_text: str) -> str:
# Replace this with your real LLM call.
return f"Echo: {user_text}"
if __name__ == "__main__":
print(hello_trace("Tracing is working!"))
2.4 Quick sanity check via HTTP
Even if you use SDKs, it’s helpful to know the REST host and the auth pattern. Many requests authenticate with a key header:
X-Api-Key / x-api-key. A direct curl test can reveal auth/network issues quickly.
# Pick a real endpoint from the official LangSmith API docs for a true test. # Pattern shown here: set API key header and call the API host. curl -sS -H "X-Api-Key: ls_your_key_here" \ "https://api.smith.langchain.com/"
3) Authentication & API keys
Every serious integration starts with auth. The common pattern is an API key in a request header. In multi-tenant organizations, you may also need to specify which workspace/tenant a request belongs to. In production, treat keys like passwords: store them in a secret manager, rotate them, and apply least privilege.
3.1 Headers you’ll see in practice
X-Api-Key: ls_your_key_here # Optional when a key can access multiple workspaces/tenants: X-Tenant-Id: your_workspace_or_tenant_id
3.2 Service keys vs personal tokens
For production services and CI pipelines, prefer service keys because they’re easy to rotate and do not depend on one developer’s account. For local testing, personal tokens are convenient. But avoid using personal tokens in production because they can complicate audits, create brittle ownership, and make incident response harder if a person leaves a team.
3.3 Key handling checklist
- Never commit keys to version control. Use environment variables and secret managers.
- Rotate keys routinely and after incidents. Keep a documented rotation playbook.
- Separate environments: use different keys for dev/staging/prod.
- Minimize scope: keys used only for ingestion shouldn’t also power admin scripts unless needed.
4) REST API overview
The LangSmith REST API is the universal interface for managing LangSmith resources from any language or environment. Even if you never call REST directly, it’s useful to understand what the API offers because the SDKs are effectively “friendly wrappers” over these endpoints.
4.1 What you can do via REST
Ingest (write)
Send runs/traces (including batch/multipart), export OpenTelemetry traces, create feedback, create datasets and examples.
Query & manage (read/write)
Query runs by time range and metadata, export data, manage experiments/evaluations, and build admin automation.
4.2 When REST is best
- Your stack is not Python/JS and you want a stable HTTP-based integration.
- You operate batch data jobs: nightly exports, evaluation runners, or log ingestion.
- You want a small dependency footprint and prefer raw HTTP clients.
- You’re building internal tools or dashboards that query runs and feedback.
4.3 When SDKs are best
- You want automatic nested run trees with minimal code.
- You want convenient tracing decorators/context managers.
- You want typed client methods and common patterns built-in.
5) Tracing: SDK instrumentation vs REST ingest
Tracing is the heart of LangSmith. A trace is a structured record of “what happened” during an LLM workflow: inputs, outputs, intermediate steps, tool calls, retrieved documents, errors, retries, latency, and metadata. In LangSmith, these are stored as runs — often organized into a parent/child tree.
5.1 Two tracing strategies
SDK tracing (most common)
Instrument your app with the LangSmith SDK so each function/step becomes a run. The SDK handles nesting and metadata, and it’s usually the fastest to implement.
REST ingest / OpenTelemetry
Use this when you have custom tracing infrastructure, multiple languages, or a pipeline that emits OTel spans. You can send traces in batches for performance and resilience.
5.2 What to record in a run (practical list)
Exact schemas evolve, but these fields consistently matter:
- name: stable step name like
answer_questionorretrieve_context - run_type:
llm,chain,tool,retriever, etc. - inputs / outputs: what went in and what came out (redact sensitive data)
- metadata: model, temperature, prompt version, environment, release sha
- tags: low-cardinality labels (
prod,agent-v3,checkout) - timing: start/end timestamps, latency, retries, timeouts
- error: structured error + safe stack summary
5.3 A production-grade tracing pipeline
Tracing is easiest when you treat it like high-volume telemetry rather than synchronous logging. A robust architecture:
- Collect run events in-process while handling a user request.
- Enqueue events to a queue/worker (or a background task) so user latency isn’t impacted.
- Batch events to reduce HTTP overhead (batch/multipart ingest where supported).
- Retry on transient failures with exponential backoff + jitter.
- Sample if needed (trace all errors, sample successes, temporarily trace 100% on new releases).
5.4 Video-style “shot list” for agents (how to name runs)
Developers often struggle with naming. A simple naming convention makes traces readable:
- Root run:
requestorchat_turn(includerequest_id,session_id,user_hashas metadata) - LLM calls:
llm.generateorllm.classify_intent - Retrieval:
retriever.queryandretriever.rerank - Tools:
tool.search,tool.calendar,tool.db_write - Post-processing:
parser.json_schema,formatter.response
With this style, your run tree reads like a storyboard. That makes debugging faster and evaluation labeling more consistent.
6) Runs: query, inspect, and export
Once tracing is on, the most valuable API workflow becomes: query runs → inspect → label/score → fix → re-evaluate. Runs are the raw data that supports debugging, analytics, and quality improvement.
6.1 Typical run query use cases
Debugging
Find errors, inspect tool calls, see which step produced the wrong output, and correlate with your logs.
Quality review
Sample production traces, label failure types, attach feedback scores, and build datasets from real traffic.
Performance & cost
Measure latency, token usage, tool count, retries, and “expensive path” frequency by version and feature.
6.2 Metadata schema (recommended starter set)
Your metadata choices determine what you can filter and measure later. A stable starter schema:
| Key | Example value | Why it matters | Notes / privacy |
|---|---|---|---|
app.version |
1.7.3 or git:9f3c2a1 |
Compare releases, rollbacks, and regressions | No PII; always include |
env |
dev, staging, prod |
Separate experiments from production telemetry | Low risk; always include |
feature |
checkout_assistant |
Find issues by product area / routing logic | Low risk; keep names stable |
request_id |
req_01HT... |
Correlate with logs and incident reports | Avoid embedding raw user data |
tenant |
hash:ab12... |
Multi-tenant analytics and support workflows | Prefer salted hashes; no raw customer names |
6.3 Export patterns (how to avoid a privacy mess)
It’s tempting to export everything, but the safest approach is exporting derived metrics and selected fields. For example, export latency, token counts, model identifiers, and feedback scores widely. Export full inputs/outputs only to restricted systems, and only when you truly need them. If you must export content, implement a consistent redaction layer first.
7) Feedback: scores, labels, and evaluation signals
Tracing tells you what happened. Feedback tells you whether it was good. Feedback is the bridge between observability and evaluation: you attach human ratings, QA labels, or automated grader outputs to runs so you can measure quality over time and compare versions.
7.1 Common feedback types
Scalar scores
Numeric ratings like helpfulness (0–1), correctness (0–1), or quality (1–5). Choose a scale and keep it consistent across releases.
Labels / categories
Structured labels like “hallucination”, “tool misuse”, “missing citations”, “slow response”, “bad formatting”. Labels are great for root-cause breakdowns.
7.2 A feedback rubric that teams actually use
If you want feedback to be actionable, define a rubric. Here’s a practical rubric you can adopt immediately:
- Correctness (0–1): is the response factually correct or consistent with known data?
- Completeness (0–1): does it cover the user’s requirements and edge cases?
- Instruction following (0–1): does it obey format, tone, and constraints?
- Tool use correctness (0–1): did it call the right tools with the right arguments?
- Safety / policy adherence (pass/fail): does it avoid disallowed content and unsafe recommendations?
7.3 Root-run vs child-run feedback
Use both whenever possible:
- Root run feedback measures the end-user outcome (“did the system solve the request?”).
- Child run feedback diagnoses specific stages (retrieval quality, tool accuracy, prompt adherence).
Should feedback be human-only?
8) Datasets & examples: build regression tests for your agent
Datasets are where LangSmith becomes a release machine. A dataset is a curated set of examples that represent what users actually need. When you change prompts, models, tools, or retrieval settings, you run evaluations on the dataset to catch regressions before shipping.
8.1 What belongs in an example?
At minimum, an example needs inputs. Optionally it can include reference outputs (ground truth), plus metadata such as: domain, difficulty, language, required tools, or expected output schema. Even without ground truth, you can evaluate with rubrics and validators.
8.2 The best way to build datasets: harvest from production traces
Instead of inventing examples, harvest them:
- Query recent production root runs for your target feature (e.g., “checkout”).
- Filter out sensitive content (redact, hash, or drop fields).
- Convert run inputs into dataset inputs.
- Add reference outputs for high-value tasks where you have authoritative answers.
- Tag examples with metadata: difficulty, language, tool requirements, or known failure type.
This ensures your evaluation set mirrors reality. It also makes your model upgrades safer: you can run the same “real-world” examples against a new model before you deploy it.
8.3 Dataset sizing and iteration
You don’t need thousands of examples to start. Many teams get real value with 50–200 examples. The secret is coverage: include easy cases, typical cases, and “problem cases” where your agent has historically failed. Over time, add examples based on: support tickets, production errors, or newly shipped features.
How do I avoid overfitting to my dataset?
9) Evaluations & experiments: compare versions like you compare video cuts
Evaluations answer the practical questions that matter for shipping: “Is version B better than version A?” and “Did we regress?” In LangSmith, evaluations can be run offline on datasets (release gating) and augmented with online feedback signals (monitoring).
9.1 Two evaluation modes
Offline evals (pre-release)
Run a dataset through two versions, score outputs, compare metrics and failure categories, and block regressions.
Online eval signals (post-release)
Attach feedback to real traffic, monitor drift, detect quality drops, and target improvements by segment.
9.2 What to evaluate (a prioritized list)
- Task success: did the agent produce a usable result?
- Correctness: is it factually correct or consistent with retrieved sources?
- Tool accuracy: correct tool selection and correct arguments.
- Retrieval relevance: relevant docs retrieved; minimal irrelevant context.
- Output validity: if you require JSON or schema, does it validate?
- Latency/cost: speed and expense, especially in multi-step agents.
9.3 “Experiment hygiene” (how to interpret results)
A single average score can hide important regressions. Use layered analysis:
- Check overall score shifts.
- Break down by category: language, domain, difficulty, tenant segment, tool usage.
- Inspect failure cases and label root causes.
- Fix one root cause at a time (prompt, retrieval, tool guardrails, routing).
- Re-run evaluations and confirm you improved without breaking other categories.
Release gating like a “final cut”
Before deploying a new agent version, run your offline evaluation suite. If the metrics are worse or the failure types increase, treat it as a bad edit: revise and re-run until the final cut is better.
- Use dataset coverage to represent real user scenes.
- Score consistently using a rubric and grader versions.
- Inspect traces for failures (not just averages).
10) OpenTelemetry (OTel): unify LLM traces with your system telemetry
OpenTelemetry is a widely adopted standard for traces and spans. If your infrastructure already emits OTel telemetry, integrating LangSmith with OTel can give you a unified view: LLM calls and agent steps alongside API latency, database queries, and downstream service performance.
10.1 Why OTel matters for agent systems
- Cross-service correlation: connect a user request to every microservice hop and every agent step.
- Standard attributes: use consistent naming across language stacks.
- Multiple destinations: route telemetry to LangSmith plus your APM (depending on architecture).
10.2 Safe attribute design
Treat attributes like logs: don’t include secrets or raw PII. Prefer hashed identifiers and redacted content. If you need content for debugging, keep it limited and restrict access. A good compromise is: store full content only in dev/staging, and in prod store derived metrics plus a small sample of redacted content.
11) Rate limits & performance: ingest fast, query smart
Observability APIs typically enforce rate limits for stability. In LangSmith, the practical pattern is: ingest endpoints are built for throughput (especially batch/multipart) while run querying can be more restrictive. Your goal is to design for throughput without harming user latency.
11.1 Endpoint throughput patterns (responsive table)
This table is mobile-friendly: on small screens it becomes stacked cards.
| Endpoint (example) | Auth type | Throughput pattern | What it’s for |
|---|---|---|---|
POST /runs/batch |
x-api-key |
High throughput (batch ingest) | Send many run records efficiently |
POST /otel/v1/traces |
x-api-key |
High throughput (OTel ingest) | Export OTel traces/spans to LangSmith |
POST /runs/multipart |
x-api-key |
Very high throughput (multipart ingest) | Efficient ingest for large payloads |
POST /runs/query |
x-api-key |
Lower throughput (query) | Query/filter runs via API |
POST /runs/query |
x-user-id + IP |
Higher than API query in some cases | Query runs in “user” context |
11.2 Performance best practices
- Batch ingest rather than sending one HTTP request per run.
- Async flush so user requests aren’t slowed by telemetry network calls.
- Retry with jitter on transient failures (especially 429 and timeouts).
- Narrow query windows: prefer “last hour/day” over “last month” for dashboards.
- Cache dashboard queries if your UI repeatedly requests the same slices.
What should I do on HTTP 429?
12) Deployment workflows: ship agents with confidence
The LangSmith API becomes even more valuable when you connect it to your deployment process. The pattern is: trace everything in dev/staging, build datasets from real usage, run offline evaluations before releases, and monitor online feedback after releases. That gives you a continuous loop: observe → evaluate → improve → ship.
12.1 A practical CI/CD checklist
- On every PR, run a small eval subset (fast feedback).
- Before a release, run the full dataset suite and compare to the current production baseline.
- Block deploy if key metrics regress or failure labels spike.
- After deploy, temporarily increase tracing and monitoring to catch new issues.
- Rotate keys and confirm the ingest pipeline health for the new release.
12.2 Versioning strategy
Don’t rely on memory or “it feels better.” Put versions into metadata:
app.version, prompt.version, retriever.version, and toolset.version.
When something breaks, you’ll know exactly which change caused it.
13) Security & privacy: log enough to debug, not enough to leak
Traces can contain sensitive content: user messages, internal documents retrieved for RAG, tool arguments with identifiers, and model outputs. The safest approach is to build privacy into your instrumentation from day one.
13.1 Redaction strategy
Implement redaction in your app before sending traces:
- Mask secrets (API keys, tokens, passwords).
- Redact PII (emails, phone numbers, addresses) when feasible.
- Hash user identifiers and tenant identifiers with a salt.
- Separate content logging from metrics logging.
13.2 Access control mindset
Restrict access to raw trace content. Many teams allow broad access to metrics dashboards but restrict raw content views to a small set of engineers or reviewers. Treat trace content like production logs: useful, powerful, and potentially risky.
How do I handle sensitive documents in RAG?
14) Troubleshooting: common problems and fast fixes
14.1 401 / 403 (auth or permissions)
- Confirm the header name and key value (
X-Api-Key/x-api-key). - If your key can access multiple workspaces, set the tenant/workspace header or configuration.
- Verify the key hasn’t been revoked and has the needed permissions (read vs write).
- Confirm your server clock is sane if you use any signed time-based auth in surrounding infrastructure.
14.2 Traces missing or incomplete trees
- Confirm tracing flags are enabled and the project name is correct.
- Ensure background flushing completes (process may exit early in scripts).
- Check that parent/child linking is consistent (especially for custom REST ingest).
- Reduce sampling temporarily to confirm you are not skipping runs.
14.3 429 rate limits
- Batch ingest and reduce request frequency.
- Retry with exponential backoff + jitter.
- Narrow query windows and cache repeated slices.
60-second integration checklist
2) Print your tracing/project config (without printing secrets).
3) Verify the run appears in the UI within the last 5 minutes window.
4) If not, run a curl test to the API host with the same API key header.
5) If multi-workspace, add the tenant/workspace header and retry.
15) FAQ: LangSmith API
Is LangSmith only for LangChain apps?
Do I need datasets and evaluations immediately?
What’s the best way to keep the system fast?
How should I name projects?
product-env like assistant-dev, assistant-staging, assistant-prod.
You can also split by feature if your org is large, but avoid too many projects early.