1) What is a LangSmith trace?
A trace in LangSmith represents a single “operation” end-to-end most commonly a user request, a background job, an evaluation run, or a scheduled agent workflow. In practical terms, a trace is the unit you open in the UI when you want to answer: “What happened, in what order, and why did we get this output?”
LangSmith traces are built from smaller pieces called runs (also described as spans). If you have experience with OpenTelemetry, the mapping is intuitive: a trace is a collection of spans, where each span is one step in the overall operation.
One-sentence definition: A trace is a collection of runs/spans for a single operation, bound together by a shared trace ID, so you can inspect an operation as a coherent timeline.
What counts as “one operation”?
In real systems, “one operation” depends on your product. A chat assistant might define one trace as one user message and the system’s response. A document processing pipeline might define one trace as “ingest one PDF and produce one summary + embeddings + metadata.” A multi-agent planner might define one trace as “complete one task,” even if it takes dozens of sub-steps and tool calls.
The point isn’t to force a single perfect definition—it’s to choose a trace boundary that makes debugging and evaluation meaningful. Too small and you lose context (you can’t see the whole chain of decisions). Too large and your traces become noisy and expensive, and you’ll spend time hunting for the “interesting” parts.
What you see in the UI when you open a trace
The LangSmith UI typically presents a trace as:
- A timeline/tree of runs (root run → children)
- Inputs/outputs at each step
- Timing information and (often) token/cost metadata when available
- Tags, metadata, errors/exceptions, and attachments such as retrieved documents
- Links to annotations, eval results, or monitoring dashboards depending on your setup
That structure is why traces are so useful: they convert “the model did something weird” into inspectable facts you can act on.
2) Runs and spans: how traces are structured
The building blocks of a trace are runs (a.k.a. spans). Each run captures one step in the execution of your application. A run has a start time, an end time, inputs, outputs, and optional metadata. Runs can be nested, which is how LangSmith represents complex agent behavior.
Root run (the “operation”)
A trace usually has a root run that represents the top-level operation: “handle user message,” “answer question,” “run evaluation,” etc. The root run’s children represent the steps that happened inside.
If you make just one LLM call and return the result, you might see a short trace: root run → LLM run → output parser run.
Child runs (the “steps”)
Child runs are the nested steps: LLM calls, tool calls, retrieval operations, formatting, re-ranking, routing, validators, and more. Nesting is the key to making the execution understandable.
When an agent loops, you’ll often see repeated patterns: plan → tool → observe → plan → tool → observe.
Trace IDs and run IDs
Runs are bound to a trace by a trace ID. That trace ID is how LangSmith knows “these spans belong together.” Within the trace, each run also has its own identifier so you can query, link, and reference specific spans.
This matters for production systems because you often want to connect LangSmith traces to external systems: your request IDs, user session IDs, error tracking events, A/B test cohorts, or incident tickets. The best practice is to store those as metadata/tags and also keep a stable external request ID that you can search for.
Why the run (span) data format matters
LangSmith stores trace data in a structured format designed to be easy to export and import. Understanding the shape of the data helps when you want to:
- Build internal dashboards beyond the UI
- Archive traces for compliance or long-term analytics
- Write scripts to triage failed runs
- Compare runs across versions of prompts or tools
If you ever feel like the UI is “just a viewer,” remember: traces are data, and you can treat them like a dataset.
3) Why tracing matters for LLM apps (debugging, evals, monitoring)
LLM apps fail differently than classic software. Instead of a stack trace and a deterministic exception, you can see: hallucinations, tool misuse, retrieval misses, instruction drift, formatting errors, latency spikes, cost blow-ups, and “it works on my prompt but not on theirs” behavior.
Tracing turns “mystery behavior” into inspectable evidence
- Debugging: see exact prompt, tool parameters, retrieved docs, parsing steps, and errors.
- Quality: connect outputs to eval scores, labels, and human feedback.
- Ops: monitor latency, failure rates, and high-cost traces; set alerts when patterns change.
- Iteration: compare traces across prompt versions or model swaps to identify what improved.
Tracing is not just for engineers
Strong organizations turn traces into a shared artifact:
- Product reviews: “Show me 10 traces where the agent failed on refunds.”
- Support triage: “Here’s the trace for customer ticket #18493.”
- Safety audits: “Here are traces that triggered a policy rule.”
- Model governance: “What changed when we upgraded to a new model?”
A good trace is a story: the context the system saw, the decisions it made, and the outcome it produced.
4) How to send traces to LangSmith
LangSmith supports multiple ways to capture traces, depending on your stack:
A) Native LangChain tracing
If you’re using LangChain Runnables, you can enable tracing via environment configuration and/or runtime config. This is usually the fastest path: you write normal LangChain code and LangSmith records the run tree.
B) LangGraph tracing
If you’re building multi-step agent graphs with LangGraph, you can trace graph execution so each node and tool call becomes a run inside the trace.
C) REST API / OpenTelemetry
If you’re not in the LangChain ecosystem—or you want deeper control—you can send traces via the LangSmith REST API or instrument with OpenTelemetry and forward traces to LangSmith.
Tracing setup: the core idea
Regardless of the integration method, tracing is a pipeline:
- Create a trace context (trace ID, root run)
- Create nested runs/spans for steps (LLM, tool, retrieval, parse)
- Attach inputs/outputs/metadata
- Send the run tree to LangSmith (ideally asynchronously)
- View/query/export in UI or via SDK/API
Performance note (especially for API-based tracing)
When you send traces yourself (e.g., via REST), avoid blocking your user response on trace uploads. Synchronous trace posting can add latency and reduce reliability if the observability backend is slow or temporarily unavailable. The usual production pattern is: enqueue trace events → send in background → retry with backoff → drop if necessary.
Production principle: Observability must never be your bottleneck. Trace aggressively, but ship reliably.
Minimal mental model for “traceable” work
If you’re instrumenting custom functions, think in spans:
- Wrap each meaningful unit of work in a run/span.
- Use nesting to preserve causality: “this tool call happened because of this plan step.”
- Store the things you will need later: prompt, tool args, retrieved docs, errors, versions.
- Keep sensitive fields out or redacted (more on this later).
5) Projects: organizing traces so they stay usable
Projects are how you keep trace data from turning into a junk drawer. A good project structure makes it obvious: which environment produced the trace, which application produced it, and which experiment or version it belongs to.
Common project structures that work
By environment
Use separate projects for dev, staging, and prod. Production traces are precious for incident response and quality monitoring. Dev traces are noisy and often contain sensitive test content.
- proj: myapp-dev
- proj: myapp-staging
- proj: myapp-prod
By product or agent
If you run multiple agents, separate them. Otherwise your monitoring signals become ambiguous and your debugging time increases.
- proj: support-agent
- proj: sales-agent
- proj: doc-summarizer
Naming runs for readability
Within a trace, run names act like headings. Good names make the trace understandable at a glance. For example: “Retrieve policies,” “Call refund tool,” “Generate final reply,” “Validate JSON schema,” “Safety filter,” etc.
Naming is not cosmetic: it’s the difference between a trace that your team can read in 30 seconds and a trace that requires tribal knowledge and guessing.
6) What a trace typically contains (and what you should add)
A trace is only as useful as the data captured inside. At minimum, you want inputs and outputs for each run. In practice, high-quality traces contain additional fields that make analysis and debugging faster.
Core fields you’ll usually see
- Inputs: prompt, system instruction, tool arguments, retrieved docs, structured request.
- Outputs: model response, tool results, parsed structured output, final message.
- Timing: start/end timestamps and duration per run.
- Status: success/failure, error messages, exception type/stack where applicable.
- Trace/run IDs: stable identifiers for linking and querying.
High-leverage extras you should consider adding
Versioning metadata
Store prompt version, model version, tool version, and app build SHA. This is essential for regression analysis.
metadata: { prompt_version: "v12", model: "gpt-4.1", build: "a13f9c2" }
User & session context
Store a hashed user ID or session ID, plus important flags (locale, plan tier, channel). Avoid raw PII unless you have explicit permission and strong controls.
tags: ["locale:en-US","channel:web","tier:pro"]
Retrieval diagnostics
When doing RAG, log the query, top-k docs, doc IDs, scores, and any re-ranking results. Most RAG failures are “retrieved the wrong thing” not “the model is dumb.”
Token/cost metadata (when available)
For budgeting and performance work, token counts and per-call cost fields can be extremely useful. If your stack provides token usage in responses, capture it on the corresponding run. If you don’t have this data, you’ll end up guessing which parts of your agent are expensive.
Redaction and safety
Traces can contain sensitive user content. Before you trace in production, decide what you will store and what you will redact. Many teams use:
- Field-level redaction: remove or mask PII in inputs/outputs.
- Selective tracing: only trace a sample of requests, or only trace error cases.
- Separate projects: isolate sensitive traces and restrict access.
- Short retention: keep most traces in base retention and only upgrade “golden” cases.
7) Pricing fundamentals: base traces and included monthly amounts
If you’re reading this page, you likely searched “LangSmith traces” because you want to understand what you’re being billed for. The key is to separate:
- What is counted: traces (collections of runs), with pricing expressed in “base traces”
- What is included: a monthly included amount per plan
- What changes cost: volume beyond included + retention upgrades
Included base traces by plan (self-serve)
| Plan | Included base traces / month | After included amount | Notes |
|---|---|---|---|
| Developer | Up to 5,000 base traces / month | Pay-as-you-go beyond included | Billing setup removes the 5k rate limit and charges overages per pricing page. |
| Plus | Up to 10,000 base traces / month | Pay-as-you-go beyond included | Team orgs include 10k traces per month before overage rates apply. |
| Enterprise | Custom | Custom | Contracts may include different bundled usage and retention options. |
Overage rate (base traces)
Official pricing describes base trace overage as $0.50 per 1,000 base traces. A practical way to think about it is: once you exceed included traces, each additional trace has a small per-trace fee—then retention can multiply it.
Budgeting shortcut: Estimate monthly traces (N). If N exceeds included, overage cost is roughly (N − included) / 1000 × $0.50 for base traces, before considering retention upgrades.
What counts as “one trace” for billing?
In LangSmith, pricing is stated in terms of traces rather than runs/spans. A trace corresponds to one top-level operation (one trace ID). That means an agent trace might contain many runs (LLM + tools + retrieval), but it still counts as one trace at the “base trace” level.
That’s good news: a richly instrumented trace doesn’t automatically mean you pay per span. However, richly instrumented traces may encourage you to trace more operations, and high throughput systems can generate many traces quickly—so the biggest cost driver is typically request volume and retention, not how many spans you have within each trace.
How billing limits affect tracing behavior
The billing docs describe a rate limit on personal organizations (5k traces/month) until a card is added, and that team organizations have an initial 10k traces/month included. This means if you want to run high-volume load tests or production traffic, you should plan billing setup early—otherwise you might hit trace limits mid-experiment.
8) Retention: base (14 days) vs extended (400 days)
Retention is the second major axis of trace cost and governance. Official support documentation describes two fixed retention periods:
- Base retention: 14 days
- Extended retention: 400 days
Why retention exists (and why it’s not configurable)
Retention affects storage and indexing cost. Many teams want “keep everything forever,” but that’s rarely necessary. In fact, if you keep everything, you’ll stop looking at the data because the signal-to-noise ratio collapses.
Today, retention is offered as two fixed tiers rather than arbitrary configuration. If you need custom retention, support documentation suggests workarounds like programmatic deletion after your desired period, implemented as an automated job.
Automatic upgrades to extended retention
Retention is also influenced by your automation setup. If an automation rule matches any run within a trace, the trace can be auto-upgraded to extended retention. This is intentional: teams often want to preserve exactly the traces that match certain criteria (errors, high latency, low eval scores, certain tags, specific customers, etc.).
Important: Automations are powerful, but they can also increase extended retention usage. Treat rules like a cost lever: define them precisely and audit them periodically.
How to choose what gets extended retention
A sustainable retention strategy usually looks like this:
- Default: base retention for the majority of traffic.
- Upgrade: extended retention for high-value traces (failures, edge cases, golden datasets, audits).
- Delete: programmatically remove traces that must not be stored longer than policy allows.
Retention and compliance
Retention decisions are not just cost decisions. They affect privacy and legal posture. If your application processes personal data, you should:
- Redact sensitive fields before tracing.
- Restrict access to sensitive projects/workspaces.
- Document the retention tier used per project.
- Implement deletion workflows if required by policy.
Retention and debugging velocity
Short retention can feel risky: “What if we need an old trace?” The fix is not “store everything for 400 days.” The fix is to identify which traces matter and preserve those. When you do this well, you end up with:
- A compact set of “golden” traces you can revisit
- A clear set of failure examples you can evaluate against
- Lower noise when you investigate current incidents
9) Querying and exporting traces (SDK + API patterns)
Once you have traces, the next step is: find the right ones. The recommended way to query the span data is to query runs (runs are the span objects inside traces). In other words: you filter runs by project, time range, tags, status, name, metadata fields, or trace ID—and then you can reconstruct or inspect the trace context.
Why “query runs” instead of “query traces”?
Runs are the atomic units that carry the details you filter on: run name, errors, model type, tool name, tags, and metadata. Traces are bundles. When you search “tool failed,” you’re really searching for runs where a specific tool run errored, and then you want the trace that contains it.
Common query patterns
Incident response
- Filter: last 30 minutes
- Filter: error status
- Group: by run name (e.g., “Call payment tool”)
- Open: top failing traces
Quality regression
- Filter: model version changed
- Filter: eval score below threshold
- Compare: traces before vs after deployment
- Extract: golden set of failures for fixes
Exporting traces (why and when)
Exporting becomes important when you want to:
- Build internal BI dashboards from trace data
- Archive high-value traces into a long-term dataset
- Run offline analyses at scale
- Create reproducible evaluation corpora
A practical approach is to schedule a daily export for specific projects, then analyze in your data warehouse. If you do this, make sure you also export versioning metadata so you can compare changes over time.
Accessing the “current span” in custom code
Advanced tracing often requires injecting metadata into the currently active run. LangSmith provides helper functions in the SDKs to access the current run tree, which enables advanced workflows like tagging runs with request IDs, attaching additional artifacts, or dynamically changing trace naming.
Best practice: attach the external request ID and build SHA to the root run early, then inherit or reuse it on child runs as needed for consistent querying.
10) Production tracing patterns (sampling, async, privacy, and “don’t DDoS yourself”)
Production tracing has one rule: don’t break your product to record observability data. Everything else is implementation detail. Below are patterns that keep tracing useful and safe.
Pattern A: asynchronous ingestion
Send traces in the background. A clean architecture is:
- Capture runs/spans in memory during request handling
- Serialize to a queue (memory buffer, Redis, Kafka, etc.)
- Worker flushes to LangSmith with retries/backoff
- On failure, drop non-critical traces or sample down
Pattern B: sampling (trace less, learn more)
Sampling is not a compromise—it’s a strategy. Many systems use:
- Baseline sample: 1% of all requests
- Error sample: 100% of error requests
- Latency sample: 100% of slow requests
- Customer sample: 100% of whitelisted accounts during onboarding
This preserves visibility where it matters while keeping trace volume manageable and cost predictable.
Pattern C: “upgrade only the good stuff” retention strategy
Use base retention for the bulk of traffic. Then upgrade:
- Traces with low eval scores
- Traces with high user impact (refunds, payment flows, safety issues)
- Traces associated with escalated tickets
- Traces selected for golden datasets
Pattern D: privacy-first tracing
Privacy-first tracing is a combination of:
- Redaction: remove PII before sending to LangSmith
- Least privilege: limit who can view sensitive projects
- Isolation: separate projects/workspaces for sensitive workflows
- Deletion: implement trace deletion if required by policy
Pattern E: version every change that affects outputs
Many trace investigations fail because teams can’t answer: “Which prompt/model/tool version produced this output?” Always log versions:
- Prompt version or commit hash
- Model name and provider
- Tool version and schema version
- App build SHA / container image tag
11) Best practices checklist (copy/paste into your runbook)
Trace design
- Define “one operation” clearly (one trace = one user request, job, or task).
- Make a root run with a descriptive name.
- Use nested runs for steps; keep naming consistent.
- Attach external request/session IDs early.
Data & governance
- Redact PII and secrets; avoid storing raw sensitive content by default.
- Store versions: prompt, model, tools, build SHA.
- Separate projects by env (dev/staging/prod).
- Restrict access to sensitive traces; audit permissions.
Cost control
- Estimate monthly traces; compare to included (5k dev / 10k plus).
- Use sampling; trace 100% of errors, not 100% of everything.
- Keep most traces at base retention; upgrade only valuable cases.
- Audit automation rules that upgrade retention.
Reliability
- Send traces asynchronously; never block user latency on tracing.
- Retry with backoff; drop non-critical traces if under pressure.
- Implement rate limiting on trace upload pipeline.
- Monitor tracing failures separately from app failures.
Rule of thumb: Trace enough to explain failures, measure quality, and monitor drift—then use sampling + retention to keep the signal strong and the cost predictable.
12) FAQ: LangSmith traces
Does LangSmith charge per run/span or per trace?
How many traces are included on Developer and Plus?
What’s the safest way to trace in production?
Why are some traces stored much longer than others?
Can I set a custom retention period like 30 or 90 days?
How do I query traces programmatically?
13) Official references (verify current behavior here)
Pricing, billing, and retention can evolve. Use these official pages as the source of truth:
- LangSmith Plans & Pricing
- Manage billing in your account
- Observability concepts (Traces/Runs)
- Automation rules (retention upgrades)
- Retention periods (support)
- Query/export traces (runs/query)
- Trace with API (REST)
- Run (span) data format
Educational note: This page is an independent guide. Confirm pricing/limits/retention rules on the official pages above.