AssemblyAI API - Complete Developer Guide

1) What the AssemblyAI API is

Overview

The AssemblyAI API is a set of endpoints and models for converting speech into text and extracting structured insights from audio. Think of it as two layers:

Speech-to-Text (STT): turn audio into words, timestamps, and speaker segments.
Speech Understanding: take the transcript and generate “meaningful outputs” like summaries, chapters, entities, and redaction.

As a developer, you typically use the API in one of two modes:

Async transcription (files)

You submit an audio file (or an audio URL) to /v2/transcript, the job runs in the background, and you fetch the transcript later (or receive a webhook). This is ideal for recorded meetings, podcasts, call recordings, and video archives.

Streaming transcription (live audio)

You open a WebSocket connection and stream audio chunks. You receive partial and final transcripts in near real-time. This is ideal for voice agents, live captions, and real-time assistants.

What makes AssemblyAI “developer friendly”?

The core patterns are simple: upload (optional) → submit transcript job → wait (poll/webhook) → use results. Most complexity comes from production details: retries, rate limits, costs, and security. This guide focuses on those details so you can ship.

Base URL and API shape

The REST API uses JSON requests/responses and is hosted at a base URL of https://api.assemblyai.com.  In general, you authenticate by sending an Authorization header containing your API key.

2) Key concepts (jobs, transcripts, models, and add-ons)

Concepts

Before writing code, it helps to align on what the API actually returns and how the pieces fit together. This prevents common mistakes like “I uploaded audio but nothing happened” or “My transcript doesn’t include speakers.”

Transcript job vs transcript result

When you call /v2/transcript, you are creating a transcription job. The API returns an id and an initial status. You then fetch the transcript later using the id until the status becomes completed (or error).

Speech models and options

The submit endpoint supports specifying speech model options and many feature flags (for example: diarization, chapters, summarization, content safety, custom spelling, etc.). The key idea is: if you want extra outputs, enable them at submission time so the job produces them.

Feature compatibility matters

Some features are mutually exclusive. For example, AssemblyAI’s docs note you can only enable one of Summarization and Auto Chapters on the same transcription.  That’s not a “bug”—it’s a product constraint you should design around (e.g., run two jobs, or choose the one you need).

Transcription outputs

Text, confidence, timestamps (word/sentence/paragraph), utterances, speaker labels, etc.

Understanding outputs

Summary, chapters, entities, redaction, and other “structured meaning” layers.

LLM layer (LeMUR)

Generative tasks on transcripts: summarize, Q&A, action items, custom prompts.

3) Quickstart: transcribe a public audio URL (fastest path)

Quickstart

The simplest way to start is to provide an audio file hosted at a public, direct URL (not a webpage). You submit that URL to the transcript endpoint and then poll until it completes. This avoids the upload step, which is useful while you’re experimenting.

Step A: Create a transcript job

The request below is representative of the official submit endpoint: POST https://api.assemblyai.com/v2/transcript.  Replace <YOUR_API_KEY> with your real key.

curl -X POST https://api.assemblyai.com/v2/transcript \
  -H "Authorization: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://assembly.ai/wildfires.mp3"
  }'

Step B: Check status until completed

The API returns an id. Use it to fetch status:

curl -X GET https://api.assemblyai.com/v2/transcript/<TRANSCRIPT_ID> \
  -H "Authorization: <YOUR_API_KEY>"

Keep checking until the status becomes completed. AssemblyAI documents typical statuses as: queued, processing, completed, error. 

Production tip: don’t poll forever

Polling is okay for quick experiments, but for production you’ll generally prefer webhooks so you don’t waste requests and you can handle large volumes cleanly. (Webhooks are covered below.)

4) Authentication & headers (REST + streaming)

Auth

AssemblyAI uses API key authentication for REST and WebSocket streaming. The typical pattern is to send an Authorization header with your key. For REST calls, the upload endpoint explicitly shows this header as required. 

REST auth (most endpoints)

Header: Authorization: <apiKey>
Content-Type: JSON for most endpoints; application/octet-stream for upload.

Streaming auth (WebSocket)

Streaming uses a WebSocket URL and can authenticate either via an Authorization header (API key) or via a temporary token query parameter, per the Streaming API reference. 

Keep keys server-side

Treat your AssemblyAI API key as a secret. Do not embed it in client-side JavaScript on public websites. Put it on your server, use environment variables, and proxy requests when needed.

5) Uploading audio vs using a public URL

Ingestion

You have two ways to provide audio for async transcription:

Option 1: Use a public audio URL

You pass audio_url directly to the transcript submit endpoint. It must be a direct media file URL accessible from AssemblyAI’s servers (not a page that requires cookies/logins). This is fastest for demos and tests.

Option 2: Upload a local file

You first upload the file to AssemblyAI using POST /v2/upload, receive an upload_url, then use that URL in audio_url when creating the transcript. This is common for user-uploaded files and private recordings.

How file upload works

The upload endpoint is: POST https://api.assemblyai.com/v2/upload and expects Content-Type: application/octet-stream.  The response returns an upload_url. 

curl -X POST https://api.assemblyai.com/v2/upload \
  -H "Authorization: <YOUR_API_KEY>" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @path/to/file

Then submit a transcript using the returned URL:

curl -X POST https://api.assemblyai.com/v2/transcript \
  -H "Authorization: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://cdn.assemblyai.com/upload/<YOUR_UPLOAD_ID>"
  }'

Common mistake: uploading without submitting a transcript

Uploading only stores the media and returns a URL. Transcription does not start until you create a transcript job. In production, treat “upload” and “transcribe” as two distinct pipeline steps.

6) Async Speech-to-Text pipeline (REST) — the production foundation

Async STT

Async STT is the most common way teams start using AssemblyAI, because most businesses already have recordings: customer calls, meetings, podcasts, interviews, training videos, lectures, and webinars. The API is job-based: create a transcript job, then fetch results.

The canonical async pipeline

Ingest audio: public URL or upload a local file to get an upload_url.
Create transcript job: POST /v2/transcript with options.
Wait: poll status or use webhook_url to get notified.
Consume results: store transcript text + structured outputs in your DB for downstream use.
Post-process: optionally run LeMUR or your own LLM on the transcript.

Enabling features at submit time

The submit endpoint supports many optional fields. The docs show a request containing fields like auto_chapters, auto_highlights, content_safety, custom_spelling, and more.  The core idea: “turn on what you want to receive.”

Design rule: treat features as explicit cost & output decisions

Many features are add-ons. Decide what is essential for your product experience, then turn on only those features for those workloads (e.g., diarization for meetings, redaction for healthcare).

Handling partial audio (start/end offsets)

The submit endpoint also supports fields like audio_start_from and audio_end_at in examples.  This enables use cases like “transcribe only the first 10 minutes” or “skip the intro.” In production, this is a powerful cost-saving lever when your audio contains irrelevant sections.

7) Transcript statuses & lifecycle (queued → processing → completed)

Lifecycle

Transcript status is the simplest (and most important) state machine in the platform. After submitting a job, it transitions through statuses such as: queued, processing, completed, and error. 

Status	Meaning	What you should do	Common pitfalls
queued	Job is waiting to be processed.	Wait; avoid aggressive polling.	Assuming “queued” means stuck; it may just be load.
processing	Job is actively transcribing.	Wait; show progress in UI if needed.	Polling too often; burning requests.
completed	Transcript is ready.	Fetch and store results.	Not storing outputs; re-fetching repeatedly.
error	An error occurred.	Inspect error message; retry if appropriate.	Retrying blindly without fixing the underlying issue.

Why jobs fail

The docs mention common causes such as corrupted or unsupported audio, providing a webpage URL instead of a file, or a URL not accessible from AssemblyAI’s servers.  In production, validate audio early (format, duration, accessibility) to reduce error rates.

8) Webhooks vs polling (and when to use each)

Webhooks

Polling is easy: keep requesting transcript status until it completes. But polling becomes expensive and noisy at scale. Webhooks let AssemblyAI call your server when the transcript is ready. The official docs describe webhooks as custom HTTP callbacks you define to get notified when transcripts are ready. 

When polling is okay

Prototyping or a hackathon.
Low volume transcription (a few jobs/day) with no strict efficiency requirements.
Local scripts and personal automation.

When to use webhooks

Anything production that might scale beyond a small trickle.
Background processing pipelines (queues, workers, job systems).
When you want reliable completion events without tight polling loops.

Webhook design best practices

Idempotency: treat webhook deliveries as “may repeat.” If you receive the same completion event twice, do not double-bill or double-store.
Verification: verify requests (shared secret / signature pattern if provided, or allowlist IPs if your infra supports it).
Async handling: accept quickly (200 OK), then process in background (queue) so webhook delivery doesn’t time out.
Safe retries: if your handler fails, you should be able to re-fetch transcript results later using the transcript ID.

Practical pattern

Webhooks should trigger a “fetch transcript details” job in your system. The webhook itself should not do heavy work. It’s a notification, not a batch processor.

9) Realtime Streaming Speech-to-Text (WebSockets)

Streaming

Streaming transcription is for live audio. Instead of uploading a file, you open a persistent WebSocket connection and send audio frames as they are recorded. AssemblyAI returns partial and final transcripts over the same connection.

WebSocket handshake and endpoint

The Streaming API reference lists the handshake URL as: wss://streaming.assemblyai.com/v3/ws.  It also notes: to use the EU server, replace the host with streaming.eu.assemblyai.com. 

Required parameters

The handshake requires at least sample_rate. There are options like encoding (e.g., PCM 16-bit little-endian or mu-law), and settings that influence turn detection and formatting. 

Authentication choices

You can authenticate with an API key in the Authorization header, or generate a temporary token and pass it via a query parameter, per the docs. 

Streaming is a product design decision

Streaming isn’t only a “technical” upgrade. It changes user experience: you can display words as they’re spoken, detect turn ends, and drive a live UI. It also changes infrastructure: you need stable connections, backpressure handling, and failure recovery (reconnect strategy).

10) Turn detection, partial vs final, endpointing (how streaming feels fast)

Realtime UX

In real-time speech recognition, users judge “speed” by how quickly they see partial text and how quickly the system correctly decides that someone finished speaking (end-of-turn detection / endpointing). AssemblyAI’s streaming API includes parameters that help tune this behavior, such as an end-of-turn confidence threshold. 

Partial vs final transcripts

Partial: fast, may change, used for live captions or immediate feedback.
Final: stable, used for saving, compliance, downstream NLP, and analytics.

When to use formatted final transcripts

The streaming API includes a formatting option for returning formatted final transcripts.  Use this when the final output is user-facing (subtitles, notes, transcripts shown in UI). If you’re using the transcript as raw NLP input, you may prefer minimal formatting and do your own post-processing.

Voice agent latency tip

In voice agents, “natural conversation” depends on turn detection. Tune threshold settings carefully: too sensitive and you cut users off; too slow and the agent feels laggy.

11) Speech Understanding: summaries, chapters, entities, redaction

Understanding

Speech-to-text gives you words. Speech understanding gives you structure and meaning. AssemblyAI provides several understanding models you can enable at transcription time, returning extra fields in the transcript response.

Common “Speech Understanding” features

Summarization

Generate a summary of the transcript. Controlled by summary model/type. 

Auto Chapters

Generate time-based chapters: headline, gist, summary, and timestamps. 

Entity Detection

Detect names, orgs, addresses, phone numbers, and other entities. 

PII Redaction

Remove or mask personal data in the transcript to protect privacy. 

Speaker Diarization

Segment transcript into speaker turns with speaker labels. 

Keyterms Prompting

Improve recognition for domain words (product names, jargon). 

The important architectural point: most of these are computed as part of the transcription job (or are tightly coupled to it). So you should decide and enable what you need when you create the transcript job, then store the resulting structured fields for fast UI rendering and analytics.

12) Speaker diarization and speaker identification

Speakers

Speaker diarization answers: “who spoke when?” It detects multiple speakers and returns a list of utterances (chunks of uninterrupted speech) per speaker.  In practice, this powers meeting transcripts, call center analytics, interviews, and panel discussions.

How to enable diarization

AssemblyAI’s diarization docs show enabling it by setting speaker_labels to true in the transcription config.  The transcript result then includes “utterances” with speaker labels.

Speaker count controls

You can set speakers_expected when you’re confident about the exact number of speakers.  The docs warn to use it only when you are certain; forcing the wrong number can produce bad splits.  A more flexible alternative is to set a range of possible speakers (when supported by options).

Generic labels vs real names

Diarization typically labels speakers as “Speaker A/B/…”. AssemblyAI also describes a separate concept of Speaker Identification to name speakers (roles/names) beyond generic labels.  In product terms:

Diarization: segmenting + labeling distinct voices.
Identification: mapping voice turns to human-meaningful names/roles.

Product tip: diarization drives UX

If you are building a meeting app, diarization is not “optional polish.” It changes the experience: users scan speaker turns and quickly locate decisions. If you can’t ship diarization everywhere, ship it at least for longer meetings or paid tiers.

13) Summarization (models, types, and practical usage)

Summaries

AssemblyAI’s Summarization model generates a summary of the resulting transcript.  The docs highlight that you control the style and format using a summary model and summary type.  They also note a constraint: you can only enable one of Summarization and Auto Chapters in the same transcription. 

Why summaries matter

Summaries convert “pages of transcript” into something users can act on: meeting highlights, lecture overviews, daily call digests, and quick review cards. They also reduce cost if you later run a separate LLM, because you can feed the summary instead of the full transcript.

When summaries fail

Summaries struggle when audio quality is poor, speakers overlap, or the content is extremely technical. If you are summarizing high-stakes content (medical, legal, financial), treat summaries as “assistive,” not authoritative.

Recommended UI pattern

Show summary first, then let users expand into chapters or the full transcript. Add a “jump to timestamp” link (if you store timestamps) so users can verify.

14) Auto Chapters (time-based navigation built for long audio)

Chapters

Auto Chapters is a model that summarizes audio data over time into chapters, making it easier to navigate and find specific information.  Each chapter contains items like summary, gist, headline, and start/end timestamps. 

Why chapters are better than “one big summary”

A single summary is great for speed, but long recordings (podcasts, lectures, meetings) contain multiple topics. Chapters provide a table-of-contents style navigation: users can jump directly to the section they care about without searching raw text.

Chapters vs summarization (choose one per job)

Like summarization, chapters have a compatibility constraint: the docs state you can only enable one of Auto Chapters and Summarization.  If your product needs both, design for a second pass job or a LeMUR step (see LeMUR section).

15) PII redaction (privacy, compliance, and safe sharing)

Privacy

PII redaction is about minimizing sensitive information by identifying and removing it from transcripts.  PII includes data like names, email addresses, phone numbers, and other identifiers. This feature is especially important if you store transcripts for long periods, share them externally, or process regulated content.

When PII redaction is a must

Customer support calls that include addresses, card details, or account numbers.
Healthcare-related audio that includes personal patient details.
HR interviews and recruiting calls.
Any transcript you export or share with third parties.

Redaction as a workflow, not a checkbox

In production, redaction should be treated as part of a data governance pipeline: access controls, retention policy, audit logging, and careful handling of raw audio. The API provides redaction outputs, but your system still needs policy and enforcement.

16) Entity detection (structured data extraction from speech)

Entities

Entity detection automatically identifies and categorizes key information in audio, including names of people, organizations, addresses, phone numbers, dates, and more.  This transforms a transcript into searchable structured data.

Real-world use cases

Sales calls: extract competitors, product names, contract dates, and pricing mentions.
Support calls: extract ticket numbers, error codes, device models.
Podcasts: extract guest names, brands, locations, and events.
Compliance: detect mentions of sensitive identifiers and trigger review flows.

Entity detection works best with clean transcripts

If audio quality is weak, consider doing a first pass with keyterms prompting for domain vocabulary, then enable entity detection. You can also enforce higher confidence thresholds in your downstream logic.

17) LeMUR: LLM features on transcripts (summary, Q&A, action items, custom tasks)

LeMUR

LeMUR is AssemblyAI’s framework for applying large language model capabilities to recognized speech. Conceptually, it turns “a transcript” into “an interactive document”: you can ask questions, generate action items, summarize with custom prompts, and run other generative tasks on top of the transcript.

How LeMUR fits into a product

Many apps need two layers: (1) stable transcript + structured fields (speakers, timestamps, chapters), and (2) generative insights (action items, Q&A, executive summaries). LeMUR is commonly used as the second layer once transcription is completed.

Using existing transcripts

LeMUR workflows typically reference an existing transcript ID. The official LeMUR docs show fetching an existing transcript via GET https://api.assemblyai.com/v2/transcript/YOUR_TRANSCRIPT_ID. 

When to prefer LeMUR vs built-in summarization/chapters

Prefer built-in summarization/chapters when you want predictable formats and simple configuration at transcription time.
Prefer LeMUR when you want custom prompts, multi-step insights, action items, question answering, and flexible responses.

Practical strategy

For many products, the best combo is: transcript + diarization + chapters (for navigation) → LeMUR action items + Q&A (for interactivity).

18) Limits (file size, duration) & scaling patterns

Limits

Limits matter because they shape your product boundaries and your “edge cases.” AssemblyAI’s FAQ documents that the maximum file size for submission to /v2/transcript is 5GB and maximum duration is 10 hours.  It also notes an upload size limit for /v2/upload (local upload) of 2.2GB. 

What to do with files over the limit

Split audio: chunk the file into segments and transcribe separately.
Use partial offsets: if only certain sections matter (intro, Q&A segment), transcribe those sections only.
Downsample / re-encode: reduce size by using efficient encoding (without destroying intelligibility).

Scaling without pain

Scaling isn’t only “more CPUs.” It’s also fewer failures: validate audio, use webhooks, implement retries, and make your pipeline idempotent. If you want increased rate limits, AssemblyAI’s docs suggest contacting them. 

19) Pricing (base + add-ons) & realistic cost planning

Pricing

AssemblyAI uses usage-based pricing. The official pricing page lists base rates and add-on costs. As of the pricing page details shown, streaming speech-to-text is listed at $0.15/hr, and several add-ons have their own hourly prices (for example, Speaker Diarization at $0.02/hr, Entity Detection at $0.08/hr, Keyterms Prompting at $0.04/hr, and more). 

Cost planning: think in “workload recipes”

A helpful way to estimate costs is to define a few recipe profiles your app supports, like:

Recipe	What it enables	Best for	Cost behavior
Transcript only	Speech-to-text base model	Searchable transcripts, captions, archives	Low and predictable
Meeting mode	STT + diarization (+ optional speaker identification)	Meetings, interviews, support calls	Moderate; scales with add-ons 
Compliance mode	STT + PII redaction (+ content safety)	Regulated or shareable transcripts	Higher; privacy features add cost
Insights mode	STT + chapters/summaries + entities	Analytics, knowledge extraction	Higher; multiple models applied

Where costs get out of control

Turning on everything by default for every transcription, even when users don’t need it.
Re-processing the same audio multiple times instead of caching results.
Polling at high frequency across large numbers of jobs.
Not trimming audio (transcribing silence, intros, hold music, or irrelevant sections).

Cost control checklist

1) Enable features only when needed. 2) Store transcript outputs. 3) Use webhooks. 4) Trim silence/irrelevant audio. 5) Apply LeMUR only on “important” conversations (or paid tiers).

20) Production architecture & reliability patterns

Production

The difference between a demo and a product is reliability. Here’s a proven architecture for building on AssemblyAI.

Reference architecture (async transcription)

Frontend: user uploads audio/video, or connects a recording source.
Backend API: receives upload, stores file in your object storage, generates a job record.
Worker: uploads to AssemblyAI (or supplies a signed URL) and creates transcript job.
Webhook endpoint: receives completion event and enqueues a “fetch results” task.
Processor: fetches transcript JSON, stores in DB, indexes text for search, stores structured fields (speakers/entities/chapters).
Optional LLM layer: run LeMUR or your own LLM for action items, summaries, Q&A, or extraction.

Idempotency strategy

Use stable IDs and deduplication keys:

Compute a file hash (or storage object key) and store it so the same file isn’t processed twice.
Store transcript job ID and treat “fetch transcript” as safe to retry.
On webhook receive, enqueue a job keyed by transcript ID; drop duplicates.

Retries and backoff

Not every failure is permanent. Network errors, timeouts, and occasional server errors should be retried with exponential backoff and jitter. However, “bad input” errors (corrupt audio, inaccessible URL) should not be retried until fixed. The docs mention that if failures persist after resubmitting, you should contact support. 

Storage & indexing

Store:

Raw transcript text
Word-level or sentence-level timestamps (for subtitles and jump links)
Utterances and speaker labels (for meeting UX)
Entities and redaction outputs (for search and compliance)
Chapters/summaries (for navigation and “TL;DR” UX)

Product UX hack

Add “jump to timestamp” everywhere: in summary bullet points, in chapters, and in search results. It makes transcripts feel trustworthy because users can verify by listening.

21) Security best practices (keys, URLs, webhooks)

Security

Voice data is sensitive. Your security posture should assume audio can contain PII, confidential business info, and regulated content. Good security is a combination of key handling, network controls, and data governance.

API key handling

Keep your key in server environment variables (not in client apps).
Rotate keys if accidentally exposed.
Log carefully: never log full headers containing Authorization keys.

Public URLs and access control

If you pass audio_url as a public URL, ensure it is a direct file URL and accessible from AssemblyAI’s servers. Avoid exposing private recordings with globally public URLs; prefer short-lived signed URLs when possible.

Webhook security

Use HTTPS only.
Validate incoming webhook payloads (signature/secret if supported in your configuration).
Make handlers idempotent and quick; enqueue heavy work.
Rate limit webhook endpoints to defend against spam.

Compliance mindset

If you handle sensitive audio, also define retention (how long you keep audio and transcripts), access controls (who can view), and audit logs (who accessed what). The API provides capabilities, but your policies complete the security story.

22) Troubleshooting: common errors and fixes

Troubleshooting

“My transcript is stuck”

If status is queued or processing, it may simply be waiting for capacity. Avoid polling every second. If the job stays in that state for an unusually long time, try:

Confirm the audio URL is a direct file and reachable.
Confirm the audio isn’t corrupted or unsupported.
Retry submission once (don’t spam).

401 Unauthorized

Usually means missing or invalid API key. Confirm you are sending Authorization: <YOUR_API_KEY> and you did not include extra spaces or quotes.

429 Too Many Requests

Rate limiting can occur on some endpoints/services. The docs describe rate limits with X-RateLimit-* headers and 429 responses for exceeded limits.  Implement exponential backoff and reduce polling frequency. Use webhooks instead of polling at scale.

422 Unprocessable Entity / Upload failures

Upload failures are usually about request formatting or payload issues: wrong content type, sending JSON instead of binary, or a file that exceeds the upload size constraints. Ensure you use Content-Type: application/octet-stream and --data-binary for curl uploads. 

“Audio URL is a webpage”

The docs explicitly mention failures when the audio URL is a webpage rather than a file.  Fix by using a direct file URL or upload the media via /v2/upload.

Debug checklist

1) Confirm URL is direct media. 2) Confirm it’s accessible without cookies. 3) Check duration/size limits.  4) Reduce features to isolate (transcript-only), then add features back one by one.

23) AssemblyAI API FAQs

FAQ

Is AssemblyAI REST or WebSocket? ⌄

Both. Async transcription uses REST endpoints under https://api.assemblyai.com.  Streaming transcription uses a WebSocket endpoint like wss://streaming.assemblyai.com/v3/ws. 

Do I have to upload my audio? ⌄

No. If you have a public direct audio URL, you can pass it as audio_url to /v2/transcript.  Uploading is helpful when you have local files or private user uploads.

What is the max file size and duration? ⌄

AssemblyAI’s FAQ states /v2/transcript supports up to 5GB and up to 10 hours, and /v2/upload supports local uploads up to 2.2GB. 

Can I enable both Summarization and Auto Chapters? ⌄

The docs state you can only enable one of Summarization and Auto Chapters in the same transcription.  If you need both, consider running separate jobs or using LeMUR for one of the outputs.

How does speaker diarization work? ⌄

Diarization detects multiple speakers and returns utterances segmented by speaker. The docs describe that the transcript returns a list of utterances when diarization is enabled. 

How do I reduce costs? ⌄

Turn on only necessary add-ons, trim audio (skip intros/silence), store transcript results to avoid reprocessing, and use webhooks instead of frequent polling. Pricing includes base rates and add-on rates. 

24) Sources & official docs (recommended starting points)

Sources

Use official docs for exact request/response fields and the latest feature set. The links below are the most important pages to bookmark.

Core REST endpoints

API Overview / Base URL: REST base URL is https://api.assemblyai.com. 
Upload: POST /v2/upload with binary upload. 
Submit transcript: POST /v2/transcript with audio_url. 
Status lifecycle: queued/processing/completed/error. 

Streaming (real-time)

WebSocket handshake: wss://streaming.assemblyai.com/v3/ws. 
EU streaming host option: streaming.eu.assemblyai.com. 

Models & features

Summarization docs: summary model/type + mutual exclusion with chapters. 
Auto Chapters docs: chapter fields and navigation structure. 
Speaker diarization docs: utterances + speaker_labels + best practices. 
Entity detection docs: supported entity types. 
PII redaction docs: redaction behavior and purpose. 
Pricing page: base rates and add-ons. 
Limits FAQ: file size/duration constraints. 

AssemblyAI API : a complete guide to Speech-to-Text, Streaming, LeMUR, and production Voice AI

1) What the AssemblyAI API is

Async transcription (files)

Streaming transcription (live audio)

Base URL and API shape

2) Key concepts (jobs, transcripts, models, and add-ons)

Transcript job vs transcript result

Speech models and options

Feature compatibility matters

Transcription outputs

Understanding outputs

LLM layer (LeMUR)

3) Quickstart: transcribe a public audio URL (fastest path)

Step A: Create a transcript job

Step B: Check status until completed

4) Authentication & headers (REST + streaming)

REST auth (most endpoints)

Streaming auth (WebSocket)

5) Uploading audio vs using a public URL

Option 1: Use a public audio URL

Option 2: Upload a local file

How file upload works

6) Async Speech-to-Text pipeline (REST) — the production foundation

The canonical async pipeline

Enabling features at submit time

Handling partial audio (start/end offsets)

7) Transcript statuses & lifecycle (queued → processing → completed)

Why jobs fail

8) Webhooks vs polling (and when to use each)

When polling is okay

When to use webhooks

Webhook design best practices

9) Realtime Streaming Speech-to-Text (WebSockets)

WebSocket handshake and endpoint

Required parameters

Authentication choices

10) Turn detection, partial vs final, endpointing (how streaming feels fast)

Partial vs final transcripts

When to use formatted final transcripts

11) Speech Understanding: summaries, chapters, entities, redaction

Common “Speech Understanding” features

Summarization

Auto Chapters

Entity Detection

PII Redaction

Speaker Diarization

Keyterms Prompting

12) Speaker diarization and speaker identification

How to enable diarization

Speaker count controls

Generic labels vs real names

13) Summarization (models, types, and practical usage)

Why summaries matter

When summaries fail

14) Auto Chapters (time-based navigation built for long audio)

Why chapters are better than “one big summary”

Chapters vs summarization (choose one per job)

15) PII redaction (privacy, compliance, and safe sharing)

When PII redaction is a must

Redaction as a workflow, not a checkbox

16) Entity detection (structured data extraction from speech)

Real-world use cases

17) LeMUR: LLM features on transcripts (summary, Q&A, action items, custom tasks)

How LeMUR fits into a product

Using existing transcripts

When to prefer LeMUR vs built-in summarization/chapters

18) Limits (file size, duration) & scaling patterns

What to do with files over the limit

Scaling without pain

19) Pricing (base + add-ons) & realistic cost planning

Cost planning: think in “workload recipes”

Where costs get out of control

20) Production architecture & reliability patterns

Reference architecture (async transcription)

Idempotency strategy

Retries and backoff

Storage & indexing

21) Security best practices (keys, URLs, webhooks)

API key handling

Public URLs and access control