1) What the AssemblyAI API is
OverviewThe AssemblyAI API is a set of endpoints and models for converting speech into text and extracting structured insights from audio. Think of it as two layers:
- Speech-to-Text (STT): turn audio into words, timestamps, and speaker segments.
- Speech Understanding: take the transcript and generate “meaningful outputs” like summaries, chapters, entities, and redaction.
As a developer, you typically use the API in one of two modes:
Async transcription (files)
You submit an audio file (or an audio URL) to /v2/transcript,
the job runs in the background, and you fetch the transcript later (or receive a webhook).
This is ideal for recorded meetings, podcasts, call recordings, and video archives.
Streaming transcription (live audio)
You open a WebSocket connection and stream audio chunks. You receive partial and final transcripts in near real-time. This is ideal for voice agents, live captions, and real-time assistants.
The core patterns are simple: upload (optional) → submit transcript job → wait (poll/webhook) → use results. Most complexity comes from production details: retries, rate limits, costs, and security. This guide focuses on those details so you can ship.
Base URL and API shape
The REST API uses JSON requests/responses and is hosted at a base URL of https://api.assemblyai.com.
In general, you authenticate by sending an Authorization header containing your API key.
2) Key concepts (jobs, transcripts, models, and add-ons)
ConceptsBefore writing code, it helps to align on what the API actually returns and how the pieces fit together. This prevents common mistakes like “I uploaded audio but nothing happened” or “My transcript doesn’t include speakers.”
Transcript job vs transcript result
When you call /v2/transcript, you are creating a transcription job.
The API returns an id and an initial status.
You then fetch the transcript later using the id until the status becomes completed (or error).
Speech models and options
The submit endpoint supports specifying speech model options and many feature flags (for example: diarization, chapters, summarization, content safety, custom spelling, etc.). The key idea is: if you want extra outputs, enable them at submission time so the job produces them.
Feature compatibility matters
Some features are mutually exclusive. For example, AssemblyAI’s docs note you can only enable one of Summarization and Auto Chapters on the same transcription. That’s not a “bug”—it’s a product constraint you should design around (e.g., run two jobs, or choose the one you need).
Transcription outputs
Text, confidence, timestamps (word/sentence/paragraph), utterances, speaker labels, etc.
Understanding outputs
Summary, chapters, entities, redaction, and other “structured meaning” layers.
LLM layer (LeMUR)
Generative tasks on transcripts: summarize, Q&A, action items, custom prompts.
3) Quickstart: transcribe a public audio URL (fastest path)
QuickstartThe simplest way to start is to provide an audio file hosted at a public, direct URL (not a webpage). You submit that URL to the transcript endpoint and then poll until it completes. This avoids the upload step, which is useful while you’re experimenting.
Step A: Create a transcript job
The request below is representative of the official submit endpoint: POST https://api.assemblyai.com/v2/transcript.
Replace <YOUR_API_KEY> with your real key.
curl -X POST https://api.assemblyai.com/v2/transcript \
-H "Authorization: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://assembly.ai/wildfires.mp3"
}'
Step B: Check status until completed
The API returns an id. Use it to fetch status:
curl -X GET https://api.assemblyai.com/v2/transcript/<TRANSCRIPT_ID> \
-H "Authorization: <YOUR_API_KEY>"
Keep checking until the status becomes completed.
AssemblyAI documents typical statuses as: queued, processing, completed, error.
Polling is okay for quick experiments, but for production you’ll generally prefer webhooks so you don’t waste requests and you can handle large volumes cleanly. (Webhooks are covered below.)
4) Authentication & headers (REST + streaming)
Auth
AssemblyAI uses API key authentication for REST and WebSocket streaming.
The typical pattern is to send an Authorization header with your key.
For REST calls, the upload endpoint explicitly shows this header as required.
REST auth (most endpoints)
- Header:
Authorization: <apiKey> - Content-Type: JSON for most endpoints;
application/octet-streamfor upload.
Streaming auth (WebSocket)
Streaming uses a WebSocket URL and can authenticate either via an Authorization header (API key)
or via a temporary token query parameter, per the Streaming API reference.
Treat your AssemblyAI API key as a secret. Do not embed it in client-side JavaScript on public websites. Put it on your server, use environment variables, and proxy requests when needed.
5) Uploading audio vs using a public URL
IngestionYou have two ways to provide audio for async transcription:
Option 1: Use a public audio URL
You pass audio_url directly to the transcript submit endpoint.
It must be a direct media file URL accessible from AssemblyAI’s servers (not a page that requires cookies/logins).
This is fastest for demos and tests.
Option 2: Upload a local file
You first upload the file to AssemblyAI using POST /v2/upload,
receive an upload_url, then use that URL in audio_url when creating the transcript.
This is common for user-uploaded files and private recordings.
How file upload works
The upload endpoint is:
POST https://api.assemblyai.com/v2/upload
and expects Content-Type: application/octet-stream.
The response returns an upload_url.
curl -X POST https://api.assemblyai.com/v2/upload \
-H "Authorization: <YOUR_API_KEY>" \
-H "Content-Type: application/octet-stream" \
--data-binary @path/to/file
Then submit a transcript using the returned URL:
curl -X POST https://api.assemblyai.com/v2/transcript \
-H "Authorization: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://cdn.assemblyai.com/upload/<YOUR_UPLOAD_ID>"
}'
Uploading only stores the media and returns a URL. Transcription does not start until you create a transcript job. In production, treat “upload” and “transcribe” as two distinct pipeline steps.
6) Async Speech-to-Text pipeline (REST) — the production foundation
Async STTAsync STT is the most common way teams start using AssemblyAI, because most businesses already have recordings: customer calls, meetings, podcasts, interviews, training videos, lectures, and webinars. The API is job-based: create a transcript job, then fetch results.
The canonical async pipeline
- Ingest audio: public URL or upload a local file to get an
upload_url. - Create transcript job:
POST /v2/transcriptwith options. - Wait: poll status or use
webhook_urlto get notified. - Consume results: store transcript text + structured outputs in your DB for downstream use.
- Post-process: optionally run LeMUR or your own LLM on the transcript.
Enabling features at submit time
The submit endpoint supports many optional fields. The docs show a request containing fields like
auto_chapters, auto_highlights, content_safety, custom_spelling, and more.
The core idea: “turn on what you want to receive.”
Many features are add-ons. Decide what is essential for your product experience, then turn on only those features for those workloads (e.g., diarization for meetings, redaction for healthcare).
Handling partial audio (start/end offsets)
The submit endpoint also supports fields like audio_start_from and audio_end_at in examples.
This enables use cases like “transcribe only the first 10 minutes” or “skip the intro.”
In production, this is a powerful cost-saving lever when your audio contains irrelevant sections.
7) Transcript statuses & lifecycle (queued → processing → completed)
Lifecycle
Transcript status is the simplest (and most important) state machine in the platform.
After submitting a job, it transitions through statuses such as:
queued, processing, completed, and error.
| Status | Meaning | What you should do | Common pitfalls |
|---|---|---|---|
| queued | Job is waiting to be processed. | Wait; avoid aggressive polling. | Assuming “queued” means stuck; it may just be load. |
| processing | Job is actively transcribing. | Wait; show progress in UI if needed. | Polling too often; burning requests. |
| completed | Transcript is ready. | Fetch and store results. | Not storing outputs; re-fetching repeatedly. |
| error | An error occurred. | Inspect error message; retry if appropriate. | Retrying blindly without fixing the underlying issue. |
Why jobs fail
The docs mention common causes such as corrupted or unsupported audio, providing a webpage URL instead of a file, or a URL not accessible from AssemblyAI’s servers. In production, validate audio early (format, duration, accessibility) to reduce error rates.
8) Webhooks vs polling (and when to use each)
WebhooksPolling is easy: keep requesting transcript status until it completes. But polling becomes expensive and noisy at scale. Webhooks let AssemblyAI call your server when the transcript is ready. The official docs describe webhooks as custom HTTP callbacks you define to get notified when transcripts are ready.
When polling is okay
- Prototyping or a hackathon.
- Low volume transcription (a few jobs/day) with no strict efficiency requirements.
- Local scripts and personal automation.
When to use webhooks
- Anything production that might scale beyond a small trickle.
- Background processing pipelines (queues, workers, job systems).
- When you want reliable completion events without tight polling loops.
Webhook design best practices
- Idempotency: treat webhook deliveries as “may repeat.” If you receive the same completion event twice, do not double-bill or double-store.
- Verification: verify requests (shared secret / signature pattern if provided, or allowlist IPs if your infra supports it).
- Async handling: accept quickly (200 OK), then process in background (queue) so webhook delivery doesn’t time out.
- Safe retries: if your handler fails, you should be able to re-fetch transcript results later using the transcript ID.
Webhooks should trigger a “fetch transcript details” job in your system. The webhook itself should not do heavy work. It’s a notification, not a batch processor.
9) Realtime Streaming Speech-to-Text (WebSockets)
StreamingStreaming transcription is for live audio. Instead of uploading a file, you open a persistent WebSocket connection and send audio frames as they are recorded. AssemblyAI returns partial and final transcripts over the same connection.
WebSocket handshake and endpoint
The Streaming API reference lists the handshake URL as:
wss://streaming.assemblyai.com/v3/ws.
It also notes: to use the EU server, replace the host with streaming.eu.assemblyai.com.
Required parameters
The handshake requires at least sample_rate.
There are options like encoding (e.g., PCM 16-bit little-endian or mu-law),
and settings that influence turn detection and formatting.
Authentication choices
You can authenticate with an API key in the Authorization header,
or generate a temporary token and pass it via a query parameter, per the docs.
Streaming isn’t only a “technical” upgrade. It changes user experience: you can display words as they’re spoken, detect turn ends, and drive a live UI. It also changes infrastructure: you need stable connections, backpressure handling, and failure recovery (reconnect strategy).
10) Turn detection, partial vs final, endpointing (how streaming feels fast)
Realtime UXIn real-time speech recognition, users judge “speed” by how quickly they see partial text and how quickly the system correctly decides that someone finished speaking (end-of-turn detection / endpointing). AssemblyAI’s streaming API includes parameters that help tune this behavior, such as an end-of-turn confidence threshold.
Partial vs final transcripts
- Partial: fast, may change, used for live captions or immediate feedback.
- Final: stable, used for saving, compliance, downstream NLP, and analytics.
When to use formatted final transcripts
The streaming API includes a formatting option for returning formatted final transcripts. Use this when the final output is user-facing (subtitles, notes, transcripts shown in UI). If you’re using the transcript as raw NLP input, you may prefer minimal formatting and do your own post-processing.
In voice agents, “natural conversation” depends on turn detection. Tune threshold settings carefully: too sensitive and you cut users off; too slow and the agent feels laggy.
11) Speech Understanding: summaries, chapters, entities, redaction
UnderstandingSpeech-to-text gives you words. Speech understanding gives you structure and meaning. AssemblyAI provides several understanding models you can enable at transcription time, returning extra fields in the transcript response.
Common “Speech Understanding” features
Summarization
Generate a summary of the transcript. Controlled by summary model/type.
Auto Chapters
Generate time-based chapters: headline, gist, summary, and timestamps.
Entity Detection
Detect names, orgs, addresses, phone numbers, and other entities.
PII Redaction
Remove or mask personal data in the transcript to protect privacy.
Speaker Diarization
Segment transcript into speaker turns with speaker labels.
Keyterms Prompting
Improve recognition for domain words (product names, jargon).
The important architectural point: most of these are computed as part of the transcription job (or are tightly coupled to it). So you should decide and enable what you need when you create the transcript job, then store the resulting structured fields for fast UI rendering and analytics.
12) Speaker diarization and speaker identification
SpeakersSpeaker diarization answers: “who spoke when?” It detects multiple speakers and returns a list of utterances (chunks of uninterrupted speech) per speaker. In practice, this powers meeting transcripts, call center analytics, interviews, and panel discussions.
How to enable diarization
AssemblyAI’s diarization docs show enabling it by setting speaker_labels to true in the transcription config.
The transcript result then includes “utterances” with speaker labels.
Speaker count controls
You can set speakers_expected when you’re confident about the exact number of speakers.
The docs warn to use it only when you are certain; forcing the wrong number can produce bad splits.
A more flexible alternative is to set a range of possible speakers (when supported by options).
Generic labels vs real names
Diarization typically labels speakers as “Speaker A/B/…”. AssemblyAI also describes a separate concept of Speaker Identification to name speakers (roles/names) beyond generic labels. In product terms:
- Diarization: segmenting + labeling distinct voices.
- Identification: mapping voice turns to human-meaningful names/roles.
If you are building a meeting app, diarization is not “optional polish.” It changes the experience: users scan speaker turns and quickly locate decisions. If you can’t ship diarization everywhere, ship it at least for longer meetings or paid tiers.
13) Summarization (models, types, and practical usage)
SummariesAssemblyAI’s Summarization model generates a summary of the resulting transcript. The docs highlight that you control the style and format using a summary model and summary type. They also note a constraint: you can only enable one of Summarization and Auto Chapters in the same transcription.
Why summaries matter
Summaries convert “pages of transcript” into something users can act on: meeting highlights, lecture overviews, daily call digests, and quick review cards. They also reduce cost if you later run a separate LLM, because you can feed the summary instead of the full transcript.
When summaries fail
Summaries struggle when audio quality is poor, speakers overlap, or the content is extremely technical. If you are summarizing high-stakes content (medical, legal, financial), treat summaries as “assistive,” not authoritative.
Show summary first, then let users expand into chapters or the full transcript. Add a “jump to timestamp” link (if you store timestamps) so users can verify.
14) Auto Chapters (time-based navigation built for long audio)
ChaptersAuto Chapters is a model that summarizes audio data over time into chapters, making it easier to navigate and find specific information. Each chapter contains items like summary, gist, headline, and start/end timestamps.
Why chapters are better than “one big summary”
A single summary is great for speed, but long recordings (podcasts, lectures, meetings) contain multiple topics. Chapters provide a table-of-contents style navigation: users can jump directly to the section they care about without searching raw text.
Chapters vs summarization (choose one per job)
Like summarization, chapters have a compatibility constraint: the docs state you can only enable one of Auto Chapters and Summarization. If your product needs both, design for a second pass job or a LeMUR step (see LeMUR section).
15) PII redaction (privacy, compliance, and safe sharing)
PrivacyPII redaction is about minimizing sensitive information by identifying and removing it from transcripts. PII includes data like names, email addresses, phone numbers, and other identifiers. This feature is especially important if you store transcripts for long periods, share them externally, or process regulated content.
When PII redaction is a must
- Customer support calls that include addresses, card details, or account numbers.
- Healthcare-related audio that includes personal patient details.
- HR interviews and recruiting calls.
- Any transcript you export or share with third parties.
Redaction as a workflow, not a checkbox
In production, redaction should be treated as part of a data governance pipeline: access controls, retention policy, audit logging, and careful handling of raw audio. The API provides redaction outputs, but your system still needs policy and enforcement.
16) Entity detection (structured data extraction from speech)
EntitiesEntity detection automatically identifies and categorizes key information in audio, including names of people, organizations, addresses, phone numbers, dates, and more. This transforms a transcript into searchable structured data.
Real-world use cases
- Sales calls: extract competitors, product names, contract dates, and pricing mentions.
- Support calls: extract ticket numbers, error codes, device models.
- Podcasts: extract guest names, brands, locations, and events.
- Compliance: detect mentions of sensitive identifiers and trigger review flows.
If audio quality is weak, consider doing a first pass with keyterms prompting for domain vocabulary, then enable entity detection. You can also enforce higher confidence thresholds in your downstream logic.
17) LeMUR: LLM features on transcripts (summary, Q&A, action items, custom tasks)
LeMURLeMUR is AssemblyAI’s framework for applying large language model capabilities to recognized speech. Conceptually, it turns “a transcript” into “an interactive document”: you can ask questions, generate action items, summarize with custom prompts, and run other generative tasks on top of the transcript.
How LeMUR fits into a product
Many apps need two layers: (1) stable transcript + structured fields (speakers, timestamps, chapters), and (2) generative insights (action items, Q&A, executive summaries). LeMUR is commonly used as the second layer once transcription is completed.
Using existing transcripts
LeMUR workflows typically reference an existing transcript ID.
The official LeMUR docs show fetching an existing transcript via
GET https://api.assemblyai.com/v2/transcript/YOUR_TRANSCRIPT_ID.
When to prefer LeMUR vs built-in summarization/chapters
- Prefer built-in summarization/chapters when you want predictable formats and simple configuration at transcription time.
- Prefer LeMUR when you want custom prompts, multi-step insights, action items, question answering, and flexible responses.
For many products, the best combo is: transcript + diarization + chapters (for navigation) → LeMUR action items + Q&A (for interactivity).
18) Limits (file size, duration) & scaling patterns
Limits
Limits matter because they shape your product boundaries and your “edge cases.”
AssemblyAI’s FAQ documents that the maximum file size for submission to /v2/transcript is 5GB
and maximum duration is 10 hours.
It also notes an upload size limit for /v2/upload (local upload) of 2.2GB.
What to do with files over the limit
- Split audio: chunk the file into segments and transcribe separately.
- Use partial offsets: if only certain sections matter (intro, Q&A segment), transcribe those sections only.
- Downsample / re-encode: reduce size by using efficient encoding (without destroying intelligibility).
Scaling without pain
Scaling isn’t only “more CPUs.” It’s also fewer failures: validate audio, use webhooks, implement retries, and make your pipeline idempotent. If you want increased rate limits, AssemblyAI’s docs suggest contacting them.
19) Pricing (base + add-ons) & realistic cost planning
PricingAssemblyAI uses usage-based pricing. The official pricing page lists base rates and add-on costs. As of the pricing page details shown, streaming speech-to-text is listed at $0.15/hr, and several add-ons have their own hourly prices (for example, Speaker Diarization at $0.02/hr, Entity Detection at $0.08/hr, Keyterms Prompting at $0.04/hr, and more).
Cost planning: think in “workload recipes”
A helpful way to estimate costs is to define a few recipe profiles your app supports, like:
| Recipe | What it enables | Best for | Cost behavior |
|---|---|---|---|
| Transcript only | Speech-to-text base model | Searchable transcripts, captions, archives | Low and predictable |
| Meeting mode | STT + diarization (+ optional speaker identification) | Meetings, interviews, support calls | Moderate; scales with add-ons |
| Compliance mode | STT + PII redaction (+ content safety) | Regulated or shareable transcripts | Higher; privacy features add cost |
| Insights mode | STT + chapters/summaries + entities | Analytics, knowledge extraction | Higher; multiple models applied |
Where costs get out of control
- Turning on everything by default for every transcription, even when users don’t need it.
- Re-processing the same audio multiple times instead of caching results.
- Polling at high frequency across large numbers of jobs.
- Not trimming audio (transcribing silence, intros, hold music, or irrelevant sections).
1) Enable features only when needed. 2) Store transcript outputs. 3) Use webhooks. 4) Trim silence/irrelevant audio. 5) Apply LeMUR only on “important” conversations (or paid tiers).
20) Production architecture & reliability patterns
ProductionThe difference between a demo and a product is reliability. Here’s a proven architecture for building on AssemblyAI.
Reference architecture (async transcription)
- Frontend: user uploads audio/video, or connects a recording source.
- Backend API: receives upload, stores file in your object storage, generates a job record.
- Worker: uploads to AssemblyAI (or supplies a signed URL) and creates transcript job.
- Webhook endpoint: receives completion event and enqueues a “fetch results” task.
- Processor: fetches transcript JSON, stores in DB, indexes text for search, stores structured fields (speakers/entities/chapters).
- Optional LLM layer: run LeMUR or your own LLM for action items, summaries, Q&A, or extraction.
Idempotency strategy
Use stable IDs and deduplication keys:
- Compute a file hash (or storage object key) and store it so the same file isn’t processed twice.
- Store transcript job ID and treat “fetch transcript” as safe to retry.
- On webhook receive, enqueue a job keyed by transcript ID; drop duplicates.
Retries and backoff
Not every failure is permanent. Network errors, timeouts, and occasional server errors should be retried with exponential backoff and jitter. However, “bad input” errors (corrupt audio, inaccessible URL) should not be retried until fixed. The docs mention that if failures persist after resubmitting, you should contact support.
Storage & indexing
Store:
- Raw transcript text
- Word-level or sentence-level timestamps (for subtitles and jump links)
- Utterances and speaker labels (for meeting UX)
- Entities and redaction outputs (for search and compliance)
- Chapters/summaries (for navigation and “TL;DR” UX)
Add “jump to timestamp” everywhere: in summary bullet points, in chapters, and in search results. It makes transcripts feel trustworthy because users can verify by listening.
21) Security best practices (keys, URLs, webhooks)
SecurityVoice data is sensitive. Your security posture should assume audio can contain PII, confidential business info, and regulated content. Good security is a combination of key handling, network controls, and data governance.
API key handling
- Keep your key in server environment variables (not in client apps).
- Rotate keys if accidentally exposed.
- Log carefully: never log full headers containing Authorization keys.
Public URLs and access control
If you pass audio_url as a public URL, ensure it is a direct file URL and accessible from AssemblyAI’s servers.
Avoid exposing private recordings with globally public URLs; prefer short-lived signed URLs when possible.
Webhook security
- Use HTTPS only.
- Validate incoming webhook payloads (signature/secret if supported in your configuration).
- Make handlers idempotent and quick; enqueue heavy work.
- Rate limit webhook endpoints to defend against spam.
If you handle sensitive audio, also define retention (how long you keep audio and transcripts), access controls (who can view), and audit logs (who accessed what). The API provides capabilities, but your policies complete the security story.
22) Troubleshooting: common errors and fixes
Troubleshooting“My transcript is stuck”
If status is queued or processing, it may simply be waiting for capacity.
Avoid polling every second. If the job stays in that state for an unusually long time, try:
- Confirm the audio URL is a direct file and reachable.
- Confirm the audio isn’t corrupted or unsupported.
- Retry submission once (don’t spam).
401 Unauthorized
Usually means missing or invalid API key.
Confirm you are sending Authorization: <YOUR_API_KEY> and you did not include extra spaces or quotes.
429 Too Many Requests
Rate limiting can occur on some endpoints/services.
The docs describe rate limits with X-RateLimit-* headers and 429 responses for exceeded limits.
Implement exponential backoff and reduce polling frequency. Use webhooks instead of polling at scale.
422 Unprocessable Entity / Upload failures
Upload failures are usually about request formatting or payload issues:
wrong content type, sending JSON instead of binary, or a file that exceeds the upload size constraints.
Ensure you use Content-Type: application/octet-stream and --data-binary for curl uploads.
“Audio URL is a webpage”
The docs explicitly mention failures when the audio URL is a webpage rather than a file.
Fix by using a direct file URL or upload the media via /v2/upload.
1) Confirm URL is direct media. 2) Confirm it’s accessible without cookies. 3) Check duration/size limits. 4) Reduce features to isolate (transcript-only), then add features back one by one.
23) AssemblyAI API FAQs
FAQIs AssemblyAI REST or WebSocket? ⌄
Both. Async transcription uses REST endpoints under https://api.assemblyai.com.
Streaming transcription uses a WebSocket endpoint like wss://streaming.assemblyai.com/v3/ws.
Do I have to upload my audio? ⌄
No. If you have a public direct audio URL, you can pass it as audio_url to /v2/transcript.
Uploading is helpful when you have local files or private user uploads.
What is the max file size and duration? ⌄
AssemblyAI’s FAQ states /v2/transcript supports up to 5GB and up to 10 hours,
and /v2/upload supports local uploads up to 2.2GB.
Can I enable both Summarization and Auto Chapters? ⌄
The docs state you can only enable one of Summarization and Auto Chapters in the same transcription. If you need both, consider running separate jobs or using LeMUR for one of the outputs.
How does speaker diarization work? ⌄
Diarization detects multiple speakers and returns utterances segmented by speaker. The docs describe that the transcript returns a list of utterances when diarization is enabled.
How do I reduce costs? ⌄
Turn on only necessary add-ons, trim audio (skip intros/silence), store transcript results to avoid reprocessing, and use webhooks instead of frequent polling. Pricing includes base rates and add-on rates.
24) Sources & official docs (recommended starting points)
SourcesUse official docs for exact request/response fields and the latest feature set. The links below are the most important pages to bookmark.
Core REST endpoints
- API Overview / Base URL: REST base URL is
https://api.assemblyai.com. - Upload:
POST /v2/uploadwith binary upload. - Submit transcript:
POST /v2/transcriptwithaudio_url. - Status lifecycle: queued/processing/completed/error.
Streaming (real-time)
- WebSocket handshake:
wss://streaming.assemblyai.com/v3/ws. - EU streaming host option:
streaming.eu.assemblyai.com.
Models & features
- Summarization docs: summary model/type + mutual exclusion with chapters.
- Auto Chapters docs: chapter fields and navigation structure.
- Speaker diarization docs: utterances + speaker_labels + best practices.
- Entity detection docs: supported entity types.
- PII redaction docs: redaction behavior and purpose.
- Pricing page: base rates and add-ons.
- Limits FAQ: file size/duration constraints.