Speech-to-Text, Text-to-Speech, and Voice Agent APIs for developers
Deepgram is a Voice AI platform for developers building speech recognition, voice synthesis, and autonomous voice agent applications.
AI Panel Score
6 AI reviews
Reviewed
Developers integrate Deepgram through REST and WebSocket APIs, SDKs, or an in-browser playground. The primary workflow involves sending audio streams or files to receive transcriptions, synthesized speech, or fully orchestrated voice agent responses. The Voice Agent API handles the full pipeline — speech recognition, LLM orchestration, and voice synthesis — without requiring developers to stitch together separate services.
Deepgram's STT offering includes two models: Nova-3, a general-purpose model with native streaming, and Flux, a conversational model with built-in end-of-turn detection and interruption handling. On the TTS side, Aura-2 provides over 40 voice personas with sub-200ms time-to-first-byte. Audio Intelligence features — summarization, sentiment analysis, topic detection, and intent recognition — run in real time alongside transcription. The platform also supports HIPAA-compliant medical transcription via Nova-3 Medical models trained on clinical terminology.
Deepgram targets developers and enterprise teams building contact center infrastructure, conversational AI, healthcare transcription, and quick-service restaurant automation. Pricing follows a usage-based model with Pay-As-You-Go, Growth, and Enterprise tiers. Competitors in the ASR and Voice AI category include OpenAI Whisper, Amazon Transcribe, Google Cloud Speech-to-Text, and Microsoft Azure Speech Services. Deepgram publishes benchmark comparisons against each of these, including an interactive ASR comparison tool on its website.
The platform supports self-hosted deployment via VPC or on-premise installation, in addition to its cloud offering. Native integrations exist for Five9 and Genesys in contact center contexts, and Deepgram is the exclusive voice partner for IBM watsonx Orchestrate. Official SDKs cover major programming languages, and full API reference documentation is available at developers.deepgram.com.
A conversational-first speech-to-text model with built-in end-of-turn detection and natural interruption handling for dialogue-focused applications.
A single WebSocket API that unifies STT, LLM orchestration, and TTS to deliver end-to-end voice interactions with sub-300ms latency.
Real-time audio analysis features including summarization, sentiment analysis, topic detection, and intent recognition applied to transcribed audio.
Enterprise-grade TTS engine offering 40+ voice personas with sub-200ms time-to-first-byte latency.
Deepgram's flagship STT model achieving a 5.26% Word Error Rate with native streaming support for real-time transcription.
A scalable voice AI infrastructure capable of handling 140,000+ concurrent calls with native integrations for Five9 and Genesys platforms.
Deepgram serves as the exclusive voice AI partner for IBM watsonx Orchestrate, embedding its voice capabilities into IBM's enterprise AI platform.
Officially maintained SDKs for integrating Deepgram's speech-to-text, text-to-speech, and language understanding APIs into developer applications.
HIPAA-compliant STT models trained specifically on clinical terminology for use in medical and healthcare environments.
Documented support for deploying Deepgram models within private VPC or on-premise infrastructure for data-sensitive environments.
An in-browser testing environment within the Deepgram console that allows developers to test all Deepgram models without writing code.
An interactive browser-based tool that enables side-by-side accuracy testing of Deepgram against other ASR providers using custom audio input.
Self-serve, usage-based access to Deepgram APIs with no upfront commitment. Pay only for what you use.
For growing teams and businesses scaling voice AI usage, with additional support and higher throughput.
Custom pricing for large-scale deployments requiring dedicated infrastructure, SLAs, and compliance needs.
Deepgram owns the voice AI stack developers actually want to build on.
“5.26% WER on Nova-3, sub-300ms Voice Agent latency, and a single WebSocket replacing three separate services. IBM's exclusive voice partner and scaling to 140,000 concurrent calls — this isn't a scrappy challenger.”
The IBM watsonx Orchestrate exclusive tells you something. Enterprise partnerships at that level don't go to vendors who won't be around. The 140,000 concurrent call capacity and Five9/Genesys integrations confirm they're already embedded in production infrastructure, not pilots.
The unified Voice Agent API is the real differentiator. Google and Amazon Transcribe make you stitch STT, LLM, and TTS together yourself. Deepgram does it over a single WebSocket. That's developer hours, not just latency. The Nova-3 Medical HIPAA compliance opens healthcare deals competitors aren't positioned to close.
The tradeoff: pricing page is absent, starting price unknown. Pay-As-You-Go is listed as free-tier access, not a fixed rate — renewal math is invisible until you're already scaled. Pilot aggressively, but get the volume pricing in writing before you standardize.
Nova-3's 5.26% WER beats published benchmarks against Amazon Transcribe and Google Cloud Speech-to-Text; they publish the comparison tool, which is confident.
IBM partnership plus Five9/Genesys integrations make this easy to defend; no pricing transparency is the only board-level awkward question.
In-browser API Playground and no-code testing mean developers can validate production viability in hours, not sprint cycles.
The Voice Agent API collapses three-service architectures into one — that's architecture advancement, not just cost savings.
IBM watsonx exclusive partnership and 140,000+ concurrent call infrastructure suggest a vendor with real enterprise traction and staying power.
Engineering teams building production voice agents who want one vendor instead of three.
You need multilingual support beyond 10 languages and can't afford pricing uncertainty at scale.
Deepgram is the voice infrastructure layer that serious voice products are built on.
“Nova-3's 5.26% WER and sub-300ms Voice Agent latency aren't marketing numbers — they're architectural commitments. For teams building voice-forward products, this is the platform that removes the stitching work.”
40+ voice personas in Aura-2, a unified WebSocket pipeline for the full STT-LLM-TTS stack, and HIPAA-compliant medical models — that's not a feature list, that's a platform decision. Someone here has shipped production voice infrastructure before. The ASR comparison tool is a confident move: you don't put competitors side-by-side unless you know you win.
The creative ceiling question is real, though. 40 voice personas sounds deep until your brand needs a voice that doesn't sound like anyone else's. Custom voice cloning isn't surfaced in the evidence — if it's absent, teams building distinctive audio identities will hit that wall within 18 months. ElevenLabs owns that creative tier right now.
If you adopt Deepgram as your voice infrastructure, in 3 years you have enterprise-grade reliability, IBM watsonx as a distribution moat, and 140,000+ concurrent call capacity. What you may not have is brand voice differentiation. Right infrastructure choice, potentially wrong creative choice.
Publishing benchmark comparisons against Amazon Transcribe, Google Cloud Speech, and Azure is a category-leader posture — few challengers do it this transparently.
Built for developer-led voice product teams, not brand creative workflows — the API Playground confirms the practitioner profile they're designing for.
Single WebSocket for the full voice pipeline, official multi-language SDKs, and self-hosted VPC options cover nearly every enterprise deployment pattern.
IBM watsonx exclusivity and Five9/Genesys integrations create distribution depth, but custom voice identity capability isn't evidenced and that gap compounds over time.
Nova-3 Medical, Flux's end-of-turn detection, and real-time Audio Intelligence show genuine model specialization beyond generic ASR.
Developer-led teams building production voice products where accuracy, latency, and compliance requirements are non-negotiable.
Your primary need is distinctive brand voice creation rather than reliable voice infrastructure at scale.
5.26% WER, sub-300ms latency, zero published unit pricing — classic enterprise bait.
“Deepgram's technical specs are credible and competitive. But no pricing page means every TCO model starts with a phone call.”
Nova-3 at 5.26% WER and Aura-2 at sub-200ms TTFB are real numbers. 140,000 concurrent calls is a real ceiling. The Voice Agent API unifying STT, LLM, and TTS over one WebSocket connection cuts integration cost — fewer vendors, fewer invoices. Compare that to stitching Amazon Transcribe plus Polly plus Lambda plus your own orchestration. The consolidation math favors Deepgram at scale.
The pricing page doesn't exist. Three tiers listed — Pay As You Go, Growth, Enterprise — all marked 'Free' as a placeholder. No per-minute rate published. No overage rate published. That's the real risk: not the sticker, the invoice you can't model. Category norm for ASR is $0.006–$0.024 per minute. Deepgram's actual rate is unknown from public materials.
Enterprise tier adds HIPAA compliance and self-hosted VPC — meaningful for healthcare and contact center buyers. But 'custom pricing' plus no termination-for-convenience language visible means procurement will fight this. Growth tier has higher concurrency but no defined threshold. Budget conservatively: assume 20–30% volume growth annually, and your year-3 invoice could be 2× year-1 with no contractual ceiling in sight.
Pay-As-You-Go self-serve lowers SMB friction, but Enterprise tier requires a sales call and custom contract — standard procurement overhead.
No public auto-renewal terms, no published cancellation window, and 'custom pricing' Enterprise contracts suggest standard negotiation friction.
No unit pricing published anywhere; three tiers exist but all lack per-minute rates or overage caps per their pricing page.
5.26% WER and sub-300ms latency are measurable; contact center and medical transcription use cases produce quantifiable throughput and accuracy gains.
Consolidated API reduces integration vendor count, but unpublished rates make 3-year TCO modeling impossible without a sales conversation.
Enterprise contact center or healthcare teams with volume to negotiate custom rates and measurable accuracy requirements.
Your team needs a predictable monthly invoice without a sales relationship to set it.
Nova-3's 5.26% WER and sub-300ms latency make this a serious production stack.
“Deepgram ships a unified Voice Agent API over a single WebSocket — STT, LLM orchestration, TTS in one connection. For audio producers building voice pipelines, that's less stitching, more shipping.”
Nova-3 at 5.26% WER is a number worth respecting. Whisper and Amazon Transcribe both require workarounds for real-time streaming that Deepgram handles natively. Flux adds end-of-turn detection out of the box — that's not a minor feature, that's the difference between a conversation agent that works and one you spend weeks tuning. Aura-2's sub-200ms time-to-first-byte on TTS means the voice response doesn't feel like a chatbot with lag.
The Audio Intelligence layer — summarization, sentiment, topic detection, intent — runs alongside transcription in real time. That's a meaningful workflow win for post-production pipelines handling call center audio. The 140,000+ concurrent call capacity signals real infrastructure, not a demo-tier promise. Self-hosted VPC deployment is documented, which matters the moment a healthcare client asks about HIPAA.
Pricing page isn't public, which is a daily friction point when you're estimating project costs for a client. The API Playground helps day-one orientation, but the lack of transparent per-minute rates means budgeting requires a conversation, not a spreadsheet.
Single WebSocket for the full voice agent pipeline means fewer integration surfaces to babysit daily, but opaque pricing creates recurring friction when scoping new projects.
developers.deepgram.com with changelog, API reference, and in-browser playground suggests the docs are maintained by people who actually integrate the API.
No public pricing page forces cost estimation offline; otherwise the API Playground and changelog-present docs reduce daily friction considerably.
Nova-3 Medical for HIPAA-compliant clinical transcription, self-hosted VPC deployment, and 140,000+ concurrent call handling give power users real headroom beyond the starter tier.
Native Five9 and Genesys integrations, official SDKs across major languages, and Flux's built-in interruption handling map directly onto contact center and conversational AI build workflows.
Audio producers and dev teams building production voice agents or high-volume transcription pipelines who need sub-300ms latency and don't want to orchestrate three separate APIs.
You need transparent per-minute pricing upfront or are building a multilingual product requiring broad language coverage beyond 10 languages.
The API-first voice stack that makes AWS Transcribe look like it's trying too hard
“Deepgram has the bones of a genuine category leader — 5.26% word error rate, sub-300ms Voice Agent latency, a single WebSocket that replaces three stitched-together services. This is infrastructure for builders, not a shiny dashboard for occasional users.”
The Voice Agent API pulling STT, LLM orchestration, and TTS over one WebSocket is genuinely thoughtful. Anyone who's duct-taped together Whisper, an LLM, and a TTS service knows the latency compounding pain. Deepgram just skips that tax. Sub-300ms end-to-end isn't a marketing number — it's the difference between a voice product that feels alive and one that feels like a conference call with a bad connection.
The in-browser API Playground is the kind of thing that means someone on the team actually thought about the new-developer experience. Test Nova-3 without writing a line of code. That's day-one friction removed. The ASR comparison tool is a nice flex too — they're confident enough to let you upload your own audio and run it against Google and Amazon live.
The real tradeoff: there's no pricing page. Usage-based with no public rates means you're flying blind until you're already invested. For solo builders, that's annoying. The mobile story is also basically nonexistent — this is an API platform, so that's expected, but worth naming.
The API Playground and ASR comparison tool suggest real attention to developer-facing detail, though the missing public pricing page is a conspicuous rough edge.
Official SDKs across major languages plus full docs at developers.deepgram.com make the ramp reasonable, though the Flux versus Nova-3 model decision requires some homework.
This is an API-first developer platform — mobile isn't the product — but the web console offers no meaningful mobile experience.
Free sign-up plus a no-code Playground means a developer can hear Nova-3 working before they've written a single line — that's a strong first ten minutes.
The 140,000+ concurrent call capacity and dedicated enterprise SLAs at the top tier suggest the infrastructure is built to hold, not just demo well.
Developer teams building production voice agents, contact center infrastructure, or healthcare transcription who want one platform instead of three.
You need transparent upfront pricing before you can get internal budget approval.
5.26% WER and 140k concurrent calls — the numbers do real work here
“Deepgram has the benchmarks, the enterprise integrations, and the latency specs to be a credible default choice for developer-first voice AI. The missing pricing page and no free plan are small yellow flags in an otherwise solid evidence set.”
Three tells upfront. One: 'The Voice AI Economy is Powered by Deepgram' is the kind of headline that ages poorly — but the actual feature claims are specific and testable. Two: no pricing page scraped, which means cost at scale is an unknown until you're already committed. Three: IBM watsonx as exclusive voice partner is a real anchor tenant, not a logo-wall vanity badge.
The differentiation is genuine. Nova-3 at 5.26% WER with a medical vertical variant, Flux with built-in end-of-turn detection, and a single-WebSocket Voice Agent API under 300ms — that's not Amazon Transcribe territory. Amazon and Google charge you to stitch those layers yourself. Deepgram bundles them. That matters for contact center buyers.
Exit portability is the quiet tradeoff. The Voice Agent API is a proprietary orchestration layer. If you build deep into it, migrating to OpenAI or Azure means re-architecting. STT-only usage exits cleanly. The bundled stack doesn't.
Bundled STT+LLM+TTS under one WebSocket with an interactive ASR comparison tool is a concrete gap vs. Amazon Transcribe and Google Cloud Speech-to-Text.
STT-only exits are clean via standard APIs, but the Voice Agent WebSocket orchestration layer creates real re-architecture costs if you go deep.
Changelog present, active SDK maintenance, IBM partnership, and 140k concurrent-call scale claims suggest real infrastructure investment — no public funding data visible though.
Specific numbers like 5.26% WER and sub-300ms latency are verifiable claims; the 'Voice AI Economy' framing is puffery but the product page stays mostly grounded.
IBM watsonx exclusivity, Five9/Genesys integrations, and Nova-3 Medical all suggest enterprise traction beyond typical API-startup-round-two fadeouts.
Developer teams building production contact center or healthcare voice applications who want a single vendor for the full STT-to-TTS pipeline.
You need transparent usage pricing before committing, or you're prototyping and want a no-credit-card sandbox.
Common questions answered by our AI research team
Nova-3 achieves a 5.26% Word Error Rate.
The Voice Agent API delivers end-to-end voice interactions with sub-300ms latency over a single WebSocket connection.
Yes, Deepgram is available in both cloud and self-hosted (on-premises) deployments.
The Flux Multilingual model supports 10 languages.
Yes, Deepgram offers a free sign-up and a Playground to test the API without writing code.





Deepgram is a San Francisco-based speech AI company offering speech-to-text, text-to-speech, and voice agent APIs for developers and enterprises.