"Whisper vs Deepgram vs Google STT for real-world Portuguese audio: the benchmark I actually ran"

Direct answer first: for real-world Brazilian-Portuguese audio, after running 50 actual clips through all three, the pick is Deepgram for streaming (lowest latency, lowest cost, best on noise and heavy accents), Whisper API for batch (good accuracy, five lines of integration), and self-hosted Whisper large-v3 when the audio can't leave your machine. Google Cloud Speech-to-Text I skip — it didn't win anything for Portuguese in my test and it's the priciest of the three for streaming: $0.016/min (Google Cloud STT pricing) versus $0.0048/min for Deepgram Nova-3 (Deepgram pricing).

I'm Ulisses. I run Hens, a software studio, and I ship voice and bot pipelines for clients. I also built OverAir, a WhatsApp-based digital memory product (0 paying customers today — I'll stay honest about that the whole way through). In April 2026 I had to pick a transcription engine for a pipeline that chews through WhatsApp voice notes in production. I didn't trust the vendor marketing benchmarks. I built my own.

If you're a team in the US or the Gulf evaluating speech-to-text for a non-English market, the lesson generalizes: benchmark on your real audio, not on LibriSpeech. WhatsApp is the default channel across Latin America and the Middle East, and a voice note recorded on a phone in a noisy souk or a São Paulo bus is nothing like an English audiobook.

The dataset — and why it changes everything

Most STT benchmarks you'll find run on LibriSpeech: studio-grade English, trained narrator. That has zero overlap with the audio that hits a real consumer product in Portuguese, Arabic, or Hindi.

My dataset is 50 Brazilian-Portuguese clips, split on purpose:

30 WhatsApp voice notes — recorded on phones, outdoors, with wind, a TV in the background, people talking fast and swallowing syllables. The audio ships as compressed OPUS, mono, 16 kHz. Worst case, and the real case.
20 podcast segments — decent mic, treated room, articulate speech. Best case.

For every clip I transcribed a "golden" reference by hand. Without a hand-made golden, WER is meaningless — you're just comparing one machine to another. Transcribing 50 clips manually took me about 6 hours. Boring, and the part that made everything downstream worth anything.

WER (Word Error Rate) is the metric: the percentage of words wrong across insertions, deletions, and substitutions. A 10% WER means 10 of every 100 words came out wrong. For voice notes, anything under 12% lets an LLM extract intent reliably. Above 18%, the LLM starts hallucinating on top of the error.

The result in one table

Benchmark run in April 2026 on my machine (Apple M2, 16 GB) and on the cloud APIs. Mean WER over the 50 clips:

Engine	WER clean audio	WER noisy audio	Latency	Where it runs
Whisper large-v3 (self-hosted M2)	8.2%	14.1%	~30% of clip length (10s for 30s)	Local
OpenAI Whisper API (`whisper-1`)	9.5%	15.8%	~5s constant	Cloud (batch)
Deepgram Nova-2 PT-BR	7.8%	12.3%	<300ms (streaming)	Cloud (streaming)
Google Cloud STT v2 (`chirp_2`)	8.5%	13.5%	~4s	Cloud

Three things jump out. Deepgram won both WER scenarios and is by far the fastest. Self-hosted Whisper ties on clean audio but drops harder under noise. And Google sits mid-pack — not bad, just not best at anything, which is a problem when it's also the most expensive.

Honesty note: I ran this on Deepgram Nova-2, the Portuguese model at the time. Deepgram has since shipped Nova-3, the first model to do real-time multilingual transcription (Deepgram, Nova-3 launch). I haven't re-run the full benchmark on Nova-3, so I won't quote a number I didn't measure. But the trend of Deepgram leading on noisy non-English audio matches independent testing: Whisper large-v3 trails Nova-3 by 1–3 percentage points on most real-world clips (Northflank STT benchmarks 2026).

The monthly bill — 1,000 hours of audio

WER doesn't pay invoices. Cost does. So I ran the math for a realistic production volume: 1,000 hours of audio per month = 60,000 minutes. Roughly an active WhatsApp bot with a few thousand users sending voice notes.

Engine	Price/min	Cost of 1,000h/month	Source
GPT-4o-mini-transcribe	$0.003	$180	OpenAI
Deepgram Nova-3 (streaming mono)	$0.0048	$288	Deepgram
OpenAI Whisper API (`whisper-1`)	$0.006	$360	OpenAI
Google STT v2 Chirp (streaming)	$0.016	$960	Google Cloud
Whisper large-v3 (self-hosted)	"free" + hardware/power	~$0 marginal	—

At today's rate that's roughly AED 660 vs AED 3,525/month between Deepgram and Google for the same work. Google costs 3.3x Deepgram and 2.7x the Whisper API, and it returned worse WER than Deepgram in my test. That's what took Google off the table early. To be fair: Google has a dynamic batch tier at $0.004/min (Google Cloud STT pricing) if you can wait up to 24 hours for results. Fine for last week's podcast. Useless for answering a voice note in real time.

Self-hosted Whisper looks "free" in the table, and that's the classic trap. It isn't free. You pay in hardware (an M2 or a GPU), in power, and mostly in your own time keeping the service alive. For 1,000h/month of batch on a Mac mini already on your desk, the marginal cost really is near zero. For streaming with an SLA, forget it — you don't want to be on call for your own Whisper at 3am.

The OPUS gotcha nobody writes down

Here's the detail that ate two hours of my Saturday and that you won't find in any blog post.

WhatsApp voice notes arrive as OPUS inside an .ogg container. Whisper running locally (via whisper.cpp or faster-whisper) eats OPUS directly — it shells out to ffmpeg under the hood and decodes without you thinking about it. The cloud APIs are pickier: Google and several endpoints would rather you convert to wav/m4a first, and even when they accept OPUS, the edge conversion burns 1–2 seconds per file. In a voice-note pipeline, that 1–2s times thousands of messages becomes a queue, and the queue becomes a support ticket.

The fix is dumb once you know it: convert OPUS → 16 kHz mono WAV once, the moment audio enters the pipeline, and hand the WAV to the engine. A single ffmpeg -i audio.ogg -ar 16000 -ac 1 audio.wav does it. But if you discover this in production, with the bot already choking, the cost isn't the CPU second — it's the customer watching a "slow" bot and filing a complaint.

In OverAir I didn't even use dedicated STT for everything: Gemini 2.5 Flash takes audio as direct multimodal input and hands back extracted intent in a single call. For a bot that just needs to understand "book me 3pm tomorrow," that beats transcribe-then-interpret on cost. Dedicated STT earns its place when you need the exact text — captions, meeting transcripts, compliance. That's when the benchmark above matters.

The test that separated the engines most: regional accent

Clean podcast audio, almost every engine nails. What separates the contenders in any non-English language is heavy accent and regional slang.

I loaded the dataset with speakers talking "closed" — strong rural Southern Brazilian, heavy Northeastern slang, fast urban speech eating word endings. The ranking shifted:

Deepgram held up best under strong accent. It'd miss a proper noun but keep the sentence standing.
Whisper large-v3 failed on roughly 20% of heavy-slang cases — and not just by missing a word. Sometimes it invented a plausible neutral-Portuguese sentence with nothing to do with the audio. That's the worst kind of error, because it reads as correct.
Google landed in between: less hallucination than Whisper, more dropped words than Deepgram.

This matches the literature. Whisper wins on clean multilingual benchmarks like FLEURS (arXiv 2501.06117), but "clean" is the keyword. Street audio isn't FLEURS. If your product will hear the real world — ride-hailing, delivery, counter service captured on a phone — Whisper alone will let you down on strong accents, and you'll only find out when a user complains. This generalizes straight to Gulf Arabic dialects or Indian English: benchmark on the accent your users actually have.

When I pick each one

Enough "it depends." Here's what I actually do:

Streaming / real time → Deepgram. Sub-300ms latency, best WER under noise, best on accent, cheaper than Google. If the product transcribes while the person speaks — live captions, voice agent, call center — there's no argument for me. Nova-3 also brought real-time multilingual, which fixes audio that mixes two languages mid-sentence (common in the Gulf: Arabic and English in one breath).

Batch / quality with simple integration → Whisper API (whisper-1). Five lines of code, $0.006/min, nothing to maintain. For transcribing files that already exist (podcasts, recorded meetings, support media) with no millisecond urgency, it's the best cost-per-brain-cell. And if cost is everything, gpt-4o-mini-transcribe at $0.003/min halves the bill at near-equal quality.

Privacy / offline / huge batch volume → self-hosted Whisper large-v3. When audio can't leave the machine (healthcare, legal, sensitive data) or when volume is so large the API gets expensive, running local on an M2 or a GPU pays off. Accept the maintenance cost as part of the deal.

Google Cloud STT → I skip it for Portuguese. Not because it's bad — it's decent. I skip it because it's the priciest streaming option and it beat nobody in my test. The one case where it earns a look: you're already deep in GCP and the dynamic-batch tier at $0.004/min, wired natively into your data lake, saves more in engineering than it loses on per-minute price. Otherwise it's money left on the table.

What I'd do differently

If I built this pipeline again today, I'd start with Deepgram Nova-3 straight for streaming and fall back to the Whisper API only where the audio is already sitting on disk. I wouldn't waste time re-testing Google — I tested it for you, it cost more and landed mid-pack. And I'd convert every WhatsApp OPUS file to 16 kHz WAV at the pipeline entrance, not at the edge of each call, because that 1–2s bites when you least expect it.

If you're building a WhatsApp bot, a voice agent, or anything that has to actually understand non-English audio in production, that measured-not-guessed kind of decision is what Hens ships. Reach out.

Sources

Deepgram — Pricing (Nova-3 streaming/batch, mono and multilingual)
Deepgram — Introducing Nova-3
OpenAI — Whisper / transcription models
Google Cloud — Speech-to-Text pricing
Northflank — Best open-source STT in 2026 (benchmarks)
FLEURS-SLU multilingual benchmark — arXiv 2501.06117