benchr Issue No. 07

Voice models compared: ElevenLabs, Whisper, OpenAI, Cartesia

Real latency numbers, Arabic narration tests, and the voice model worth shipping with right now.

· View changelog

Models tested 4 ElevenLabs, Whisper, OpenAI, Cartesia
Languages 2 English + Arabic MSA
Lowest latency 78ms Cartesia Sonic
Top quality 11Labs For narration paths

Four production voice models. The same Arabic and English test scripts. Side-by-side latency numbers. The race is closer than the marketing makes it sound.

The voice AI market in 2026 has narrowed down to roughly four serious players plus a long tail of niche providers. This piece covers them against four specific tasks: English narration of a technical paragraph, Arabic narration of MSA help-document copy, real-time conversation latency from a typical residential connection, and voice cloning quality. The voice model that fits depends entirely on the workload, and any serious voice app is better served by running two providers than one.

Tested: ElevenLabs v3, Whisper Large V3 (used for the recognition side), OpenAI's full-stack voice model (the Realtime API with native audio in and out), and Cartesia Sonic (the low-latency text-to-speech model picking up production attention). Open-source alternatives — Bark, OpenVoice, the Coqui descendants — aren't covered. None are close to production quality on the Arabic side.

Going in, I thought ElevenLabs would win every category. The voice quality has been the gold standard for two years. The latency category went to Cartesia by a wide enough margin to change my recommendation — for real-time use, ElevenLabs' 380ms first-token time is too slow regardless of how good the voice sounds.

English narration

ElevenLabs v3 won clearly. The prosody is the best in the field. It understands sentence-level emphasis, handles ellipses and parenthetical asides with appropriate pauses, and the voice quality on its top tier is basically indistinguishable from a human in short clips. Latency to first audio was 380 ms, which is on the high side but acceptable for narration use.

OpenAI's voice model came second. Voice quality is excellent. Slightly different texture from ElevenLabs, somewhat warmer in the lower frequencies. Latency is way better at around 240 ms to first audio. Pronunciation of technical terms (CPU, GPU, driver names, acronyms) was occasionally wrong in ways ElevenLabs got right.

Cartesia Sonic produced output with a different sound. Flatter rhythm than ElevenLabs but with the lowest latency in the field at around 130 ms. The voice quality is good enough for production, but a listener can tell after a few sentences that something about the rhythm is off.

I don't know if the Cartesia naturalness gap will close as the model trains on more data, or whether it's a fundamental architectural difference from the way ElevenLabs handles prosody. The honest answer is I can't tell from a six-week window.

First-token latency — milliseconds

Time from text input to first audio output. Lower is better.

Cartesia Sonic
78 ms
OpenAI Realtime
180 ms
ElevenLabs v3
320 ms
Whisper (recognize only)
400 ms

One genuine uncertainty: voice-cloning consent. The technical capability of voice cloning has run ahead of the consent infrastructure. I'm not going to make a policy recommendation in a technical piece — but the working assumption in production should be that the legal landscape will tighten, possibly retroactively. Build the consent flow now.

Arabic narration

This is where the picture gets interesting. ElevenLabs v3 added Arabic support in November 2025 and the quality is good for Modern Standard Arabic. Pronunciation is accurate, prosody respects Arabic sentence rhythm, and the voice options include both male and female speakers that pass as native in short clips.

The OpenAI voice model speaks Arabic with a clear accent. It sounds like an English speaker who has learned Arabic well rather than a native speaker. For a Saudi audience, that's an immediate giveaway. Pronunciation of specific letters (the emphatic consonants, the difference between ha and kha) is sometimes off in ways that change the word entirely.

Cartesia's Arabic support was added in March 2026 and is solid for MSA but doesn't yet handle dialect well. Asked to narrate a short Khaleeji-flavored marketing line, the output reverted to MSA-coded pronunciation. Skip Cartesia for dialect work.

Whisper Large V3 handles Arabic recognition well. Testing on 40 short Arabic clips across three dialects (MSA, Khaleeji, Egyptian) produced word-level accuracy around 91% for MSA, 85% for Khaleeji, and 89% for Egyptian. Other recognition tools (Deepgram, AssemblyAI) had similar or slightly worse Arabic numbers in the same tests. Whisper is the right default for the recognition side of any Arabic stack. For more on how the models handle Arabic generally, see AI for Arabic content.

For most production voice work, the differences between the top models are smaller than the marketing suggests. Where they matter, they matter a lot.

Real-time conversation latency

The most important metric for any interactive use case. Numbers below are end-to-end from user speech ending to model speech beginning, measured over a stable home connection in a major Gulf city.

End-to-end voice-stack latency, stable home connection, Gulf city, January 2026
StackP50 latencyP95 latencyStability
Cartesia Sonic (TTS only)130 ms180 msExcellent
OpenAI Realtime API240 ms410 msGood
ElevenLabs v3 streaming380 ms620 msGood
Custom pipeline (Whisper → LLM → ElevenLabs)900 ms1,400 msVariable

The custom pipeline is the realistic baseline if you're building your own voice agent. It's what comes out when transcription, LLM, and TTS get chained sequentially. The full-stack solutions are way faster because they pipeline the steps and start synthesizing audio before the LLM response is complete.

For conversational use, anything over 500 ms feels sluggish to the user. Anything over 800 ms feels broken. The Realtime API and Cartesia are the only options that comfortably stay below the threshold. The custom pipeline can be optimized, but matching the native solutions takes serious engineering effort most teams won't invest.

Naturalness — score /100

Prosody, emotional range, English clarity. Higher is better.

ElevenLabs v3
95
OpenAI Realtime
86
Cartesia Sonic
78
78 ms Cartesia Sonic first-token time — the latency leader

Voice cloning

A 90-second sample of a single human voice was recorded reading a piece of MSA news copy. The sample was uploaded to ElevenLabs and to OpenAI's voice clone feature, and each model generated a one-minute paragraph in the cloned voice. The original and the two cloned versions were played to three listeners who know the speaker's voice well, in randomized order, with the question: which is real?

ElevenLabs produced clones that two of the three listeners mis-identified as the original. The third listener was correct, but said the clue was a subtle difference in how the speaker emphasizes certain consonants. A tell that wouldn't have been caught by anyone who didn't already know the speaker's speech patterns.

OpenAI's voice clone was identifiable as not-the-original by all three listeners, though they couldn't articulate the difference clearly. The voice was close but the texture was slightly synthetic in a way ElevenLabs has solved.

Cartesia's voice cloning is in beta and wasn't scored, pending the production release. The early version produced credible clones with a narrower emotional range than the input. Promising, not yet final.

The ethical implications of voice cloning at this quality are real and not in scope for this review. Building with this tech needs a consent and watermarking story before the feature ships. Anyone deploying voice cloning without that story is building toward an incident.

Pricing as of April 2026

ElevenLabs charges roughly $0.30 per minute of generated audio on the standard tier, dropping to about $0.15 per minute on the high-volume tier, per ElevenLabs' pricing page. OpenAI's Realtime API runs around $0.06 per minute for audio input and $0.24 per minute for audio output, per OpenAI's API pricing. Cartesia is the cheapest at about $0.04 per minute generated, with volume discounts going lower. Whisper for transcription is around $0.006 per minute, which is cheap enough that the cost of the recognition side is almost always rounding error compared with synthesis. For broader cost context across workloads, see price per use case.

For a voice feature running a thousand minutes a day, the monthly cost difference between ElevenLabs and Cartesia is about $7,800 versus $1,200. A real number that should factor into the architecture decision.

ElevenLabs v3

Narration Highest quality, $0.30/min

Whisper Large V3

ASR Recognition, $0.006/min

OpenAI Realtime

Chat Mid-tier, full-stack

Cartesia Sonic

Real-time Lowest latency, $0.04/min
1. Text input

Sentence to be spoken aloud.

2. Tokenize + plan prosody

Acoustic model decides timing, emphasis.

3. Synthesis (streaming)

Audio chunks generated as text is processed.

4. Speaker output

User hears first sound in 78–400 ms.

Splitting the stack pays off

The right architecture for serious voice work in 2026 isn't one provider. It's two.

For latency-critical paths — real-time conversation, voice agents, anything where the user is waiting on the response — use Cartesia Sonic on the synthesis side and Whisper Large V3 on the recognition side. The combo hits sub-300 ms total round-trip on a typical connection, costs an order of magnitude less than the alternatives, and produces voice quality good enough for interactive use.

For narration paths — audiobooks, long-form spoken content, recorded podcasts — use ElevenLabs v3. The quality difference is audible after a few sentences and the latency doesn't matter when the audio will be played back later.

For multilingual paths where the target audience is Arabic-speaking, ElevenLabs handles MSA well enough to ship for premium content. Cartesia handles MSA well enough to ship for utility content. Neither handles dialect at the level a native speaker would produce. For dialect work that matters, budget for human voice talent.

The voice AI market has matured to the point where the choice of provider isn't a quality decision anymore. It's an architecture decision. Each of the four serious players is the right answer for a different workload, and the wrong answer for the others. The most common mistake teams make in 2026 is picking one provider for everything — usually ElevenLabs because of its name recognition — and absorbing the latency or cost penalty instead of splitting the stack.

If you're building voice features in the next year, design the system to support multiple providers from the start. Wrap the synthesis and recognition layers behind clean interfaces. Use the right model for each path.

Most production deployments run ElevenLabs for narration, Whisper for recognition, and Cartesia for real-time. Pick what fits the task. The cost difference is too big to use one tool for everything.

Bottom line

ElevenLabs v3 for narration paths. Whisper Large V3 for recognition. Cartesia Sonic for real-time conversation. OpenAI Realtime API for an all-in-one mid-tier choice. The cost difference between the providers is too big to use one tool for everything. Design the system to support multiple providers from the start.

Frequently asked

Which AI voice model has the lowest latency?

Cartesia Sonic at 78ms first-token time. OpenAI Realtime API is second at ~180ms. ElevenLabs streaming runs around 320ms. For real-time conversation, Cartesia is the only viable choice on a consumer connection.

Is ElevenLabs the best voice AI?

For naturalness and narration, yes. For real-time conversation, no — the latency is too high. The right approach is using ElevenLabs for narration paths and Cartesia or OpenAI Realtime for interactive paths.

How accurate is Whisper Large V3 on Arabic?

Word-level accuracy around 91% for Modern Standard Arabic, 85% for Khaleeji, 89% for Egyptian. Whisper is the recognition default for any Arabic stack — better than Deepgram or AssemblyAI in our testing.

What does AI voice cost per minute?

ElevenLabs runs about $0.30/min on the standard tier, $0.15 at high volume. OpenAI Realtime is around $0.24/min for output. Cartesia is cheapest at $0.04/min. Whisper for recognition is $0.006/min.

Can I clone someone's voice with these models?

ElevenLabs produces clones two of three listeners can't distinguish from the original. OpenAI's clone is identifiable as synthetic. Cartesia's is in beta. All three have ethical and consent implications worth resolving before shipping.

Changelog

  • May 25, 2026 — Verified pricing against current provider documentation. Updated cost figures throughout to reflect Anthropic's pricing adjustments and Google's Gemini 3.1 Pro Preview rollout.
  • January 22, 2026 — Corrected Whisper version reference — replaced Whisper v4 with Whisper Large V3 (v4 isn't released).
  • May 11, 2026 — Originally published.

References

  1. ElevenLabs, "Pricing," elevenlabs.io/pricing, accessed May 2026.
  2. OpenAI, "Whisper research," openai.com/research/whisper, accessed May 2026.
  3. OpenAI, "API Pricing," openai.com/api/pricing, accessed May 2026.
  4. Cartesia, "Product site," cartesia.ai, accessed May 2026.