Four production voice models. The same Arabic and English test scripts. Side-by-side latency numbers. The race is closer than the marketing makes it sound.
The voice AI market in 2026 has narrowed down to roughly four serious players plus a long tail of niche providers. This piece covers them against four specific tasks: English narration of a technical paragraph, Arabic narration of MSA help-document copy, real-time conversation latency from a typical residential connection, and voice cloning quality. The voice model that fits depends entirely on the workload, and any serious voice app is better served by running two providers than one.
Tested: ElevenLabs v3, Whisper Large V3 (used for the recognition side), OpenAI's full-stack voice model (the Realtime API with native audio in and out), and Cartesia Sonic (the low-latency text-to-speech model picking up production attention). Open-source alternatives — Bark, OpenVoice, the Coqui descendants — aren't covered. None are close to production quality on the Arabic side.
Going in, I thought ElevenLabs would win every category. The voice quality has been the gold standard for two years. The latency category went to Cartesia by a wide enough margin to change my recommendation — for real-time use, ElevenLabs' 380ms first-token time is too slow regardless of how good the voice sounds.
English narration
ElevenLabs v3 won clearly. The prosody is the best in the field. It understands sentence-level emphasis, handles ellipses and parenthetical asides with appropriate pauses, and the voice quality on its top tier is basically indistinguishable from a human in short clips. Latency to first audio was 380 ms, which is on the high side but acceptable for narration use.
OpenAI's voice model came second. Voice quality is excellent. Slightly different texture from ElevenLabs, somewhat warmer in the lower frequencies. Latency is way better at around 240 ms to first audio. Pronunciation of technical terms (CPU, GPU, driver names, acronyms) was occasionally wrong in ways ElevenLabs got right.
Cartesia Sonic produced output with a different sound. Flatter rhythm than ElevenLabs but with the lowest latency in the field at around 130 ms. The voice quality is good enough for production, but a listener can tell after a few sentences that something about the rhythm is off.
I don't know if the Cartesia naturalness gap will close as the model trains on more data, or whether it's a fundamental architectural difference from the way ElevenLabs handles prosody. The honest answer is I can't tell from a six-week window.
One genuine uncertainty: voice-cloning consent. The technical capability of voice cloning has run ahead of the consent infrastructure. I'm not going to make a policy recommendation in a technical piece — but the working assumption in production should be that the legal landscape will tighten, possibly retroactively. Build the consent flow now.
Arabic narration
This is where the picture gets interesting. ElevenLabs v3 added Arabic support in November 2025 and the quality is good for Modern Standard Arabic. Pronunciation is accurate, prosody respects Arabic sentence rhythm, and the voice options include both male and female speakers that pass as native in short clips.
The OpenAI voice model speaks Arabic with a clear accent. It sounds like an English speaker who has learned Arabic well rather than a native speaker. For a Saudi audience, that's an immediate giveaway. Pronunciation of specific letters (the emphatic consonants, the difference between ha and kha) is sometimes off in ways that change the word entirely.
Cartesia's Arabic support was added in March 2026 and is solid for MSA but doesn't yet handle dialect well. Asked to narrate a short Khaleeji-flavored marketing line, the output reverted to MSA-coded pronunciation. Skip Cartesia for dialect work.
Whisper Large V3 handles Arabic recognition well. Testing on 40 short Arabic clips across three dialects (MSA, Khaleeji, Egyptian) produced word-level accuracy around 91% for MSA, 85% for Khaleeji, and 89% for Egyptian. Other recognition tools (Deepgram, AssemblyAI) had similar or slightly worse Arabic numbers in the same tests. Whisper is the right default for the recognition side of any Arabic stack. For more on how the models handle Arabic generally, see AI for Arabic content.
For most production voice work, the differences between the top models are smaller than the marketing suggests. Where they matter, they matter a lot.
Real-time conversation latency
The most important metric for any interactive use case. Numbers below are end-to-end from user speech ending to model speech beginning, measured over a stable home connection in a major Gulf city.
| Stack | P50 latency | P95 latency | Stability |
|---|---|---|---|
| Cartesia Sonic (TTS only) | 130 ms | 180 ms | Excellent |
| OpenAI Realtime API | 240 ms | 410 ms | Good |
| ElevenLabs v3 streaming | 380 ms | 620 ms | Good |
| Custom pipeline (Whisper → LLM → ElevenLabs) | 900 ms | 1,400 ms | Variable |
The custom pipeline is the realistic baseline if you're building your own voice agent. It's what comes out when transcription, LLM, and TTS get chained sequentially. The full-stack solutions are way faster because they pipeline the steps and start synthesizing audio before the LLM response is complete.
For conversational use, anything over 500 ms feels sluggish to the user. Anything over 800 ms feels broken. The Realtime API and Cartesia are the only options that comfortably stay below the threshold. The custom pipeline can be optimized, but matching the native solutions takes serious engineering effort most teams won't invest.
Voice cloning
A 90-second sample of a single human voice was recorded reading a piece of MSA news copy. The sample was uploaded to ElevenLabs and to OpenAI's voice clone feature, and each model generated a one-minute paragraph in the cloned voice. The original and the two cloned versions were played to three listeners who know the speaker's voice well, in randomized order, with the question: which is real?
ElevenLabs produced clones that two of the three listeners mis-identified as the original. The third listener was correct, but said the clue was a subtle difference in how the speaker emphasizes certain consonants. A tell that wouldn't have been caught by anyone who didn't already know the speaker's speech patterns.
OpenAI's voice clone was identifiable as not-the-original by all three listeners, though they couldn't articulate the difference clearly. The voice was close but the texture was slightly synthetic in a way ElevenLabs has solved.
Cartesia's voice cloning is in beta and wasn't scored, pending the production release. The early version produced credible clones with a narrower emotional range than the input. Promising, not yet final.
The ethical implications of voice cloning at this quality are real and not in scope for this review. Building with this tech needs a consent and watermarking story before the feature ships. Anyone deploying voice cloning without that story is building toward an incident.
Pricing as of April 2026
ElevenLabs charges roughly $0.30 per minute of generated audio on the standard tier, dropping to about $0.15 per minute on the high-volume tier, per ElevenLabs' pricing page. OpenAI's Realtime API runs around $0.06 per minute for audio input and $0.24 per minute for audio output, per OpenAI's API pricing. Cartesia is the cheapest at about $0.04 per minute generated, with volume discounts going lower. Whisper for transcription is around $0.006 per minute, which is cheap enough that the cost of the recognition side is almost always rounding error compared with synthesis. For broader cost context across workloads, see price per use case.
For a voice feature running a thousand minutes a day, the monthly cost difference between ElevenLabs and Cartesia is about $7,800 versus $1,200. A real number that should factor into the architecture decision.
ElevenLabs v3
Narration Highest quality, $0.30/minWhisper Large V3
ASR Recognition, $0.006/minOpenAI Realtime
Chat Mid-tier, full-stackCartesia Sonic
Real-time Lowest latency, $0.04/minSentence to be spoken aloud.
Acoustic model decides timing, emphasis.
Audio chunks generated as text is processed.
User hears first sound in 78–400 ms.
Splitting the stack pays off
The right architecture for serious voice work in 2026 isn't one provider. It's two.
For latency-critical paths — real-time conversation, voice agents, anything where the user is waiting on the response — use Cartesia Sonic on the synthesis side and Whisper Large V3 on the recognition side. The combo hits sub-300 ms total round-trip on a typical connection, costs an order of magnitude less than the alternatives, and produces voice quality good enough for interactive use.
For narration paths — audiobooks, long-form spoken content, recorded podcasts — use ElevenLabs v3. The quality difference is audible after a few sentences and the latency doesn't matter when the audio will be played back later.
For multilingual paths where the target audience is Arabic-speaking, ElevenLabs handles MSA well enough to ship for premium content. Cartesia handles MSA well enough to ship for utility content. Neither handles dialect at the level a native speaker would produce. For dialect work that matters, budget for human voice talent.
The voice AI market has matured to the point where the choice of provider isn't a quality decision anymore. It's an architecture decision. Each of the four serious players is the right answer for a different workload, and the wrong answer for the others. The most common mistake teams make in 2026 is picking one provider for everything — usually ElevenLabs because of its name recognition — and absorbing the latency or cost penalty instead of splitting the stack.
If you're building voice features in the next year, design the system to support multiple providers from the start. Wrap the synthesis and recognition layers behind clean interfaces. Use the right model for each path.
Most production deployments run ElevenLabs for narration, Whisper for recognition, and Cartesia for real-time. Pick what fits the task. The cost difference is too big to use one tool for everything.