One thing to get straight before any picks. This page isn't about translating between Arabic and English. That's a different job with different winners, and it has its own guide. This is about register: you write or speak in Khaleeji, and you want the model to answer in Khaleeji, not in the flat newsreader Arabic that every model defaults to when it's nervous.
That default has a name. Every frontier model is trained on far more Modern Standard Arabic, the formal written register, than on any spoken dialect. So MSA is where they're strong, and dialect is where they wobble. The wobble shows up as drift: you ask in Najdi, and the reply comes back in MSA, or worse, in Egyptian, because Egyptian has more training data than Gulf. Holding the dialect is the whole test here.
Why MSA is the gravity well
Start with the numbers, because they explain everything that follows. On academic Arabic tasks, models sit around 85 to 92 percent. Drop to Gulf dialect and accuracy falls to roughly 75 to 85 percent. Egyptian holds up a little better than Gulf because it's better represented; Levantine sits lower; Maghrebi is the floor. The pattern tracks training-data volume, not anything clever about the dialects themselves.
That leakage figure is the clearest measurement anyone's published on Gulf drift. A Saudi-tuned model called Saudi-Dialect-ALLaM, trained with Hijazi and Najdi datasets, leaked MSA 32.63 percent of the time before tuning. After LoRA fine-tuning on 5,466 synthetic instruction pairs, leakage dropped to 6.21 percent and its Saudi-dialect rate hit 84.21 percent. The lesson for the frontier models is blunt: none of them are tuned like that, so all of them leak. The question is only how much, and toward what.
Dialect fidelity, model by model
There's no controlled head-to-head that pits Claude, GPT, Gemini, and Qwen against each other on Khaleeji. So the bars below are an inference, drawn from each model's documented dialect behavior and its academic Arabic scores, not a single benchmark run. Read them as a ranking of how likely each model is to hold your register, with the caveat that the floor moves under all of them.
Claude Opus 4.7 takes the top spot. Anthropic's own multilingual notes, backed by comparative testing, put a slight edge on Egyptian and Gulf Arabic, likely from training-data mix. In blind tests on academic Arabic summaries it scored 9.5 out of 10, well ahead of GPT at 8.5 and Gemini at 7.5, and it produces more varied, natural Arabic where GPT tends toward repetitive vocabulary and stiffer phrasing. The trait that matters most for dialect work is the failure mode: when Claude isn't sure, it stays nearer your register instead of bailing to MSA. For the wider read on how it stacks against the field on Saudi-market Arabic, benchr's working report on AI for Arabic content scores five models across MSA and three dialects.
Qwen 3 is the runner-up, and the honest one. It supports 119 languages and dialects with a 128K context window, more linguistic breadth than any rival here, and it's the strongest of the group at switching between Arabic and English mid-sentence. But breadth isn't depth. No public benchmark validates Qwen's Khaleeji output against Claude's, so picking it is a bet on coverage, not a verified Gulf result. If your work is heavily code-mixed, Qwen earns the look; if it's pure Najdi or Hijazi, the case is weaker.
GPT-5.5 is fine at Arabic and bad at this specific job. It's strongest in the dialects with the most training data, Egyptian and Levantine, and accuracy slips in thinner ones. The catch for Gulf users is the direction of the slip: GPT tends to drift toward Egyptian, the dialect it knows best, so a Najdi prompt can come back sounding like Cairo. That's a worse outcome than MSA for a Saudi reader, because it's wrong in a way that sounds confident.
Gemini 3 Pro understands Gulf input fine; it reportedly parses over 16 Arabic dialects on the way in. The problem is the way out. Gemini standardizes its replies to MSA across every regional variant, so you can ask in Khaleeji and you'll get formal Arabic back, every time. For comprehension that's fine. For a reply that sounds like it came from the Gulf, it's a non-starter, and benchr's Gemini 3 Pro evaluation covers the rest of where that model lands.
Najdi and Hijazi: the sub-dialect cliff
Step inside Saudi Arabia and the ground gets thinner. "Gulf Arabic" in a benchmark usually means a blended Khaleeji average. Najdi (central, Riyadh) and Hijazi (western, Jeddah and Mecca) are distinct, and they're rare enough in public data that no frontier model can promise either one. The literature is consistent here: large models stay dominated by MSA with limited support for Saudi dialects specifically, and they tend to collapse Najdi and Hijazi into a generic Gulf register or straight into MSA.
A model can know "Gulf" as a category and still flatten Najdi and Hijazi into the same beige Arabic.
This is where the fine-tuned models earn their place. If your product lives or dies on Saudi sub-dialect accuracy, a base frontier model isn't the answer; a LoRA-tuned Saudi model is, and Saudi-Dialect-ALLaM's 84 percent Saudi rate is the proof of concept. Among the off-the-shelf models, Claude Opus 4.7 collapses the sub-dialects least, but "least" isn't "well." Set expectations accordingly.
The scoreboard
The short version, by what you're trying to do. Find your row and take the pick.
General Khaleeji chat
Claude Opus 4.7 Holds register, least MSA driftArabic-English code-mix
Qwen 3 119 dialects, best at switchingNajdi / Hijazi accuracy
Fine-tuned model ALLaM-class, not a frontier baseFormal MSA output
Any of them Gemini defaults here for freeVoice / transcription
Speechmatics 6.3% WER on code-switch audioWorst fit for Gulf
Gemini 3 Pro Understands, replies in MSAA note on that voice row, because it's easy to over-read. Speechmatics' Arabic-English bilingual model hits a 6.3 percent word error rate on code-switching, about 35 percent lower than Google's 9.7 percent. That's a strong result, but it's speech-to-text. It tells you nothing about how a chat model generates dialect, so don't carry it over to text work. If you're weighing the spoken side more broadly, benchr's comparison of voice models has the wider picture.
If your real need is moving text cleanly between the two languages rather than holding one dialect, that's the sibling guide's territory: benchr's guide to Arabic-English translation ranks the models on direction quality both ways, which is a separate question from register fidelity. And if the underlying job is just producing solid long-form Arabic prose, the language-agnostic guide to AI for writing is the better starting point, since the writing-quality leader and the dialect leader happen to be the same family.