The best AI for Saudi and Gulf Arabic

Where the models hold Khaleeji dialect and where they slide back into MSA or drift toward Egyptian.

· View changelog · Figures verified against official sources, 30 May 2026

One thing to get straight before any picks. This page isn't about translating between Arabic and English. That's a different job with different winners, and it has its own guide. This is about register: you write or speak in Khaleeji, and you want the model to answer in Khaleeji, not in the flat newsreader Arabic that every model defaults to when it's nervous.

That default has a name. Every frontier model is trained on far more Modern Standard Arabic, the formal written register, than on any spoken dialect. So MSA is where they're strong, and dialect is where they wobble. The wobble shows up as drift: you ask in Najdi, and the reply comes back in MSA, or worse, in Egyptian, because Egyptian has more training data than Gulf. Holding the dialect is the whole test here.

Why MSA is the gravity well

Start with the numbers, because they explain everything that follows. On academic Arabic tasks, models sit around 85 to 92 percent. Drop to Gulf dialect and accuracy falls to roughly 75 to 85 percent. Egyptian holds up a little better than Gulf because it's better represented; Levantine sits lower; Maghrebi is the floor. The pattern tracks training-data volume, not anything clever about the dialects themselves.

32% MSA leakage in a Saudi base model before fine-tuning. Specialized LoRA tuning cut it to 6.21%.

That leakage figure is the clearest measurement anyone's published on Gulf drift. A Saudi-tuned model called Saudi-Dialect-ALLaM, trained with Hijazi and Najdi datasets, leaked MSA 32.63 percent of the time before tuning. After LoRA fine-tuning on 5,466 synthetic instruction pairs, leakage dropped to 6.21 percent and its Saudi-dialect rate hit 84.21 percent. The lesson for the frontier models is blunt: none of them are tuned like that, so all of them leak. The question is only how much, and toward what.

Dialect fidelity, model by model

There's no controlled head-to-head that pits Claude, GPT, Gemini, and Qwen against each other on Khaleeji. So the bars below are an inference, drawn from each model's documented dialect behavior and its academic Arabic scores, not a single benchmark run. Read them as a ranking of how likely each model is to hold your register, with the caveat that the floor moves under all of them.

How well each model holds Gulf (Khaleeji) register

Inferred dialect fidelity, higher is better at staying in Gulf rather than drifting to MSA or Egyptian. Based on documented dialect behavior and academic Arabic scores, not a single head-to-head benchmark. May 2026.

Claude Opus 4.7
Holds best
Qwen 3
Broad, unproven on Gulf
GPT-5.5
Drifts Egyptian
Gemini 3 Pro
Replies in MSA

Claude Opus 4.7 takes the top spot. Anthropic's own multilingual notes, backed by comparative testing, put a slight edge on Egyptian and Gulf Arabic, likely from training-data mix. In blind tests on academic Arabic summaries it scored 9.5 out of 10, well ahead of GPT at 8.5 and Gemini at 7.5, and it produces more varied, natural Arabic where GPT tends toward repetitive vocabulary and stiffer phrasing. The trait that matters most for dialect work is the failure mode: when Claude isn't sure, it stays nearer your register instead of bailing to MSA. For the wider read on how it stacks against the field on Saudi-market Arabic, benchr's working report on AI for Arabic content scores five models across MSA and three dialects.

Qwen 3 is the runner-up, and the honest one. It supports 119 languages and dialects with a 128K context window, more linguistic breadth than any rival here, and it's the strongest of the group at switching between Arabic and English mid-sentence. But breadth isn't depth. No public benchmark validates Qwen's Khaleeji output against Claude's, so picking it is a bet on coverage, not a verified Gulf result. If your work is heavily code-mixed, Qwen earns the look; if it's pure Najdi or Hijazi, the case is weaker.

GPT-5.5 is fine at Arabic and bad at this specific job. It's strongest in the dialects with the most training data, Egyptian and Levantine, and accuracy slips in thinner ones. The catch for Gulf users is the direction of the slip: GPT tends to drift toward Egyptian, the dialect it knows best, so a Najdi prompt can come back sounding like Cairo. That's a worse outcome than MSA for a Saudi reader, because it's wrong in a way that sounds confident.

Gemini 3 Pro understands Gulf input fine; it reportedly parses over 16 Arabic dialects on the way in. The problem is the way out. Gemini standardizes its replies to MSA across every regional variant, so you can ask in Khaleeji and you'll get formal Arabic back, every time. For comprehension that's fine. For a reply that sounds like it came from the Gulf, it's a non-starter, and benchr's Gemini 3 Pro evaluation covers the rest of where that model lands.

Najdi and Hijazi: the sub-dialect cliff

Step inside Saudi Arabia and the ground gets thinner. "Gulf Arabic" in a benchmark usually means a blended Khaleeji average. Najdi (central, Riyadh) and Hijazi (western, Jeddah and Mecca) are distinct, and they're rare enough in public data that no frontier model can promise either one. The literature is consistent here: large models stay dominated by MSA with limited support for Saudi dialects specifically, and they tend to collapse Najdi and Hijazi into a generic Gulf register or straight into MSA.

A model can know "Gulf" as a category and still flatten Najdi and Hijazi into the same beige Arabic.

This is where the fine-tuned models earn their place. If your product lives or dies on Saudi sub-dialect accuracy, a base frontier model isn't the answer; a LoRA-tuned Saudi model is, and Saudi-Dialect-ALLaM's 84 percent Saudi rate is the proof of concept. Among the off-the-shelf models, Claude Opus 4.7 collapses the sub-dialects least, but "least" isn't "well." Set expectations accordingly.

The scoreboard

The short version, by what you're trying to do. Find your row and take the pick.

General Khaleeji chat

Claude Opus 4.7 Holds register, least MSA drift

Arabic-English code-mix

Qwen 3 119 dialects, best at switching

Najdi / Hijazi accuracy

Fine-tuned model ALLaM-class, not a frontier base

Formal MSA output

Any of them Gemini defaults here for free

Voice / transcription

Speechmatics 6.3% WER on code-switch audio

Worst fit for Gulf

Gemini 3 Pro Understands, replies in MSA

A note on that voice row, because it's easy to over-read. Speechmatics' Arabic-English bilingual model hits a 6.3 percent word error rate on code-switching, about 35 percent lower than Google's 9.7 percent. That's a strong result, but it's speech-to-text. It tells you nothing about how a chat model generates dialect, so don't carry it over to text work. If you're weighing the spoken side more broadly, benchr's comparison of voice models has the wider picture.

If your real need is moving text cleanly between the two languages rather than holding one dialect, that's the sibling guide's territory: benchr's guide to Arabic-English translation ranks the models on direction quality both ways, which is a separate question from register fidelity. And if the underlying job is just producing solid long-form Arabic prose, the language-agnostic guide to AI for writing is the better starting point, since the writing-quality leader and the dialect leader happen to be the same family.

Frequently asked

Does Claude handle Khaleeji (Gulf Arabic) better than other models?

Yes. Claude Opus 4.7 shows a documented slight advantage in Gulf and Egyptian Arabic, likely due to training-data distribution. It produces more natural dialect variance and reduces MSA drift when it hits uncertain dialectal input, compared with GPT, which tends to drift toward Egyptian, and Gemini, which standardizes everything to MSA in its replies. None of this rests on a head-to-head Gulf benchmark, so treat it as a lead, not a settled result.

What is the difference between MSA and Gulf dialect performance in LLMs?

Models perform much better on Modern Standard Arabic, around 85 to 92 percent on academic benchmarks, than on Gulf dialect, around 75 to 85 percent, because training data skews toward formal written Arabic. When a model hits Gulf input it often drifts back to MSA or a more common dialect, and the effect is worst on underrepresented sub-dialects like Najdi and Hijazi Saudi Arabic.

How do sub-dialects like Najdi and Hijazi perform in modern LLMs?

Najdi and Hijazi are badly underrepresented in every major frontier model. Base models like Claude, GPT, Qwen, and Gemini do not explicitly separate or optimize for these sub-dialects and tend to collapse them toward MSA or a generic Gulf register. Specialized fine-tuning closes the gap: Saudi-Dialect-ALLaM, LoRA-tuned on 5,466 synthetic instruction pairs, reached an 84.21 percent Saudi rate and cut MSA leakage from 32.63 percent to 6.21 percent.

Does Qwen 3 compete with Claude on Arabic dialects?

Qwen 3 supports 119 languages and dialects with a 128K context window, more breadth than any major rival, and it handles Arabic-English code-switching better than most peers. But no published benchmark compares Qwen's Khaleeji output head-to-head against Claude Opus 4.7, so the comparison is open. Multilingual breadth does not guarantee Gulf depth.

Which model best handles Arabic-English code-switching?

For speech-to-text, Speechmatics leads with a 6.3 percent word error rate on Arabic-English code-switching versus Google's 9.7 percent, about 35 percent lower. That result is for transcription, not text generation, so it does not transfer to chat models directly. Among frontier text models, Qwen 3 handles code-switching better than Claude, GPT, or Gemini, though it is not formally benchmarked on Gulf dialect pairs.

Changelog

  • May 30, 2026 — Originally published. Covers Claude Opus 4.7, Qwen 3, GPT-5.5, and Gemini 3 Pro on Gulf register and MSA drift; sub-dialect and fine-tuning notes from current research.

References

  1. Truescho, "Claude vs ChatGPT: Which Is Better for Arabic Content? (2026)," truescho.com, accessed May 2026.
  2. Anthropic, "Multilingual support," platform.claude.com, accessed May 2026.
  3. "Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation," arxiv.org, accessed May 2026.
  4. "Advancing AI-Driven Linguistic Analysis: Arabic Dialect Corpora for Gulf Countries and Saudi Arabia," mdpi.com, accessed May 2026.
  5. "Cross-dialectal Arabic translation: comparative analysis on large language models," frontiersin.org, accessed May 2026.
  6. Speechmatics, "Arabic-English bilingual speech-to-text," speechmatics.com, accessed May 2026.