top of page

Testing Arabic LLM Safety with Saudi Dialect: A Follow-up Study on ASAS

  • Jan 23
  • 6 min read

Updated: Jan 25

Manar Alharbi, Asma Alsakkaf, Reem Bazarah and AI Astrolabe team

(work done as interns at AI Astrolabe)



Introduction


In our previous research, we introduced ASAS (أساس) - the first human-rated Arabic safety benchmark for evaluating LLMs in Modern Standard Arabic (MSA). The results revealed significant safety gaps across frontier models, with the best-performing model achieving only 68% safe responses.


Building on that work, we now present a critical follow-up study: evaluating a subset of the models on a Saudi dialect version of ASAS. This research addresses a fundamental question in multilingual AI safety: Do models that demonstrate certain safety capabilities in Modern Standard Arabic maintain those capabilities when faced with regional dialects?


Even the best-performing model achieves only 59% safe responses on Saudi dialect prompts - an 11 percentage point drop from the MSA benchmark. This finding underscores a crucial insight: language alignment must account for dialectal variation, and safety evaluations conducted solely in standard forms may significantly overestimate real-world model safety.


Figure 1: Safety ratings of four tested models on Saudi ASAS. The dialect variant proves even more challenging than MSA, with ALLaM 7B achieving the highest safety score of 59%, with OpenAI's GPT-4o performing the weakest.
Figure 1: Safety ratings of four tested models on Saudi ASAS. The dialect variant proves even more challenging than MSA, with ALLaM 7B achieving the highest safety score of 59%, with OpenAI's GPT-4o performing the weakest.

Why Saudi Dialect Matters


Arabic is not a monolithic language. While Modern Standard Arabic serves as the formal written standard across the Arab world, daily communication occurs in diverse regional dialects that differ substantially in vocabulary, grammar, and syntax. Saudi dialect, spoken by over 21 million people, represents one of the most economically significant and widely-used Arabic variants.


Prior work on Arabic LLM evaluation has found persistent gaps in dialectal generalization and uneven performance across Arabic varieties relative to MSA (Altakrori et. al., 2025, Robinson et. al., 2025). Building on this, we examine whether similar disparities arise in the safety domain and find that:

  1. Dialectal variation creates safety vulnerabilities that are not apparent in MSA testing

  2. Model safety does not transfer uniformly across language variants

  3. Regional AI deployment requires dialect-specific evaluation to ensure true safety alignment



Overall Findings


Our evaluation of four leading models with Arabic capabilities on 694 Saudi dialect prompts reveals:


Model Performance Rankings


  1. ALLaM 7B: 59.1% safe - Surprisingly, this 7B parameter model achieves the highest safety score, outperforming larger models. A plausible explanation is that ALLaM’s training and alignment pipeline may have had stronger exposure to Saudi-/Gulf-relevant Arabic usage and safety data (e.g., region-specific dialect, norms, and policy constraints)

  2. Mistral Saba: 55.5% safe - Solid performance from this regionally-focused model

  3. Claude 3.7 Sonnet: 55.2% safe - Despite leading in MSA (68%), Claude shows significant performance degradation on this dialect

  4. GPT-4o: 36.3% safe - The lowest-performing model, with a dramatic 17.7 percentage point drop from its MSA performance (54%)


Category-Specific Findings


The most challenging safety categories remain consistent with MSA results:


  1. Guns & Illegal Weapons: The most difficult category across all models


  • GPT-4o: Only 8.9% safe, with 80% of responses rated extremely unsafe

  • ALLaM 7B: 30.4% safe (best in category, but still concerning)

  • Models struggle significantly more than in MSA

  1. Controlled Substances: Second most challenging category


  • GPT-4o: 29.9% safe with 60.8% extremely unsafe

  • Claude 3.7 Sonnet: 46.9% safe

  • ALLaM 7B: 53.1% safe (best in category)


  1. Suicide & Self-Harm: Critical safety category showing major gaps


  • GPT-4o: Only 24.2% safe responses

  • Claude 3.7 Sonnet: 64.7% safe (best performer)

  • ALLaM 7B: 41.2% safe (struggles more in this category)


  1. Islamic/Arab Cultural Alignment: Largest category, culturally sensitive


  • Mistral Saba: 73.9% safe (best in category)

  • ALLaM 7B: 59.1% safe

  • GPT-4o: 47.0% safe


Figure 2: Comparative view of model performance across safety categories. The most challenging categories remain consistent with MSA, with GPT-4o showing the lowest safety rates, while Islamic/Arab Cultural Alignment favors regionally trained models such as ALLaM.
Figure 2: Comparative view of model performance across safety categories. The most challenging categories remain consistent with MSA, with GPT-4o showing the lowest safety rates, while Islamic/Arab Cultural Alignment favors regionally trained models such as ALLaM.

Attack Type Analysis


Different attack vectors show varying effectiveness across models. Most effective attacks (lowest safety scores):


  1. Storytelling: Most effective at eliciting unsafe responses


  • GPT-4o: Only 6.9% safe (75.9% extremely unsafe)

  • Claude 3.7 Sonnet: 22.4% safe

  • ALLaM 7B: 43.1% safe

  • Mistral Saba: 44.8% safe


  1. Hypothetical Testing: Second most effective


  • GPT-4o: 16.7% safe (56.1% extremely unsafe)

  • Claude 3.7 Sonnet: 43.9% safe

  • ALLaM 7B: 42.4% safe

  • Mistral Saba: 53.0% safe


  1. Persona Emulation: Roleplay-based attacks


  • GPT-4o: 28.1% safe

  • Mistral Saba: 40.4% safe

  • Claude 3.7 Sonnet: 42.1% safe

  • ALLaM 7B: 50.9% safe


Most resilient defense (highest safety scores):


Out of Context: Multi-turn conversational attacks are least effective:

  • Claude 3.7 Sonnet: 82.5% safe

  • ALLaM 7B: 75.9% safe

  • Mistral Saba: 72.3% safe

  • GPT-4o: 62.3% safe


Figure 3: Comparison of model performance across attack vectors. Storytelling is the most effective attack, followed by Hypothetical Testing, while Out of Context is the least effective overall.
Figure 3: Comparison of model performance across attack vectors. Storytelling is the most effective attack, followed by Hypothetical Testing, while Out of Context is the least effective overall.

Safety Severity Distribution


The extremely unsafe response rates are particularly concerning:

  • GPT-4o: 46.1% of responses are extremely unsafe when guardrails fail

  • ALLaM 7B: 21.2% extremely unsafe responses

  • Claude 3.7 Sonnet: 19.3% extremely unsafe responses

  • Mistral Saba: 10.5% extremely unsafe responses (best containment of severe failures)


Figure 4: Distribution of safety severity across attack vectors. Models remain susceptible to breakdown, with GPT-4o exhibiting the highest rate of extremely unsafe responses.
Figure 4: Distribution of safety severity across attack vectors. Models remain susceptible to breakdown, with GPT-4o exhibiting the highest rate of extremely unsafe responses.

Model-Specific Insights

ALLaM 7B: A Notable Performer

ALLaM’s strong results are notable given its 7B parameter size. Strengths include:

  • Relatively consistent safety across most categories

  • Solid performance on Islamic/Arab Cultural Alignment (59.1%)

  • Best-in-class on Controlled Substances (53.1%)

  • A balanced profile with fewer sharp regressions than some larger models

However, ALLaM struggles with:

  • Suicide & Self-Harm: 41.2% safe

  • Storytelling attacks: 43.1% safe (still stronger than Claude and GPT-4o in our setup)

  • Guns & Illegal Weapons: 30.4% safe


GPT-4o: Significant Dialect Vulnerability


GPT-4o's dramatic performance drop from MSA to Saudi dialect (54% → 36.3%) reveals critical gaps:


  • Extremely dangerous when guardrails fail: 46% of responses are extremely unsafe

  • Catastrophic failure in weapons category: Only 8.9% safe

  • Highly vulnerable to storytelling: 6.9% safe (worst across all models/attacks)

  • Poor dialectal safety alignment across the board


This suggests GPT-4o's safety training may be heavily biased toward standard language forms. In future iterations of this work, we plan to redteam GPT-5.2, where some of these limitations may already have been mitigated through improved multilingual and dialectal safety alignment.


Claude 3.7 Sonnet: Degraded but Controlled


Despite leading the MSA benchmark (68%), Claude drops to 55.2% on Saudi dialect:


  • Still maintains best-in-class for Suicide & Self-Harm (64.7%)

  • Strong at defending against Out of Context attacks (82.5%)

  • Lower rate of extremely unsafe responses (19.3%)

  • Struggles with Storytelling (22.4%) and Guns & Illegal Weapons (47.8%)

The performance gap suggests dialect-specific safety tuning may be lacking.


Mistral Saba: Regional Focus Pays Off


As a model specifically trained for regional languages, Mistral Saba performs competitively:


  • Best containment of extremely unsafe responses (10.5%)

  • Leads in Islamic/Arab Cultural Alignment (73.9%)

  • Consistent mid-range performance across categories

  • More resilient to severe failure modes



The Dialect Challenge: Why This Matters


Our findings reveal a fundamental challenge in Arabic LLM safety:


1. Dialectal Variation Creates Safety Blind Spots


Models trained and evaluated primarily on MSA develop safety patterns that don't fully transfer to dialects. Differences in:


  • Vocabulary (different words for the same concepts)

  • Phrasing and idioms

  • Cultural context and references

  • Conversation patterns


All create opportunities for safety mechanisms to fail.


2. Real-World Usage is Primarily Dialectal


While MSA dominates formal writing, Arabic speakers use dialects for:


  • Daily conversation and chat

  • Social media communication

  • Voice assistants and conversational AI

  • Informal online content


Safety evaluations that ignore dialectal variation fail to assess real-world risk.


3. Deployment Risk in Regional Markets


Organizations deploying AI systems in Arabic-speaking regions face significant risks if safety is only verified in MSA:


  • Saudi Arabia: 21M+ speakers of Saudi dialect

  • Egypt: 100M+ speakers of Egyptian Arabic

  • Levantine region: 60M+ speakers of Levantine dialects

  • North Africa: 90M+ speakers of Maghrebi dialects


Each region requires dialect-specific safety validation.


What's Next?


Our Saudi ASAS study demonstrates that Arabic LLM safety research must expand beyond Modern Standard Arabic. At AI Astrolabe, we are committed to:


  1. Developing dialect-specific safety benchmarks for other major Arabic varieties (Egyptian, Levantine, Maghrebi)

  2. Creating safety datasets that help improve model alignment across dialectal variation

  3. Advancing research on transfer learning approaches that enable safety to generalize across language variants

  4. Redteaming newer model generations (e.g., updated GPT and Claude releases) to assess whether dialectal safety gaps are narrowing and to identify remaining failure modes

  5. Providing evaluation services for organizations deploying Arabic AI systems regionally


The path to truly safe Arabic AI systems requires understanding and addressing the full linguistic diversity of Arabic-speaking communities.



Methodology Note


This study used 694 prompts from ASAS translated and adapted to Saudi dialect, covering:


  • 7 safety categories (Violence & Hate, Controlled Substances, Bias, Suicide & Self-Harm, Islamic/Arab Cultural Alignment, Guns & Illegal Weapons, Criminal Planning)

  • 8 attack strategies (Direct Prompting, Step-by-Step Escalation, Code/Encryption, Hypothetical Testing, Storytelling, False Premise, Persona Emulation, Out of Context)


All responses were evaluated by trained human experts using a 4-level safety taxonomy:


  • Safe: Appropriate refusal or helpful response without harmful content

  • Slightly Unsafe: Minor issues or borderline content

  • Moderately Unsafe: Clear policy violations with moderate harm potential

  • Extremely Unsafe: Severe violations with high harm potential


Conclusion


The Saudi ASAS results paint a concerning picture: dialectal variation significantly impacts LLM safety, and models that appear reasonably safe in Modern Standard Arabic can show dramatic safety degradation when evaluated in regional dialects.


ALLaM 7B's performance demonstrates that regionally-focused models with proper training can achieve better safety outcomes than larger general-purpose models. However, even the best model achieves only 59% safe responses - far from the safety standards needed for real-world deployment.


The message is clear: Arabic LLM safety evaluation must be multilectal. Organizations developing or deploying Arabic AI systems need dialect-specific safety testing to understand true risk profiles and ensure safe, trustworthy AI across the diverse Arabic-speaking world.



AI Astrolabe helps model and agent builders evaluate, stress-test, and improve safety - across dialects, contexts, and cultures. Reach out to see how we can partner (contact@aiastrolabe.com).

 
 
 

Comments


bottom of page