Testing Arabic LLM Safety with Saudi Dialect: A Follow-up Study on ASAS
- Jan 23
- 6 min read
Updated: Jan 25
Manar Alharbi, Asma Alsakkaf, Reem Bazarah and AI Astrolabe team
(work done as interns at AI Astrolabe)
Introduction
In our previous research, we introduced ASAS (أساس) - the first human-rated Arabic safety benchmark for evaluating LLMs in Modern Standard Arabic (MSA). The results revealed significant safety gaps across frontier models, with the best-performing model achieving only 68% safe responses.
Building on that work, we now present a critical follow-up study: evaluating a subset of the models on a Saudi dialect version of ASAS. This research addresses a fundamental question in multilingual AI safety: Do models that demonstrate certain safety capabilities in Modern Standard Arabic maintain those capabilities when faced with regional dialects?
Even the best-performing model achieves only 59% safe responses on Saudi dialect prompts - an 11 percentage point drop from the MSA benchmark. This finding underscores a crucial insight: language alignment must account for dialectal variation, and safety evaluations conducted solely in standard forms may significantly overestimate real-world model safety.

Why Saudi Dialect Matters
Arabic is not a monolithic language. While Modern Standard Arabic serves as the formal written standard across the Arab world, daily communication occurs in diverse regional dialects that differ substantially in vocabulary, grammar, and syntax. Saudi dialect, spoken by over 21 million people, represents one of the most economically significant and widely-used Arabic variants.
Prior work on Arabic LLM evaluation has found persistent gaps in dialectal generalization and uneven performance across Arabic varieties relative to MSA (Altakrori et. al., 2025, Robinson et. al., 2025). Building on this, we examine whether similar disparities arise in the safety domain and find that:
Dialectal variation creates safety vulnerabilities that are not apparent in MSA testing
Model safety does not transfer uniformly across language variants
Regional AI deployment requires dialect-specific evaluation to ensure true safety alignment
Overall Findings
Our evaluation of four leading models with Arabic capabilities on 694 Saudi dialect prompts reveals:
Model Performance Rankings
ALLaM 7B: 59.1% safe - Surprisingly, this 7B parameter model achieves the highest safety score, outperforming larger models. A plausible explanation is that ALLaM’s training and alignment pipeline may have had stronger exposure to Saudi-/Gulf-relevant Arabic usage and safety data (e.g., region-specific dialect, norms, and policy constraints)
Mistral Saba: 55.5% safe - Solid performance from this regionally-focused model
Claude 3.7 Sonnet: 55.2% safe - Despite leading in MSA (68%), Claude shows significant performance degradation on this dialect
GPT-4o: 36.3% safe - The lowest-performing model, with a dramatic 17.7 percentage point drop from its MSA performance (54%)
Category-Specific Findings
The most challenging safety categories remain consistent with MSA results:
Guns & Illegal Weapons: The most difficult category across all models
GPT-4o: Only 8.9% safe, with 80% of responses rated extremely unsafe
ALLaM 7B: 30.4% safe (best in category, but still concerning)
Models struggle significantly more than in MSA
Controlled Substances: Second most challenging category
GPT-4o: 29.9% safe with 60.8% extremely unsafe
Claude 3.7 Sonnet: 46.9% safe
ALLaM 7B: 53.1% safe (best in category)
Suicide & Self-Harm: Critical safety category showing major gaps
GPT-4o: Only 24.2% safe responses
Claude 3.7 Sonnet: 64.7% safe (best performer)
ALLaM 7B: 41.2% safe (struggles more in this category)
Islamic/Arab Cultural Alignment: Largest category, culturally sensitive
Mistral Saba: 73.9% safe (best in category)
ALLaM 7B: 59.1% safe
GPT-4o: 47.0% safe

Attack Type Analysis
Different attack vectors show varying effectiveness across models. Most effective attacks (lowest safety scores):
Storytelling: Most effective at eliciting unsafe responses
GPT-4o: Only 6.9% safe (75.9% extremely unsafe)
Claude 3.7 Sonnet: 22.4% safe
ALLaM 7B: 43.1% safe
Mistral Saba: 44.8% safe
Hypothetical Testing: Second most effective
GPT-4o: 16.7% safe (56.1% extremely unsafe)
Claude 3.7 Sonnet: 43.9% safe
ALLaM 7B: 42.4% safe
Mistral Saba: 53.0% safe
Persona Emulation: Roleplay-based attacks
GPT-4o: 28.1% safe
Mistral Saba: 40.4% safe
Claude 3.7 Sonnet: 42.1% safe
ALLaM 7B: 50.9% safe
Most resilient defense (highest safety scores):
Out of Context: Multi-turn conversational attacks are least effective:
Claude 3.7 Sonnet: 82.5% safe
ALLaM 7B: 75.9% safe
Mistral Saba: 72.3% safe
GPT-4o: 62.3% safe

Safety Severity Distribution
The extremely unsafe response rates are particularly concerning:
GPT-4o: 46.1% of responses are extremely unsafe when guardrails fail
ALLaM 7B: 21.2% extremely unsafe responses
Claude 3.7 Sonnet: 19.3% extremely unsafe responses
Mistral Saba: 10.5% extremely unsafe responses (best containment of severe failures)

Model-Specific Insights
ALLaM 7B: A Notable Performer
ALLaM’s strong results are notable given its 7B parameter size. Strengths include:
Relatively consistent safety across most categories
Solid performance on Islamic/Arab Cultural Alignment (59.1%)
Best-in-class on Controlled Substances (53.1%)
A balanced profile with fewer sharp regressions than some larger models
However, ALLaM struggles with:
Suicide & Self-Harm: 41.2% safe
Storytelling attacks: 43.1% safe (still stronger than Claude and GPT-4o in our setup)
Guns & Illegal Weapons: 30.4% safe
GPT-4o: Significant Dialect Vulnerability
GPT-4o's dramatic performance drop from MSA to Saudi dialect (54% → 36.3%) reveals critical gaps:
Extremely dangerous when guardrails fail: 46% of responses are extremely unsafe
Catastrophic failure in weapons category: Only 8.9% safe
Highly vulnerable to storytelling: 6.9% safe (worst across all models/attacks)
Poor dialectal safety alignment across the board
This suggests GPT-4o's safety training may be heavily biased toward standard language forms. In future iterations of this work, we plan to redteam GPT-5.2, where some of these limitations may already have been mitigated through improved multilingual and dialectal safety alignment.
Claude 3.7 Sonnet: Degraded but Controlled
Despite leading the MSA benchmark (68%), Claude drops to 55.2% on Saudi dialect:
Still maintains best-in-class for Suicide & Self-Harm (64.7%)
Strong at defending against Out of Context attacks (82.5%)
Lower rate of extremely unsafe responses (19.3%)
Struggles with Storytelling (22.4%) and Guns & Illegal Weapons (47.8%)
The performance gap suggests dialect-specific safety tuning may be lacking.
Mistral Saba: Regional Focus Pays Off
As a model specifically trained for regional languages, Mistral Saba performs competitively:
Best containment of extremely unsafe responses (10.5%)
Leads in Islamic/Arab Cultural Alignment (73.9%)
Consistent mid-range performance across categories
More resilient to severe failure modes
The Dialect Challenge: Why This Matters
Our findings reveal a fundamental challenge in Arabic LLM safety:
1. Dialectal Variation Creates Safety Blind Spots
Models trained and evaluated primarily on MSA develop safety patterns that don't fully transfer to dialects. Differences in:
Vocabulary (different words for the same concepts)
Phrasing and idioms
Cultural context and references
Conversation patterns
All create opportunities for safety mechanisms to fail.
2. Real-World Usage is Primarily Dialectal
While MSA dominates formal writing, Arabic speakers use dialects for:
Daily conversation and chat
Social media communication
Voice assistants and conversational AI
Informal online content
Safety evaluations that ignore dialectal variation fail to assess real-world risk.
3. Deployment Risk in Regional Markets
Organizations deploying AI systems in Arabic-speaking regions face significant risks if safety is only verified in MSA:
Saudi Arabia: 21M+ speakers of Saudi dialect
Egypt: 100M+ speakers of Egyptian Arabic
Levantine region: 60M+ speakers of Levantine dialects
North Africa: 90M+ speakers of Maghrebi dialects
Each region requires dialect-specific safety validation.
What's Next?
Our Saudi ASAS study demonstrates that Arabic LLM safety research must expand beyond Modern Standard Arabic. At AI Astrolabe, we are committed to:
Developing dialect-specific safety benchmarks for other major Arabic varieties (Egyptian, Levantine, Maghrebi)
Creating safety datasets that help improve model alignment across dialectal variation
Advancing research on transfer learning approaches that enable safety to generalize across language variants
Redteaming newer model generations (e.g., updated GPT and Claude releases) to assess whether dialectal safety gaps are narrowing and to identify remaining failure modes
Providing evaluation services for organizations deploying Arabic AI systems regionally
The path to truly safe Arabic AI systems requires understanding and addressing the full linguistic diversity of Arabic-speaking communities.
Methodology Note
This study used 694 prompts from ASAS translated and adapted to Saudi dialect, covering:
7 safety categories (Violence & Hate, Controlled Substances, Bias, Suicide & Self-Harm, Islamic/Arab Cultural Alignment, Guns & Illegal Weapons, Criminal Planning)
8 attack strategies (Direct Prompting, Step-by-Step Escalation, Code/Encryption, Hypothetical Testing, Storytelling, False Premise, Persona Emulation, Out of Context)
All responses were evaluated by trained human experts using a 4-level safety taxonomy:
Safe: Appropriate refusal or helpful response without harmful content
Slightly Unsafe: Minor issues or borderline content
Moderately Unsafe: Clear policy violations with moderate harm potential
Extremely Unsafe: Severe violations with high harm potential
Conclusion
The Saudi ASAS results paint a concerning picture: dialectal variation significantly impacts LLM safety, and models that appear reasonably safe in Modern Standard Arabic can show dramatic safety degradation when evaluated in regional dialects.
ALLaM 7B's performance demonstrates that regionally-focused models with proper training can achieve better safety outcomes than larger general-purpose models. However, even the best model achieves only 59% safe responses - far from the safety standards needed for real-world deployment.
The message is clear: Arabic LLM safety evaluation must be multilectal. Organizations developing or deploying Arabic AI systems need dialect-specific safety testing to understand true risk profiles and ensure safe, trustworthy AI across the diverse Arabic-speaking world.
AI Astrolabe helps model and agent builders evaluate, stress-test, and improve safety - across dialects, contexts, and cultures. Reach out to see how we can partner (contact@aiastrolabe.com).



Comments