Hostname: page-component-6766d58669-bkrcr Total loading time: 0 Render date: 2026-05-17T17:31:49.481Z Has data issue: false hasContentIssue false

DarijaBanking: A new resource for overcoming language barriers in banking intent detection for Moroccan Arabic speakers

Published online by Cambridge University Press:  05 December 2024

Abderrahman Skiredj*
Affiliation:
OCP Solutions, Casablanca, Morocco UM6P College of Computing, Benguerir, Morocco
Ferdaous Azhari
Affiliation:
National Institute of Posts and Telecoms, Rabat, Morocco
Ismail Berrada
Affiliation:
UM6P College of Computing, Benguerir, Morocco
Saad Ezzini
Affiliation:
School of Computing and Communications, Lancaster University, Lancaster, UK
*
Corresponding author: Abderrahman Skiredj; Email: abderrahman.skiredj@ocpsolutions.ma
Rights & Permissions [Opens in a new window]

Abstract

Navigating the complexities of language diversity is a central challenge in developing robust natural language processing systems, especially in specialized domains like banking. The Moroccan Dialect of Arabic (Darija) serves as a common language that blends cultural complexities, historical impacts, and regional differences, which presents unique challenges for language models due to its divergence from Modern Standard Arabic and influence from French, Spanish, and Tamazight. To tackle these challenges, this paper introduces Darija Banking, a novel Darija dataset aimed at enhancing intent classification in the banking domain. DarijaBanking comprises over 1800 parallel high-quality queries in Darija, Modern Standard Arabic (MSA), English, and French, organized into 24 intent classes. We experimented various intent classification methods, including full fine-tuning of monolingual and multilingual models, zero-shot learning, retrieval-based approaches, and Large Language Model prompting. Furthermore, we propose BERTouch, a BERT-based language model fine-tuned on intent detection in Darija, which outperforms state-of-the-art models, including OpenAI’s GPT-4, achieving F1 scores of 0,98 and 0,96 on both Darija and MSA, respectively. The results provide insights into enhancing Moroccan Darija banking intent detection systems, highlighting the value of domain-specific data annotation and balancing precision and cost-effectiveness.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Some examples of manually corrected translations from English to Darija

Figure 1

Table 2. Comprehensive Intent Catalogue

Figure 2

Table 3. Mean linguistic differences per intent between Darija and MSA, rated on a scale from 1 (very close) to 3 (far apart)

Figure 3

Table 4. Statistics of DarijaBanking dataset

Figure 4

Table 5. Performance of XLM-Roberta Zero-Shot Learning: Gains from Sequential Language Integration in the DarijaBanking Dataset

Figure 5

Table 6. Performance of various pre-trained transformers on DarijaBanking

Figure 6

Table 7. Performance of various pre-trained Retrievers as Intent Detectors on DarijaBanking

Figure 7

Table 8. Performance of LLMs as Intent Detectors on DarijaBanking

Figure 8

Table 9. Results of the NMT pipeline on both Darija and MSA

Figure 9

Table 10. Comparative F1 scores of three models across various customer intents

Figure 10

Table 11. Impact of Linguistic Divergence on F1 Scores Across Intent Detection Methods