Skip to main content

Arabic Dialects in Phonexia Language Identification and Speech to Text

The Arabic language has one standardized variety and many non-standard varieties (dialects). In this article, our linguistic team explains the differences between Modern Standard Arabic and Arabic dialects in the context of Phonexia Arabic models.

Standard variety: Modern Standard Arabic (MSA)

  • All Arabs learn it at school (not from their parents, so it is not their native variety)
  • It is the lingua franca (common language) for the Arabic world, similar to English for Europeans. However, Arabs speak it more fluently since they are taught MSA from an early age.
  • MSA is more similar to some dialects (e.g., Levantine) but is vastly different from others (e.g., Moroccan Arabic). Therefore, if you speak only MSA, you might understand Levantine slightly but not Moroccan.

Data acquisition

AUDIO (used for LID and STT training): MSA is used in formal speaking situations such as sermons, lectures, news broadcasts, and speeches.

  • It is quite difficult/impossible to find recordings of spontaneous phone conversations in MSA.
  • Available MSA recordings are usually from broadcasts (microphone) or formal scripted speeches (also microphone).

TEXT (used for STT language model training): MSA is used in all formal writing, such as official correspondence, literature, newspapers, and websites.

  • There is no issue accumulating a large amount of text, but it will be more formal and not reflective of spontaneous speech.

Support for MSA in Phonexia products

NameLID L4STTDescription
Arabic (MSA)arbModern Standard Arabic, official language in most countries of the Arab world

Non-standard varieties: Arabic dialects

  • Dialects are native regional varieties in the Arab world.
  • They are used in all spontaneous contexts (friends, family) and in more formal environments (work, businesses), meaning dialects are used in all daily life situations, making them more relevant to our customers.
  • Depending on the criteria for classifying dialects, there are between 10 and 30 Arabic dialects:

arabic_dialects

Arabic dialects (image source: Wikipedia)

  • There is no standard norm for writing; speakers have two options:
    • Use MSA for writing something more formal.
    • Use a dialect for more personal communication, like on Facebook, Twitter, or forums.

Data acquisition

AUDIO (used for LID and STT acoustic model training):

  • There is no difficulty in collecting spontaneous speech in dialects.
  • However, it can be challenging to create annotations for STT training—dialect speakers write words as they hear them, but due to the lack of a standardized writing system, different speakers may write words differently. Therefore, annotations in dialect need to be double-checked and unified.

TEXT (used for STT language model training): Dialects are used for more personal communication on platforms like Facebook, Twitter, and forums.

  • There is not much material available, as most written texts are in MSA.
  • Facebook, Twitter, and forums can be sources, but they need to be classified, corrected, and unified manually—which is not done at Phonexia.

These are the reasons for the limited out-of-the-box support for Arabic dialects in STT.

Support for Arabic dialects in Phonexia products

NameLID L4STTDescription
Arabic (Egypt)ar-EGArabic variety spoken in Egypt
Arabic (Gulf, Kuwait)ar-KWAR_KW_6Arabic variety as spoken in Kuwait
Also spoken in Bahrain, Al Hasa, Qatif, United Arab Emirates, Qatar, Southern Iraq, and Northern Oman
Arabic (Iraq)ar-IQArabic variety as spoken in Iraq
Also spoken in Syria, Turkey, Iran, Kuwait, Jordan, parts of northern and eastern Arabia
Arabic (Levantine)ar-XL*AR_XL_6*
AR_XL_5*Arabic variety spoken in Lebanon, Jordan, Syria, Palestine, Israel, and Turkey
Arabic (Maghrebi, Morocco)ar-MAArabic variety spoken in Morocco, Algeria, Tunisia, Libya, Western Sahara, Mauritania

There are two distinct varieties of Levantine Arabic, each with a separate language code—North Levantine (apc) and South Levantine (ajp).
Our models were trained using data from both varieties, so following RFC 5646, section 2.2.4, we created a custom language code ar-XL, where XL stands for "cross-Levantine."

note

To achieve the best STT results, use the model corresponding to the given dialect. The AR_XL_* model is best suited for Levantine dialect recordings.
Using the AR_XL_* model for neighboring dialects, such as Iraqi, will result in significantly lower accuracy, and for dialects like Maghrebi, the results will likely be completely unusable.