Skip to main content
Version: 4.0.0-rc1

Language Tags

Some of Phonexia speech technologies such as Language Identification or Speech to Text are language dependent.

Hence, a set of consistent, systematic, and clear guidelines for language tags or labels has been established. These guidelines adhere to the globally recognized standards for language labeling.

By implementing these guidelines, we can enhance the effectiveness and interoperability of Phonexia speech technologies across diverse linguistic contexts.

Tip

Explore the list of supported languages for Phonexia's Speech to Text technology here, and for Phonexia's Language Identification technology here.

Standard for Labelling Languages

All language tags used in Phonexia products follow the IETF language tags standard outlined in RFC 5646: Tags for Identifying Languages. Broadly speaking, there are 3 types of language tags.

1. Language-REGION

This denotes a combination of the two-letter (or three-letter) lowercase language subtag and the two-letter uppercase subtag for region (country). It designates the language spoken in a specific country. Examples:

Language tagDescription
cs-CZCzech as spoken in the Czech Republic
en-USEnglish as spoken in the United States
es-ESSpanish as spoken in Spain
fa-AFDari, Persian variety spoken in Afghanistan
fa-IRFarsi, Persian variety spoken in Iran
ar-MAMaghrebi Arabic, (Morocco, Algeria, Tunisia, Libya, Mauritania)
ar-KWGulf Arabic, Arabic as spoken in Kuwait, this includes other Gulf countries
Subtags
  • language subtag ideally consists of 2-letter language code (ISO 639 code) in lowercase. If a 2-letter code is unavailable, a 3-letter code may be utilized.
  • region subtag ideally consists of a 2-letter country code or regional code (ISO 3166-1 code or UN M.49 code) in uppercase. If a 2-letter code is unavailable, a 3-letter code may be utilized.

2. Language

The two-letter (or three-letter) language subtag can be used independently, without the need for a region specification. This occurs when the language is spoken across multiple countries, and the Speech to Text model is generic, meaning it has been trained to transcribe all dialectal varieties of the language. Examples:

Language tagDescription
deGeneric German; e.g., a mix of German from Austria and Germany
ptGeneric Portuguese; spoken in Portugal, Brazil, Angola, Mozambique, etc.
gnGeneric Guarani; spoken in Paraguay, Bolivia, Argentina and Brazil
kuGeneric Kurdish; spoken mainly in Iran, Iraq, Turkey and Syria
arbModern Standard Arabic; spoken in all Arab countries, only 3-character ISO code exists for MSA
caution

Enhanced Speech to Text Built od Whisper never specifies the region, not only when language models are general (e.g. es for general Spanish) but also when the language is spoken exclusively in one country (e.g. cs for Czech, is for Icelandic)

3. Non-standard "privateuse" subtag

The privateuse subtag comes into play when there's a necessity to tag non-standardized information, either related to language, region, or both. This could include specifying a particular macroregion or linguistic variety that lacks its own ISO code. The privateuse subtag can substitute either the language or region subtag, or both, as needed.

a. privateuse-REGION

This combination can be used for a privately defined language/dialect in a region. Example:

Language tagDescription
qaa-CZNon-standard qaa could refer to the Cieszyn Silesian dialect spoken in the Czech Republic

b. language-PRIVATEUSE

This combination can be utilized in situations where there isn't an official ISO code for a region where the language is spoken. For example, there are no specific ISO codes designated for the Levantine region or the Central and Southern American region. Therefore, a privateuse tag has been established as XL (representing cross Levantine) and XA (representing cross American). Examples:

Language tagDescription
ar-XLGeneric Levantine Arabic.
es-XAGeneric American Spanish.

c. entirely non-standard tag

The use of the x singleton at the beginning of a language tag signifies that the entire tag is private. This indicates that the tag comprises only subtags whose meanings are determined by private agreement. However, it's worth noting that the use of such private tags is not recommended for language tagging.