Version: 4.0.2

Language Tags

Some of Phonexia speech technologies such as Language Identification or Speech to Text are language dependent.

Hence, a set of consistent, systematic, and clear guidelines for language tags or labels has been established. These guidelines adhere to the globally recognized standards for language labeling.

By implementing these guidelines, we can enhance the effectiveness and interoperability of Phonexia speech technologies across diverse linguistic contexts.

Tip

Explore the list of supported languages for Phonexia's Speech to Text technology here, and for Phonexia's Language Identification technology here.

Standard for Labelling Languages

All language tags used in Phonexia products follow the IETF language tags standard outlined in RFC 5646: Tags for Identifying Languages. Broadly speaking, there are 3 types of language tags.

1. Language-REGION

This denotes a combination of the two-letter (or three-letter) lowercase language subtag and the two-letter uppercase subtag for region (country). It designates the language spoken in a specific country. Examples:

Language tag	Description
`cs-CZ`	Czech as spoken in the Czech Republic
`en-US`	English as spoken in the United States
`es-ES`	Spanish as spoken in Spain
`fa-AF`	Dari, Persian variety spoken in Afghanistan
`fa-IR`	Farsi, Persian variety spoken in Iran
`ar-MA`	Maghrebi Arabic, (Morocco, Algeria, Tunisia, Libya, Mauritania)
`ar-KW`	Gulf Arabic, Arabic as spoken in Kuwait, this includes other Gulf countries

Subtags

language subtag ideally consists of 2-letter language code (ISO 639 code) in lowercase. If a 2-letter code is unavailable, a 3-letter code may be utilized.
region subtag ideally consists of a 2-letter country code or regional code (ISO 3166-1 code or UN M.49 code) in uppercase. If a 2-letter code is unavailable, a 3-letter code may be utilized.

2. Language

The two-letter (or three-letter) language subtag can be used independently, without the need for a region specification. This occurs when the language is spoken across multiple countries, and the Speech to Text model is generic, meaning it has been trained to transcribe all dialectal varieties of the language. Examples:

Language tag	Description
`de`	Generic German; e.g., a mix of German from Austria and Germany
`pt`	Generic Portuguese; spoken in Portugal, Brazil, Angola, Mozambique, etc.
`gn`	Generic Guarani; spoken in Paraguay, Bolivia, Argentina and Brazil
`ku`	Generic Kurdish; spoken mainly in Iran, Iraq, Turkey and Syria
`arb`	Modern Standard Arabic; spoken in all Arab countries, only 3-character ISO code exists for MSA

caution

Enhanced Speech to Text Built od Whisper never specifies the region, not only when language models are general (e.g. es for general Spanish) but also when the language is spoken exclusively in one country (e.g. cs for Czech, is for Icelandic)

3. Non-standard "privateuse" subtag

The privateuse subtag comes into play when there's a necessity to tag non-standardized information, either related to language, region, or both. This could include specifying a particular macroregion or linguistic variety that lacks its own ISO code. The privateuse subtag can substitute either the language or region subtag, or both, as needed.

a. privateuse-REGION

This combination can be used for a privately defined language/dialect in a region. Example:

Language tag	Description
`qaa-CZ`	Non-standard `qaa` could refer to the Cieszyn Silesian dialect spoken in the Czech Republic

b. language-PRIVATEUSE

This combination can be utilized in situations where there isn't an official ISO code for a region where the language is spoken. For example, there are no specific ISO codes designated for the Levantine region or the Central and Southern American region. Therefore, a privateuse tag has been established as XL (representing cross Levantine) and XA (representing cross American). Examples:

Language tag	Description
`ar-XL`	Generic Levantine Arabic.
`es-XA`	Generic American Spanish.

c. entirely non-standard tag

The use of the x singleton at the beginning of a language tag signifies that the entire tag is private. This indicates that the tag comprises only subtags whose meanings are determined by private agreement. However, it's worth noting that the use of such private tags is not recommended for language tagging.

Standard for Labelling Languages​

1. Language-REGION​

2. Language​

3. Non-standard "privateuse" subtag​

a. privateuse-REGION​

b. language-PRIVATEUSE​

c. entirely non-standard tag​