Language Tags
Some of Phonexia speech technologies such as Language Identification or Speech to Text are language dependent.
Hence, a set of consistent, systematic, and clear guidelines for language tags or labels has been established. These guidelines adhere to the globally recognized standards for language labeling.
By implementing these guidelines, we can enhance the effectiveness and interoperability of Phonexia speech technologies across diverse linguistic contexts.
Standard for Labelling Languages
All language tags used in Phonexia products follow the IETF language tags standard outlined in RFC 5646: Tags for Identifying Languages. Broadly speaking, there are 3 types of language tags.
1. Language-REGION
This denotes a combination of the two-letter (or three-letter) lowercase language subtag and the two-letter uppercase subtag for region (country). It designates the language spoken in a specific country. Examples:
Language tag | Description |
---|---|
cs-CZ | Czech as spoken in the Czech Republic |
en-US | English as spoken in the United States |
es-ES | Spanish as spoken in Spain |
fa-AF | Dari, Persian variety spoken in Afghanistan |
fa-IR | Farsi, Persian variety spoken in Iran |
ar-MA | Maghrebi Arabic, (Morocco, Algeria, Tunisia, Libya, Mauritania) |
ar-KW | Gulf Arabic, Arabic as spoken in Kuwait, this includes other Gulf countries |
- language subtag ideally consists of 2-letter language code (ISO 639 code) in lowercase. If a 2-letter code is unavailable, a 3-letter code may be utilized.
- region subtag ideally consists of a 2-letter country code or regional code (ISO 3166-1 code or UN M.49 code) in uppercase. If a 2-letter code is unavailable, a 3-letter code may be utilized.
2. Language
The two-letter (or three-letter) language subtag can be used independently, without the need for a region specification. This occurs when the language is spoken across multiple countries, and the Speech to Text model is generic, meaning it has been trained to transcribe all dialectal varieties of the language. Examples:
Language tag | Description |
---|---|
de | Generic German; e.g., a mix of German from Austria and Germany |
pt | Generic Portuguese; spoken in Portugal, Brazil, Angola, Mozambique, etc. |
gn | Generic Guarani; spoken in Paraguay, Bolivia, Argentina and Brazil |
ku | Generic Kurdish; spoken mainly in Iran, Iraq, Turkey and Syria |
arb | Modern Standard Arabic; spoken in all Arab countries, only 3-character ISO code exists for MSA |
Enhanced Speech to Text Built od Whisper never specifies the region, not
only when language models are general (e.g. es
for general Spanish) but also
when the language is spoken exclusively in one country (e.g. cs
for Czech,
is
for Icelandic)
3. Non-standard "privateuse" subtag
The privateuse subtag comes into play when there's a necessity to tag non-standardized information, either related to language, region, or both. This could include specifying a particular macroregion or linguistic variety that lacks its own ISO code. The privateuse subtag can substitute either the language or region subtag, or both, as needed.
a. privateuse-REGION
This combination can be used for a privately defined language/dialect in a region. Example:
Language tag | Description |
---|---|
qaa-CZ | Non-standard qaa could refer to the Cieszyn Silesian dialect spoken in the Czech Republic |
b. language-PRIVATEUSE
This combination can be utilized in situations where there isn't an official ISO
code for a region where the language is spoken. For example, there are no
specific ISO codes designated for the Levantine region or the Central and
Southern American region. Therefore, a privateuse tag has been established as
XL
(representing cross Levantine) and XA
(representing cross American).
Examples:
Language tag | Description |
---|---|
ar-XL | Generic Levantine Arabic. |
es-XA | Generic American Spanish. |
c. entirely non-standard tag
The use of the x
singleton at the beginning of a language tag signifies that
the entire tag is private. This indicates that the tag comprises only subtags
whose meanings are determined by private agreement. However, it's worth noting
that the use of such private tags is not recommended for language tagging.