Comparison of the 4th and 5th Generation of Language Identification
At Phonexia, we strive for continuous improvement, which is why we continue to develop new generations of our technologies. This article compares the latest generation of Language Identification XL5 and its predecessor L4.
Supported languages
One of the improvements in the new generation is that the number of supported languages has almost doubled, with XL5 offering up to 140.
Added languages include Danish, Finnish, Estonian, additional varieties of Arabic, Hebrew, and many others. For a full list of languages supported by XL5 and L4, please refer to the table below.
Comparison of languages
L4 Code | L4 Language Name | XL5 Code | XL5 Language Name |
---|---|---|---|
ab-GE | Abkhaz | ||
af | Afrikaans | ||
sq-AL | Albanian | sq-AL | Albanian |
am-ET | Amharic | am-ET | Amharic |
ar-EG | Arabic (Egypt) | ar-EG | Arabic (Egypt) |
ar-KW | Arabic (Gulf, Kuwait) | ar-KW | Arabic (Gulf) |
ar-IQ | Arabic (Iraq) | ar-IQ | Arabic (Iraq) |
ar-XL | Arabic (Levantine) | ar-XL | Arabic (Levantine) |
ar-MA | Arabic (Maghrebi) | ar-MA | Arabic (Maghrebi) |
ar-OM | Arabic (Oman) | ||
ar-SA | Arabic (Saudi) | ||
ar-TN | Arabic (Tunisia) | ||
ar-YE | Arabic (Yemen) | ||
arb | Arabic (MSA) | arb | Arabic (MSA) |
hy-AM | Armenian | ||
as-IN | Assamese | as-IN | Assamese |
ast-ES | Asturian | ||
az-AZ | Azerbaijani | az-AZ | Azerbaijani |
ba-RU | Bashkir | ||
eu | Basque | ||
bn-BD | Bengali (Bangladesh) | bn | Bengali |
be-BY | Belarusian | be-BY | Belarusian |
br-FR | Breton | ||
bg-BG | Bulgarian | bg-BG | Bulgarian |
my-MM | Burmese | my-MM | Burmese |
kea-CV | Cape Verdean Creole | ||
ca-ES | Catalan | ||
ceb-PH | Cebuano | ceb-PH | Cebuano |
zh-HK | Chinese (Cantonese, Hong Kong) | zh-HK | Cantonese |
zh-CN | Chinese (Mandarin, China) | zh-CN | Chinese (Mandarin) |
nan-CN | Chinese (Min Nan) | min-CN | Chinese (Min) |
wuu-CN | Chinese (Wu) | wuu-CN | Chinese (Wu) |
cv-RU | Chuvash | cv-RU | Chuvash |
cs-CZ | Czech | cs-CZ | Czech |
fa-AF | Dari | covered in fa - Persian (see below) | |
da-DK | Danish | ||
luo-KE | Dholuo | ||
nl | Dutch | nl | Dutch |
en-AU | English (Australia) | ||
en-IN | English (India) | en-IN | English (India) |
en-GB | English (United Kingdom) | en-GB | English (UK) |
en-US | English (United States) | en-US | English (US) |
et-EE | Estonian | ||
fo | Faroese | ||
fi-FI | Finnish | ||
fr | French | fr | French |
gl-ES | Galician | ||
ka-GE | Georgian | ka-GE | Georgian |
de | German | de | German |
el-GR | Greek | el-GR | Greek |
gn | Guarani | gn | Guarani |
gu-IN | Gujarati | ||
ht-HT | Haitian Creole | ht-HT | Haitian Creole |
ha | Hausa | ha | Hausa |
haw-US | Hawaiian | ||
he-IL | Hebrew | ||
hi-IN | Hindi | hi-IN | Hindi |
hu-HU | Hungarian | hu-HU | Hungarian |
is-IS | Icelandic | ||
ig-NG | Igbo | ||
id-ID | Indonesian | id-ID | Indonesian |
ga-IE | Irish | ||
it | Italian | it-IT | Italian |
ja-JP | Japanese | ja-JP | Japanese |
jv-ID | Javanese | ||
kam-KE | Kamba | ||
kn-IN | Kannada | ||
kk-KZ | Kazakh | kk-KZ | Kazakh |
km | Khmer | km-KH | Khmer |
rn-BI | Rundi | rn-BI | Kirundi |
ko-KR | Korean | ko-KR | Korean |
ku | Kurdish | ku | Kurdish |
ky-KG | Kyrgyz | ||
lo-LA | Lao | lo-LA | Lao |
lv-LV | Latvian | ||
ln | Lingala | ||
lt-LT | Lithuanian | lt-LT | Lithuanian |
lg-UG | Luganda | ||
lb-LU | Luxembourgish | lb-LU | Luxembourgish |
mk-MK | Macedonian | mk-MK | Macedonian |
ms-MY | Malay | ||
ml-IN | Malayalam | ||
mg-MG | Malagasy | ||
mt-MT | Maltese | ||
mi-NZ | Māori | ||
mr-IN | Marathi | ||
mn-MN | Mongolian | ||
nd-ZW | Ndebele | nd-ZW | Ndebele (North) |
nr-ZA | Ndebele (South) | ||
ne-NP | Nepali | ||
no-NO | Norwegian | ||
oc-FR | Occitan | ||
or-IN | Odia | ||
om | Oromo | om-ET | Oromo |
ps | Pashto | ps | Pashto |
fa-IR | Persian (Iran) | fa | Persian |
pl-PL | Polish | pl-PL | Polish |
pt | Portuguese | pt | Portuguese |
ro-RO | Romanian | ro-RO | Romanian |
pa | Punjabi | pa | Punjabi |
ru-RU | Russian | ru-RU | Russian |
sh | Serbo-Croat-Bosnian | hbs | Serbocroatian |
st-ZA | st-ZA | Sesotho | |
sn | Shona | sn | Shona |
si-LK | si-LK | Sinhala | |
sd | sd | Sindhi | |
sl-SI | Slovenian | sl-SI | Slovenian |
sk-SK | Slovak | sk-SK | Slovak |
so | Somali | so | Somali |
es-XA | Spanish (America) | es-XA | Spanish (American) |
es-ES | Spanish (Europe) | es-ES | Spanish (Spain) |
su-ID | su-ID | Sundanese | |
sw | Swahili | sw | Swahili |
ss-ZA | ss-ZA | Swazi | |
sv-SE | Swedish | sv-SE | Swedish |
tg | Tajik | ||
ta | Tamil | ta | Tamil |
tt | Tatar | ||
te-IN | Telugu | te-IN | Telugu |
th-TH | Thai | th-TH | Thai |
bo | Tibetan | bo | Tibetan |
ti | Tigrinya | ti | Tigrinya |
tpi-PG | Tok Pisin | tpi-PG | Tok Pisin |
ts-ZA | Tsonga | ||
tn-ZA | Tswana | ||
tr-TR | Turkish | tr-TR | Turkish |
tk | Turkmen | ||
uk-UA | Ukrainian | uk-UA | Ukrainian |
umb-AO | Umbundu | ||
ur | Urdu | ur | Urdu |
uz-UZ | Uzbek | uz-UZ | Uzbek |
ve-ZA | Venda | ||
vi-VN | Vietnamese | vi-VN | Vietnamese |
cy-GB | Welsh | ||
wo | Wolof | ||
xh-ZA | Xhosa | ||
yi | Yiddish | ||
yo | Yoruba | ||
zu | Zulu | zu | Zulu |
GPU processing
Another difference is the possibility of GPU processing, which is not available on the L4 version. Processing audio on a GPU greatly improves the performance of the system. Processing the same amount of data on a CPU is much less efficient. For more information about the performance enhancement refer to the example measurements.
Accuracy
The new generation is also a step forward in terms of accuracy. Below are 3 sample measurements.
Dataset | Number of languages | L4 accuracy | XL5 accuracy |
---|---|---|---|
NIST LRE 15 | 13 | 0.597 | 0.626 |
NIST LRE 17 | 17 | 0.659 | 0.713 |
ROXSD | 14 | 0.859 | 0.953 |
The accuracy evaluation is done using the metric described in the Accuracy Evaluation section, where higher accuracy indicates better performance.
The accuracy can be further enhanced using subsets of languages or language groups.
Subsets of languages
Users can enhance the accuracy of language identification by configuring the system to detect only a subset of the available languages. These subsets are created based on the languages the user expects to be present within the audio pool to be analyzed. This is achieved by adjusting the prior weights so that only the relevant languages have non-zero weights, thereby excluding the others. This method minimizes confusion and increases the probability of correct identification by focusing on the expected languages.
For detailed information, refer to the Specifying Languages and Groups section. Creating and using subsets in XL5 is straightforward and user-friendly.
A similar outcome may be achieved in L4 using custom language packs, but this process was significantly more time-consuming compared to the current streamlined subset configuration in the new XL5.
Language groups
XL5 also introduces a novelty in the form of so-called language groups, which offer another way to improve accuracy. When a group is formed from a set of languages, the scores of these languages are combined and reported as a group score. One potential use case for groups is to combine dialects or regional variants of a language into a single, encapsulating group that represents the language as a whole. For more information, refer to the Specifying Languages and Groups section.