Version: 3.3.0

Comparison of the 4th and 5th Generation of Language Identification

At Phonexia, we strive for continuous improvement, which is why we continue to develop new generations of our technologies. This article compares the latest generation of Language Identification XL5 and its predecessor L4.

Supported languages

One of the improvements in the new generation is that the number of supported languages has almost doubled, with XL5 offering up to 140.

Added languages include Danish, Finnish, Estonian, additional varieties of Arabic, Hebrew, and many others. For a full list of languages supported by XL5 and L4, please refer to the table below.

Comparison of languages

L4 Code	L4 Language Name	XL5 Code	XL5 Language Name
		ab-GE	Abkhaz
		af	Afrikaans
sq-AL	Albanian	sq-AL	Albanian
am-ET	Amharic	am-ET	Amharic
ar-EG	Arabic (Egypt)	ar-EG	Arabic (Egypt)
ar-KW	Arabic (Gulf, Kuwait)	ar-KW	Arabic (Gulf)
ar-IQ	Arabic (Iraq)	ar-IQ	Arabic (Iraq)
ar-XL	Arabic (Levantine)	ar-XL	Arabic (Levantine)
ar-MA	Arabic (Maghrebi)	ar-MA	Arabic (Maghrebi)
		ar-OM	Arabic (Oman)
		ar-SA	Arabic (Saudi)
		ar-TN	Arabic (Tunisia)
		ar-YE	Arabic (Yemen)
arb	Arabic (MSA)	arb	Arabic (MSA)
		hy-AM	Armenian
as-IN	Assamese	as-IN	Assamese
		ast-ES	Asturian
az-AZ	Azerbaijani	az-AZ	Azerbaijani
		ba-RU	Bashkir
		eu	Basque
bn-BD	Bengali (Bangladesh)	bn	Bengali
be-BY	Belarusian	be-BY	Belarusian
		br-FR	Breton
bg-BG	Bulgarian	bg-BG	Bulgarian
my-MM	Burmese	my-MM	Burmese
		kea-CV	Cape Verdean Creole
		ca-ES	Catalan
ceb-PH	Cebuano	ceb-PH	Cebuano
zh-HK	Chinese (Cantonese, Hong Kong)	zh-HK	Cantonese
zh-CN	Chinese (Mandarin, China)	zh-CN	Chinese (Mandarin)
nan-CN	Chinese (Min Nan)	min-CN	Chinese (Min)
wuu-CN	Chinese (Wu)	wuu-CN	Chinese (Wu)
cv-RU	Chuvash	cv-RU	Chuvash
cs-CZ	Czech	cs-CZ	Czech
fa-AF	Dari	covered in fa - Persian (see below)
		da-DK	Danish
		luo-KE	Dholuo
nl	Dutch	nl	Dutch
		en-AU	English (Australia)
en-IN	English (India)	en-IN	English (India)
en-GB	English (United Kingdom)	en-GB	English (UK)
en-US	English (United States)	en-US	English (US)
		et-EE	Estonian
		fo	Faroese
		fi-FI	Finnish
fr	French	fr	French
		gl-ES	Galician
ka-GE	Georgian	ka-GE	Georgian
de	German	de	German
el-GR	Greek	el-GR	Greek
gn	Guarani	gn	Guarani
		gu-IN	Gujarati
ht-HT	Haitian Creole	ht-HT	Haitian Creole
ha	Hausa	ha	Hausa
		haw-US	Hawaiian
		he-IL	Hebrew
hi-IN	Hindi	hi-IN	Hindi
hu-HU	Hungarian	hu-HU	Hungarian
		is-IS	Icelandic
		ig-NG	Igbo
id-ID	Indonesian	id-ID	Indonesian
		ga-IE	Irish
it	Italian	it-IT	Italian
ja-JP	Japanese	ja-JP	Japanese
		jv-ID	Javanese
		kam-KE	Kamba
		kn-IN	Kannada
kk-KZ	Kazakh	kk-KZ	Kazakh
km	Khmer	km-KH	Khmer
rn-BI	Rundi	rn-BI	Kirundi
ko-KR	Korean	ko-KR	Korean
ku	Kurdish	ku	Kurdish
		ky-KG	Kyrgyz
lo-LA	Lao	lo-LA	Lao
		lv-LV	Latvian
		ln	Lingala
lt-LT	Lithuanian	lt-LT	Lithuanian
		lg-UG	Luganda
lb-LU	Luxembourgish	lb-LU	Luxembourgish
mk-MK	Macedonian	mk-MK	Macedonian
		ms-MY	Malay
		ml-IN	Malayalam
		mg-MG	Malagasy
		mt-MT	Maltese
		mi-NZ	Māori
		mr-IN	Marathi
		mn-MN	Mongolian
nd-ZW	Ndebele	nd-ZW	Ndebele (North)
		nr-ZA	Ndebele (South)
		ne-NP	Nepali
		no-NO	Norwegian
		oc-FR	Occitan
		or-IN	Odia
om	Oromo	om-ET	Oromo
ps	Pashto	ps	Pashto
fa-IR	Persian (Iran)	fa	Persian
pl-PL	Polish	pl-PL	Polish
pt	Portuguese	pt	Portuguese
ro-RO	Romanian	ro-RO	Romanian
pa	Punjabi	pa	Punjabi
ru-RU	Russian	ru-RU	Russian
sh	Serbo-Croat-Bosnian	hbs	Serbocroatian
st-ZA		st-ZA	Sesotho
sn	Shona	sn	Shona
si-LK		si-LK	Sinhala
sd		sd	Sindhi
sl-SI	Slovenian	sl-SI	Slovenian
sk-SK	Slovak	sk-SK	Slovak
so	Somali	so	Somali
es-XA	Spanish (America)	es-XA	Spanish (American)
es-ES	Spanish (Europe)	es-ES	Spanish (Spain)
su-ID		su-ID	Sundanese
sw	Swahili	sw	Swahili
ss-ZA		ss-ZA	Swazi
sv-SE	Swedish	sv-SE	Swedish
		tg	Tajik
ta	Tamil	ta	Tamil
		tt	Tatar
te-IN	Telugu	te-IN	Telugu
th-TH	Thai	th-TH	Thai
bo	Tibetan	bo	Tibetan
ti	Tigrinya	ti	Tigrinya
tpi-PG	Tok Pisin	tpi-PG	Tok Pisin
		ts-ZA	Tsonga
		tn-ZA	Tswana
tr-TR	Turkish	tr-TR	Turkish
		tk	Turkmen
uk-UA	Ukrainian	uk-UA	Ukrainian
		umb-AO	Umbundu
ur	Urdu	ur	Urdu
uz-UZ	Uzbek	uz-UZ	Uzbek
		ve-ZA	Venda
vi-VN	Vietnamese	vi-VN	Vietnamese
		cy-GB	Welsh
		wo	Wolof
		xh-ZA	Xhosa
		yi	Yiddish
		yo	Yoruba
zu	Zulu	zu	Zulu

GPU processing

Another difference is the possibility of GPU processing, which is not available on the L4 version. Processing audio on a GPU greatly improves the performance of the system. Processing the same amount of data on a CPU is much less efficient. For more information about the performance enhancement refer to the example measurements.

Accuracy

The new generation is also a step forward in terms of accuracy. Below are 3 sample measurements.

Dataset	Number of languages	L4 accuracy	XL5 accuracy
NIST LRE 15	13	0.597	0.626
NIST LRE 17	17	0.659	0.713
ROXSD	14	0.859	0.953

The accuracy evaluation is done using the metric described in the Accuracy Evaluation section, where higher accuracy indicates better performance.

The accuracy can be further enhanced using subsets of languages or language groups.

Subsets of languages

Users can enhance the accuracy of language identification by configuring the system to detect only a subset of the available languages. These subsets are created based on the languages the user expects to be present within the audio pool to be analyzed. This is achieved by adjusting the prior weights so that only the relevant languages have non-zero weights, thereby excluding the others. This method minimizes confusion and increases the probability of correct identification by focusing on the expected languages.

For detailed information, refer to the Specifying Languages and Groups section. Creating and using subsets in XL5 is straightforward and user-friendly.

A similar outcome may be achieved in L4 using custom language packs, but this process was significantly more time-consuming compared to the current streamlined subset configuration in the new XL5.

Language groups

XL5 also introduces a novelty in the form of so-called language groups, which offer another way to improve accuracy. When a group is formed from a set of languages, the scores of these languages are combined and reported as a group score. One potential use case for groups is to combine dialects or regional variants of a language into a single, encapsulating group that represents the language as a whole. For more information, refer to the Specifying Languages and Groups section.

Supported languages​

GPU processing​

Accuracy​

Subsets of languages​

Language groups​

Supported languages

GPU processing

Accuracy

Subsets of languages

Language groups