Language Identification
This guide demonstrates how to perform Language Identification with Phonexia Speech Platform 4. You can find a high-level description in the About Language Identification article.
For testing, we'll be using the following recordings. You can download them all together in the audio_files.zip archive.
filename | language code | language name |
---|---|---|
Adedewe.wav | yo | Yoruba |
Dina.wav | arb | Arabic (MSA) |
Fadimatu.wav | ha | Hausa |
Harry.wav | en-GB | British English |
Juan.wav | es-XA | Spanish (American) |
Julia.wav | en-US | US English |
Lenka.wav | cs-CZ | Czech |
Lubica.wav | sk-SK | Slovak |
Luka.wav | hbs | Serbo-Croatian |
Nirav.wav | gu-IN | Gujarati |
Noam.wav | he-IL | Hebrew |
Obioma.wav | ig-NG | Igbo |
Tatiana.wav | ru-RU | Russian |
Thida.wav | km-KH | Khmer |
Tuan.wav | vi-VN | Vietnamese |
Xiang.wav | zh-CN | Mandarin Chinese |
Zoltan.wav | hu-HU | Hungarian |
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Language Identification in your own projects.
Prerequisites
In the guide, we assume that the Speech Platform server is running on port
8000
of http://localhost
and a properly configured Language Identification
microservice is available. Here's more information on how to install and start
the Speech Platform server and how to make the
microservice
available.
Environment Setup
We are using Python 3.9
and Python library requests 2.27
in this example.
You can install the requests
library with pip
as follows:
pip install requests~=2.27
Basic Language Identification
To run Language Identification for a single audio file, you should start by
sending a POST
request to the
/api/technology/language-identification
endpoint. In Python, you can do this as follows:
import requests
SPEECH_PLATFORM_SERVER = "http://localhost:8000"
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/language-identification"
audio_path = "Harry.wav"
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
)
print(response.status_code) # Should print '202'
If the task has been successfully accepted, the 202
code will be returned
together with a unique task ID
in the response body. The task isn't processed
immediately, but only scheduled for processing. You can check the current task
status by polling for the result.
The URL for polling the result is returned in the X-Location
header.
Alternatively, you can assemble the polling URL on your own by appending a slash
(/
) and the task ID
to the endpoint URL.
import time
polling_url = response.headers["x-location"] # Use the `response` from the previous step
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]
counter = 0
while counter < 100:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
counter += 1
time.sleep(5)
Once the polling finishes, data
will contain the latest response from the
server -- either the result of Language Identification, or an error message with
details, in case processing was not able to finish properly.
By default, the result contains scores for more than a hundred languages. The
following JSON is a manually shortened result of a successful Language
Identification task for the Harry.wav
file which shows that the language was
correctly identified as British English ("en-GB") with the probability
close
to 1.0
, and that Australian English ("en-AU") also received some "points"
,
in contrast to Greek ("el-GR"). You can find the meaning of individual language
tags in the list of
supported languages.
{
"task": {
"task_id": "cccd6bf9-9c8c-44a3-9373-c0182fc096b4",
"state": "done"
},
"result": {
"channels": [
{
"channel_number": 0,
"speech_length": 30.0,
"scores": [
...
{
"identifier": "el-gr",
"identifier_type": "language",
"probability": 0.0
},
{
"identifier": "en-au",
"identifier_type": "language",
"probability": 0.00212
},
{
"identifier": "en-gb",
"identifier_type": "language",
"probability": 0.99787
},
...
]
}
]
}
}
You can easily parse the result and select for example only the three
top-scoring languages in the first channel (those with the highest
probability
), print them to the console and save them to a file like this:
import json
scores = data["result"]["channels"][0]["scores"]
top_scores = sorted(scores, key=lambda x: x["probability"], reverse=True)[:3]
print(top_scores)
with open("output.json", "w") as output:
json.dump(top_scores, output, indent=2)
This will produce the following JSON array:
[
{
"identifier": "en-gb",
"identifier_type": "language",
"probability": 0.99787
},
{
"identifier": "en-au",
"identifier_type": "language",
"probability": 0.00212
},
{
"identifier": "ab-ge",
"identifier_type": "language",
"probability": 0.0
}
]
Language Identification with Parameters
If you want to have more control over the output, you can use the config
request body parameter in which you can limit the list of languages
that will
be shown in the output and you can define language_groups
that will make
certain languages be treated as a single result item. See the
endpoint documentation
for more details. To use the request body parameters, you can encode them as a
JSON string with json.dumps
and pass them as an argument for the data
parameter of requests.post()
. In the following example, we're instructing the
Language Identification technology to limit the list of languages to the related
languages German, English, and Dutch, and to treat all available dialects of
English as one group:
payload = {
"config": json.dumps(
{
"languages": ["de", "en-AU", "en-GB", "en-IN", "en-US", "nl"],
"language_groups": [
{
"identifier": "English",
"languages": ["en-AU", "en-GB", "en-IN", "en-US"],
}
],
}
)
}
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
data=payload,
files=files,
)
After polling for the result as in the previous example we'll get the following
output. Notice that the "English" group now received the maximum possible
probability of 1.0
and we can see how much the individual dialects contributed
to the overall score:
[
{
"identifier": "English",
"identifier_type": "group",
"probability": 1.0,
"languages": [
{
"identifier": "en-au",
"identifier_type": "language",
"probability": 0.00212
},
{
"identifier": "en-gb",
"identifier_type": "language",
"probability": 0.99788
},
{
"identifier": "en-in",
"identifier_type": "language",
"probability": 0.0
},
{
"identifier": "en-us",
"identifier_type": "language",
"probability": 0.0
}
]
},
{
"identifier": "de",
"identifier_type": "language",
"probability": 0.0
},
{
"identifier": "nl",
"identifier_type": "language",
"probability": 0.0
}
]
Full Python Code
Here is the full example on how to run the Language Identification technology with parameters that limit the list of input languages to just those that are actually spoken in the sample dataset (plus some more English dialects). The code is slightly adjusted and wrapped into functions for better readability.
⚠️ Warning: If you use both the
languages
andlanguage_groups
parameters, make sure that all individual languages in a group are also included in the globallanguages
list. The example also shows that a language group can contain any language (e.g., "Czech" and "Slovak"), not just dialects of one language.
The top_scores.json file contains the result of the test:
import json
import requests
import time
SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/language-identification"
payload = {
"config": json.dumps(
{
"languages": [
"arb",
"cs-CZ",
"en-AU",
"en-GB",
"en-IN",
"en-US",
"es-XA",
"gu-IN",
"ha",
"hbs",
"he-IL",
"hu-HU",
"ig-NG",
"km-KH",
"ru-RU",
"sk-SK",
"vi-VN",
"yo",
"zh-CN",
],
"language_groups": [
{
"identifier": "English",
"languages": ["en-AU", "en-GB", "en-IN", "en-US"],
},
{"identifier": "Czecho-Slovak", "languages": ["cs-CZ", "sk-SK"]},
],
}
)
}
def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(sleep)
return response
def run_language_identification(audio_path: str):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
data=payload,
files=files,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
language_identification_response = poll_result(polling_url)
return language_identification_response.json()
filenames = [
"Adedewe.wav",
"Dina.wav",
"Fadimatu.wav",
"Harry.wav",
"Juan.wav",
"Julia.wav",
"Lenka.wav",
"Lubica.wav",
"Luka.wav",
"Nirav.wav",
"Noam.wav",
"Obioma.wav",
"Tatiana.wav",
"Thida.wav",
"Tuan.wav",
"Xiang.wav",
"Zoltan.wav",
]
results = {}
for filename in filenames:
print(f"Runnning Language Identification for file {filename}.")
data = run_language_identification(filename)
scores = data["result"]["channels"][0]["scores"]
top_scores = sorted(scores, key=lambda x: x["probability"], reverse=True)[:3]
results[filename] = top_scores
print(f"The top-scoring languages in {filename} are: {top_scores}")
with open("top_scores.json", "w") as output:
json.dump(results, output, indent=2)