Version: 4.0.2

Language Identification

This guide demonstrates how to perform Language Identification with Phonexia Speech Platform 4. You can find a high-level description in the About Language Identification article.

For testing, we'll be using the following recordings. You can download them all together in the audio_files.zip archive.

filename	language code	language name
Adedewe.wav	yo	Yoruba
Dina.wav	arb	Arabic (MSA)
Fadimatu.wav	ha	Hausa
Harry.wav	en-GB	British English
Juan.wav	es-XA	Spanish (American)
Julia.wav	en-US	US English
Lenka.wav	cs-CZ	Czech
Lubica.wav	sk-SK	Slovak
Luka.wav	hbs	Serbo-Croatian
Nirav.wav	gu-IN	Gujarati
Noam.wav	he-IL	Hebrew
Obioma.wav	ig-NG	Igbo
Tatiana.wav	ru-RU	Russian
Thida.wav	km-KH	Khmer
Tuan.wav	vi-VN	Vietnamese
Xiang.wav	zh-CN	Mandarin Chinese
Zoltan.wav	hu-HU	Hungarian

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Language Identification in your own projects.

Prerequisites

In the guide, we assume that the Virtual Appliance is running on port 8000 of http://localhost and contains a proper model and license for the technology. For more information on how to install and start the Virtual Appliance, please refer to the Virtual Appliance Installation chapter.

Environment Setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Basic Language Identification

To run Language Identification for a single audio file, you should start by sending a POST request to the /api/technology/language-identification endpoint. In Python, you can do this as follows:

import requests

SPEECH_PLATFORM_SERVER = "http://localhost:8000"
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/language-identification"

audio_path = "Harry.wav"

with open(audio_path, mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        url=ENDPOINT_URL,
        files=files,
    )
    print(response.status_code)  # Should print '202'

If the task has been successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

The URL for polling the result is returned in the Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and the task ID to the endpoint URL.

import time

polling_url = response.headers["Location"]  # Use the `response` from the previous step
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]

counter = 0
while counter < 100:
    response = requests.get(polling_url)
    data = response.json()
    task_status = data["task"]["state"]
    if task_status in {"done", "failed", "rejected"}:
        break
    counter += 1
    time.sleep(5)

Once the polling finishes, data will contain the latest response from the server -- either the result of Language Identification, or an error message with details, in case processing was not able to finish properly.

By default, the result contains scores for more than a hundred languages. The following JSON is a manually shortened result of a successful Language Identification task for the Harry.wav file which shows that the language was correctly identified as British English ("en-GB") with the probability close to 1.0, and that Australian English ("en-AU") also received some "points", in contrast to Greek ("el-GR"). You can find the meaning of individual language tags in the list of supported languages.

{
  "task": {
    "task_id": "cccd6bf9-9c8c-44a3-9373-c0182fc096b4",
    "state": "done"
  },
  "result": {
    "channels": [
      {
        "channel_number": 0,
        "speech_length": 30.0,
        "scores": [
          ...
          {
            "identifier": "el-gr",
            "identifier_type": "language",
            "probability": 0.0
          },
          {
            "identifier": "en-au",
            "identifier_type": "language",
            "probability": 0.00212
          },
          {
            "identifier": "en-gb",
            "identifier_type": "language",
            "probability": 0.99787
          },
          ...
        ]
      }
    ]
  }
}

You can easily parse the result and select for example only the three top-scoring languages in the first channel (those with the highest probability), print them to the console and save them to a file like this:

import json

scores = data["result"]["channels"][0]["scores"]
top_scores = sorted(scores, key=lambda x: x["probability"], reverse=True)[:3]
print(top_scores)
with open("output.json", "w") as output:
    json.dump(top_scores, output, indent=2)

This will produce the following JSON array:

[
  {
    "identifier": "en-gb",
    "identifier_type": "language",
    "probability": 0.99787
  },
  {
    "identifier": "en-au",
    "identifier_type": "language",
    "probability": 0.00212
  },
  {
    "identifier": "ab-ge",
    "identifier_type": "language",
    "probability": 0.0
  }
]

Language Identification with Parameters

If you want to have more control over the output, you can use the config request body parameter in which you can limit the list of languages that will be shown in the output and you can define language_groups that will make certain languages be treated as a single result item. See the endpoint documentation for more details. To use the request body parameters, you can encode them as a JSON string with json.dumps and pass them as an argument for the data parameter of requests.post(). In the following example, we're instructing the Language Identification technology to limit the list of languages to the related languages German, English, and Dutch, and to treat all available dialects of English as one group:

payload = {
    "config": json.dumps(
        {
            "languages": ["de", "en-AU", "en-GB", "en-IN", "en-US", "nl"],
            "language_groups": [
                {
                    "identifier": "English",
                    "languages": ["en-AU", "en-GB", "en-IN", "en-US"],
                }
            ],
        }
    )
}

with open(audio_path, mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        url=ENDPOINT_URL,
        data=payload,
        files=files,
    )

After polling for the result as in the previous example we'll get the following output. Notice that the "English" group now received the maximum possible probability of 1.0 and we can see how much the individual dialects contributed to the overall score:

[
  {
    "identifier": "English",
    "identifier_type": "group",
    "probability": 1.0,
    "languages": [
      {
        "identifier": "en-au",
        "identifier_type": "language",
        "probability": 0.00212
      },
      {
        "identifier": "en-gb",
        "identifier_type": "language",
        "probability": 0.99788
      },
      {
        "identifier": "en-in",
        "identifier_type": "language",
        "probability": 0.0
      },
      {
        "identifier": "en-us",
        "identifier_type": "language",
        "probability": 0.0
      }
    ]
  },
  {
    "identifier": "de",
    "identifier_type": "language",
    "probability": 0.0
  },
  {
    "identifier": "nl",
    "identifier_type": "language",
    "probability": 0.0
  }
]

Full Python Code

Here is the full example on how to run the Language Identification technology with parameters that limit the list of input languages to just those that are actually spoken in the sample dataset (plus some more English dialects). The code is slightly adjusted and wrapped into functions for better readability.

⚠️ Warning: If you use both the languages and language_groups parameters, make sure that all individual languages in a group are also included in the global languages list. The example also shows that a language group can contain any language (e.g., "Czech" and "Slovak"), not just dialects of one language.

The top_scores.json file contains the result of the test:

import json
import requests
import time

SPEECH_PLATFORM_SERVER = "http://localhost:8000"  # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/language-identification"


payload = {
    "config": json.dumps(
        {
            "languages": [
                "arb",
                "cs-CZ",
                "en-AU",
                "en-GB",
                "en-IN",
                "en-US",
                "es-XA",
                "gu-IN",
                "ha",
                "hbs",
                "he-IL",
                "hu-HU",
                "ig-NG",
                "km-KH",
                "ru-RU",
                "sk-SK",
                "vi-VN",
                "yo",
                "zh-CN",
            ],
            "language_groups": [
                {
                    "identifier": "English",
                    "languages": ["en-AU", "en-GB", "en-IN", "en-US"],
                },
                {"identifier": "Czecho-Slovak", "languages": ["cs-CZ", "sk-SK"]},
            ],
        }
    )
}


def poll_result(polling_url: str, sleep: int = 5):
    while True:
        response = requests.get(polling_url)
        response.raise_for_status()
        data = response.json()
        task_status = data["task"]["state"]
        if task_status in {"done", "failed", "rejected"}:
            break
        time.sleep(sleep)
    return response


def run_language_identification(audio_path: str):
    with open(audio_path, mode="rb") as file:
        files = {"file": file}
        response = requests.post(
            url=ENDPOINT_URL,
            data=payload,
            files=files,
        )
        response.raise_for_status()
    polling_url = response.headers["Location"]
    language_identification_response = poll_result(polling_url)
    return language_identification_response.json()


filenames = [
    "Adedewe.wav",
    "Dina.wav",
    "Fadimatu.wav",
    "Harry.wav",
    "Juan.wav",
    "Julia.wav",
    "Lenka.wav",
    "Lubica.wav",
    "Luka.wav",
    "Nirav.wav",
    "Noam.wav",
    "Obioma.wav",
    "Tatiana.wav",
    "Thida.wav",
    "Tuan.wav",
    "Xiang.wav",
    "Zoltan.wav",
]

results = {}
for filename in filenames:
    print(f"Running Language Identification for file {filename}.")
    data = run_language_identification(filename)
    scores = data["result"]["channels"][0]["scores"]
    top_scores = sorted(scores, key=lambda x: x["probability"], reverse=True)[:3]
    results[filename] = top_scores
    print(f"The top-scoring languages in {filename} are: {top_scores}")

with open("top_scores.json", "w") as output:
    json.dump(results, output, indent=2)

Prerequisites​

Environment Setup​

Basic Language Identification​

Language Identification with Parameters​

Full Python Code​

Prerequisites

Environment Setup

Basic Language Identification

Language Identification with Parameters

Full Python Code