Version: 3.4.0

Speech Translation

This guide demonstrates how to perform Speech Translation to English with Phonexia Speech Platform 4. This technology is based on the Enhanced Speech to Text Built on Whisper, and its high-level description can be found in Enhanced Speech to Text Built on Whisper articles.

For testing, we'll be using the following audio files. You can download them all together in the audio_files.zip archive:

filename	language name
Lenka.wav	Czech
Tatiana.wav	Russian
Xiang.wav	Mandarin Chinese
Zoltan.wav	Hungarian

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Speech Translation in your own projects.

Prerequisites

In the guide, we assume that the Speech Platform server is running on port 8000 of http://localhost and a properly configured Enhanced Speech to Text Built on Whisper microservice is available. Here's more information on how to install and start the Speech Platform server and how to make the microservice available.

Environment Setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Basic Speech Translation

By default, the translation source language is detected once at the beginning of the file by the auto-detect feature. To run Speech Translation for a single media file, you should start by sending a POST request to the /api/technology/speech-translation-whisper-enhanced endpoint. In Python, you can do this as follows:

import requests

SPEECH_PLATFORM_SERVER = "http://localhost:8000"  # Replace with your actual server URL
ENDPOINT_URL = (
    f"{SPEECH_PLATFORM_SERVER}/api/technology/speech-translation-whisper-enhanced"
)

audio_path = "Lenka.wav"

with open(audio_path, mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        url=ENDPOINT_URL,
        files=files,
    )
    print(response.status_code)  # Should print '202'

If the task has been successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

The URL for polling the result is returned in the X-Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and the task ID to the endpoint URL.

import requests
import time

polling_url = response.headers[
    "x-location"
]  # Use the `response` from the previous step
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]

counter = 0
while counter < 100:
    response = requests.get(polling_url)
    data = response.json()
    task_status = data["task"]["state"]
    if task_status in {"done", "failed", "rejected"}:
        break
    counter += 1
    time.sleep(5)

Once the polling finishes, data will contain the latest response from the server -- either the result of Speech Translation, or an error message with details, in case processing was not able to finish properly. The technology result can be accessed as data["result"], and for our sample audio, the result should look as follows (result was shortened due to readability):

"one_best": {
    "segments": [
        {
            "channel_number": 0,
            "start_time": 2.17,
            "end_time": 5.3,
            "language": "en",
            "text": "Good day, I am very happy that you are calling.",
            "source_language": "cs",
            "detected_source_language": "cs",
        },
        {
            "channel_number": 0,
            "start_time": 5.3,
            "end_time": 8.3,
            "language": "en",
            "text": "We have just opened the swimming courses,",
            "source_language": "cs",
            "detected_source_language": "cs",
        },
        {
            "channel_number": 0,
            "start_time": 8.3,
            "end_time": 12.08,
            "language": "en",
            "text": "and we have planned them like this.",
            "source_language": "cs",
            "detected_source_language": "cs",
        },
        {
            "channel_number": 0,
            "start_time": 12.08,
            "end_time": 15.41,
            "language": "en",
            "text": "The course is for one semester,",
            "source_language": "cs",
            "detected_source_language": "cs",
        },
        {
            "channel_number": 0,
            "start_time": 15.41,
            "end_time": 18.41,
            "language": "en",
            "text": "and the training takes place once or twice a week,",
            "source_language": "cs",
            "detected_source_language": "cs",
        },
        ...,
    ]
}

In the example above, both source_language and detected_source_language contain the same language code. However, it's possible that the values may differ in some cases -- the detailed explanation can be found in the description of detected_source_language in the GET request to the /api/technology/speech-translation-whisper-enhanced/:task_id

In case you are processing multi-channel media files, the translated segments for all channels appear in the common segments list and are clearly distinguished by their channel_number value. The list of segments is sorted by start_time, end_time and channel_number.

Speech Translation with Parameters

The technology supports two mutually exclusive query parameters -- source_language and enable_language_switching. In case you know what language is used in the file, you can specify it with the source_language parameter and possibly make the translation more accurate. In case the file contains multiple languages, you can use enable_language_switching parameter, so that the source language is re-detected every 30 seconds.

When specifying the source_language, the POST request can look like follows:

import requests

SPEECH_PLATFORM_SERVER = "http://localhost:8000"  # Replace with your actual server URL
ENDPOINT_URL = (
    f"{SPEECH_PLATFORM_SERVER}/api/technology/speech-translation-whisper-enhanced"
)

params = {"source_language": "cs"}  # selecting the source language manually

# or enable language switching
# params = {"enable_language_switching": True}

audio_path = "Lenka.wav"

with open(audio_path, mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        url=ENDPOINT_URL,
        files=files,
        params=params,
    )
    response.raise_for_status()

You can follow the polling steps and parsing of the results as was demonstrated in the Basic Speech Translation section.

Full Python code

Here is the full example on how to run the Speech Translation technology. The code is slightly adjusted and wrapped into functions for better readability.

import json
import requests
import time

SPEECH_PLATFORM_SERVER = "http://localhost:8000"  # Replace with your actual server URL
ENDPOINT_URL = (
    f"{SPEECH_PLATFORM_SERVER}/api/technology/speech-translation-whisper-enhanced"
)


def poll_result(polling_url: str, sleep: int = 5):
    while True:
        response = requests.get(polling_url)
        response.raise_for_status()
        data = response.json()
        task_status = data["task"]["state"]
        if task_status in {"done", "failed", "rejected"}:
            break
        time.sleep(sleep)
    return response


def run_speech_translation_whisper_enhanced(audio_path: str):
    with open(audio_path, mode="rb") as file:
        files = {"file": file}
        response = requests.post(
            url=ENDPOINT_URL,
            files=files,
        )
        response.raise_for_status()
    polling_url = response.headers["x-location"]
    response_result = poll_result(polling_url)
    return response_result.json()


filenames = ["Lenka.wav", "Tatiana.wav", "Xiang.wav", "Zoltan.wav"]

for filename in filenames:
    print(f"Runnning Enhanced Speech Translation Built on Whisper for file {filename}.")
    data = run_speech_translation_whisper_enhanced(filename)
    result = data["result"]
    print(f"{json.dumps(result, indent=2)}\n")

Prerequisites​

Environment Setup​

Basic Speech Translation​

Speech Translation with Parameters​

Full Python code​

Prerequisites

Environment Setup

Basic Speech Translation

Speech Translation with Parameters

Full Python code