Speech Translation
This guide demonstrates how to perform Speech Translation to English with Phonexia Speech Platform 4. This technology is based on the Enhanced Speech to Text Built on Whisper, and its high-level description can be found in Enhanced Speech to Text Built on Whisper articles.
For testing, we'll be using the following audio files. You can download them all together in the audio_files.zip archive:
filename | language name |
---|---|
Lenka.wav | Czech |
Tatiana.wav | Russian |
Xiang.wav | Mandarin Chinese |
Zoltan.wav | Hungarian |
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Speech Translation in your own projects.
Prerequisites
In the guide, we assume that the Speech Platform server is running on port
8000
of http://localhost
and a properly configured Enhanced Speech to Text
Built on Whisper microservice is available. Here's more information on how to
install and start the Speech Platform server and
how to make the
microservice
available.
Environment Setup
We are using Python 3.9
and Python library requests 2.27
in this example.
You can install the requests
library with pip
as follows:
pip install requests~=2.27
Basic Speech Translation
By default, the translation source language is detected once at the beginning of
the file by the auto-detect feature. To run Speech Translation for a single
media file, you should start by sending a POST
request to the
/api/technology/speech-translation-whisper-enhanced
endpoint. In Python, you can do this as follows:
import requests
SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = (
f"{SPEECH_PLATFORM_SERVER}/api/technology/speech-translation-whisper-enhanced"
)
audio_path = "Lenka.wav"
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
)
print(response.status_code) # Should print '202'
If the task has been successfully accepted, the 202
code will be returned
together with a unique task ID
in the response body. The task isn't processed
immediately, but only scheduled for processing. You can check the current task
status by polling for the result.
The URL for polling the result is returned in the X-Location
header.
Alternatively, you can assemble the polling URL on your own by appending a slash
(/
) and the task ID
to the endpoint URL.
import requests
import time
polling_url = response.headers[
"x-location"
] # Use the `response` from the previous step
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]
counter = 0
while counter < 100:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
counter += 1
time.sleep(5)
Once the polling finishes, data
will contain the latest response from the
server -- either the result of Speech Translation, or an error message with
details, in case processing was not able to finish properly. The technology
result can be accessed as data["result"]
, and for our sample audio, the result
should look as follows (result was shortened due to readability):
"one_best": {
"segments": [
{
"channel_number": 0,
"start_time": 2.17,
"end_time": 5.3,
"language": "en",
"text": "Good day, I am very happy that you are calling.",
"source_language": "cs",
"detected_source_language": "cs",
},
{
"channel_number": 0,
"start_time": 5.3,
"end_time": 8.3,
"language": "en",
"text": "We have just opened the swimming courses,",
"source_language": "cs",
"detected_source_language": "cs",
},
{
"channel_number": 0,
"start_time": 8.3,
"end_time": 12.08,
"language": "en",
"text": "and we have planned them like this.",
"source_language": "cs",
"detected_source_language": "cs",
},
{
"channel_number": 0,
"start_time": 12.08,
"end_time": 15.41,
"language": "en",
"text": "The course is for one semester,",
"source_language": "cs",
"detected_source_language": "cs",
},
{
"channel_number": 0,
"start_time": 15.41,
"end_time": 18.41,
"language": "en",
"text": "and the training takes place once or twice a week,",
"source_language": "cs",
"detected_source_language": "cs",
},
...,
]
}
In the example above, both source_language
and detected_source_language
contain the same language code. However, it's possible that the values may
differ in some cases -- the detailed explanation can be found in the description
of detected_source_language
in the GET
request to the
/api/technology/speech-translation-whisper-enhanced/:task_id
In case you are processing multi-channel media files, the translated segments
for all channels appear in the common segments
list and are clearly
distinguished by their channel_number
value. The list of segments is sorted by
start_time
, end_time
and channel_number
.
Speech Translation with Parameters
The technology supports two mutually exclusive query parameters --
source_language
and enable_language_switching
. In case you know what
language is used in the file, you can specify it with the source_language
parameter and possibly make the translation more accurate. In case the file
contains multiple languages, you can use enable_language_switching
parameter,
so that the source language is re-detected every 30 seconds.
When specifying the source_language
, the POST
request can look like follows:
import requests
SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = (
f"{SPEECH_PLATFORM_SERVER}/api/technology/speech-translation-whisper-enhanced"
)
params = {"source_language": "cs"} # selecting the source language manually
# or enable language switching
# params = {"enable_language_switching": True}
audio_path = "Lenka.wav"
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
params=params,
)
response.raise_for_status()
You can follow the polling steps and parsing of the results as was demonstrated in the Basic Speech Translation section.
Full Python code
Here is the full example on how to run the Speech Translation technology. The code is slightly adjusted and wrapped into functions for better readability.
import json
import requests
import time
SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = (
f"{SPEECH_PLATFORM_SERVER}/api/technology/speech-translation-whisper-enhanced"
)
def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(sleep)
return response
def run_speech_translation_whisper_enhanced(audio_path: str):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
response_result = poll_result(polling_url)
return response_result.json()
filenames = ["Lenka.wav", "Tatiana.wav", "Xiang.wav", "Zoltan.wav"]
for filename in filenames:
print(f"Runnning Enhanced Speech Translation Built on Whisper for file {filename}.")
data = run_speech_translation_whisper_enhanced(filename)
result = data["result"]
print(f"{json.dumps(result, indent=2)}\n")