Speech to Text
This guide demonstrates how to run Speech to Text with Phonexia Speech Platform 4 Virtual Appliance. The technology can transform speech in a media file into a text transcript. You can find a high-level description of the technology in the Speech to Text articles.
Phonexia Speech Platform 4 Virtual Appliance supports two different Speech to Text engines — Phonexia 6th generation Speech to Text and Enhanced Speech to Text Built on Whisper. In this guide we will process media files with both of them.
Throughout this guide, we'll be using this mono-channel audio file in English as an example: Harry.wav
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Speech to Text in your own projects.
Prerequisites
Follow the prerequisites for setup of Virtual Appliance and Python environment as described in the Task lifecycle code examples.
Run Phonexia 6th generation Speech to Text
To run Phonexia Speech to Text for a single media file, you should start by
sending a POST request to the
/api/technology/speech-to-text
endpoint. file and language are the mandatory parameters. In Python, you can
do this as follows:
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-to-text"
media_file = "Harry.wav"
params = {"language": "en"} # Language of the transcription model
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
params=params,
)
print(start_task_response.status_code) # Should print '202'
Note that to get meaningful results, the language parameter has to match the
language spoken in the audio file. See the documentation for the list of
supported languages.
If the task was successfully accepted, a 202 code will be returned together with
a unique task ID in the response body. The task isn't processed immediately,
but only scheduled for processing. You can check the current task status by
polling for the result.
Polling
To obtain the final result, periodically query the task status until the task
state changes to done, failed or rejected. The general polling procedure
is described in detail in the
Task lifecycle code examples.
Result for Phonexia 6th generation Speech to Text
The result field of the task contains the one_best result, with a list of
segments from all channels. Each segment contains the channel_number,
start_time, end_time, and text of the segment, and a list of individual
timestamped words in the segment.
An example task result of a successful Phonexia Speech to Text transcription looks like this (shortened for readability):
{
"task": {
"task_id": "f866570e-9d62-4e31-b85b-1fa385e90c71",
"state": "done"
},
"result": {
"one_best": {
"segments": [
{
"channel_number": 0,
"start_time": 1.182,
"end_time": 3.259,
"text": "oh <silence/> yeah sure <silence/> hold on a second",
"words": [
{"start_time": 1.182, "end_time": 1.4, "text": "oh"},
{"start_time": 1.4, "end_time": 1.444, "text": "<silence/>"},
{"start_time": 1.444, "end_time": 1.67, "text": "yeah"},
...
{"start_time": 2.81, "end_time": 3.259, "text": "second"}
]
},
...
{
"channel_number": 0,
"start_time": 51.79,
"end_time": 52.545,
"text": "cheers you too",
"words": [...]
}
]
},
"additional_words": []
}
}
Phonexia 6th generation Speech to Text provides detailed timestamps for
individual words. If you are only interested in a single transcript of the
entire media file, without timestamps for individual words, you can access the
whole text of each segment like this (using polling_task_response_json from
the polling step):
transcript = ""
for segment in polling_task_response_json["result"]["one_best"]["segments"]:
transcript += segment["text"] + "\n"
Run Phonexia 6th generation Speech to Text with parameters
Phonexia 6th generation Speech to Text provides some options to fine-tune the
transcription. The underlying language model may struggle with jargon, local
names, region-specific pronunciation or neologisms. Furthermore, speakers in
your media files might utter phrases whose transcription is ambiguous, for
example because of strong accent. You can use the config request body field to
deal with all these situations.
With the preferred_phrases field in the config you can force the technology to
prefer some phrases over alternative transcriptions. In the additional_words
field you can list words that you expect to be missing from the technology's
built-in vocabulary. See the
endpoint documentation
for more details. In the following example, we define a specific phrase that we
expect to appear in the input files. To make the transcription even more
precise, we provide the phonetic form of the word 'superspeed'.
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-to-text"
media_file = "Harry.wav"
params = {"language": "en"} # Language of the transcription model
config = {
"config": json.dumps(
{
"preferred_phrases": ["superspeed tariff from your offer"],
"additional_words": [
{
"spelling": "superspeed",
"pronunciations": ["s u p @r s p i d"],
},
]
}
)
}
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
data=config,
params=params,
)
print(start_task_response.status_code) # Should print '202'
The config request body field expects a JSON-formatted string. In other words,
config's content is going to be interpreted as JSON by Speech Platform 4
Virtual Appliance. Therefore, all rules of the JSON syntax apply.
Some phoneme symbols (e.g. the Czech phoneme P\) include the backslash which
has a special meaning in JSON and must be escaped with another backslash (\\)
to suppress the special meaning. For example, the correct way to capture the
pronunciation of the Czech word řeka ("river") in
config.additional_words[*].pronunciations is "P\\ e k a".
After polling for the result as in the previous example, we'll get the following output (shortened for readability):
{
"task": {
"task_id": "f866570e-9d62-4e31-b85b-1fa385e90c71",
"state": "done"
},
"result": {
"one_best": {...},
"additional_words": [
...
{
"spelling": "superspeed",
"pronunciations": [
{
"pronunciation": "s u p @r s p i d",
"out_of_vocabulary": false
}
]
},
{
"spelling": "tariff",
"pronunciations": [
{
"pronunciation": "t E r @ f",
"out_of_vocabulary": false
}
]
}
]
}
}
Notice that the technology returns pronunciations for all the additional words and words included in preferred phrases, even though we didn't specify pronunciations for all of them. If a word's pronunciation isn't specified by the user, it's either found in the technology's vocabulary or auto-generated.
Run Enhanced Speech to Text Built on Whisper
To run Enhanced Speech to Text Built on Whisper for a single media file, you
should start by sending a POST request to the
/api/technology/speech-to-text-whisper-enhanced
endpoint. file is the only mandatory parameter. In Python, you can do this as
follows:
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-to-text-whisper-enhanced"
media_file = "Harry.wav"
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
)
print(start_task_response.status_code) # Should print '202'
If the task was successfully accepted, a 202 code will be returned together with
a unique task ID in the response body. The task isn't processed immediately,
but only scheduled for processing. You can check the current task status by
polling for the result.
Polling
To obtain the final result, periodically query the task status until the task
state changes to done, failed or rejected. The general polling procedure
is described in detail in the
Task lifecycle code examples.
Result for Enhanced Speech to Text Built on Whisper
The result field of the task contains the one_best result, with a list of
segments from all channels. Each segment contains the channel_number,
start_time, end_time, and text of the segment.
An example task result of a successful Enhanced Speech to Text Built on Whisper transcription looks like this (shortened for readability):
{
"task": {
"task_id": "b1850fed-3e4b-4f2a-94c8-9a57e73abb9f",
"state": "done"
},
"result": {
"one_best": {
"segments": [
{
"channel_number": 0,
"start_time": 1.06,
"end_time": 6.8,
"language": "en",
"text": "Yeah, sure. Hold on a second. Where did I put it? Ah, here we are.",
},
{
"channel_number": 0,
"start_time": 6.8,
"end_time": 19.61,
"language": "en",
"text": "So the agreement number is 7895478. Right, the third digit. Where do I...",
},
{
"channel_number": 0,
"start_time": 19.61,
"end_time": 26.01,
"language": "en",
"text": "Oh, yes, it's nine. Oh, right, the security code. Sorry, not the agreement number.",
},
...
{
"channel_number": 0,
"start_time": 49.46,
"end_time": 52.49,
"language": "en",
"text": "Cheers. You too.",
},
]
}
},
}
You can aggregate the transcript of the entire media file like this (using
polling_task_response_json from the polling step):
transcript = ""
for segment in polling_task_response_json["result"]["one_best"]["segments"]:
transcript += segment["text"] + "\n"
Full Python Code
Here is the full code for this example, slightly adjusted and wrapped into functions for better readability. Refer to the Task lifecycle code examples for a generic code template, applicable to all technologies.
import json
import time
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
PHONEXIA_MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-to-text"
WHISPER_MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-to-text-whisper-enhanced"
def poll_result(polling_url, polling_interval=5):
"""Poll the task endpoint until processing completes."""
while True:
polling_task_response = requests.get(polling_url)
polling_task_response.raise_for_status()
polling_task_response_json = polling_task_response.json()
task_state = polling_task_response_json["task"]["state"]
if task_state in {"done", "failed", "rejected"}:
break
time.sleep(polling_interval)
return polling_task_response
def run_media_based_task(endpoint_url, media_file, params = {}, config = {}):
"""Create a media-based task and wait for results."""
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=endpoint_url,
files=files,
params=params,
data={"config": json.dumps(config)},
)
start_task_response.raise_for_status()
polling_url = start_task_response.headers["Location"]
task_result = poll_result(polling_url)
return task_result.json()
# Run Speech to Text
media_file = "Harry.wav"
params = {"language": "en"}
config = {
"config": json.dumps(
{
"preferred_phrases": ["superspeed tariff from your offer"],
"additional_words": [
{
"spelling": "superspeed",
"pronunciations": ["s u p @r s p i d"],
},
]
}
)
}
speech_to_text_phonexia_response = run_media_based_task(PHONEXIA_MEDIA_FILE_BASED_ENDPOINT_URL, media_file, params=params, config=config)
print(f"Phonexia 6th generation:\n{json.dumps(speech_to_text_phonexia_response["result"], indent=2)}\n")
enhanced_speech_to_text_built_on_whisper_response = run_media_based_task(WHISPER_MEDIA_FILE_BASED_ENDPOINT_URL, media_file)
print(f"Enhanced Speech to Text Built on Whisper:\n{json.dumps(enhanced_speech_to_text_built_on_whisper_response["result"], indent=2)}")