Version: 4.0.2

Speech to Text

This tutorial demonstrates how to use the Phonexia Speech Platform 4 to obtain a transcript of the speech in a media file, in other words, how to run Speech to Text. The Speech Platform supports two different Speech to Text engines — Phonexia 6th generation Speech to Text and Enhanced Speech to Text Built on Whisper. In this guide we will process audio with both of them.

Attached, you will find the audio file Harry.wav which is a mono recording in English. It will be used as example audio throughout this guide.

At the end of this guide, you will find a full Python code example that encapsulates all the steps discussed. This should offer a comprehensive understanding and an actionable guide on implementing Speech to Text in your own projects.

Environment setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Then, you can import the following libraries (time is built-in):

import time
import requests

Phonexia 6th generation

In order to run Phonexia Speech to Text processing for a single media file, you should start by sending a POST request to /api/technology/speech-to-text endpoint.

You also need to pass the language of the recording as an argument. The example recording is in English, so you should pass en, as follows:

# Replace <speech-platform-server> with the actual server address
SPEECH_PLATFORM_SERVER = "<speech-platform-server>"
with open("Harry.wav", mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text",
        files=files,
        params={"language": "en"},  # Set the language here to matching language model.
    )
    response.raise_for_status()

Note that to get meaningful results, the language parameter has to match the language spoken in the audio file. See the documentation for the list of supported languages.

If the task was successfully accepted, a 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

The URL for polling for the result is returned in the Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and task ID to the initial URL.

polling_url = response.headers["Location"]

counter = 0
while counter < 100:
    response = requests.get(polling_url)
    response.raise_for_status()
    data = response.json()
    task_status = data["task"]["state"]
    if task_status in ["done", "failed", "rejected"]:
        break
    counter += 1
    time.sleep(5)

Once the polling finishes, data will contain the latest response from the server — either a response with the transcript, or an error message with details in case processing was not able to finish. An example result of a successful Phonexia Speech to Text transcription looks like this (output was shortened for readability):

{
  "task": {"task_id": "f866570e-9d62-4e31-b85b-1fa385e90c71", "state": "done"},
  "result": {
    "one_best": {
      "segments": [
        {
          "channel_number": 0,
          "start_time": 1.182,
          "end_time": 3.259,
          "text": "oh <silence/> yeah sure <silence/> hold on a second",
          "words": [
            {"start_time": 1.182, "end_time": 1.4, "text": "oh"},
            {"start_time": 1.4, "end_time": 1.444, "text": "<silence/>"},
            {"start_time": 1.444, "end_time": 1.67, "text": "yeah"},
            ...
            {"start_time": 2.81, "end_time": 3.259, "text": "second"}
          ]
        },
        ...
        {
          "channel_number": 0,
          "start_time": 51.79,
          "end_time": 52.545,
          "text": "cheers you too",
          "words": [...]
        }
      ]
    },
    "additional_words": []
  }
}

Phonexia 6th generation Speech to Text provides detailed timestamps for individual words. If you are only interested in a single transcript of the entire media file, you can access the whole text of each segment like this:

transcript = ""
for segment in data["result"]["one_best"]["segments"]:
    transcript += segment["text"] + "\n"

Congratulations, you have successfully run Speech to Text using the Phonexia 6th generation model.

Phonexia 6th generation with parameters

Phonexia 6th generation Speech to Text provides an option to fine-tune the transcription. The underlying language model may struggle with jargon, local names, region-specific pronunciation or neologisms. Furthermore, speakers in your recordings might utter phrases whose transcription is ambiguous, for example because of strong accent. You can use config request body parameter to deal with all these situations.

You can specify phrases to be preferred over other options in its preferred_phrases attribute. In the additional_words attribute, you can list words that you expect not to be part of the language model's built-in vocabulary. See the endpoint documentation for more details. To use the request body parameters, you can encode them as a JSON string with json.dumps and pass them as a value of the data parameter of requests.post. In the following example, we define a specific phrase we expect to appear in the recording. To make the transcription even more precise, we provide phonetic form of the word 'superspeed'.

payload = {
    "config": json.dumps(
        {
            "preferred_phrases": ["superspeed tariff from your offer"],
            "additional_words": [
                {
                    "spelling": "superspeed",
                    "pronunciations": ["s u p @r s p i d"],
                },
            ]
        }
    )
}

with open("Harry.wav", mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text",
        files=files,
        data=payload,  # Notice that we now send config in the request body
        params={"language": "en"},
    )
    response.raise_for_status()

warning

The config parameter expects a JSON formatted string. In other words, config's content is going to be interpreted as JSON by Speech Platform 4. Therefore, all rules of the JSON syntax apply.

Some phoneme symbols (e.g. the Czech phoneme P\) include the backslash which has a special meaning in JSON and must be escaped with another backslash (\\) to suppress the special meaning. For example, the correct way to capture the pronunciation of the Czech word Řek in config.additional_words[*].pronunciations is "P\\ e k".

After polling for the result as in the previous example, we'll get the following output (shortened for readability):

{
  "task": {"task_id": "f866570e-9d62-4e31-b85b-1fa385e90c71", "state": "done"},
  "result": {
    "one_best": {...},
    "additional_words": [
      ...
      {
        "spelling": "superspeed",
        "pronunciations": [
          {
            "pronunciation": "s u p @r s p i d",
            "out_of_vocabulary": false
          }
        ]
      },
      {
        "spelling": "tariff",
        "pronunciations": [
          {
            "pronunciation": "t E r @ f",
            "out_of_vocabulary": false
          }
        ]
      }
    ]
  }
}

Notice that the technology returns pronunciation for all the additional words and words included in preferred phrases, even though we didn't specify pronunciations for all of them. If a word's pronunciation isn't specified by the user, it's either found in the model's vocabulary or auto-generated.

Enhanced Speech to Text Built on Whisper

In order to run Enhanced Speech to Text Built on Whisper processing for a single media file, you should start by sending a POST request to /api/technology/speech-to-text-whisper-enhanced as follows:

# Replace <speech-platform-server> with the actual server address
SPEECH_PLATFORM_SERVER = "<speech-platform-server>"
with open("Harry.wav", mode="rb") as file:
    files = {"file": file}
    response = requests.post(
        f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text-whisper-enhanced",
        files=files,
    )
    response.raise_for_status()

The URL for polling for the result is returned in the Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and task ID to the initial URL.

polling_url = response.headers["Location"]

counter = 0
while counter < 100:
    response = requests.get(polling_url)
    response.raise_for_status()
    data = response.json()
    task_status = data["task"]["state"]
    if task_status in ["done", "failed", "rejected"]:
        break
    counter += 1
    time.sleep(5)

Once the polling finishes, data will contain the latest response from the server -- either a response with the transcript, or an error message with details in case processing was not able to finish. An example result of a successful Phonexia Speech to Text transcription looks like this (output was shortened for readability):

{
    "task": {"task_id": "b1850fed-3e4b-4f2a-94c8-9a57e73abb9f", "state": "done"},
    "result": {
        "one_best": {
            "segments": [
                {
                    "channel_number": 0,
                    "start_time": 1.06,
                    "end_time": 6.8,
                    "language": "en",
                    "text": "Yeah, sure. Hold on a second. Where did I put it? Ah, here we are.",
                },
                {
                    "channel_number": 0,
                    "start_time": 6.8,
                    "end_time": 19.61,
                    "language": "en",
                    "text": "So the agreement number is 7895478. Right, the third digit. Where do I...",
                },
                {
                    "channel_number": 0,
                    "start_time": 19.61,
                    "end_time": 26.01,
                    "language": "en",
                    "text": "Oh, yes, it's nine. Oh, right, the security code. Sorry, not the agreement number.",
                },
                ...
                {
                    "channel_number": 0,
                    "start_time": 49.46,
                    "end_time": 52.49,
                    "language": "en",
                    "text": "Cheers. You too.",
                },
            ]
        }
    },
}

You can aggregate the transcript of the entire media file like this:

transcript = ""
for segment in data["result"]["one_best"]["segments"]:
    transcript += segment["text"] + "\n"

Congratulations, you have successfully run Speech to Text using Enhanced Speech to Text Built on Whisper.

Full Python Code

Here is the full code for this example, slightly adjusted and wrapped into functions for better readability:

import json
import time
import requests


SPEECH_PLATFORM_SERVER = "your-actual-Speech-Platform-server-URL"


def poll_result(polling_url: str, sleep: int = 5):
    while True:
        response = requests.get(polling_url)
        response.raise_for_status()
        data = response.json()
        task_status = data["task"]["state"]
        if task_status in ["done", "failed", "rejected"]:
            break
        time.sleep(sleep)
    return response


def do_speech_to_text_phonexia(audio_path: str, language: str, config: dict):
    print("Running Phonexia 6th generation Speech to Text.")
    with open(audio_path, mode="rb") as file:
        files = {"file": file}
        response = requests.post(
            f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text",
            files=files,
            params={"language": language},
            data=config,
        )
        response.raise_for_status()
    polling_url = response.headers["Location"]
    speech_to_text_response = poll_result(polling_url)
    return speech_to_text_response.json()


def do_speech_to_text_whisper(audio_path: str):
    print("Running Enhanced Speech to Text Built on Whisper.")
    with open(audio_path, mode="rb") as file:
        files = {"file": file}
        response = requests.post(
            f"https://{SPEECH_PLATFORM_SERVER}/api/technology/speech-to-text-whisper-enhanced",
            files=files,
        )
        response.raise_for_status()
    polling_url = response.headers["Location"]
    speech_to_text_whisper_enhanced_response = poll_result(polling_url)
    return speech_to_text_whisper_enhanced_response.json()


file_name = "Harry.wav"

payload = {
    "config": json.dumps(
        {
            "preferred_phrases": ["superspeed tariff from your offer"],
            "additional_words": [
                {
                    "spelling": "superspeed",
                    "pronunciations": ["s u p @r s p i d"],
                },
            ]
        }
    )
}

result = do_speech_to_text_phonexia(file_name, "en", payload)
print(f"Phonexia 6th generation:\n{json.dumps(result, indent=2)}\n")

result = do_speech_to_text_whisper(file_name)
print(f"Enhanced Speech to Text Built on Whisper:\n{json.dumps(result, indent=2)}")

Environment setup​

Phonexia 6th generation​

Phonexia 6th generation with parameters​

Enhanced Speech to Text Built on Whisper​

Full Python Code​

Environment setup

Phonexia 6th generation

Phonexia 6th generation with parameters

Enhanced Speech to Text Built on Whisper

Full Python Code