Version: 4.0.2

Speech to Text

Speech to text combines two powerful technologies, Phonexia 6th Gen and Enhanced Speech to Text Built on Whisper, which support a wide range of languages. The number of visible language options is always tailored to the user's needs.

Supported languages

Have a look at the complete language portfolio of our Speech to Text.

This page explains how to use Phonexia Speech to Text in our web application. If you want to dive deeper into the inner workings of this technology, check out our detailed technical documentation.

Selecting a language

The first step is to select the language utilized in the recordings. If uncertain about the language, an alternative is to employ auto-detect mode, which seamlessly identifies the language and proceeds with transcription. If you know that more than one language is used in the recording, you can enable language switching mode to get the transcription in the respective languages. Read more about language switching mode.

note

If you choose auto-detect or language switching mode, Speech to Text will only identify and use the languages from the Built on Whisper portfolio.

Uploading files

Upload your files or create your own recordings by using the built-in recording feature. If you don't possess your own files, you can utilize Phonexia examples specifically configured for automatic transcription in auto-detect mode.

Results

The transcription process may take a while.

caution

Cancelling the process or deleting the recording while you await the transcription does not affect the transcription speed, as the transcription process continues uninterrupted on the server.

Leaving the page while awaiting transcription may result in process interruption.

Export formats

Whether you export as a bulk action or individually, you have the option to select from various export formats. All formats, with the exception of XLSX, utilize UTF-8 encoding.

Plain text

This format provides plain text without timestamps or any additional metadata. The text merges together the transcription of all speech without specifying individual channels.

Yeah, sure. Hold on a second. Where did I put it? Ah, here we are.
So the agreement number is 7895478. Right, the third digit. Where do I...
Oh, yes, it's nine. Oh, right, the security code. Sorry, not the agreement number.
Yeah, so the fourth and seventh digit you said, right? Fourth and seventh digit.
Okay, it's three and four. Well, I'm interested in the super speed tariff from your offer.
No, I think that's everything. Thank you.
Yeah, sounds good.
Yeah, that's all.
Cheers. You too.

Text with timestamps

This format contains two types of data: timestamps and text. The text merges together the transcription of all speech without specifying individual channels.

00:01  Yeah, sure. Hold on a second. Where did I put it? Ah, here we are.
00:06  So the agreement number is 7895478. Right, the third digit. Where do I...
00:19  Oh, yes, it's nine. Oh, right, the security code. Sorry, not the agreement number.
00:26  Yeah, so the fourth and seventh digit you said, right? Fourth and seventh digit.
00:30  Okay, it's three and four. Well, I'm interested in the super speed tariff from your offer.
00:41  No, I think that's everything. Thank you.
00:45  Yeah, sounds good.
00:46  Yeah, that's all.
00:49  Cheers. You too.

CSV and XLSX format

These formats contain identical metadata: transcription technology, language of transcription, channel tags, segment timestamps, confidence score, and the transcribed text.

The .CSV format is well-suited for users who work with large datasets, as it facilitates sophisticated computational processing and filtering based on specific metadata criteria. Start time and end time of each segment are represented as sums of seconds in this format.

Transcription_technology,Language_code,Channel,Start_time,End_time,Confidence_score,Transcription
Built on Whisper,en,0,1.06,6.8,,"Yeah, sure. Hold on a second. Where did I put it? Ah, here we are."
Built on Whisper,en,0,6.8,19.61,,"So the agreement number is 7895478. Right, the third digit. Where do I..."
Built on Whisper,en,0,19.61,26.01,,"Oh, yes, it's nine. Oh, right, the security code. Sorry, not the agreement number."
Built on Whisper,en,0,26.01,30.07,,"Yeah, so the fourth and seventh digit you said, right? Fourth and seventh digit."
Built on Whisper,en,0,30.07,41.58,,"Okay, it's three and four. Well, I'm interested in the super speed tariff from your offer."
Built on Whisper,en,0,41.58,45.6,,"No, I think that's everything. Thank you."
Built on Whisper,en,0,45.6,46.6,,"Yeah, sounds good."
Built on Whisper,en,0,46.6,49.46,,"Yeah, that's all."
Built on Whisper,en,0,49.46,52.49,,Cheers. You too.

The .XLSX format provides a clear, comprehensive, and human-readable overview of the metadata and textual content, catering to users who prefer working with more graphical data representation. In this format, timestamps are presented in the format: HH:MM:SS.

Table showing a list of transcriptions including metadata such as selected variant of transcription technology, language code, channel and timestamps.

JSON format

This format presents metadata similar to those mentioned above. Additionally, per-word segmentation is available, providing a timestamp for each individual word.

Enhanced Speech to Text Built on Whisper
Phonexia 6th GEN

{
  "one_best": {
    "segments": [
      {
        "channel_number": 0,
        "start_time": 0.32,
        "end_time": 1.12,
        "language": "en",
        "text": "How are you?",
        "words": [
          {
            "start_time": 0.32,
            "end_time": 0.48,
            "text": "how",
            "punctuated_text": "How"
          },
          {
            "start_time": 0.48,
            "end_time": 0.72,
            "text": "are",
            "punctuated_text": "are"
          },
          {
            "start_time": 0.72,
            "end_time": 1.12,
            "text": "you",
            "punctuated_text": "you?"
          }
        ]
      },
      {
        "channel_number": 0,
        "start_time": 2.07,
        "end_time": 3.23,
        "language": "en",
        "text": "Yeah, I'm fine, also.",
        "words": [
          {
            "start_time": 2.07,
            "end_time": 2.31,
            "text": "yeah",
            "punctuated_text": "Yeah,"
          },
          {
            "start_time": 2.31,
            "end_time": 2.51,
            "text": "i'm",
            "punctuated_text": "I'm"
          },
          {
            "start_time": 2.51,
            "end_time": 2.79,
            "text": "fine",
            "punctuated_text": "fine,"
          },
          {
            "start_time": 2.91,
            "end_time": 3.23,
            "text": "also",
            "punctuated_text": "also."
          }
        ]
      },
      {
        "channel_number": 0,
        "start_time": 4.06,
        "end_time": 4.84,
        "language": "en",
        "text": "No, it's okay.",
        "words": [
          {
            "start_time": 4.06,
            "end_time": 4.32,
            "text": "no",
            "punctuated_text": "No,"
          },
          {
            "start_time": 4.32,
            "end_time": 4.58,
            "text": "it's",
            "punctuated_text": "it's"
          },
          {
            "start_time": 4.58,
            "end_time": 4.84,
            "text": "okay",
            "punctuated_text": "okay."
          }
        ]
      }
    ]
  }
}

{
  "one_best": {
    "segments": [
      {
        "channel_number": 0,
        "end_time": 0.987,
        "start_time": 0.271,
        "text": "how are you",
        "words": [
          {
            "end_time": 0.42,
            "start_time": 0.271,
            "text": "how"
          },
          {
            "end_time": 0.571,
            "start_time": 0.42,
            "text": "are"
          },
          {
            "end_time": 0.987,
            "start_time": 0.571,
            "text": "you"
          }
        ]
      },
      {
        "channel_number": 0,
        "end_time": 7.80,
        "start_time": 1.89,
        "text": "yeah <silence/> i'm fine also <silence/> no <silence/> that's okay",
        "words": [
          {
            "end_time": 2.28,
            "start_time": 1.89,
            "text": "yeah"
          },
          {
            "end_time": 2.31,
            "start_time": 2.28,
            "text": "<silence/>"
          },
          {
            "end_time": 2.37,
            "start_time": 2.31,
            "text": "i'm"
          },
          {
            "end_time": 2.76,
            "start_time": 2.37,
            "text": "fine"
          },
          {
            "end_time": 3.3,
            "start_time": 2.76,
            "text": "also"
          },
          {
            "end_time": 3.895,
            "start_time": 3.3,
            "text": "<silence/>"
          },
          {
            "end_time": 4.253,
            "start_time": 3.895,
            "text": "no"
          },
          {
            "end_time": 4.282,
            "start_time": 4.253,
            "text": "<silence/>"
          },
          {
            "end_time": 4.44,
            "start_time": 4.282,
            "text": "that's"
          },
          {
            "end_time": 4.83,
            "start_time": 4.44,
            "text": "okay"
          }
        ]
      }
    ]
  }
}

Selecting a language​

Uploading files​

Results​

Export formats​

Plain text​

Text with timestamps​

CSV and XLSX format​

JSON format​