Skip to main content
Version: 4.0.0-rc1

Transcription Normalization

Transcription normalization is a technology designed for the 6th generation Phonexia Speech-to-Text technology. Its primary purpose is to transform the raw output produced by the Phonexia Speech-to-Text engine (which prioritizes accuracy and consistency for downstream processing, rather than human readability) into a form that is clear and easy for humans to read.

The technology employs a two-stage approach to normalize transcriptions. The first stage utilizes a neural network to perform capitalization and punctuation insertion. The second stage employs a context-free grammar model to normalize numerical expressions written in text form into their corresponding digit forms (e.g., "twenty-one" → "21").

Key Features

  • Improved Readability: For every supported language, the technology capitalizes appropriate letters and appends punctuation marks, making transcriptions more natural and readable.
  • Advanced Number Normalization (Selected Languages): For a subset of supported languages, transcription normalization uses a context-free grammar model to convert numbers in text form into digit form.
  • High Speed: The normalization model is optimized for speed, allowing for fast processing of large volumes of transcriptions.
  • Multilingual: The normalization model is capable of processing transcriptions in any supported language, without the need for language-specific models.
  • Direct Compatibility: The normalization model is designed to directly accept the output of the 6th generation Phonexia Speech-to-Text engine, without the need for any additional transformation steps.
  • Segmentation Timestamps Preserved: The normalization model preserves the segmentation timestamps from the raw transcription output, allowing for further processing and alignment with the original audio input.

Supported Languages

Basic normalization is available for all languages supported by Phonexia Speech-to-Text 6th Gen, with the exception of the following languages:

  • Vietnamese (vi)
  • Kazakh (kk)
  • Georgian (ka)
  • Bengali (bn)
  • Persian (fa)
  • Chinese (zh)
  • Arabic (ar)
  • Pashto (ps)

Languages with Full Grammar Support (number-to-digit conversion):

  • Czech (cs)
  • German (de)
  • English (en)
  • Spanish (es)
  • French (fr)
  • Russian (ru)

Use Case

This technology is intended to be used as a post-processing step after speech-to-text transcription, ensuring that the final output is suitable for direct presentation to end-users or for further text processing tasks that require human-readable formatting.

Frequently Asked Questions

Q: Does the number of input words always match the number of output words?

A: For most languages, the number of words remains the same. However, for languages with advanced grammar support, the number of words may change due to the application of context-free grammar rules. For example, the phrase "forty two" may be normalized to a single word 42.

Q: Is the language detected automatically?

A: The language is detected automatically for basic normalization. However, for languages with advanced grammar support, the language must be specified explicitly in the request, otherwise the grammar will not be applied even if it is supported for the language.

Q: Can I use the grammar for a language that is not supported?

A: No, the grammar is only available for the languages that are explicitly supported. If you need to use the grammar for a language that is not supported, you will need to use a different approach or technology.

Q: Can I use the normalization for multiple languages in a single request?

A: Although this is technically possible, it is not recommended. The technology is trained and intended to be used for the output of the Phonexia 6th Gen Speech-to-Text engine, which cannot output transcriptions in multiple languages. Furthermore, the grammar rules could not be applied correctly in this case, as the model is designed to work with a single language per request.