Skip to main content

How to Properly Convert Confusion Network Results to One-best

Confusion Network output is the most detailed Speech Engine STT output as it provides multiple word alternatives for individual time-slots of processed speech signal. Therefore many applications want to use it as the main source of speech transcription and perform eventual conversion to less verbose output formats internally. This article provides the recommended way to do the conversion.

Time slots and word alternatives:

stt-confusion-network-word-alternatives

  1. Loop through all CN time-slots from start to end:

    • In each time-slot, get the input alternative with the highest score. If it's not <null/> or _DELETE_:
      • Add the input alternative at the end of your output.
  2. Loop through all alternatives in your output:

    • For each alternative, amend its end time to match the start time of the following alternative.

Alternatively, the second step can be done right away when building the result:

  1. Loop through all CN time-slots from start to end:
    • In each time-slot, get the input alternative with the highest score. If it's not <null/> or _DELETE_:
      • Set the end time of the last alternative in your output to the start time of the input alternative.
      • Add the input alternative at the end of your output.

Example:

stt-confusion-network-to-onebest-conversion