Keyword Spotting
Phonexia Keyword Spotting (KWS) technology enables the identification of specific keywords and key phrases within audio recordings. This tool is designed to extract valuable insights from large volumes of speech data. Users can define the keywords or phrases they are interested in, and the system will detect all instances of these keywords across the recordings. Additionally, this technology facilitates the automatic routing of relevant recordings or calls to designated experts for further analysis.
Typical use cases
Call centers
- Enhance operator and supervisor efficiency by searching for specific calls.
- Detect inappropriate language or expressions used by operators.
- Monitor marketing campaigns through automated script compliance checks.
Mass media and web search engines
- Index and search multimedia content by keyword.
- Route multimedia files and streams based on their content.
Security/defense
- Ensure rapid response times by directing calls with specific content to human operators.
- Search for targeted information within large call archives.
- Trigger immediate alarms (in real-time) when specific events are detected.
Technology
Phonexia Keyword Spotting technology operates purely on acoustic analysis, independent of any language dictionary. This allows for the detection of any words or phrases, including those in foreign languages.
Keyword Spotting utilizes a keyword list containing one or more keywords. Each keyword can have several pronunciation variants.
While there is no limit to the number of keywords or pronunciations that can be used, performance may decrease with larger keyword lists, particularly when pronunciations are not explicitly defined. In such cases, the technology generates pronunciations internally, which can lead to delays, especially if the list contains many undefined pronunciations. To maintain optimal performance, it is recommended to use a maximum of approximately 200 keywords unless pronunciations are predefined. If all keywords have predefined pronunciations, even lists with thousands of keywords do not impact performance.
The technology processes audio recordings and returns a list of detected keywords, each with an associated score and confidence level.
It is important to note that the Keyword Spotting technology always searches for keywords based on how they are pronounced (how they sound) not how they are written.
Keywords
Keywords are not dependent on any dictionary. This flexibility allows you to define specific, foreign, or even nonexistent words, such as product names.
However, keywords must be composed of allowed graphemes (symbols) from a supported list. This list of supported graphemes can be easily obtained from the API.
If Keyword Spotting rejects a keyword and returns an error, verify that the keyword contains only allowed graphemes.
Pronunciations
Keywords can be defined with or without an explicit pronunciation. If a pronunciation is not provided, the system will create a default pronunciation internally. The default pronunciation is either sourced from a dictionary (if the keyword exists in the dictionary) or generated automatically using a grapheme-to-phoneme mechanism for keywords not found in the dictionary.
It is important to note that the actual pronunciation of keywords in recordings may differ from what Keyword Spotting anticipates. This is especially true for product or brand names, domain-specific terms, misspelled words, or incorrectly pronounced foreign words. Therefore, it is highly recommended to explicitly specify the pronunciation (or multiple pronunciation variants) for keywords to ensure accurate detection.
The simplest approach to defining a pronunciation is to use the automatically generated pronunciation as a starting point and modify it as needed.
Scoring
Keyword Spotting works by calculating the likelihoods that at a given audio segment contains a keyword or just any other speech, and comparing those two likelihoods.
The following diagram shows a background model (1) of any speech in front of the keyword, the Keyword model (2) and a Background model of any speech parallel with the Keyword model (3). Models 2 and 3 produce two likelihoods - and .
Raw score is calculated as a log-likelihood ratio:
Confidence is calculated from the raw score using a sigmoid function:
where sharpness
specifies how the dynamic range of score
is used (default is
1).
It's important to properly understand the correlation between score and confidence via the sigmoid function steepness, controlled by the sharpness
The interactive graph below can help you to better understand the correlation
between score
and confidence
via the steepness of the sigmoid function,
controlled by the sharpness value.
(Q) What languages does the technology support?
A: Keyword Spotting supports 26 languages. See supported languages for more information.
(Q) Can one keyword have multiple pronunciations?
A: Yes, this is possible and it can decrease the number of misses.
(Q) What is the typical accuracy of the system?
A: The accuracy of the system is highly dependent on the technology model, keyword list and the audio data itself.
(Q) What is the typical processing speed of the system?
A: The processing speed is approximately 50-100 times faster than real time per core on modern CPUs. However, this speed can vary depending on HW, the amount of speech present in the audio and technology model.