Skip to main content
Version: 3.4.0

Emotion Recognition

The Emotion Recognition is a technology focused on detecting and interpreting human emotions through voice and speech patterns. By analyzing various acoustic features of audio signals, such as tone, pitch, volume, and rhythm, the system aims to understand the emotional state of a speaker.

Typical areas of use

  • Customer Service: technology is used in call centers to monitor customer satisfaction by detecting frustration, stress, or dissatisfaction based on vocal cues.
  • Mental Health Monitoring: systems are integrated into mental health applications to detect early signs of depression, anxiety, or emotional distress.
  • Virtual Assistants: Digital assistants use technology to provide more emotionally intelligent responses and engage users more effectively.
  • Market Research: Companies leverage technology in market studies to gauge emotional responses during product tests or advertisements.

How does it work?

Emotion Recognition technology consists of the following steps:

  • Filtering out segments of silence and technical signals (Voice Activity Detection).
  • Recognizing emotion by neural network
  • Returning scores for each of four emotions (angry, sad, happy, neutral).

Scoring

Accuracy

The most common metric for evaluating the performance of Emotion Recognition is accuracy. Accuracy is the ratio of correct predictions (both positive and negative) to the total number of predictions.

Accuracy=CorrectPredictionsTotalPredictions100Accuracy=\frac{CorrectPredictions}{TotalPredictions} * 100
note

While useful for general classification problems, accuracy can be misleading in emotion recognition, especially when the dataset is imbalanced (e.g., some emotions like neutrality may appear more frequently).

Precision, Recall, and F1-Score

These metrics are commonly used to assess how well the model performs on specific emotion classes. Precision and recall are especially useful when classifying emotions in imbalanced datasets, as they show how well the model performs on minority classes (e.g., rare emotions). The F1-Score offers a balanced metric when both precision and recall are important. See Precision_and_recall wiki page for more information.

Example measurements

Below is a example measurements performed on our Czech dataset.

ClassPrecisionRecallF1
Neutral0.80160.82110.8112
Happy0.89470.87180.8831
Sad0.69440.54350.6098
Angry0.48150.68420.5652

FAQ

Details

(Q) What languages does the technology support? A: Basically the technology is multilingual, but the best results can be obtained on English and Czech languages.

Details

(Q) What if the emotion changes during the recording? A: The result of the technology will be average emotion.

(Q) What if recording contains multiple speakers with different emotions?

A: The result of technology will be average emotion. But it is possible to use diarization technology to split the speaker to separate the recording beforehand.

Details

(Q) What is the typical accuracy of the system? A: The accuracy of the system is highly dependent on the dataset used. You can find an example measurements in the Measurements section.

Details

(Q) What is the typical processing speed of the system? A: The processing speed is approximately 5 times faster than real time per core on modern CPUs and 300 faster than real time on GPU. However, this speed can vary depending on HW and the amount of speech present in the audio.