Logo Logo
Hilfe
Kontakt
Switch language to English
Emotion-aware voice interfaces based on speech signal processing
Emotion-aware voice interfaces based on speech signal processing
Voice interfaces (VIs) will become increasingly widespread in current daily lives as AI techniques progress. VIs can be incorporated into smart devices like smartphones, as well as integrated into autos, home automation systems, computer operating systems, and home appliances, among other things. Current speech interfaces, however, are unaware of users’ emotional states and hence cannot support real communication. To overcome these limitations, it is necessary to implement emotional awareness in future VIs. This thesis focuses on how speech signal processing (SSP) and speech emotion recognition (SER) can enable VIs to gain emotional awareness. Following an explanation of what emotion is and how neural networks are implemented, this thesis presents the results of several user studies and surveys. Emotions are complicated, and they are typically characterized using category and dimensional models. They can be expressed verbally or nonverbally. Although existing voice interfaces are unaware of users’ emotional states and cannot support natural conversations, it is possible to perceive users’ emotions by speech based on SSP in future VIs. One section of this thesis, based on SSP, investigates mental restorative effects on humans and their measures from speech signals. SSP is less intrusive and more accessible than traditional measures such as attention scales or response tests, and it can provide a reliable assessment for attention and mental restoration. SSP can be implemented into future VIs and utilized in future HCI user research. The thesis then moves on to present a novel attention neural network based on sparse correlation features. The detection accuracy of emotions in the continuous speech was demonstrated in a user study utilizing recordings from a real classroom. In this section, a promising result will be shown. In SER research, it is unknown if existing emotion detection methods detect acted emotions or the genuine emotion of the speaker. Another section of this thesis is concerned with humans’ ability to act on their emotions. In a user study, participants were instructed to imitate five fundamental emotions. The results revealed that they struggled with this task; nevertheless, certain emotions were easier to replicate than others. A further study concern is how VIs should respond to users’ emotions if SER techniques are implemented in VIs and can recognize users’ emotions. The thesis includes research on ways for dealing with the emotions of users. In a user study, users were instructed to make sad, angry, and terrified VI avatars happy and were asked if they would like to be treated the same way if the situation were reversed. According to the results, the majority of participants tended to respond to these unpleasant emotions with neutral emotion, but there is a difference among genders in emotion selection. For a human-centered design approach, it is important to understand what the users’ preferences for future VIs are. In three distinct cultures, a questionnaire-based survey on users’ attitudes and preferences for emotion-aware VIs was conducted. It was discovered that there are almost no gender differences. Cluster analysis found that there are three fundamental user types that exist in all cultures: Enthusiasts, Pragmatists, and Sceptics. As a result, future VI development should consider diverse sorts of consumers. In conclusion, future VIs systems should be designed for various sorts of users as well as be able to detect the users’ disguised or actual emotions using SER and SSP technologies. Furthermore, many other applications, such as restorative effects assessments, can be included in the VIs system.
future emotion-aware VIs, speech emotion recognition, restorative effects measurements, mimic emotions, emotion reactions
Ma, Yong
2022
Englisch
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Ma, Yong (2022): Emotion-aware voice interfaces based on speech signal processing. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik
[thumbnail of Ma_Yong.pdf] PDF
Ma_Yong.pdf

8MB

Abstract

Voice interfaces (VIs) will become increasingly widespread in current daily lives as AI techniques progress. VIs can be incorporated into smart devices like smartphones, as well as integrated into autos, home automation systems, computer operating systems, and home appliances, among other things. Current speech interfaces, however, are unaware of users’ emotional states and hence cannot support real communication. To overcome these limitations, it is necessary to implement emotional awareness in future VIs. This thesis focuses on how speech signal processing (SSP) and speech emotion recognition (SER) can enable VIs to gain emotional awareness. Following an explanation of what emotion is and how neural networks are implemented, this thesis presents the results of several user studies and surveys. Emotions are complicated, and they are typically characterized using category and dimensional models. They can be expressed verbally or nonverbally. Although existing voice interfaces are unaware of users’ emotional states and cannot support natural conversations, it is possible to perceive users’ emotions by speech based on SSP in future VIs. One section of this thesis, based on SSP, investigates mental restorative effects on humans and their measures from speech signals. SSP is less intrusive and more accessible than traditional measures such as attention scales or response tests, and it can provide a reliable assessment for attention and mental restoration. SSP can be implemented into future VIs and utilized in future HCI user research. The thesis then moves on to present a novel attention neural network based on sparse correlation features. The detection accuracy of emotions in the continuous speech was demonstrated in a user study utilizing recordings from a real classroom. In this section, a promising result will be shown. In SER research, it is unknown if existing emotion detection methods detect acted emotions or the genuine emotion of the speaker. Another section of this thesis is concerned with humans’ ability to act on their emotions. In a user study, participants were instructed to imitate five fundamental emotions. The results revealed that they struggled with this task; nevertheless, certain emotions were easier to replicate than others. A further study concern is how VIs should respond to users’ emotions if SER techniques are implemented in VIs and can recognize users’ emotions. The thesis includes research on ways for dealing with the emotions of users. In a user study, users were instructed to make sad, angry, and terrified VI avatars happy and were asked if they would like to be treated the same way if the situation were reversed. According to the results, the majority of participants tended to respond to these unpleasant emotions with neutral emotion, but there is a difference among genders in emotion selection. For a human-centered design approach, it is important to understand what the users’ preferences for future VIs are. In three distinct cultures, a questionnaire-based survey on users’ attitudes and preferences for emotion-aware VIs was conducted. It was discovered that there are almost no gender differences. Cluster analysis found that there are three fundamental user types that exist in all cultures: Enthusiasts, Pragmatists, and Sceptics. As a result, future VI development should consider diverse sorts of consumers. In conclusion, future VIs systems should be designed for various sorts of users as well as be able to detect the users’ disguised or actual emotions using SER and SSP technologies. Furthermore, many other applications, such as restorative effects assessments, can be included in the VIs system.