A Robust Pitch-Fusion Model for Speech Emotion Recognition in Tonal Languages
Published in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Speech Emotion Recognition (SER) is an essential task in spoken language processing, applicable across various domains. While research on SER systems for English datasets is growing rapidly, the reliability of these models for tonal languages remains a significant concern. Therefore, this paper introduces Pitch-fusion, a novel SER model tailored for tonal languages. This model enhances tonal SER performance by leveraging pitch features in speech segments. Pitch-fusion integrates a pitch encoder module, with efficient cross-attention and self-attention mechanisms to align pitch features with the contextual acoustic features from a speech representation model like Wav2Vec 2.0. Experimental results reveal that our proposed model consistently outperforms Wav2Vec 2.0 on tonal datasets, giving an absolute improvement in weighted accuracy ranging from 9.53% to 21.11%. Furthermore, addressing the lack of public Vietnamese datasets, we introduce ViSEC, the first openly available resource for Vietnamese SER. This dataset is collected from public media focusing on sources with rich and balanced emotional content. Through a proposed construction pipeline, ViSEC includes four emotional categories: angry, sad, neutral, and happy, and features 5,280 utterances from 147 Vietnamese speakers.
