Search | VHL Regional Portal

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap.

Wagner, Johannes; Triantafyllopoulos, Andreas; Wierstorf, Hagen; Schmitt, Maximilian; Burkhardt, Felix; Eyben, Florian; Schuller, Bjorn W.

IEEE Trans Pattern Anal Mach Intell ; 45(9): 10745-10759, 2023 09.

Article in English | MEDLINE | ID: mdl-37015129

ABSTRACT

Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of. 638 on MSP-Podcast. Our investigations reveal that transformer-based architectures are more robust compared to a CNN-based baseline and fair with respect to gender groups, but not towards individual speakers. Finally, we show that their success on valence is based on implicit linguistic information, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. To make our findings reproducible, we release the best performing model to the community.

Subject(s)

Algorithms , Speech , Emotions , Machine Learning

Voice Analysis for Neurological Disorder Recognition-A Systematic Review and Perspective on Emerging Trends.

Hecker, Pascal; Steckhan, Nico; Eyben, Florian; Schuller, Björn W; Arnrich, Bert.

Front Digit Health ; 4: 842301, 2022.

Article in English | MEDLINE | ID: mdl-35899034

ABSTRACT

Quantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide a high-level overview of practices across various neurological disorders and highlight emerging trends. PRISMA-based literature searches were conducted through PubMed, Web of Science, and IEEE Xplore to identify publications in which original (i.e., newly recorded) datasets were collected. Disorders of interest were psychiatric as well as neurodegenerative disorders, such as bipolar disorder, depression, and stress, as well as amyotrophic lateral sclerosis amyotrophic lateral sclerosis, Alzheimer's, and Parkinson's disease, and speech impairments (aphasia, dysarthria, and dysphonia). Of the 43 retrieved studies, Parkinson's disease is represented most prominently with 19 discovered datasets. Free speech and read speech tasks are most commonly used across disorders. Besides popular feature extraction toolkits, many studies utilise custom-built feature sets. Correlations of acoustic features with psychiatric and neurodegenerative disorders are presented. In terms of analysis, statistical analysis for significance of individual features is commonly used, as well as predictive modeling approaches, especially with support vector machines and a small number of artificial neural networks. An emerging trend and recommendation for future studies is to collect data in everyday life to facilitate longitudinal data collection and to capture the behavior of participants more naturally. Another emerging trend is to record additional modalities to voice, which can potentially increase analytical performance.

The voice of COVID-19: Acoustic correlates of infection in sustained vowels.

Bartl-Pokorny, Katrin D; Pokorny, Florian B; Batliner, Anton; Amiriparian, Shahin; Semertzidou, Anastasia; Eyben, Florian; Kramer, Elena; Schmidt, Florian; Schönweiler, Rainer; Wehler, Markus; Schuller, Björn W.

J Acoust Soc Am ; 149(6): 4377, 2021 06.

Article in English | MEDLINE | ID: mdl-34241490

ABSTRACT

COVID-19 is a global health crisis that has been affecting our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. Many symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice production. For the first time, the present study investigates voice acoustic correlates of a COVID-19 infection based on a comprehensive acoustic parameter set. We compare 88 acoustic features extracted from recordings of the vowels /i:/, /e:/, /u:/, /o:/, and /a:/ produced by 11 symptomatic COVID-19 positive and 11 COVID-19 negative German-speaking participants. We employ the Mann-Whitney U test and calculate effect sizes to identify features with prominent group differences. The mean voiced segment length and the number of voiced segments per second yield the most important differences across all vowels indicating discontinuities in the pulmonic airstream during phonation in COVID-19 positive participants. Group differences in front vowels are additionally reflected in fundamental frequency variation and the harmonics-to-noise ratio, group differences in back vowels in statistics of the Mel-frequency cepstral coefficients and the spectral slope. Our findings represent an important proof-of-concept contribution for a potential voice-based identification of individuals infected with COVID-19.

Subject(s)

COVID-19 , Voice , Acoustics , Humans , Phonation , SARS-CoV-2 , Speech Acoustics , Voice Quality

The expression of emotion in the singing voice: Acoustic patterns in vocal performance.

Scherer, Klaus R; Sundberg, Johan; Fantini, Bernardino; Trznadel, Stéphanie; Eyben, Florian.

J Acoust Soc Am ; 142(4): 1805, 2017 10.

Article in English | MEDLINE | ID: mdl-29092548

ABSTRACT

There has been little research on the acoustic correlates of emotional expression in the singing voice. In this study, two pertinent questions are addressed: How does a singer's emotional interpretation of a musical piece affect acoustic parameters in the sung vocalizations? Are these patterns specific enough to allow statistical discrimination of the intended expressive targets? Eight professional opera singers were asked to sing the musical scale upwards and downwards (using meaningless content) to express different emotions, as if on stage. The studio recordings were acoustically analyzed with a standard set of parameters. The results show robust vocal signatures for the emotions studied. Overall, there is a major contrast between sadness and tenderness on the one hand, and anger, joy, and pride on the other. This is based on low vs high levels on the components of loudness, vocal dynamics, high perturbation variation, and a tendency for high low-frequency energy. This pattern can be explained by the high power and arousal characteristics of the emotions with high levels on these components. A multiple discriminant analysis yields classification accuracy greatly exceeding chance level, confirming the reliability of the acoustic patterns.

Subject(s)

Acoustics , Emotions , Singing , Female , Humans , Male , Multivariate Analysis , Sound Spectrography , Voice

On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common.

Weninger, Felix; Eyben, Florian; Schuller, Björn W; Mortillaro, Marcello; Scherer, Klaus R.

Front Psychol ; 4: 292, 2013.

Article in English | MEDLINE | ID: mdl-23750144

ABSTRACT

WITHOUT DOUBT, THERE IS EMOTIONAL INFORMATION IN ALMOST ANY KIND OF SOUND RECEIVED BY HUMANS EVERY DAY: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring in the environment, in the soundtrack of a movie, or in a radio play. In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena, but a holistic computational model of affect in sound is still lacking. In turn, for tomorrow's pervasive technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of "the sound that something makes," in order to evaluate the system's auditory environment and its own audio output. This article aims at a first step toward a holistic computational model: starting from standard acoustic feature extraction schemes in the domains of speech, music, and sound analysis, we interpret the worth of individual features across these three domains, considering four audio databases with observer annotations in the arousal and valence dimensions. In the results, we find that by selection of appropriate descriptors, cross-domain arousal, and valence regression is feasible achieving significant correlations with the observer annotations of up to 0.78 for arousal (training on sound and testing on enacted speech) and 0.60 for valence (training on enacted speech and testing on music). The high degree of cross-domain consistency in encoding the two main dimensions of affect may be attributable to the co-evolution of speech and music from multimodal affect bursts, including the integration of nature sounds for expressive effects.

Affective video retrieval: violence detection in Hollywood movies by large-scale segmental feature extraction.

Eyben, Florian; Weninger, Felix; Lehment, Nicolas; Schuller, Björn; Rigoll, Gerhard.

PLoS One ; 8(12): e78506, 2013.

Article in English | MEDLINE | ID: mdl-24391704

ABSTRACT

Without doubt general video and sound, as found in large multimedia archives, carry emotional information. Thus, audio and video retrieval by certain emotional categories or dimensions could play a central role for tomorrow's intelligent systems, enabling search for movies with a particular mood, computer aided scene and sound design in order to elicit certain emotions in the audience, etc. Yet, the lion's share of research in affective computing is exclusively focusing on signals conveyed by humans, such as affective speech. Uniting the fields of multimedia retrieval and affective computing is believed to lend to a multiplicity of interesting retrieval applications, and at the same time to benefit affective computing research, by moving its methodology "out of the lab" to real-world, diverse data. In this contribution, we address the problem of finding "disturbing" scenes in movies, a scenario that is highly relevant for computer-aided parental guidance. We apply large-scale segmental feature extraction combined with audio-visual classification to the particular task of detecting violence. Our system performs fully data-driven analysis including automatic segmentation. We evaluate the system in terms of mean average precision (MAP) on the official data set of the MediaEval 2012 evaluation campaign's Affect Task, which consists of 18 original Hollywood movies, achieving up to .398 MAP on unseen test data in full realism. An in-depth analysis of the worth of individual features with respect to the target class and the system errors is carried out and reveals the importance of peak-related audio feature extraction and low-level histogram-based video analysis.

Subject(s)

Artificial Intelligence , Motion Pictures , Violence , Algorithms , Databases, Factual , Emotions , Humans , Motion Pictures/statistics & numerical data , Multimedia

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL