Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
R Soc Open Sci ; 10(12): 230574, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38126059

ABSTRACT

The relationship between music and emotion has been addressed within several disciplines, from more historico-philosophical and anthropological ones, such as musicology and ethnomusicology, to others that are traditionally more empirical and technological, such as psychology and computer science. Yet, understanding the link between music and emotion is limited by the scarce interconnections between these disciplines. Trying to narrow this gap, this data-driven exploratory study aims at assessing the relationship between linguistic, symbolic and acoustic features-extracted from lyrics, music notation and audio recordings-and perception of emotion. Employing a listening experiment, statistical analysis and unsupervised machine learning, we investigate how a data-driven multi-modal approach can be used to explore the emotions conveyed by eight Bach chorales. Through a feature selection strategy based on a set of more than 300 Bach chorales and a transdisciplinary methodology integrating approaches from psychology, musicology and computer science, we aim to initiate an efficient dialogue between disciplines, able to promote a more integrative and holistic understanding of emotions in music.

2.
Front Digit Health ; 5: 1196079, 2023.
Article in English | MEDLINE | ID: mdl-37767523

ABSTRACT

Recent years have seen a rapid increase in digital medicine research in an attempt to transform traditional healthcare systems to their modern, intelligent, and versatile equivalents that are adequately equipped to tackle contemporary challenges. This has led to a wave of applications that utilise AI technologies; first and foremost in the fields of medical imaging, but also in the use of wearables and other intelligent sensors. In comparison, computer audition can be seen to be lagging behind, at least in terms of commercial interest. Yet, audition has long been a staple assistant for medical practitioners, with the stethoscope being the quintessential sign of doctors around the world. Transforming this traditional technology with the use of AI entails a set of unique challenges. We categorise the advances needed in four key pillars: Hear, corresponding to the cornerstone technologies needed to analyse auditory signals in real-life conditions; Earlier, for the advances needed in computational and data efficiency; Attentively, for accounting to individual differences and handling the longitudinal nature of medical data; and, finally, Responsibly, for ensuring compliance to the ethical standards accorded to the field of medicine. Thus, we provide an overview and perspective of HEAR4Health: the sketch of a modern, ubiquitous sensing system that can bring computer audition on par with other AI technologies in the strive for improved healthcare systems.

3.
Front Digit Health ; 5: 1058163, 2023.
Article in English | MEDLINE | ID: mdl-36969956

ABSTRACT

The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from infected individuals' respiratory sounds. We present a summary of the results from the INTERSPEECH 2021 Computational Paralinguistics Challenges: COVID-19 Cough, (CCS) and COVID-19 Speech, (CSS).

4.
PLoS One ; 18(1): e0281079, 2023.
Article in English | MEDLINE | ID: mdl-36716307

ABSTRACT

This article contributes to a more adequate modelling of emotions encoded in speech, by addressing four fallacies prevalent in traditional affective computing: First, studies concentrate on few emotions and disregard all other ones ('closed world'). Second, studies use clean (lab) data or real-life ones but do not compare clean and noisy data in a comparable setting ('clean world'). Third, machine learning approaches need large amounts of data; however, their performance has not yet been assessed by systematically comparing different approaches and different sizes of databases ('small world'). Fourth, although human annotations of emotion constitute the basis for automatic classification, human perception and machine classification have not yet been compared on a strict basis ('one world'). Finally, we deal with the intrinsic ambiguities of emotions by interpreting the confusions between categories ('fuzzy world'). We use acted nonsense speech from the GEMEP corpus, emotional 'distractors' as categories not entailed in the test set, real-life noises that mask the clear recordings, and different sizes of the training set for machine learning. We show that machine learning based on state-of-the-art feature representations (wav2vec2) is able to mirror the main emotional categories ('pillars') present in perceptual emotional constellations even in degradated acoustic conditions.


Subject(s)
Speech Perception , Speech , Humans , Emotions , Machine Learning , Acoustics , Perception
5.
J Speech Lang Hear Res ; 65(12): 4623-4636, 2022 12 12.
Article in English | MEDLINE | ID: mdl-36417788

ABSTRACT

PURPOSE: The aim of this study was to investigate the speech prosody of postlingually deaf cochlear implant (CI) users compared with control speakers without hearing or speech impairment. METHOD: Speech recordings of 74 CI users (37 males and 37 females) and 72 age-balanced control speakers (36 males and 36 females) are considered. All participants are German native speakers and read Der Nordwind und die Sonne (The North Wind and the Sun), a standard text in pathological speech analysis and phonetic transcriptions. Automatic acoustic analysis is performed considering pitch, loudness, and duration features, including speech rate and rhythm. RESULTS: In general, duration and rhythm features differ between CI users and control speakers. CI users read slower and have a lower voiced segment ratio compared with control speakers. A lower voiced ratio goes along with a prolongation of the voiced segments' duration in male and with a prolongation of pauses in female CI users. Rhythm features in CI users have higher variability in the duration of vowels and consonants than in control speakers. The use of bilateral CIs showed no advantages concerning speech prosody features in comparison to unilateral use of CI. CONCLUSIONS: Even after cochlear implantation and rehabilitation, the speech of postlingually deaf adults deviates from the speech of control speakers, which might be due to changed auditory feedback. We suggest considering changes in temporal aspects of speech in future rehabilitation strategies. SUPPLEMENTAL MATERIAL: https://doi.org/10.23641/asha.21579171.


Subject(s)
Cochlear Implantation , Cochlear Implants , Deafness , Speech Perception , Adult , Male , Female , Humans , Deafness/rehabilitation , Hearing , Acoustics
6.
Article in English | MEDLINE | ID: mdl-35055816

ABSTRACT

Musical listening is broadly used as an inexpensive and safe method to reduce self-perceived anxiety. This strategy is based on the emotivist assumption claiming that emotions are not only recognised in music but induced by it. Yet, the acoustic properties of musical work capable of reducing anxiety are still under-researched. To fill this gap, we explore whether the acoustic parameters relevant in music emotion recognition are also suitable to identify music with relaxing properties. As an anxiety indicator, the positive statements from the six-item Spielberger State-Trait Anxiety Inventory, a self-reported score from 3 to 12, are taken. A user-study with 50 participants assessing the relaxing potential of four musical pieces was conducted; subsequently, the acoustic parameters were evaluated. Our study shows that when using classical Western music to reduce self-perceived anxiety, tonal music should be considered. In addition, it also indicates that harmonicity is a suitable indicator of relaxing music, while the role of scoring and dynamics in reducing non-pathological listener distress should be further investigated.


Subject(s)
Music , Acoustic Stimulation , Acoustics , Anxiety/prevention & control , Auditory Perception , Emotions , Humans , Music/psychology
7.
Pattern Recognit ; 122: 108361, 2022 Feb.
Article in English | MEDLINE | ID: mdl-34629550

ABSTRACT

The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that classify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the following classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the baseline approaches used to solve the problem, achieving a performance of 71.8 % Unweighted Average Recall (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the participants of this sub-challenge, where the winner scored a UAR of 80.1 % . Moreover, we present the results of fusing the approaches, leading to a UAR of 82.6 % . Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models.

8.
Pattern Recognit ; 122: 108289, 2022 Feb.
Article in English | MEDLINE | ID: mdl-34483372

ABSTRACT

The Coronavirus (COVID-19) pandemic impelled several research efforts, from collecting COVID-19 patients' data to screening them for virus detection. Some COVID-19 symptoms are related to the functioning of the respiratory system that influences speech production; this suggests research on identifying markers of COVID-19 in speech and other human generated audio signals. In this article, we give an overview of research on human audio signals using 'Artificial Intelligence' techniques to screen, diagnose, monitor, and spread the awareness about COVID-19. This overview will be useful for developing automated systems that can help in the context of COVID-19, using non-obtrusive and easy to use bio-signals conveyed in human non-speech and speech audio productions.

9.
J Acoust Soc Am ; 149(6): 4377, 2021 06.
Article in English | MEDLINE | ID: mdl-34241490

ABSTRACT

COVID-19 is a global health crisis that has been affecting our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. Many symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice production. For the first time, the present study investigates voice acoustic correlates of a COVID-19 infection based on a comprehensive acoustic parameter set. We compare 88 acoustic features extracted from recordings of the vowels /i:/, /e:/, /u:/, /o:/, and /a:/ produced by 11 symptomatic COVID-19 positive and 11 COVID-19 negative German-speaking participants. We employ the Mann-Whitney U test and calculate effect sizes to identify features with prominent group differences. The mean voiced segment length and the number of voiced segments per second yield the most important differences across all vowels indicating discontinuities in the pulmonic airstream during phonation in COVID-19 positive participants. Group differences in front vowels are additionally reflected in fundamental frequency variation and the harmonics-to-noise ratio, group differences in back vowels in statistics of the Mel-frequency cepstral coefficients and the spectral slope. Our findings represent an important proof-of-concept contribution for a potential voice-based identification of individuals infected with COVID-19.


Subject(s)
COVID-19 , Voice , Acoustics , Humans , Phonation , SARS-CoV-2 , Speech Acoustics , Voice Quality
10.
Int J Speech Technol ; 23(1): 169-182, 2020.
Article in English | MEDLINE | ID: mdl-34867074

ABSTRACT

Most typically developed individuals have the ability to perceive emotions encoded in speech; yet, factors such as age or environmental conditions can restrict this inherent skill. Noise pollution and multimedia over-stimulation are common components of contemporary society, and have shown to particularly impair a child's interpersonal skills. Assessing the influence of such features on the perception of emotion over different developmental stages will advance child-related research. The presented work evaluates how background noise and emotionally connoted visual stimuli affect a child's perception of emotional speech. A total of 109 subjects from Spain and Germany (4-14 years) evaluated 20 multi-modal instances of nonsense emotional speech, under several environmental and visual conditions. A control group of 17 Spanish adults performed the same perception test. Results suggest that visual stimulation, gender, and the two sub-cultures with different language background do not influence a child's perception; yet, background noise does compromise their ability to correctly identify emotion in speech-a phenomenon that seems to decrease with age.

11.
PLoS One ; 11(5): e0154486, 2016.
Article in English | MEDLINE | ID: mdl-27176486

ABSTRACT

We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient.


Subject(s)
Eating/physiology , Food , Hearing/physiology , Speech Recognition Software , Speech/physiology , Adult , Audiovisual Aids , Automation , Databases as Topic , Female , Humans , Male , Regression Analysis , Self Report , Support Vector Machine
12.
Logoped Phoniatr Vocol ; 36(4): 175-81, 2011 Dec.
Article in English | MEDLINE | ID: mdl-21875389

ABSTRACT

Objective assessment of intelligibility on the telephone is desirable for voice and speech assessment and rehabilitation. A total of 82 patients after partial laryngectomy read a standardized text which was synchronously recorded by a headset and via telephone. Five experienced raters assessed intelligibility perceptually on a five-point scale. Objective evaluation was performed by support vector regression on the word accuracy (WA) and word correctness (WR) of a speech recognition system, and a set of prosodic features. WA and WR alone exhibited correlations to human evaluation between |r| = 0.57 and |r| = 0.75. The correlation was r = 0.79 for headset and r = 0.86 for telephone recordings when prosodic features and WR were combined. The best feature subset was optimal for both signal qualities. It consists of WR, the average duration of the silent pauses before a word, the standard deviation of the fundamental frequency on the entire sample, the standard deviation of jitter, and the ratio of the durations of the voiced sections and the entire recording.


Subject(s)
Laryngectomy/adverse effects , Signal Processing, Computer-Assisted , Speech Acoustics , Speech Intelligibility , Speech Perception , Speech Recognition Software , Telephone , Voice Quality , Adult , Aged , Aged, 80 and over , Automation , Cluster Analysis , Female , Germany , Humans , Male , Markov Chains , Middle Aged , Sound Spectrography , Speech Production Measurement , Speech, Alaryngeal , Support Vector Machine , Time Factors
13.
Behav Res Methods ; 41(3): 795-804, 2009 Aug.
Article in English | MEDLINE | ID: mdl-19587194

ABSTRACT

This article describes a general framework for detecting sleepiness states on the basis of prosody, articulation, and speech-quality-related speech characteristics. The advantages of this automatic real-time approach are that obtaining speech data is nonobstrusive and is free from sensor application and calibration efforts. Different types of acoustic features derived from speech, speaker, and emotion recognition were employed (frame-level-based speech features). Combing these features with high-level contour descriptors, which capture the temporal information of frame-level descriptor contours, results in 45,088 features per speech sample. In general, the measurement process follows the speech-adapted steps of pattern recognition: (1) recording speech, (2) preprocessing, (3) feature computation (using perceptual and signal-processing-related features such as, e.g., fundamental frequency, intensity, pause patterns, formants, and cepstral coefficients), (4) dimensionality reduction, (5) classification, and (6) evaluation. After a correlation-filter-based feature subset selection employed on the feature space in order to find most relevant features, different classification models were trained. The best model-namely, the support-vector machine-achieved 86.1% classification accuracy in predicting sleepiness in a sleep deprivation study (two-class problem, N=12; 01.00-08.00 a.m.).


Subject(s)
Pattern Recognition, Automated/methods , Sleep Stages , Computer Simulation , Humans , Signal Processing, Computer-Assisted , Speech Recognition Software
14.
Eur Arch Otorhinolaryngol ; 264(11): 1315-21, 2007 Nov.
Article in English | MEDLINE | ID: mdl-17571273

ABSTRACT

In comparison with laryngeal voice, substitute voice after laryngectomy is characterized by restricted aero-acoustic properties. Until now, an objective means of prosodic differences between substitute and normal voices does not exist. In a pilot study, we applied an automatic prosody analysis module to 18 speech samples of laryngectomees (age: 64.2 +/- 8.3 years) and 18 recordings of normal speakers of the same age (65.4 +/- 7.6 years). Ninety-five different features per word based upon the speech energy, fundamental frequency F(0) and duration measures on words, pauses and voiced/voiceless sections were measured. These reflect aspects of loudness, pitch and articulation rate. Subjective evaluation of the 18 patients' voices was performed by a panel of five experts on the criteria "noise", "speech effort", "roughness", "intelligibility", "match of breath and sense units" and "overall quality". These ratings were compared to the automatically computed features. Several of them could be identified being twice as high for the laryngectomees compared to the normal speakers, and vice versa. Comparing the evaluation data of the human experts and the automatic rating, correlation coefficients of up to 0.84 were measured. The automatic analysis serves as a good means to objectify and quantify the global speech outcome of laryngectomees. Even better results are expected when both the computation of the features and the comparison method to the human ratings will have been revised and adapted to the special properties of the substitute voices.


Subject(s)
Electronic Data Processing , Speech, Alaryngeal , Speech, Esophageal , Tracheoesophageal Fistula , Voice Quality , Aged , Humans , Laryngectomy , Male , Middle Aged
SELECTION OF CITATIONS
SEARCH DETAIL
...