Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
Add more filters










Publication year range
1.
IEEE Trans Pattern Anal Mach Intell ; 46(2): 805-822, 2024 Feb.
Article in English | MEDLINE | ID: mdl-37851557

ABSTRACT

Automatically recognising apparent emotions from face and voice is hard, in part because of various sources of uncertainty, including in the input data and the labels used in a machine learning framework. This paper introduces an uncertainty-aware multimodal fusion approach that quantifies modality-wise aleatoric or data uncertainty towards emotion prediction. We propose a novel fusion framework, in which latent distributions over unimodal temporal context are learned by constraining their variance. These variance constraints, Calibration and Ordinal Ranking, are designed such that the variance estimated for a modality can represent how informative the temporal context of that modality is w.r.t. emotion recognition. When well-calibrated, modality-wise uncertainty scores indicate how much their corresponding predictions are likely to differ from the ground truth labels. Well-ranked uncertainty scores allow the ordinal ranking of different frames across different modalities. To jointly impose both these constraints, we propose a softmax distributional matching loss. Our evaluation on AVEC 2019 CES, CMU-MOSEI, and IEMOCAP datasets shows that the proposed multimodal fusion method not only improves the generalisation performance of emotion recognition models and their predictive uncertainty estimates, but also makes the models robust to novel noise patterns encountered at test time.

2.
Article in English | MEDLINE | ID: mdl-38083138

ABSTRACT

In the presented work, we utilise a noisy dataset of clinical interviews with depression patients conducted over the telephone for the purpose of depression classification and automated detection of treatment response. Compared to most previous studies dealing with depression recognition from speech, our data set does not include a healthy group of subjects that have never been diagnosed with depression. Furthermore, it contains measurements at different time points for individual subjects, making it suitable for machine learning-based detection of treatment response. In our experiments, we make use of an unsupervised feature quantisation and aggregation method achieving 69.2% Unweighted Average Recall (UAR) when classifying whether patients are currently in remission or experiencing a major depressive episode (MDE). The performance of our model matches cutoff-based classification via Hamilton Rating Scale for Depression (HRSD) scores. Finally, we show that using speech samples, we can detect response to treatment with a UAR of 68.1%.


Subject(s)
Depressive Disorder, Major , Humans , Depressive Disorder, Major/diagnosis , Depressive Disorder, Major/therapy , Depression/diagnosis , Depression/therapy , Speech , Recognition, Psychology , Health Status
3.
Article in English | MEDLINE | ID: mdl-38083221

ABSTRACT

According to the WHO, approximately one in six individuals worldwide will develop some form of cancer in their lifetime. Therefore, accurate and early detection of lesions is crucial for improving the probability of successful treatment, reducing the need for more invasive treatments, and leading to higher rates of survival. In this work, we propose a novel R-CNN approach with pretraining and data augmentation for universal lesion detection. In particular, we incorporate an asymmetric 3D context fusion (A3D) for feature extraction from 2D CT images with Hybrid Task Cascade. By doing so, we supply the network with further spatial context, refining the mask prediction over several stages and making it easier to distinguish hard foregrounds from cluttered backgrounds. Moreover, we introduce a new video pretraining method for medical imaging by using consecutive frames from the YouTube VOS video segmentation dataset which improves our model's sensitivity by 0.8 percentage points at a false positive rate of one false positive per image. Finally, we apply data augmentation techniques and analyse their impact on the overall performance of our models at various false positive rates. Using our introduced approach, it is possible to increase the A3D baseline's sensitivity by 1.04 percentage points in mFROC.

4.
Patterns (N Y) ; 4(11): 100873, 2023 Nov 10.
Article in English | MEDLINE | ID: mdl-38035199

ABSTRACT

The monitoring of depressed mood plays an important role as a diagnostic tool in psychotherapy. An automated analysis of speech can provide a non-invasive measurement of a patient's affective state. While speech has been shown to be a useful biomarker for depression, existing approaches mostly build population-level models that aim to predict each individual's diagnosis as a (mostly) static property. Because of inter-individual differences in symptomatology and mood regulation behaviors, these approaches are ill-suited to detect smaller temporal variations in depressed mood. We address this issue by introducing a zero-shot personalization of large speech foundation models. Compared with other personalization strategies, our work does not require labeled speech samples for enrollment. Instead, the approach makes use of adapters conditioned on subject-specific metadata. On a longitudinal dataset, we show that the method improves performance compared with a set of suitable baselines. Finally, applying our personalization strategy improves individual-level fairness.

5.
Front Digit Health ; 5: 1196079, 2023.
Article in English | MEDLINE | ID: mdl-37767523

ABSTRACT

Recent years have seen a rapid increase in digital medicine research in an attempt to transform traditional healthcare systems to their modern, intelligent, and versatile equivalents that are adequately equipped to tackle contemporary challenges. This has led to a wave of applications that utilise AI technologies; first and foremost in the fields of medical imaging, but also in the use of wearables and other intelligent sensors. In comparison, computer audition can be seen to be lagging behind, at least in terms of commercial interest. Yet, audition has long been a staple assistant for medical practitioners, with the stethoscope being the quintessential sign of doctors around the world. Transforming this traditional technology with the use of AI entails a set of unique challenges. We categorise the advances needed in four key pillars: Hear, corresponding to the cornerstone technologies needed to analyse auditory signals in real-life conditions; Earlier, for the advances needed in computational and data efficiency; Attentively, for accounting to individual differences and handling the longitudinal nature of medical data; and, finally, Responsibly, for ensuring compliance to the ethical standards accorded to the field of medicine. Thus, we provide an overview and perspective of HEAR4Health: the sketch of a modern, ubiquitous sensing system that can bring computer audition on par with other AI technologies in the strive for improved healthcare systems.

6.
Front Digit Health ; 5: 1058163, 2023.
Article in English | MEDLINE | ID: mdl-36969956

ABSTRACT

The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from infected individuals' respiratory sounds. We present a summary of the results from the INTERSPEECH 2021 Computational Paralinguistics Challenges: COVID-19 Cough, (CCS) and COVID-19 Speech, (CSS).

7.
Annu Int Conf IEEE Eng Med Biol Soc ; 2022: 2623-2626, 2022 07.
Article in English | MEDLINE | ID: mdl-36086314

ABSTRACT

Although running is a common leisure activity and a core training regiment for several athletes, between 29% and 79% of runners sustain an overuse injury each year. These injuries are linked to excessive fatigue, which alters how someone runs. In this work, we explore the feasibility of modelling the Borg received perception of exertion (RPE) scale (range: [6]-[19] [20]), a well-validated subjective measure of fatigue, using audio data captured in realistic outdoor environments via smartphones attached to the runners' arms. Using convolutional neural networks (CNNs) on log-Mel spectrograms, we obtain a mean absolute error (MAE) of 2.35 in subject-dependent experiments, demonstrating that audio can be effectively used to model fatigue, while being more easily and non-invasively acquired than by signals from other sensors.


Subject(s)
Fatigue , Muscle Fatigue , Fatigue/diagnosis , Humans , Neural Networks, Computer
8.
iScience ; 25(8): 104644, 2022 Aug 19.
Article in English | MEDLINE | ID: mdl-35856034

ABSTRACT

In this article, human semen samples from the Visem dataset are automatically assessed with machine learning methods for their quality with respect to sperm motility. Several regression models are trained to automatically predict the percentage (0-100) of progressive, non-progressive, and immotile spermatozoa. The videos are adopted for unsupervised tracking and two different feature extraction methods-in particular custom movement statistics and displacement features. We train multiple neural networks and support vector regression models on the extracted features. Best results are achieved using a linear Support Vector Regressor with an aggregated and quantized representation of individual displacement features of each sperm cell. Compared to the best submission of the Medico Multimedia for Medicine challenge, which used the same dataset and splits, the mean absolute error (MAE) could be reduced from 8.83 to 7.31. We provide the source code for our experiments on GitHub (Code available at: https://github.com/EIHW/motilitAI).

9.
Front Artif Intell ; 5: 856232, 2022.
Article in English | MEDLINE | ID: mdl-35372830

ABSTRACT

Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilize them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumLite, an open-source, lightweight transfer learning framework for on-device speech and audio recognition using pre-trained image Convolutional Neural Networks (CNNs). The framework creates and augments Mel spectrogram plots on the fly from raw audio signals which are then used to finetune specific pre-trained CNNs for the target classification task. Subsequently, the whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade Motorola moto e7 plus smartphone. DeepSpectrumLite operates decentralized, eliminating the need for data upload for further processing. We demonstrate the suitability of the proposed transfer learning approach for embedded audio signal processing by obtaining state-of-the-art results on a set of paralinguistic and general audio tasks, including speech and music emotion recognition, social signal processing, COVID-19 cough and COVID-19 speech analysis, and snore sound classification. We provide an extensive command-line interface for users and developers which is comprehensively documented and publicly available at https://github.com/DeepSpectrum/DeepSpectrumLite.

10.
Pattern Recognit ; 122: 108361, 2022 Feb.
Article in English | MEDLINE | ID: mdl-34629550

ABSTRACT

The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that classify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the following classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the baseline approaches used to solve the problem, achieving a performance of 71.8 % Unweighted Average Recall (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the participants of this sub-challenge, where the winner scored a UAR of 80.1 % . Moreover, we present the results of fusing the approaches, leading to a UAR of 82.6 % . Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models.

11.
Trends Hear ; 25: 23312165211046135, 2021.
Article in English | MEDLINE | ID: mdl-34751066

ABSTRACT

Computer audition (i.e., intelligent audio) has made great strides in recent years; however, it is still far from achieving holistic hearing abilities, which more appropriately mimic human-like understanding. Within an audio scene, a human listener is quickly able to interpret layers of sound at a single time-point, with each layer varying in characteristics such as location, state, and trait. Currently, integrated machine listening approaches, on the other hand, will mainly recognise only single events. In this context, this contribution aims to provide key insights and approaches, which can be applied in computer audition to achieve the goal of a more holistic intelligent understanding system, as well as identifying challenges in reaching this goal. We firstly summarise the state-of-the-art in traditional signal-processing-based audio pre-processing and feature representation, as well as automated learning such as by deep neural networks. This concerns, in particular, audio interpretation, decomposition, understanding, as well as ontologisation. We then present an agent-based approach for integrating these concepts as a holistic audio understanding system. Based on this, concluding, avenues are given towards reaching the ambitious goal of 'holistic human-parity' machine listening abilities.


Subject(s)
Neural Networks, Computer , Signal Processing, Computer-Assisted , Humans , Intelligence , Learning , Sound
12.
J Acoust Soc Am ; 149(6): 4377, 2021 06.
Article in English | MEDLINE | ID: mdl-34241490

ABSTRACT

COVID-19 is a global health crisis that has been affecting our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. Many symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice production. For the first time, the present study investigates voice acoustic correlates of a COVID-19 infection based on a comprehensive acoustic parameter set. We compare 88 acoustic features extracted from recordings of the vowels /i:/, /e:/, /u:/, /o:/, and /a:/ produced by 11 symptomatic COVID-19 positive and 11 COVID-19 negative German-speaking participants. We employ the Mann-Whitney U test and calculate effect sizes to identify features with prominent group differences. The mean voiced segment length and the number of voiced segments per second yield the most important differences across all vowels indicating discontinuities in the pulmonic airstream during phonation in COVID-19 positive participants. Group differences in front vowels are additionally reflected in fundamental frequency variation and the harmonics-to-noise ratio, group differences in back vowels in statistics of the Mel-frequency cepstral coefficients and the spectral slope. Our findings represent an important proof-of-concept contribution for a potential voice-based identification of individuals infected with COVID-19.


Subject(s)
COVID-19 , Voice , Acoustics , Humans , Phonation , SARS-CoV-2 , Speech Acoustics , Voice Quality
13.
Front Robot AI ; 6: 116, 2019.
Article in English | MEDLINE | ID: mdl-33501131

ABSTRACT

During both positive and negative dyadic exchanges, individuals will often unconsciously imitate their partner. A substantial amount of research has been made on this phenomenon, and such studies have shown that synchronization between communication partners can improve interpersonal relationships. Automatic computational approaches for recognizing synchrony are still in their infancy. In this study, we extend on previous work in which we applied a novel method utilizing hand-crafted low-level acoustic descriptors and autoencoders (AEs) to analyse synchrony in the speech domain. For this purpose, a database consisting of 394 in-the-wild speakers from six different cultures, is used. For each speaker in the dyadic exchange, two AEs are implemented. Post the training phase, the acoustic features for one of the speakers is tested using the AE trained on their dyadic partner. In this same way, we also explore the benefits that deep representations from audio may have, implementing the state-of-the-art Deep Spectrum toolkit. For all speakers at varied time-points during their interaction, the calculation of reconstruction error from the AE trained on their respective dyadic partner is made. The results obtained from this acoustic analysis are then compared with the linguistic experiments based on word counts and word embeddings generated by our word2vec approach. The results demonstrate that there is a degree of synchrony during all interactions. We also find that, this degree varies across the 6 cultures found in the investigated database. These findings are further substantiated through the use of 4,096 dimensional Deep Spectrum features.

14.
Annu Int Conf IEEE Eng Med Biol Soc ; 2018: 4776-4779, 2018 Jul.
Article in English | MEDLINE | ID: mdl-30441416

ABSTRACT

Given the world-wide prevalence of heart disease, the robust and automatic detection of abnormal heart sounds could have profound effects on patient care and outcomes. In this regard, a comparison of conventional and state-of-theart deep learning based computer audition paradigms for the audio classification task of normal, mild abnormalities, and moderate/severe abnormalities as present in phonocardiogram recordings, is presented herein. In particular, we explore the suitability of deep feature representations as learnt by sequence to sequence autoencoders based on the auDeep toolkit. Key results, gained on the new Heart Sounds Shenzhen corpus, indicate that a fused combination of deep unsupervised features is well suited to the three-way classification problem, achieving our highest unweighted average recall of 47.9% on the test partition.


Subject(s)
Heart Sounds , Deep Learning , Humans
15.
Annu Int Conf IEEE Eng Med Biol Soc ; 2018: 413-416, 2018 Jul.
Article in English | MEDLINE | ID: mdl-30440421

ABSTRACT

Snoring is often associated with serious health risks such as obstructive sleep apnea and heart disease and may require targeted surgical interventions. In this regard, research into automatically and unobtrusively analysing the site of blockages that cause snore sounds is growing in popularity. Herein, we investigate the use of low level image texture features in classification of four specific types of snore sounds. Specifically, we explore histogram of local binary patterns (LBP) in dense grid of rectangular regions and histogram of oriented gradients (HOG) extracted from colour spectrograms for snore sound characterisation. Support vector machines with homogeneous mapping are used in the classification stage of the proposed method. Various experimental works are carried out with both LBP and HOG descriptors on the INTERSPEECH ComParE 2017 snoring sub-challenge dataset. Results presented indicate that LBP descriptors are better than the HOG descriptors in snore type detection and fusion of the LBP and HOG descriptors produces stronger results than either individual descriptor. Further, when compared to the challenge baseline and state-of-the-art deep spectrum features, our approach achieved relative percentage increases in unweighted average recall of 23.1% and 8.3% respectively.


Subject(s)
Pattern Recognition, Automated/methods , Snoring/classification , Snoring/diagnosis , Sound Spectrography , Humans , Sleep Apnea, Obstructive/physiopathology , Sound , Sound Spectrography/methods , Support Vector Machine
16.
Annu Int Conf IEEE Eng Med Biol Soc ; 2017: 3806-3809, 2017 Jul.
Article in English | MEDLINE | ID: mdl-29060727

ABSTRACT

A combination of passive, non-invasive and nonintrusive smart monitoring technologies is currently transforming healthcare. These technologies will soon be able to provide immediate health related feedback for a range of illnesses and conditions. Such tools would be game changing for serious public health concerns, such as seasonal cold and flu, for which early diagnosis and social isolation play a key role in reducing the spread. In this regard, this paper explores, for the first times, the automated classification of individuals with Upper Respiratory Tract Infections (URTI) using recorded speech samples. Key results presented indicate that our classifiers can achieve similar results to those seen in related health-based detection tasks indicating the promise of using computational paralinguistic analysis for the detection of URTI related illnesses.


Subject(s)
Speech , Humans , Respiratory Tract Infections , Sound
SELECTION OF CITATIONS
SEARCH DETAIL