Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
1.
J Acoust Soc Am ; 155(4): 2836-2848, 2024 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-38682915

RESUMEN

This paper evaluates an innovative framework for spoken dialect density prediction on children's and adults' African American English. A speaker's dialect density is defined as the frequency with which dialect-specific language characteristics occur in their speech. Rather than treating the presence or absence of a target dialect in a user's speech as a binary decision, instead, a classifier is trained to predict the level of dialect density to provide a higher degree of specificity in downstream tasks. For this, self-supervised learning representations from HuBERT, handcrafted grammar-based features extracted from ASR transcripts, prosodic features, and other feature sets are experimented with as the input to an XGBoost classifier. Then, the classifier is trained to assign dialect density labels to short recorded utterances. High dialect density level classification accuracy is achieved for child and adult speech and demonstrated robust performance across age and regional varieties of dialect. Additionally, this work is used as a basis for analyzing which acoustic and grammatical cues affect machine perception of dialect.


Asunto(s)
Negro o Afroamericano , Acústica del Lenguaje , Humanos , Adulto , Niño , Masculino , Femenino , Medición de la Producción del Habla/métodos , Lenguaje , Preescolar , Adulto Joven , Percepción del Habla , Adolescente , Fonética , Lenguaje Infantil
2.
J Acoust Soc Am ; 151(2): 1393, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35232083

RESUMEN

This study compares human speaker discrimination performance for read speech versus casual conversations and explores differences between unfamiliar voices that are "easy" versus "hard" to "tell together" versus "tell apart." Thirty listeners were asked whether pairs of short style-matched or -mismatched, text-independent utterances represented the same or different speakers. Listeners performed better when stimuli were style-matched, particularly in read speech-read speech trials (equal error rate, EER, of 6.96% versus 15.12% in conversation-conversation trials). In contrast, the EER was 20.68% for the style-mismatched condition. When styles were matched, listeners' confidence was higher when speakers were the same versus different; however, style variation caused decreases in listeners' confidence for the "same speaker" trials, suggesting a higher dependency of this task on within-speaker variability. The speakers who were "easy" or "hard" to "tell together" were not the same as those who were "easy" or "hard" to "tell apart." Analysis of speaker acoustic spaces suggested that the difference observed in human approaches to "same speaker" and "different speaker" tasks depends primarily on listeners' different perceptual strategies when dealing with within- versus between-speaker acoustic variability.


Asunto(s)
Percepción del Habla , Voz , Acústica , Humanos , Habla
3.
Pediatr Res ; 87(3): 576-580, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-31585457

RESUMEN

BACKGROUND: To characterize acoustic features of an infant's cry and use machine learning to provide an objective measurement of behavioral state in a cry-translator. To apply the cry-translation algorithm to colic hypothesizing that these cries sound painful. METHODS: Assessment of 1000 cries in a mobile app (ChatterBabyTM). Training a cry-translation algorithm by evaluating >6000 acoustic features to predict whether infant cry was due to a pain (vaccinations, ear-piercings), fussy, or hunger states. Using the algorithm to predict the behavioral state of infants with reported colic. RESULTS: The cry-translation algorithm was 90.7% accurate for identifying pain cries, and achieved 71.5% accuracy in discriminating cries from fussiness, hunger, or pain. The ChatterBaby cry-translation algorithm overwhelmingly predicted that colic cries were most likely from pain, compared to fussy and hungry states. Colic cries had average pain ratings of 73%, significantly greater than the pain measurements found in fussiness and hunger (p < 0.001, 2-sample t test). Colic cries outranked pain cries by measures of acoustic intensity, including energy, length of voiced periods, and fundamental frequency/pitch, while fussy and hungry cries showed reduced intensity measures compared to pain and colic. CONCLUSIONS: Acoustic features of cries are consistent across a diverse infant population and can be utilized as objective markers of pain, hunger, and fussiness. The ChatterBaby algorithm detected significant acoustic similarities between colic and painful cries, suggesting that they may share a neuronal pathway.


Asunto(s)
Dolor Abdominal/psicología , Acústica , Cólico/psicología , Llanto , Conducta del Lactante , Aprendizaje Automático , Aplicaciones Móviles , Percepción del Dolor , Procesamiento de Señales Asistido por Computador , Dolor Abdominal/diagnóstico , Cólico/diagnóstico , Femenino , Humanos , Lactante , Recién Nacido , Masculino , Reconocimiento de Normas Patrones Automatizadas , Espectrografía del Sonido
4.
J Acoust Soc Am ; 144(6): 3437, 2018 12.
Artículo en Inglés | MEDLINE | ID: mdl-30599649

RESUMEN

This paper presents an investigation of children's subglottal resonances (SGRs), the natural frequencies of the tracheo-bronchial acoustic system. A total of 43 children (31 male, 12 female) aged between 6 and 18 yr were recorded. Both microphone signals of various consonant-vowel-consonant words and subglottal accelerometer signals of the sustained vowel /ɑ/ were recorded for each of the children, along with age and standing height. The first three SGRs of each child were measured from the sustained vowel subglottal accelerometer signals. A model relating SGRs to standing height was developed based on the quarter-wavelength resonator model, previously developed for adult SGRs and heights. Based on difficulties in predicting the higher SGR values for the younger children, the model of the third SGR was refined to account for frequency-dependent acoustic lengths of the tracheo-bronchial system. This updated model more accurately estimates both adult and child SGRs based on their heights. These results indicate the importance of considering frequency-dependent acoustic lengths of the subglottal system.

5.
J Acoust Soc Am ; 144(1): 375, 2018 07.
Artículo en Inglés | MEDLINE | ID: mdl-30075658

RESUMEN

Little is known about human and machine speaker discrimination ability when utterances are very short and the speaking style is variable. This study compares text-independent speaker discrimination ability of humans and machines based on utterances shorter than 2 s in two different speaking styles (read sentences and speech directed towards pets, characterized by exaggerated prosody). Recordings of 50 female speakers drawn from the UCLA Speaker Variability Database were used as stimuli. Performance of 65 human listeners was compared to i-vector-based automatic speaker verification systems using mel-frequency cepstral coefficients, voice quality features, which were inspired by a psychoacoustic model of voice perception, or their combination by score-level fusion. Humans always outperformed machines, except in the case of style-mismatched pairs from perceptually-marked speakers. Speaker representations by humans and machines were compared using multi-dimensional scaling (MDS). Canonical correlation analysis showed a weak correlation between machine and human MDS spaces. Multiple regression showed that means of voice quality features could represent the most important human MDS dimension well, but not the dimensions from machines. These results suggest that speaker representations by humans and machines are different, and machine performance might be improved by better understanding how different acoustic features relate to perceived speaker identity.


Asunto(s)
Acústica del Lenguaje , Percepción del Habla/fisiología , Habla/fisiología , Voz/fisiología , Adolescente , Adulto , Comprensión/fisiología , Femenino , Humanos , Lenguaje , Masculino , Calidad de la Voz , Adulto Joven
6.
J Acoust Soc Am ; 141(4): EL420, 2017 04.
Artículo en Inglés | MEDLINE | ID: mdl-28464674

RESUMEN

This letter investigates the use of subglottal resonances (SGRs) for noise-robust speaker identification (SID). It is motivated by the speaker specificity and stationarity of subglottal acoustics, and the development of noise-robust SGR estimation algorithms which are reliable at low signal-to-noise ratios for large datasets. A two-stage framework is proposed which combines the SGRs with different cepstral features. The cepstral features are used in the first stage to reduce the number of target speakers for a test utterance, and then SGRs are used as complementary second-stage features to conduct identification. Experiments with the TIMIT and NIST 2008 databases show that SGRs, when used in conjunction with power-normalized cepstral coefficients and linear prediction cepstral coefficients, can improve the performance significantly (2%-6% absolute accuracy improvement) across all noise conditions in mismatched situations.


Asunto(s)
Acústica , Glotis/fisiología , Fonación , Acústica del Lenguaje , Medición de la Producción del Habla/métodos , Calidad de la Voz , Acelerometría , Algoritmos , Automatización , Femenino , Humanos , Masculino , Procesamiento de Señales Asistido por Computador , Espectrografía del Sonido , Factores de Tiempo
7.
J Acoust Soc Am ; 140(5): 3691, 2016 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-27908084

RESUMEN

Automatic phrase detection systems of bird sounds are useful in several applications as they reduce the need for manual annotations. However, birdphrase detection is challenging due to limited training data and background noise. Limited data occur because of limited recordings or the existence of rare phrases. Background noise interference occurs because of the intrinsic nature of the recording environment such as wind or other animals. This paper presents a different approach to birdsong phrase classification using template-based techniques suitable even for limited training data and noisy environments. The algorithm utilizes dynamic time-warping (DTW) and prominent (high-energy) time-frequency regions of training spectrograms to derive templates. The performance of the proposed algorithm is compared with the traditional DTW and hidden Markov models (HMMs) methods under several training and test conditions. DTW works well when the data are limited, while HMMs do better when more data are available, yet they both suffer when the background noise is severe. The proposed algorithm outperforms DTW and HMMs in most training and testing conditions, usually with a high margin when the background noise level is high. The innovation of this work is that the proposed algorithm is robust to both limited training data and background noise.


Asunto(s)
Vocalización Animal , Algoritmos , Animales , Automatización , Aves , Ruido
8.
J Acoust Soc Am ; 137(3): 1069-80, 2015 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-25786922

RESUMEN

Annotation of phrases in birdsongs can be helpful to behavioral and population studies. To reduce the need for manual annotation, an automated birdsong phrase classification algorithm for limited data is developed. Limited data occur because of limited recordings or the existence of rare phrases. In this paper, classification of up to 81 phrase classes of Cassin's Vireo is performed using one to five training samples per class. The algorithm involves dynamic time warping (DTW) and two passes of sparse representation (SR) classification. DTW improves the similarity between training and test phrases from the same class in the presence of individual bird differences and phrase segmentation inconsistencies. The SR classifier works by finding a sparse linear combination of training feature vectors from all classes that best approximates the test feature vector. When the class decisions from DTW and the first pass SR classification are different, SR classification is repeated using training samples from these two conflicting classes. Compared to DTW, support vector machines, and an SR classifier without DTW, the proposed classifier achieves the highest classification accuracies of 94% and 89% on manually segmented and automatically segmented phrases, respectively, from unseen Cassin's Vireo individuals, using five training samples per class.


Asunto(s)
Acústica , Reconocimiento de Normas Patrones Automatizadas , Procesamiento de Señales Asistido por Computador , Pájaros Cantores/fisiología , Vocalización Animal , Algoritmos , Animales , Modelos Lineales , Masculino , Pájaros Cantores/clasificación , Espectrografía del Sonido , Especificidad de la Especie , Máquina de Vectores de Soporte , Factores de Tiempo , Vocalización Animal/clasificación
9.
J Acoust Soc Am ; 138(1): 1-10, 2015 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-26233000

RESUMEN

Models of the voice source differ in their fits to natural voices, but it is unclear which differences in fit are perceptually salient. This study examined the relationship between the fit of five voice source models to 40 natural voices, and the degree of perceptual match among stimuli synthesized with each of the modeled sources. Listeners completed a visual sort-and-rate task to compare versions of each voice created with the different source models, and the results were analyzed using multidimensional scaling. Neither fits to pulse shapes nor fits to landmark points on the pulses predicted observed differences in quality. Further, the source models fit the opening phase of the glottal pulses better than they fit the closing phase, but at the same time similarity in quality was better predicted by the timing and amplitude of the negative peak of the flow derivative (part of the closing phase) than by the timing and/or amplitude of peak glottal opening. Results indicate that simply knowing how (or how well) a particular source model fits or does not fit a target source pulse in the time domain provides little insight into what aspects of the voice source are important to listeners.


Asunto(s)
Percepción Auditiva/fisiología , Calidad de la Voz/fisiología , Estimulación Acústica , Adolescente , Adulto , Glotis/fisiología , Humanos , Persona de Mediana Edad , Modelos Biológicos , Localización de Sonidos/fisiología , Espectrografía del Sonido , Adulto Joven
10.
Comput Speech Lang ; 862024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38313320

RESUMEN

Speech signals are valuable biomarkers for assessing an individual's mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this - adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness GV D and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a GV D of -1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a GV D of -0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.

11.
CEUR Workshop Proc ; 3649: 57-63, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38650610

RESUMEN

The proposed method focuses on speaker disentanglement in the context of depression detection from speech signals. Previous approaches require patient/speaker labels, encounter instability due to loss maximization, and introduce unnecessary parameters for adversarial domain prediction. In contrast, the proposed unsupervised approach reduces cosine similarity between latent spaces of depression and pre-trained speaker classification models. This method outperforms baseline models, matches or exceeds adversarial methods in performance, and does so without relying on speaker labels or introducing additional model parameters, leading to a reduction in model complexity. The higher the speaker de-identification score (DeID), the better the depression detection system is in masking a patient's identity thereby enhancing the privacy attributes of depression detection systems. On the DAIC-WOZ dataset with ComparE16 features and an LSTM-only model, our method achieves an F1-Score of 0.776 and a DeID score of 92.87%, outperforming its adversarial counterpart which has an F1Score of 0.762 and 68.37% DeID, respectively. Furthermore, we demonstrate that speaker-disentanglement methods are complementary to text-based approaches, and a score-level fusion with a Word2vec-based depression detection model further enhances the overall performance to an F1-Score of 0.830.

12.
Commun Biol ; 7(1): 540, 2024 May 07.
Artículo en Inglés | MEDLINE | ID: mdl-38714798

RESUMEN

The genetic influence on human vocal pitch in tonal and non-tonal languages remains largely unknown. In tonal languages, such as Mandarin Chinese, pitch changes differentiate word meanings, whereas in non-tonal languages, such as Icelandic, pitch is used to convey intonation. We addressed this question by searching for genetic associations with interindividual variation in median pitch in a Chinese major depression case-control cohort and compared our results with a genome-wide association study from Iceland. The same genetic variant, rs11046212-T in an intron of the ABCC9 gene, was one of the most strongly associated loci with median pitch in both samples. Our meta-analysis revealed four genome-wide significant hits, including two novel associations. The discovery of genetic variants influencing vocal pitch across both tonal and non-tonal languages suggests the possibility of a common genetic contribution to the human vocal system shared in two distinct populations with languages that differ in tonality (Icelandic and Mandarin).


Asunto(s)
Estudio de Asociación del Genoma Completo , Lenguaje , Humanos , Masculino , Femenino , Polimorfismo de Nucleótido Simple , Adulto , Islandia , Estudios de Casos y Controles , Persona de Mediana Edad , Voz/fisiología , Percepción de la Altura Tonal , Pueblo Asiatico/genética
13.
J Acoust Soc Am ; 133(3): 1656-66, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-23464035

RESUMEN

Because voice signals result from vocal fold vibration, perceptually meaningful vibratory measures should quantify those aspects of vibration that correspond to differences in voice quality. In this study, glottal area waveforms were extracted from high-speed videoendoscopy of the vocal folds. Principal component analysis was applied to these waveforms to investigate the factors that vary with voice quality. Results showed that the first principal component derived from tokens without glottal gaps was significantly (p < 0.01) associated with the open quotient (OQ). The alternating-current (AC) measure had a significant effect (p < 0.01) on the first principal component among tokens exhibiting glottal gaps. A measure AC/OQ, defined as the ratio of AC to OQ, was proposed to combine both amplitude and temporal characteristics of the glottal area waveform for both complete and incomplete glottal closures. Analyses of "glide" phonations in which quality varied continuously from breathy to pressed showed that the AC/OQ measure was able to characterize the corresponding continuum of glottal area waveform variation, regardless of the presence or absence of glottal gaps.


Asunto(s)
Glotis/anatomía & histología , Glotis/fisiología , Fonación , Acústica del Lenguaje , Calidad de la Voz , Fenómenos Biomecánicos , Femenino , Humanos , Laringoscopía , Modelos Lineales , Masculino , Análisis de Componente Principal , Factores de Tiempo , Vibración , Grabación en Video , Pliegues Vocales/anatomía & histología , Pliegues Vocales/fisiología
14.
Interspeech ; 2023: 2343-2347, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38045821

RESUMEN

While speech-based depression detection methods that use speaker-identity features, such as speaker embeddings, are popular, they often compromise patient privacy. To address this issue, we propose a speaker disentanglement method that utilizes a non-uniform mechanism of adversarial SID loss maximization. This is achieved by varying the adversarial weight between different layers of a model during training. We find that a greater adversarial weight for the initial layers leads to performance improvement. Our approach using the ECAPA-TDNN model achieves an F1-score of 0.7349 (a 3.7% improvement over audio-only SOTA) on the DAIC-WoZ dataset, while simultaneously reducing the speaker-identification accuracy by 50%. Our findings suggest that identifying depression through speech signals can be accomplished without placing undue reliance on a speaker's identity, paving the way for privacy-preserving approaches of depression detection.

15.
J Acoust Soc Am ; 132(4): 2625-32, 2012 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-23039455

RESUMEN

Increases in open quotient are widely assumed to cause changes in the amplitude of the first harmonic relative to the second (H1*-H2*), which in turn correspond to increases in perceived vocal breathiness. Empirical support for these assumptions is rather limited, and reported relationships among these three descriptive levels have been variable. This study examined the empirical relationship among H1*-H2*, the glottal open quotient (OQ), and glottal area waveform skewness, measured synchronously from audio recordings and high-speed video images of the larynges of six phonetically knowledgeable, vocally healthy speakers who varied fundamental frequency and voice qualities quasi-orthogonally. Across speakers and voice qualities, OQ, the asymmetry coefficient, and fundamental frequency accounted for an average of 74% of the variance in H1*-H2*. However, analyses of individual speakers showed large differences in the strategies used to produce the same intended voice qualities. Thus, H1*-H2* can be predicted with good overall accuracy, but its relationship to phonatory characteristics appears to be speaker dependent.


Asunto(s)
Glotis/fisiología , Fonación , Fonética , Acústica del Lenguaje , Calidad de la Voz , Fenómenos Biomecánicos , Femenino , Glotis/anatomía & histología , Humanos , Laringoscopía , Modelos Lineales , Masculino , Medición de la Producción del Habla , Factores de Tiempo , Grabación en Video
16.
J Acoust Soc Am ; 132(4): 2592-602, 2012 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-23039452

RESUMEN

This paper presents a large-scale study of subglottal resonances (SGRs) (the resonant frequencies of the tracheo-bronchial tree) and their relations to various acoustical and physiological characteristics of speakers. The paper presents data from a corpus of simultaneous microphone and accelerometer recordings of consonant-vowel-consonant (CVC) words embedded in a carrier phrase spoken by 25 male and 25 female native speakers of American English ranging in age from 18 to 24 yr. The corpus contains 17,500 utterances of 14 American English monophthongs, diphthongs, and the rhotic approximant [[inverted r]] in various CVC contexts. Only monophthongs are analyzed in this paper. Speaker height and age were also recorded. Findings include (1) normative data on the frequency distribution of SGRs for young adults, (2) the dependence of SGRs on height, (3) the lack of a correlation between SGRs and formants or the fundamental frequency, (4) a poor correlation of the first SGR with the second and third SGRs but a strong correlation between the second and third SGRs, and (5) a significant effect of vowel category on SGR frequencies, although this effect is smaller than the measurement standard deviations and therefore negligible for practical purposes.


Asunto(s)
Glotis/fisiología , Lenguaje , Fonación , Acústica del Lenguaje , Calidad de la Voz , Acelerometría , Adolescente , Factores de Edad , Fenómenos Biomecánicos , Estatura , Femenino , Humanos , Masculino , Factores Sexuales , Espectrografía del Sonido , Medición de la Producción del Habla , Vibración , Adulto Joven
17.
Artículo en Inglés | MEDLINE | ID: mdl-35531125

RESUMEN

In this paper, a data augmentation method is proposed for depression detection from speech signals. Samples for data augmentation were created by changing the frame-width and the frame-shift parameters during the feature extraction process. Unlike other data augmentation methods (such as VTLP, pitch perturbation, or speed perturbation), the proposed method does not explicitly change acoustic parameters but rather the time-frequency resolution of frame-level features. The proposed method was evaluated using two different datasets, models, and input acoustic features. For the DAIC-WOZ (English) dataset when using the DepAudioNet model and mel-Spectrograms as input, the proposed method resulted in an improvement of 5.97% (validation) and 25.13% (test) when compared to the baseline. The improvements for the CONVERGE (Mandarin) dataset when using the x-vector embeddings with CNN as the backend and MFCCs as input features were 9.32% (validation) and 12.99% (test). Baseline systems do not incorporate any data augmentation. Further, the proposed method outperformed commonly used data-augmentation methods such as noise augmentation, VTLP, Speed, and Pitch Perturbation. All improvements were statistically significant.

18.
Interspeech ; 2022: 2018-2022, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36341466

RESUMEN

Major Depressive Disorder (MDD) is a severe illness that affects millions of people, and it is critical to diagnose this disorder as early as possible. Detecting depression from voice signals can be of great help to physicians and can be done without any invasive procedure. Since relevant labelled data are scarce, we propose a modified Instance Discriminative Learning (IDL) method, an unsupervised pre-training technique, to extract augment-invariant and instance-spread-out embeddings. In terms of learning augment-invariant embeddings, various data augmentation methods for speech are investigated, and time-masking yields the best performance. To learn instance-spreadout embeddings, we explore methods for sampling instances for a training batch (distinct speaker-based and random sampling). It is found that the distinct speaker-based sampling provides better performance than the random one, and we hypothesize that this result is because relevant speaker information is preserved in the embedding. Additionally, we propose a novel sampling strategy, Pseudo Instance-based Sampling (PIS), based on clustering algorithms, to enhance spread-out characteristics of the embeddings. Experiments are conducted with DepAudioNet on DAIC-WOZ (English) and CONVERGE (Mandarin) datasets, and statistically significant improvements, with p-value 0.0015 and 0.05, respectively, are observed using PIS in the detection of MDD relative to the baseline without pre-training.

19.
Interspeech ; 2022: 3338-3342, 2022 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-36341467

RESUMEN

Preserving a patient's identity is a challenge for automatic, speech-based diagnosis of mental health disorders. In this paper, we address this issue by proposing adversarial disentanglement of depression characteristics and speaker identity. The model used for depression classification is trained in a speaker-identity-invariant manner by minimizing depression prediction loss and maximizing speaker prediction loss during training. The effectiveness of the proposed method is demonstrated on two datasets - DAIC-WOZ (English) and CONVERGE (Mandarin), with three feature sets (Mel-spectrograms, raw-audio signals, and the last-hidden-state of Wav2vec2.0), using a modified DepAudioNet model. With adversarial training, depression classification improves for every feature when compared to the baseline. Wav2vec2.0 features with adversarial learning resulted in the best performance (F1-score of 69.2% for DAIC-WOZ and 91.5% for CONVERGE). Analysis of the class-separability measure (J-ratio) of the hidden states of the DepAudioNet model shows that when adversarial learning is applied, the backend model loses some speaker-discriminability while it improves depression-discriminability. These results indicate that there are some components of speaker identity that may not be useful for depression detection and minimizing their effects provides a more accurate diagnosis of the underlying disorder and can safeguard a speaker's identity.

20.
J Acoust Soc Am ; 129(4): 2144-62, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21476670

RESUMEN

In this paper, a quantitative study of acoustic-to-articulatory inversion for vowel speech sounds by analysis-by-synthesis using the Maeda articulatory model is performed. For chain matrix calculation of vocal tract (VT) acoustics, the chain matrix derivatives with respect to area function are calculated and used in a quasi-Newton method for optimizing articulatory trajectories. The cost function includes a distance measure between natural and synthesized first three formants, and parameter regularization and continuity terms. Calibration of the Maeda model to two speakers, one male and one female, from the University of Wisconsin x-ray microbeam (XRMB) database, using a cost function, is discussed. Model adaptation includes scaling the overall VT and the pharyngeal region and modifying the outer VT outline using measured palate and pharyngeal traces. The inversion optimization is initialized by a fast search of an articulatory codebook, which was pruned using XRMB data to improve inversion results. Good agreement between estimated midsagittal VT outlines and measured XRMB tongue pellet positions was achieved for several vowels and diphthongs for the male speaker, with average pellet-VT outline distances around 0.15 cm, smooth articulatory trajectories, and less than 1% average error in the first three formants.


Asunto(s)
Modelos Biológicos , Acústica del Lenguaje , Inteligibilidad del Habla/fisiología , Software de Reconocimiento del Habla , Habla/fisiología , Acústica , Calibración , Femenino , Humanos , Masculino , Cadenas de Markov , Pliegues Vocales/fisiología
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA