Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
1.
J Acoust Soc Am ; 155(4): 2836-2848, 2024 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-38682915

RESUMO

This paper evaluates an innovative framework for spoken dialect density prediction on children's and adults' African American English. A speaker's dialect density is defined as the frequency with which dialect-specific language characteristics occur in their speech. Rather than treating the presence or absence of a target dialect in a user's speech as a binary decision, instead, a classifier is trained to predict the level of dialect density to provide a higher degree of specificity in downstream tasks. For this, self-supervised learning representations from HuBERT, handcrafted grammar-based features extracted from ASR transcripts, prosodic features, and other feature sets are experimented with as the input to an XGBoost classifier. Then, the classifier is trained to assign dialect density labels to short recorded utterances. High dialect density level classification accuracy is achieved for child and adult speech and demonstrated robust performance across age and regional varieties of dialect. Additionally, this work is used as a basis for analyzing which acoustic and grammatical cues affect machine perception of dialect.


Assuntos
Negro ou Afro-Americano , Acústica da Fala , Humanos , Adulto , Criança , Masculino , Feminino , Medida da Produção da Fala/métodos , Idioma , Pré-Escolar , Adulto Jovem , Percepção da Fala , Adolescente , Fonética , Linguagem Infantil
2.
J Acoust Soc Am ; 151(2): 1393, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35232083

RESUMO

This study compares human speaker discrimination performance for read speech versus casual conversations and explores differences between unfamiliar voices that are "easy" versus "hard" to "tell together" versus "tell apart." Thirty listeners were asked whether pairs of short style-matched or -mismatched, text-independent utterances represented the same or different speakers. Listeners performed better when stimuli were style-matched, particularly in read speech-read speech trials (equal error rate, EER, of 6.96% versus 15.12% in conversation-conversation trials). In contrast, the EER was 20.68% for the style-mismatched condition. When styles were matched, listeners' confidence was higher when speakers were the same versus different; however, style variation caused decreases in listeners' confidence for the "same speaker" trials, suggesting a higher dependency of this task on within-speaker variability. The speakers who were "easy" or "hard" to "tell together" were not the same as those who were "easy" or "hard" to "tell apart." Analysis of speaker acoustic spaces suggested that the difference observed in human approaches to "same speaker" and "different speaker" tasks depends primarily on listeners' different perceptual strategies when dealing with within- versus between-speaker acoustic variability.


Assuntos
Percepção da Fala , Voz , Acústica , Humanos , Fala
3.
Pediatr Res ; 87(3): 576-580, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-31585457

RESUMO

BACKGROUND: To characterize acoustic features of an infant's cry and use machine learning to provide an objective measurement of behavioral state in a cry-translator. To apply the cry-translation algorithm to colic hypothesizing that these cries sound painful. METHODS: Assessment of 1000 cries in a mobile app (ChatterBabyTM). Training a cry-translation algorithm by evaluating >6000 acoustic features to predict whether infant cry was due to a pain (vaccinations, ear-piercings), fussy, or hunger states. Using the algorithm to predict the behavioral state of infants with reported colic. RESULTS: The cry-translation algorithm was 90.7% accurate for identifying pain cries, and achieved 71.5% accuracy in discriminating cries from fussiness, hunger, or pain. The ChatterBaby cry-translation algorithm overwhelmingly predicted that colic cries were most likely from pain, compared to fussy and hungry states. Colic cries had average pain ratings of 73%, significantly greater than the pain measurements found in fussiness and hunger (p < 0.001, 2-sample t test). Colic cries outranked pain cries by measures of acoustic intensity, including energy, length of voiced periods, and fundamental frequency/pitch, while fussy and hungry cries showed reduced intensity measures compared to pain and colic. CONCLUSIONS: Acoustic features of cries are consistent across a diverse infant population and can be utilized as objective markers of pain, hunger, and fussiness. The ChatterBaby algorithm detected significant acoustic similarities between colic and painful cries, suggesting that they may share a neuronal pathway.


Assuntos
Dor Abdominal/psicologia , Acústica , Cólica/psicologia , Choro , Comportamento do Lactente , Aprendizado de Máquina , Aplicativos Móveis , Percepção da Dor , Processamento de Sinais Assistido por Computador , Dor Abdominal/diagnóstico , Cólica/diagnóstico , Feminino , Humanos , Lactente , Recém-Nascido , Masculino , Reconhecimento Automatizado de Padrão , Espectrografia do Som
4.
J Acoust Soc Am ; 144(6): 3437, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30599649

RESUMO

This paper presents an investigation of children's subglottal resonances (SGRs), the natural frequencies of the tracheo-bronchial acoustic system. A total of 43 children (31 male, 12 female) aged between 6 and 18 yr were recorded. Both microphone signals of various consonant-vowel-consonant words and subglottal accelerometer signals of the sustained vowel /ɑ/ were recorded for each of the children, along with age and standing height. The first three SGRs of each child were measured from the sustained vowel subglottal accelerometer signals. A model relating SGRs to standing height was developed based on the quarter-wavelength resonator model, previously developed for adult SGRs and heights. Based on difficulties in predicting the higher SGR values for the younger children, the model of the third SGR was refined to account for frequency-dependent acoustic lengths of the tracheo-bronchial system. This updated model more accurately estimates both adult and child SGRs based on their heights. These results indicate the importance of considering frequency-dependent acoustic lengths of the subglottal system.

5.
J Acoust Soc Am ; 144(1): 375, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-30075658

RESUMO

Little is known about human and machine speaker discrimination ability when utterances are very short and the speaking style is variable. This study compares text-independent speaker discrimination ability of humans and machines based on utterances shorter than 2 s in two different speaking styles (read sentences and speech directed towards pets, characterized by exaggerated prosody). Recordings of 50 female speakers drawn from the UCLA Speaker Variability Database were used as stimuli. Performance of 65 human listeners was compared to i-vector-based automatic speaker verification systems using mel-frequency cepstral coefficients, voice quality features, which were inspired by a psychoacoustic model of voice perception, or their combination by score-level fusion. Humans always outperformed machines, except in the case of style-mismatched pairs from perceptually-marked speakers. Speaker representations by humans and machines were compared using multi-dimensional scaling (MDS). Canonical correlation analysis showed a weak correlation between machine and human MDS spaces. Multiple regression showed that means of voice quality features could represent the most important human MDS dimension well, but not the dimensions from machines. These results suggest that speaker representations by humans and machines are different, and machine performance might be improved by better understanding how different acoustic features relate to perceived speaker identity.


Assuntos
Acústica da Fala , Percepção da Fala/fisiologia , Fala/fisiologia , Voz/fisiologia , Adolescente , Adulto , Compreensão/fisiologia , Feminino , Humanos , Idioma , Masculino , Qualidade da Voz , Adulto Jovem
6.
J Acoust Soc Am ; 141(4): EL420, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-28464674

RESUMO

This letter investigates the use of subglottal resonances (SGRs) for noise-robust speaker identification (SID). It is motivated by the speaker specificity and stationarity of subglottal acoustics, and the development of noise-robust SGR estimation algorithms which are reliable at low signal-to-noise ratios for large datasets. A two-stage framework is proposed which combines the SGRs with different cepstral features. The cepstral features are used in the first stage to reduce the number of target speakers for a test utterance, and then SGRs are used as complementary second-stage features to conduct identification. Experiments with the TIMIT and NIST 2008 databases show that SGRs, when used in conjunction with power-normalized cepstral coefficients and linear prediction cepstral coefficients, can improve the performance significantly (2%-6% absolute accuracy improvement) across all noise conditions in mismatched situations.


Assuntos
Acústica , Glote/fisiologia , Fonação , Acústica da Fala , Medida da Produção da Fala/métodos , Qualidade da Voz , Acelerometria , Algoritmos , Automação , Feminino , Humanos , Masculino , Processamento de Sinais Assistido por Computador , Espectrografia do Som , Fatores de Tempo
7.
J Acoust Soc Am ; 140(5): 3691, 2016 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-27908084

RESUMO

Automatic phrase detection systems of bird sounds are useful in several applications as they reduce the need for manual annotations. However, birdphrase detection is challenging due to limited training data and background noise. Limited data occur because of limited recordings or the existence of rare phrases. Background noise interference occurs because of the intrinsic nature of the recording environment such as wind or other animals. This paper presents a different approach to birdsong phrase classification using template-based techniques suitable even for limited training data and noisy environments. The algorithm utilizes dynamic time-warping (DTW) and prominent (high-energy) time-frequency regions of training spectrograms to derive templates. The performance of the proposed algorithm is compared with the traditional DTW and hidden Markov models (HMMs) methods under several training and test conditions. DTW works well when the data are limited, while HMMs do better when more data are available, yet they both suffer when the background noise is severe. The proposed algorithm outperforms DTW and HMMs in most training and testing conditions, usually with a high margin when the background noise level is high. The innovation of this work is that the proposed algorithm is robust to both limited training data and background noise.


Assuntos
Vocalização Animal , Algoritmos , Animais , Automação , Aves , Ruído
8.
J Acoust Soc Am ; 138(1): 1-10, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-26233000

RESUMO

Models of the voice source differ in their fits to natural voices, but it is unclear which differences in fit are perceptually salient. This study examined the relationship between the fit of five voice source models to 40 natural voices, and the degree of perceptual match among stimuli synthesized with each of the modeled sources. Listeners completed a visual sort-and-rate task to compare versions of each voice created with the different source models, and the results were analyzed using multidimensional scaling. Neither fits to pulse shapes nor fits to landmark points on the pulses predicted observed differences in quality. Further, the source models fit the opening phase of the glottal pulses better than they fit the closing phase, but at the same time similarity in quality was better predicted by the timing and amplitude of the negative peak of the flow derivative (part of the closing phase) than by the timing and/or amplitude of peak glottal opening. Results indicate that simply knowing how (or how well) a particular source model fits or does not fit a target source pulse in the time domain provides little insight into what aspects of the voice source are important to listeners.


Assuntos
Percepção Auditiva/fisiologia , Qualidade da Voz/fisiologia , Estimulação Acústica , Adolescente , Adulto , Glote/fisiologia , Humanos , Pessoa de Meia-Idade , Modelos Biológicos , Localização de Som/fisiologia , Espectrografia do Som , Adulto Jovem
9.
J Acoust Soc Am ; 137(3): 1069-80, 2015 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-25786922

RESUMO

Annotation of phrases in birdsongs can be helpful to behavioral and population studies. To reduce the need for manual annotation, an automated birdsong phrase classification algorithm for limited data is developed. Limited data occur because of limited recordings or the existence of rare phrases. In this paper, classification of up to 81 phrase classes of Cassin's Vireo is performed using one to five training samples per class. The algorithm involves dynamic time warping (DTW) and two passes of sparse representation (SR) classification. DTW improves the similarity between training and test phrases from the same class in the presence of individual bird differences and phrase segmentation inconsistencies. The SR classifier works by finding a sparse linear combination of training feature vectors from all classes that best approximates the test feature vector. When the class decisions from DTW and the first pass SR classification are different, SR classification is repeated using training samples from these two conflicting classes. Compared to DTW, support vector machines, and an SR classifier without DTW, the proposed classifier achieves the highest classification accuracies of 94% and 89% on manually segmented and automatically segmented phrases, respectively, from unseen Cassin's Vireo individuals, using five training samples per class.


Assuntos
Acústica , Reconhecimento Automatizado de Padrão , Processamento de Sinais Assistido por Computador , Aves Canoras/fisiologia , Vocalização Animal , Algoritmos , Animais , Modelos Lineares , Masculino , Aves Canoras/classificação , Espectrografia do Som , Especificidade da Espécie , Máquina de Vetores de Suporte , Fatores de Tempo , Vocalização Animal/classificação
10.
CEUR Workshop Proc ; 3649: 57-63, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38650610

RESUMO

The proposed method focuses on speaker disentanglement in the context of depression detection from speech signals. Previous approaches require patient/speaker labels, encounter instability due to loss maximization, and introduce unnecessary parameters for adversarial domain prediction. In contrast, the proposed unsupervised approach reduces cosine similarity between latent spaces of depression and pre-trained speaker classification models. This method outperforms baseline models, matches or exceeds adversarial methods in performance, and does so without relying on speaker labels or introducing additional model parameters, leading to a reduction in model complexity. The higher the speaker de-identification score (DeID), the better the depression detection system is in masking a patient's identity thereby enhancing the privacy attributes of depression detection systems. On the DAIC-WOZ dataset with ComparE16 features and an LSTM-only model, our method achieves an F1-Score of 0.776 and a DeID score of 92.87%, outperforming its adversarial counterpart which has an F1Score of 0.762 and 68.37% DeID, respectively. Furthermore, we demonstrate that speaker-disentanglement methods are complementary to text-based approaches, and a score-level fusion with a Word2vec-based depression detection model further enhances the overall performance to an F1-Score of 0.830.

11.
Comput Speech Lang ; 862024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38313320

RESUMO

Speech signals are valuable biomarkers for assessing an individual's mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this - adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness GV D and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a GV D of -1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a GV D of -0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.

12.
Commun Biol ; 7(1): 540, 2024 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-38714798

RESUMO

The genetic influence on human vocal pitch in tonal and non-tonal languages remains largely unknown. In tonal languages, such as Mandarin Chinese, pitch changes differentiate word meanings, whereas in non-tonal languages, such as Icelandic, pitch is used to convey intonation. We addressed this question by searching for genetic associations with interindividual variation in median pitch in a Chinese major depression case-control cohort and compared our results with a genome-wide association study from Iceland. The same genetic variant, rs11046212-T in an intron of the ABCC9 gene, was one of the most strongly associated loci with median pitch in both samples. Our meta-analysis revealed four genome-wide significant hits, including two novel associations. The discovery of genetic variants influencing vocal pitch across both tonal and non-tonal languages suggests the possibility of a common genetic contribution to the human vocal system shared in two distinct populations with languages that differ in tonality (Icelandic and Mandarin).


Assuntos
Estudo de Associação Genômica Ampla , Idioma , Humanos , Masculino , Feminino , Polimorfismo de Nucleotídeo Único , Adulto , Islândia , Estudos de Casos e Controles , Pessoa de Meia-Idade , Voz/fisiologia , Percepção da Altura Sonora , Povo Asiático/genética
13.
J Acoust Soc Am ; 133(3): 1656-66, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23464035

RESUMO

Because voice signals result from vocal fold vibration, perceptually meaningful vibratory measures should quantify those aspects of vibration that correspond to differences in voice quality. In this study, glottal area waveforms were extracted from high-speed videoendoscopy of the vocal folds. Principal component analysis was applied to these waveforms to investigate the factors that vary with voice quality. Results showed that the first principal component derived from tokens without glottal gaps was significantly (p < 0.01) associated with the open quotient (OQ). The alternating-current (AC) measure had a significant effect (p < 0.01) on the first principal component among tokens exhibiting glottal gaps. A measure AC/OQ, defined as the ratio of AC to OQ, was proposed to combine both amplitude and temporal characteristics of the glottal area waveform for both complete and incomplete glottal closures. Analyses of "glide" phonations in which quality varied continuously from breathy to pressed showed that the AC/OQ measure was able to characterize the corresponding continuum of glottal area waveform variation, regardless of the presence or absence of glottal gaps.


Assuntos
Glote/anatomia & histologia , Glote/fisiologia , Fonação , Acústica da Fala , Qualidade da Voz , Fenômenos Biomecânicos , Feminino , Humanos , Laringoscopia , Modelos Lineares , Masculino , Análise de Componente Principal , Fatores de Tempo , Vibração , Gravação em Vídeo , Prega Vocal/anatomia & histologia , Prega Vocal/fisiologia
14.
Interspeech ; 2023: 2343-2347, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38045821

RESUMO

While speech-based depression detection methods that use speaker-identity features, such as speaker embeddings, are popular, they often compromise patient privacy. To address this issue, we propose a speaker disentanglement method that utilizes a non-uniform mechanism of adversarial SID loss maximization. This is achieved by varying the adversarial weight between different layers of a model during training. We find that a greater adversarial weight for the initial layers leads to performance improvement. Our approach using the ECAPA-TDNN model achieves an F1-score of 0.7349 (a 3.7% improvement over audio-only SOTA) on the DAIC-WoZ dataset, while simultaneously reducing the speaker-identification accuracy by 50%. Our findings suggest that identifying depression through speech signals can be accomplished without placing undue reliance on a speaker's identity, paving the way for privacy-preserving approaches of depression detection.

15.
J Acoust Soc Am ; 132(4): 2625-32, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-23039455

RESUMO

Increases in open quotient are widely assumed to cause changes in the amplitude of the first harmonic relative to the second (H1*-H2*), which in turn correspond to increases in perceived vocal breathiness. Empirical support for these assumptions is rather limited, and reported relationships among these three descriptive levels have been variable. This study examined the empirical relationship among H1*-H2*, the glottal open quotient (OQ), and glottal area waveform skewness, measured synchronously from audio recordings and high-speed video images of the larynges of six phonetically knowledgeable, vocally healthy speakers who varied fundamental frequency and voice qualities quasi-orthogonally. Across speakers and voice qualities, OQ, the asymmetry coefficient, and fundamental frequency accounted for an average of 74% of the variance in H1*-H2*. However, analyses of individual speakers showed large differences in the strategies used to produce the same intended voice qualities. Thus, H1*-H2* can be predicted with good overall accuracy, but its relationship to phonatory characteristics appears to be speaker dependent.


Assuntos
Glote/fisiologia , Fonação , Fonética , Acústica da Fala , Qualidade da Voz , Fenômenos Biomecânicos , Feminino , Glote/anatomia & histologia , Humanos , Laringoscopia , Modelos Lineares , Masculino , Medida da Produção da Fala , Fatores de Tempo , Gravação em Vídeo
16.
J Acoust Soc Am ; 132(4): 2592-602, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-23039452

RESUMO

This paper presents a large-scale study of subglottal resonances (SGRs) (the resonant frequencies of the tracheo-bronchial tree) and their relations to various acoustical and physiological characteristics of speakers. The paper presents data from a corpus of simultaneous microphone and accelerometer recordings of consonant-vowel-consonant (CVC) words embedded in a carrier phrase spoken by 25 male and 25 female native speakers of American English ranging in age from 18 to 24 yr. The corpus contains 17,500 utterances of 14 American English monophthongs, diphthongs, and the rhotic approximant [[inverted r]] in various CVC contexts. Only monophthongs are analyzed in this paper. Speaker height and age were also recorded. Findings include (1) normative data on the frequency distribution of SGRs for young adults, (2) the dependence of SGRs on height, (3) the lack of a correlation between SGRs and formants or the fundamental frequency, (4) a poor correlation of the first SGR with the second and third SGRs but a strong correlation between the second and third SGRs, and (5) a significant effect of vowel category on SGR frequencies, although this effect is smaller than the measurement standard deviations and therefore negligible for practical purposes.


Assuntos
Glote/fisiologia , Idioma , Fonação , Acústica da Fala , Qualidade da Voz , Acelerometria , Adolescente , Fatores Etários , Fenômenos Biomecânicos , Estatura , Feminino , Humanos , Masculino , Fatores Sexuais , Espectrografia do Som , Medida da Produção da Fala , Vibração , Adulto Jovem
17.
Artigo em Inglês | MEDLINE | ID: mdl-35531125

RESUMO

In this paper, a data augmentation method is proposed for depression detection from speech signals. Samples for data augmentation were created by changing the frame-width and the frame-shift parameters during the feature extraction process. Unlike other data augmentation methods (such as VTLP, pitch perturbation, or speed perturbation), the proposed method does not explicitly change acoustic parameters but rather the time-frequency resolution of frame-level features. The proposed method was evaluated using two different datasets, models, and input acoustic features. For the DAIC-WOZ (English) dataset when using the DepAudioNet model and mel-Spectrograms as input, the proposed method resulted in an improvement of 5.97% (validation) and 25.13% (test) when compared to the baseline. The improvements for the CONVERGE (Mandarin) dataset when using the x-vector embeddings with CNN as the backend and MFCCs as input features were 9.32% (validation) and 12.99% (test). Baseline systems do not incorporate any data augmentation. Further, the proposed method outperformed commonly used data-augmentation methods such as noise augmentation, VTLP, Speed, and Pitch Perturbation. All improvements were statistically significant.

18.
Interspeech ; 2022: 2018-2022, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36341466

RESUMO

Major Depressive Disorder (MDD) is a severe illness that affects millions of people, and it is critical to diagnose this disorder as early as possible. Detecting depression from voice signals can be of great help to physicians and can be done without any invasive procedure. Since relevant labelled data are scarce, we propose a modified Instance Discriminative Learning (IDL) method, an unsupervised pre-training technique, to extract augment-invariant and instance-spread-out embeddings. In terms of learning augment-invariant embeddings, various data augmentation methods for speech are investigated, and time-masking yields the best performance. To learn instance-spreadout embeddings, we explore methods for sampling instances for a training batch (distinct speaker-based and random sampling). It is found that the distinct speaker-based sampling provides better performance than the random one, and we hypothesize that this result is because relevant speaker information is preserved in the embedding. Additionally, we propose a novel sampling strategy, Pseudo Instance-based Sampling (PIS), based on clustering algorithms, to enhance spread-out characteristics of the embeddings. Experiments are conducted with DepAudioNet on DAIC-WOZ (English) and CONVERGE (Mandarin) datasets, and statistically significant improvements, with p-value 0.0015 and 0.05, respectively, are observed using PIS in the detection of MDD relative to the baseline without pre-training.

19.
Interspeech ; 2022: 3338-3342, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-36341467

RESUMO

Preserving a patient's identity is a challenge for automatic, speech-based diagnosis of mental health disorders. In this paper, we address this issue by proposing adversarial disentanglement of depression characteristics and speaker identity. The model used for depression classification is trained in a speaker-identity-invariant manner by minimizing depression prediction loss and maximizing speaker prediction loss during training. The effectiveness of the proposed method is demonstrated on two datasets - DAIC-WOZ (English) and CONVERGE (Mandarin), with three feature sets (Mel-spectrograms, raw-audio signals, and the last-hidden-state of Wav2vec2.0), using a modified DepAudioNet model. With adversarial training, depression classification improves for every feature when compared to the baseline. Wav2vec2.0 features with adversarial learning resulted in the best performance (F1-score of 69.2% for DAIC-WOZ and 91.5% for CONVERGE). Analysis of the class-separability measure (J-ratio) of the hidden states of the DepAudioNet model shows that when adversarial learning is applied, the backend model loses some speaker-discriminability while it improves depression-discriminability. These results indicate that there are some components of speaker identity that may not be useful for depression detection and minimizing their effects provides a more accurate diagnosis of the underlying disorder and can safeguard a speaker's identity.

20.
J Acoust Soc Am ; 129(4): 2144-62, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21476670

RESUMO

In this paper, a quantitative study of acoustic-to-articulatory inversion for vowel speech sounds by analysis-by-synthesis using the Maeda articulatory model is performed. For chain matrix calculation of vocal tract (VT) acoustics, the chain matrix derivatives with respect to area function are calculated and used in a quasi-Newton method for optimizing articulatory trajectories. The cost function includes a distance measure between natural and synthesized first three formants, and parameter regularization and continuity terms. Calibration of the Maeda model to two speakers, one male and one female, from the University of Wisconsin x-ray microbeam (XRMB) database, using a cost function, is discussed. Model adaptation includes scaling the overall VT and the pharyngeal region and modifying the outer VT outline using measured palate and pharyngeal traces. The inversion optimization is initialized by a fast search of an articulatory codebook, which was pruned using XRMB data to improve inversion results. Good agreement between estimated midsagittal VT outlines and measured XRMB tongue pellet positions was achieved for several vowels and diphthongs for the male speaker, with average pellet-VT outline distances around 0.15 cm, smooth articulatory trajectories, and less than 1% average error in the first three formants.


Assuntos
Modelos Biológicos , Acústica da Fala , Inteligibilidade da Fala/fisiologia , Interface para o Reconhecimento da Fala , Fala/fisiologia , Acústica , Calibragem , Feminino , Humanos , Masculino , Cadeias de Markov , Prega Vocal/fisiologia
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa