Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
J Acoust Soc Am ; 156(2): 1380-1390, 2024 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-39196104

RESUMO

For most of his illustrious career, Ken Stevens focused on examining and documenting the rich detail about vocal tract changes available to listeners underlying the acoustic signal of speech. Current approaches to speech inversion take advantage of this rich detail to recover information about articulatory movement. Our previous speech inversion work focused on movements of the tongue and lips, for which "ground truth" is readily available. In this study, we describe acquisition and validation of ground-truth articulatory data about velopharyngeal port constriction, using both the well-established measure of nasometry plus a novel technique-high-speed nasopharyngoscopy. Nasometry measures the acoustic output of the nasal and oral cavities to derive the measure nasalance. High-speed nasopharyngoscopy captures images of the nasopharyngeal region and can resolve velar motion during speech. By comparing simultaneously collected data from both acquisition modalities, we show that nasalance is a sufficiently sensitive measure to use as ground truth for our speech inversion system. Further, a speech inversion system trained on nasalance can recover known patterns of velopharyngeal port constriction shown by American English speakers. Our findings match well with Stevens' own studies of the acoustics of nasal consonants.


Assuntos
Acústica da Fala , Medida da Produção da Fala , Humanos , Masculino , Medida da Produção da Fala/métodos , Adulto , Feminino , Adulto Jovem , Qualidade da Voz , Constrição Patológica , Fala/fisiologia , Endoscopia/métodos , Endoscopia/instrumentação
2.
Psychiatr Q ; 94(2): 221-231, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37145257

RESUMO

Although digital health solutions are increasingly popular in clinical psychiatry, one application that has not been fully explored is the utilization of survey technology to monitor patients outside of the clinic. Supplementing routine care with digital information collected in the "clinical whitespace" between visits could improve care for patients with severe mental illness. This study evaluated the feasibility and validity of using online self-report questionnaires to supplement in-person clinical evaluations in persons with and without psychiatric diagnoses. We performed a rigorous in-person clinical diagnostic and assessment battery in 54 participants with schizophrenia (N = 23), depressive disorder (N = 14), and healthy controls (N = 17) using standard assessments for depressive and psychotic symptomatology. Participants were then asked to complete brief online assessments of depressive (Quick Inventory of Depressive Symptomatology) and psychotic (Community Assessment of Psychic Experiences) symptoms outside of the clinic for comparison with the ground-truth in-person assessments. We found that online self-report ratings of severity were significantly correlated with the clinical assessments for depression (two assessments used: R = 0.63, p < 0.001; R = 0.73, p < 0.001) and psychosis (R = 0.62, p < 0.001). Our results demonstrate the feasibility and validity of collecting psychiatric symptom ratings through online surveys. Surveillance of this kind may be especially useful in detecting acute mental health crises between patient visits and can generally contribute to more comprehensive psychiatric treatment.


Assuntos
Depressão , Inquéritos Epidemiológicos , Internet , Transtornos Psicóticos , Autorrelato , Saúde Mental/normas , Intervenção Baseada em Internet , Inquéritos Epidemiológicos/métodos , Inquéritos Epidemiológicos/normas , Reprodutibilidade dos Testes , Depressão/diagnóstico , Depressão/psicologia , Humanos , Masculino , Feminino , Adulto Jovem , Adulto , Esquizofrenia/diagnóstico , Transtornos Psicóticos/diagnóstico , Transtornos Psicóticos/psicologia
3.
J Acoust Soc Am ; 146(1): 316, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31370597

RESUMO

Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the proposed speaker normalization approach provides an 8.15% relative improvement in correlation between actual and estimated TVs as compared to the system where speaker normalization was not performed. To determine the efficacy of the method across datasets, cross speaker evaluations were performed across speakers from the Multichannel Articulatory-TIMIT and EMA-IEEE datasets. Results prove that the VTLN approach provides improvement in performance even across datasets.

4.
J Acoust Soc Am ; 136(4): EL268-74, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-25324109

RESUMO

This study investigated whether recognition of time-compressed speech predicts recognition of natural fast-rate speech, and whether this relationship is influenced by listener age. High and low context sentences were presented to younger and older normal-hearing adults at a normal speech rate, naturally fast speech rate, and fast rate implemented by time compressing the normal-rate sentences. Recognition of time-compressed sentences over-estimated recognition of natural fast sentences for both groups, especially for older listeners. The findings suggest that older listeners are at a much greater disadvantage when listening to natural fast speech than would be predicted by recognition performance for time-compressed speech.


Assuntos
Envelhecimento/psicologia , Periodicidade , Reconhecimento Psicológico , Percepção da Fala , Estimulação Acústica , Fatores Etários , Idoso , Audiometria da Fala , Sinais (Psicologia) , Feminino , Humanos , Masculino , Ruído/efeitos adversos , Mascaramento Perceptivo , Fatores de Tempo , Adulto Jovem
5.
J Acoust Soc Am ; 133(6): EL439-45, 2013 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-23742437

RESUMO

Magnetic resonance imaging has been widely used in speech production research. Often only one image stack (sagittal, axial, or coronal) is used for vocal tract modeling. As a result, complementary information from other available stacks is not utilized. To overcome this, a recently developed super-resolution technique was applied to integrate three orthogonal low-resolution stacks into one isotropic volume. The results on vowels show that the super-resolution volume produces better vocal tract visualization than any of the low-resolution stacks. Its derived area functions generally produce formant predictions closer to the ground truth, particularly for those formants sensitive to area perturbations at constrictions.


Assuntos
Simulação por Computador , Epiglote/anatomia & histologia , Aumento da Imagem/métodos , Processamento de Imagem Assistida por Computador/métodos , Imageamento Tridimensional/métodos , Laringe/anatomia & histologia , Lábio/anatomia & histologia , Imageamento por Ressonância Magnética/métodos , Faringe/anatomia & histologia , Fonação/fisiologia , Fonética , Algoritmos , Artefatos , Epiglote/fisiologia , Humanos , Laringe/fisiologia , Lábio/fisiologia , Faringe/fisiologia , Sensibilidade e Especificidade , Software , Espectrografia do Som , Acústica da Fala
6.
J Acoust Soc Am ; 131(3): 2270-87, 2012 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-22423722

RESUMO

Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems.


Assuntos
Gestos , Percepção da Fala/fisiologia , Interface para o Reconhecimento da Fala , Fala/fisiologia , Teorema de Bayes , Humanos , Fonética , Acústica da Fala , Vocabulário
7.
J Acoust Soc Am ; 132(6): 3980-9, 2012 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-23231127

RESUMO

Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech. For a given utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model is employed to generate a corresponding prototype gestural score. The gestural score is temporally optimized through an iterative timing-warping process such that the acoustic distance between the original and TADA-synthesized speech is minimized. This paper demonstrates that the proposed iterative approach is superior to conventional acoustically-referenced dynamic timing-warping procedures and provides reliable gestural annotation for speech datasets.


Assuntos
Acústica , Gestos , Glote/fisiologia , Boca/fisiologia , Acústica da Fala , Qualidade da Voz , Fenômenos Biomecânicos , Feminino , Humanos , Masculino , Modelos Teóricos , Processamento de Sinais Assistido por Computador , Espectrografia do Som , Medida da Produção da Fala/métodos , Fatores de Tempo
8.
J Acoust Soc Am ; 123(2): 1154-68, 2008 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-18247915

RESUMO

A probabilistic framework for a landmark-based approach to speech recognition is presented for obtaining multiple landmark sequences in continuous speech. The landmark detection module uses as input acoustic parameters (APs) that capture the acoustic correlates of some of the manner-based phonetic features. The landmarks include stop bursts, vowel onsets, syllabic peaks and dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. Binary classifiers of the manner phonetic features-syllabic, sonorant and continuant-are used for probabilistic detection of these landmarks. The probabilistic framework exploits two properties of the acoustic cues of phonetic features-(1) sufficiency of acoustic cues of a phonetic feature for a probabilistic decision on that feature and (2) invariance of the acoustic cues of a phonetic feature with respect to other phonetic features. Probabilistic landmark sequences are constrained using manner class pronunciation models for isolated word recognition with known vocabulary. The performance of the system is compared with (1) the same probabilistic system but with mel-frequency cepstral coefficients (MFCCs), (2) a hidden Markov model (HMM) based system using APs and (3) a HMM based system using MFCCs.


Assuntos
Modelos Teóricos , Fonética , Interface para o Reconhecimento da Fala , Algoritmos , Sinais (Psicologia) , Cadeias de Markov , Probabilidade , Processamento de Sinais Assistido por Computador
9.
J Acoust Soc Am ; 123(6): 4466-81, 2008 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-18537397

RESUMO

Speakers of rhotic dialects of North American English show a range of different tongue configurations for /r/. These variants produce acoustic profiles that are indistinguishable for the first three formants [Delattre, P., and Freeman, D. C., (1968). "A dialect study of American English r's by x-ray motion picture," Linguistics 44, 28-69; Westbury, J. R. et al. (1998), "Differences among speakers in lingual articulation for American English /r/," Speech Commun. 26, 203-206]. It is puzzling why this should be so, given the very different vocal tract configurations involved. In this paper, two subjects whose productions of "retroflex" /r/ and "bunched" /r/ show similar patterns of F1-F3 but very different spacing between F4 and F5 are contrasted. Using finite element analysis and area functions based on magnetic resonance images of the vocal tract for sustained productions, the results of computer vocal tract models are compared to actual speech recordings. In particular, formant-cavity affiliations are explored using formant sensitivity functions and vocal tract simple-tube models. The difference in F4/F5 patterns between the subjects is confirmed for several additional subjects with retroflex and bunched vocal tract configurations. The results suggest that the F4/F5 differences between the variants can be largely explained by differences in whether the long cavity behind the palatal constriction acts as a half- or a quarter-wavelength resonator.


Assuntos
Idioma , Imageamento por Ressonância Magnética/métodos , Fonação , Acústica da Fala , Voz/fisiologia , Acústica , Inglaterra , Humanos , Laringe/fisiologia , Testes de Articulação da Fala , Medida da Produção da Fala , Língua/fisiologia , Estados Unidos , Prega Vocal/fisiologia
10.
J Acoust Soc Am ; 121(6): 3858-73, 2007 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-17552733

RESUMO

In this study, vocal tract area functions for one American English speaker, recorded using magnetic resonance imaging, were used to simulate and analyze the acoustics of vowel nasalization. Computer vocal tract models and susceptance plots were used to study the three most important sources of acoustic variability involved in the production of nasalized vowels: velar coupling area, asymmetry of nasal passages, and the sinus cavities. Analysis of the susceptance plots of the pharyngeal and oral cavities, -(B(p)+B(o)), and the nasal cavity, B(n), helped in understanding the movement of poles and zeros with varying coupling areas. Simulations using two nasal passages clearly showed the introduction of extra pole-zero pairs due to the asymmetry between the passages. Simulations with the inclusion of maxillary and sphenoidal sinuses showed that each sinus can potentially introduce one pole-zero pair in the spectrum. Further, the right maxillary sinus introduced a pole-zero pair at the lowest frequency. The effective frequencies of these poles and zeros due to the sinuses in the sum of the oral and nasal cavity outputs changes with a change in the configuration of the oral cavity, which may happen due to a change in the coupling area, or in the vowel being articulated.


Assuntos
Audição/fisiologia , Idioma , Imageamento por Ressonância Magnética/métodos , Nariz/fisiologia , Fala , Adulto , Simulação por Computador , Glote/fisiologia , Humanos , Laringe/anatomia & histologia , Laringe/fisiologia , Masculino , Seios Paranasais/anatomia & histologia , Seios Paranasais/fisiologia , Som , Prega Vocal/fisiologia
11.
J Acoust Soc Am ; 121(6): 3886-98, 2007 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-17552735

RESUMO

In this paper we present a model called the Modified Phase-Opponency (MPO) model for single-channel speech enhancement when the speech is corrupted by additive noise. The MPO model is based on the auditory PO model, proposed for detection of tones in noise. The PO model includes a physiologically realistic mechanism for processing the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery by using a cross-auditory-nerve-fiber coincidence detection for extracting temporal cues. The MPO model alters the components of the PO model such that the basic functionality of the PO model is maintained but the properties of the model can be analyzed and modified independently. The MPO-based speech enhancement scheme does not need to estimate the noise characteristics nor does it assume that the noise satisfies any statistical model. The MPO technique leads to the lowest value of the LPC-based objective measures and the highest value of the perceptual evaluation of speech quality measure compared to other methods when the speech signals are corrupted by fluctuating noise. Combining the MPO speech enhancement technique with our aperiodicity, periodicity, and pitch detector further improves its performance.


Assuntos
Audiometria da Fala , Inteligibilidade da Fala , Percepção da Fala , Vias Auditivas/fisiologia , Sinais (Psicologia) , Bases de Dados Factuais , Humanos , Matemática , Modelos Biológicos
12.
IEEE J Sel Top Signal Process ; 4(6): 1027-1045, 2010 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-23326297

RESUMO

Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed "speech-inversion." This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.

13.
J Acoust Soc Am ; 115(3): 1274-80, 2004 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-15058349

RESUMO

The production of the lateral sounds involves airflow paths around the tongue produced by the laterally inward movement of the tongue toward the midsagittal plane. If contact is made with the palate, a closure is formed in the flow path along the midsagittal line. The effects of the lateral channels on the sound spectrum are not clear. In this study, a vocal-tract model with parallel lateral channels and a supralingual cavity was developed. Analysis shows that the lateral channels with dimensions derived from magnetic resonance images of an American English /l/ are able to produce a pole-zero pair in the frequency range of 2-5 kHz. This pole-zero pair, together with an additional pole-zero pair due to the supralingual cavity, results in a low-amplitude and relatively flat spectral shape in the F3-F5 frequency region of the /l/ sound spectrum.


Assuntos
Modelos Biológicos , Fonação/fisiologia , Fonética , Simulação por Computador , Humanos , Masculino , Espectrografia do Som , Medida da Produção da Fala
14.
J Acoust Soc Am ; 115(3): 1296-305, 2004 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-15058352

RESUMO

Studies by Shannon et al. [Science, 270, 303-304 (1995)], Van Tasell et al. [J. Acoust. Soc. Am. 82, 1152-1161 (1987)], and others show that human listeners can understand important aspects of the speech signal when spectral shape has been significantly degraded. These experiments suggest that temporal information is particularly important in human speech perception when the speech signal is heavily degraded. In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system. The parameters targeted include energy onset and offsets (computed using an adaptive algorithm) and measures of periodic and aperiodic content; together these are used to find abrupt acoustic events which signify landmarks. Overall detection rates for strongly robust events, robust events, and weak events in a portion of the TIMIT test database are 98.9%, 94.7%, and 52.1%, respectively. Error rates increase by less than 5% when the speech signals are spectrally impoverished. Use of the four temporal parameters as the front end of a hidden Markov model (HMM)-based system for the automatic recognition of the manner classes "sonorant," "fricative," "stop," and "silence" results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used, 70.1%. The combination of the temporal parameters and cepstral parameters results in an accuracy of 74.8%.


Assuntos
Automatismo , Percepção da Fala/fisiologia , Algoritmos , Córtex Auditivo , Humanos , Modelos Biológicos , Fonética , Espectrografia do Som , Acústica da Fala , Fatores de Tempo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA