RESUMO
A key barrier to making phonetic studies scalable and replicable is the need to rely on subjective, manual annotation. To help meet this challenge, a machine learning algorithm was developed for automatic measurement of a widely used phonetic measure: vowel duration. Manually-annotated data were used to train a model that takes as input an arbitrary length segment of the acoustic signal containing a single vowel that is preceded and followed by consonants and outputs the duration of the vowel. The model is based on the structured prediction framework. The input signal and a hypothesized set of a vowel's onset and offset are mapped to an abstract vector space by a set of acoustic feature functions. The learning algorithm is trained in this space to minimize the difference in expectations between predicted and manually-measured vowel durations. The trained model can then automatically estimate vowel durations without phonetic or orthographic transcription. Results comparing the model to three sets of manually annotated data suggest it outperformed the current gold standard for duration measurement, an hidden Markov model-based forced aligner (which requires orthographic or phonetic transcription as an input).
Assuntos
Fonética , Acústica , Algoritmos , Aprendizado de Máquina , Acústica da Fala , Percepção da FalaRESUMO
Interactive models of language production predict that it should be possible to observe long-distance interactions; effects that arise at one level of processing influence multiple subsequent stages of representation and processing. We examine the hypothesis that disruptions arising in nonform-based levels of planning-specifically, lexical selection-should modulate articulatory processing. A novel automatic phonetic analysis method was used to examine productions in a paradigm yielding both general disruptions to formulation processes and, more specifically, overt errors during lexical selection. This analysis method allowed us to examine articulatory disruptions at multiple levels of analysis, from whole words to individual segments. Baseline performance by young adults was contrasted with young speakers' performance under time pressure (which previous work has argued increases interaction between planning and articulation) and performance by older adults (who may have difficulties inhibiting nontarget representations, leading to heightened interactive effects). The results revealed the presence of interactive effects. Our new analysis techniques revealed these effects were strongest in initial portions of responses, suggesting that speech is initiated as soon as the first segment has been planned. Interactive effects did not increase under response pressure, suggesting interaction between planning and articulation is relatively fixed. Unexpectedly, lexical selection disruptions appeared to yield some degree of facilitation in articulatory processing (possibly reflecting semantic facilitation of target retrieval) and older adults showed weaker, not stronger interactive effects (possibly reflecting weakened connections between lexical and form-level representations). (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Assuntos
Fonética , Psicolinguística , Fala , Adolescente , Idoso , Envelhecimento/psicologia , Associação , Feminino , Humanos , Inibição Psicológica , Masculino , Pessoa de Meia-Idade , Redes Neurais de Computação , Reconhecimento Visual de Modelos , Leitura , Adulto JovemRESUMO
BACKGROUND: Voice analysis has a limited role in a day-to-day voice clinic. We developed objective measurements of vocal folds (VF) glottal closure insufficiency (GCI) during phonation. METHODS: We examined 18 subjects with no history of voice impairment and 20 patients with unilateral VF paralysis before and after injection medialization laryngoplasty. Voice analysis was extracted. We measured settling time, slope, and area under the fundamental frequency curve from the phonation onset to its settling-time. RESULTS: The measured parameters, settling time, slope, and area under the curve were in correlation with the traditional acoustic voice assessments and clinical findings before treatment and after injection medialization laryngoplasty. CONCLUSION: We found that the fundamental frequency curve has several typical contours which correspond to different glottal closure conditions. We proposed a new set of parameters that captures the contour type, and showed that they could be used to quantitatively assess individuals with GCI.
Assuntos
Laringoplastia , Fonação , Software , Acústica da Fala , Paralisia das Pregas Vocais/terapia , Qualidade da Voz , Adulto , Idoso , Idoso de 80 Anos ou mais , Durapatita , Feminino , Géis , Humanos , Masculino , Pessoa de Meia-Idade , EstroboscopiaRESUMO
We describe and analyze a simple and effective algorithm for sequence segmentation applied to speech processing tasks. We propose a neural architecture that is composed of two modules trained jointly: a recurrent neural network (RNN) module and a structured prediction model. The RNN outputs are considered as feature functions to the structured model. The overall model is trained with a structured loss function which can be designed to the given segmentation task. We demonstrate the effectiveness of our method by applying it to two simple tasks commonly used in phonetic studies: word segmentation and voice onset time segmentation. Results suggest the proposed model is superior to previous methods, obtaining state-of-the-art results on the tested datasets.
RESUMO
Vowel durations are most often utilized in studies addressing specific issues in phonetics. Thus far this has been hampered by a reliance on subjective, labor-intensive manual annotation. Our goal is to build an algorithm for automatic accurate measurement of vowel duration, where the input to the algorithm is a speech segment contains one vowel preceded and followed by consonants (CVC). Our algorithm is based on a deep neural network trained at the frame level on manually annotated data from a phonetic study. Specifically, we try two deep-network architectures: convolutional neural network (CNN), and deep belief network (DBN), and compare their accuracy to an HMM-based forced aligner. Results suggest that CNN is better than DBN, and both CNN and HMM-based forced aligner are comparable in their results, but neither of them yielded the same predictions as models fit to manually annotated data.