RESUMO
The current project undertakes a kinematic examination of vertical larynx actions and intergestural timing stability within multi-gesture complex segments such as ejectives and implosives that may possess specific temporal goals critical to their articulatory realization. Using real-time MRI (rtMRI) speech production data from Hausa non-pulmonic and pulmonic consonants, this study illuminates speech timing between oral constriction and vertical larynx actions within segments and the role this intergestural timing plays in realizing phonological contrasts and processes in varying prosodic contexts. Results suggest that vertical larynx actions have greater magnitude in the production of ejectives compared to their pulmonic counterparts, but implosives and pulmonic consonants are differentiated not by vertical larynx magnitude but by the intergestural timing patterns between their oral and vertical larynx gestures. Moreover, intergestural timing stability/variability between oral and non-oral (vertical larynx) actions differ among ejectives, implosives, and pulmonic consonants, with ejectives having the most stable temporal lags, followed by implosives and pulmonic consonants, respectively. Lastly, the findings show how contrastive linguistic 'molecules' - here, segment-sized phonological complexes with multiple gestures - interact with phrasal context in speech in such a way that it variably shapes temporal organization between participating gestures as well as respecting stability in relative timing between such gestures comprising a segment.
RESUMO
The glossectomy procedure, involving surgical resection of cancerous lingual tissue, has long been observed to affect speech production. This study aims to quantitatively index and compare complexity of vocal tract shaping due to lingual movement in individuals who have undergone glossectomy and typical speakers using real-time magnetic resonance imaging data and Principal Component Analysis. The data reveal that (i) the type of glossectomy undergone largely predicts the patterns in vocal tract shaping observed, (ii) gross forward and backward motion of the tongue body accounts for more change in vocal tract shaping than do subtler movements of the tongue (e.g., tongue tip constrictions) in patient data, and (iii) fewer vocal tract shaping components are required to account for the patients' speech data than typical speech data, suggesting that the patient data at hand exhibit less complex vocal tract shaping in the midsagittal plane than do the data from the typical speakers observed.
Assuntos
Glossectomia , Neoplasias da Língua , Humanos , Análise de Componente Principal , Fala , Língua/diagnóstico por imagem , Língua/cirurgia , Neoplasias da Língua/diagnóstico por imagem , Neoplasias da Língua/cirurgiaRESUMO
Although substantial variability is observed in the articulatory implementation of the constriction gestures involved in /ɹ/ production, studies of articulatory-acoustic relations in /ɹ/ have largely ignored the potential for subtle variation in the implementation of these gestures to affect salient acoustic dimensions. This study examines how variation in the articulation of American English /ɹ/ influences the relative sensitivity of the third formant to variation in palatal, pharyngeal, and labial constriction degree. Simultaneously recorded articulatory and acoustic data from six speakers in the USC-TIMIT corpus was analyzed to determine how variation in the implementation of each constriction across tokens of /ɹ/ relates to variation in third formant values. Results show that third formant values are differentially affected by constriction degree for the different constrictions used to produce /ɹ/. Additionally, interspeaker variation is observed in the relative effect of different constriction gestures on third formant values, most notably in a division between speakers exhibiting relatively equal effects of palatal and pharyngeal constriction degree on F3 and speakers exhibiting a stronger palatal effect. This division among speakers mirrors interspeaker differences in mean constriction length and location, suggesting that individual differences in /ɹ/ production lead to variation in articulatory-acoustic relations.
Assuntos
Fonética , Acústica da Fala , Constrição , Idioma , Faringe , Medida da Produção da Fala , Estados UnidosRESUMO
It has been previously observed [McMicken, Salles, Berg, Vento-Wilson, Rogers, Toutios, and Narayanan. (2017). J. Commun. Disorders, Deaf Stud. Hear. Aids 5(2), 1-6] using real-time magnetic resonance imaging that a speaker with severe congenital tongue hypoplasia (aglossia) had developed a compensatory articulatory strategy where she, in the absence of a functional tongue tip, produced a plosive consonant perceptually similar to /d/ using a bilabial constriction. The present paper provides an updated account of this strategy. It is suggested that the previously observed compensatory bilabial closing that occurs during this speaker's /d/ production is consistent with vocal tract shaping resulting from hyoid raising created with mylohyoid action, which may also be involved in typical /d/ production. Simulating this strategy in a dynamic articulatory synthesis experiment leads to the generation of /d/-like formant transitions.
Assuntos
Língua , Voz , Feminino , Humanos , Fonética , Fala , Língua/diagnóstico por imagemRESUMO
In speech production, the motor system organizes articulators such as the jaw, tongue, and lips into synergies whose function is to produce speech sounds by forming constrictions at the phonetic places of articulation. The present study tests whether synergies for different constriction tasks differ in terms of inter-articulator coordination. The test is conducted on utterances [ÉpÉ], [ÉtÉ], [ÉiÉ], and [ÉkÉ] with a real-time magnetic resonance imaging biomarker that is computed using a statistical model of the forward kinematics of the vocal tract. The present study is the first to estimate the forward kinematics of the vocal tract from speech production data. Using the imaging biomarker, the study finds that the jaw contributes least to the velar stop for [k], more to pharyngeal approximation for [É], still more to palatal approximation for [i], and most to the coronal stop for [t]. Additionally, the jaw contributes more to the coronal stop for [t] than to the bilabial stop for [p]. Finally, the study investigates how this pattern of results varies by participant. The study identifies differences in inter-articulator coordination by constriction task, which support the claim that inter-articulator coordination differs depending on the active articulator synergy.
Assuntos
Fala , Voz/fisiologia , Adulto , Fenômenos Biomecânicos , Feminino , Humanos , Arcada Osseodentária/diagnóstico por imagem , Arcada Osseodentária/fisiologia , Laringe/diagnóstico por imagem , Laringe/fisiologia , Imageamento por Ressonância Magnética , Masculino , Faringe/diagnóstico por imagem , Faringe/fisiologia , Fonética , Desempenho PsicomotorRESUMO
Sequences of similar (i.e., partially identical) words can be hard to say, as indicated by error frequencies, longer reaction and execution times. This study investigates the role of the location of this partial identity and the accompanying differences, i.e. whether errors are more frequent with mismatches in word onsets (top cop), codas (top tock) or both (pop tot). Number of syllables (tippy ticky) and empty positions (top ta) were also varied. Since the gradient nature of errors can be difficult to determine acoustically, articulatory data were investigated. Articulator movements were recorded using electromagnetic articulography, for up to 9 speakers of American English repeatedly producing 2-word sequences to an accelerating metronome. Most word pairs showed more intrusions and greater variability in coda than in onset position, in contrast to the predominance of onset position errors in corpora from perceptual observation.
Assuntos
Multilinguismo , Fonética , Acústica da Fala , Fala , Adulto , Feminino , Humanos , Idioma , Linguística , Masculino , Medida da Produção da Fala , Adulto JovemRESUMO
This paper reports on the concurrent use of electroglottography (EGG) and electromagnetic articulography (EMA) in the acquisition of EMA trajectory data for running speech. Static and dynamic intersensor distances, standard deviations, and coefficients of variation associated with inter-sample distances were compared in two conditions: with and without EGG present. Results indicate that measurement discrepancies between the two conditions are within the EMA system's measurement uncertainty. Therefore, potential electromagnetic interference from EGG does not seem to cause differences of practical importance on EMA trajectory behaviors, suggesting that simultaneous EMA and EGG data acquisition is a viable laboratory procedure for speech research.
Assuntos
Fenômenos Eletromagnéticos , Glote/fisiologia , Medida da Produção da Fala/instrumentação , Fala/fisiologia , Feminino , Glote/anatomia & histologia , Humanos , Laringe/anatomia & histologia , Laringe/fisiologia , Masculino , Boca/anatomia & histologia , Boca/fisiologiaRESUMO
The perceptual assimilation model (PAM; Best, C. T. [1995]. A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 171-204). Baltimore, MD: York Press.) accounts for developmental patterns of speech contrast discrimination by proposing that infants shift from untuned phonetic perception at 6 months to natively tuned perceptual assimilation at 11-12 months, but the model does not predict initial discrimination differences among contrasts. To address that issue, we evaluated the Articulatory Organ Hypothesis, which posits that consonants produced using different articulatory organs are initially easier to discriminate than those produced with the same articulatory organ. We tested English-learning 6- and 11-month-olds' discrimination of voiceless fricative place contrasts from Nuu-Chah-Nulth (non-native) and English (native), with one within-organ and one between-organ contrast from each language. Both native and non-native contrasts were discriminated across age, suggesting that articulatory-organ differences do not influence perception of speech contrasts by young infants. The results highlight the fact that a decline in discrimination for non-native contrasts does not always occur over age.
Assuntos
Desenvolvimento Infantil/fisiologia , Discriminação Psicológica/fisiologia , Desenvolvimento da Linguagem , Percepção da Fala/fisiologia , Aprendizagem por Discriminação/fisiologia , Feminino , Humanos , Lactente , MasculinoRESUMO
USC-TIMIT is an extensive database of multimodal speech production data, developed to complement existing resources available to the speech research community and with the intention of being continuously refined and augmented. The database currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English. Electromagnetic articulography data have also been presently collected from four of these speakers. The two modalities were recorded in two independent sessions while the subjects produced the same 460 sentence corpus used previously in the MOCHA-TIMIT database. In both cases the audio signal was recorded and synchronized with the articulatory data. The database and companion software are freely available to the research community.
Assuntos
Acústica , Pesquisa Biomédica , Bases de Dados Factuais , Fenômenos Eletromagnéticos , Imageamento por Ressonância Magnética , Faringe/fisiologia , Acústica da Fala , Medida da Produção da Fala , Qualidade da Voz , Acústica/instrumentação , Adulto , Fenômenos Biomecânicos , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Faringe/anatomia & histologia , Processamento de Sinais Assistido por Computador , Software , Medida da Produção da Fala/instrumentação , Fatores de Tempo , TransdutoresRESUMO
The English past tense allomorph following a coronal stop (e.g., /bÉndÉd/) includes a vocoid that has traditionally been transcribed as a schwa or as a barred i. Previous evidence has suggested that this entity does not involve a specific articulatory gesture of any kind. Rather, its presence may simply result from temporal coordination of the two temporally adjacent coronal gestures, while the interval between those two gestures remains voiced and is acoustically reminiscent of a schwa. The acoustic and articulatory characteristics of this vocoid are reexamined in this work using real-time MRI with synchronized audio which affords complete midsagittal views of the vocal tract. A novel statistical analysis is developed to address the issue of articulatory targetlessness based on previous models that predict articulatory action from segmental context. Results reinforce the idea that this vocoid is different, both acoustically and articulatorily, than lexical schwa, but its targetless nature is not supported. Data suggest that an articulatory target does exist, especially in the pharynx where it is revealed by the new data acquisition methodology. Moreover, substantial articulatory differences are observed between subjects, which highlights both the difficulty in characterizing this entity previously, and the need for further study with additional subjects.
Assuntos
Idioma , Imageamento por Ressonância Magnética/métodos , Fonética , Acústica da Fala , Testes de Articulação da Fala/métodos , Gestos , Humanos , Masculino , Faringe/fisiologia , Adulto JovemRESUMO
In typical speech words are grouped into prosodic constituents. This study investigates how such grouping interacts with segmental sequencing patterns in the production of repetitive word sequences. We experimentally manipulated grouping behavior using a rhythmic repetition task to elicit speech for perceptual and acoustic analysis to test the hypothesis that prosodic structure and patterns of segmental alternation can interact in the production planning process. Talkers produced alternating sequences of two words (top cop) and non-alternating controls (top top and cop cop), organized into six-word sequences. These sequences were further organized into prosodic groupings of three two-word pairs or two three-word triples by means of visual cues and audible metronome clicks. Results for six speakers showed more speech errors in triples, that is, when pairwise word alternation was mismatched with prosodic subgrouping in triples. This result suggests that the planning process for the segmental units of an utterance interacts with the planning process for the prosodic grouping of its words. It also highlights the importance of extending commonly used experimental speech elicitation methods to include more complex prosodic patterns, in order to evoke the kinds of interaction between prosodic structure and planning that occur in the production of lexical forms in continuous communicative speech.
Assuntos
Fonética , Semântica , Espectrografia do Som , Acústica da Fala , Percepção da Fala , Medida da Produção da Fala , Percepção do Tempo , Adulto , Feminino , Humanos , Masculino , Psicolinguística , Comportamento Verbal , Adulto JovemRESUMO
This paper presents an automatic procedure to analyze articulatory setting in speech production using real-time magnetic resonance imaging of the moving human vocal tract. The procedure extracts frames corresponding to inter-speech pauses, speech-ready intervals and absolute rest intervals from magnetic resonance imaging sequences of read and spontaneous speech elicited from five healthy speakers of American English and uses automatically extracted image features to quantify vocal tract posture during these intervals. Statistical analyses show significant differences between vocal tract postures adopted during inter-speech pauses and those at absolute rest before speech; the latter also exhibits a greater variability in the adopted postures. In addition, the articulatory settings adopted during inter-speech pauses in read and spontaneous speech are distinct. The results suggest that adopted vocal tract postures differ on average during rest positions, ready positions and inter-speech pauses, and might, in that order, involve an increasing degree of active control by the cognitive speech planning mechanism.
Assuntos
Epiglote/fisiologia , Glote/fisiologia , Interpretação de Imagem Assistida por Computador/métodos , Lábio/fisiologia , Imageamento por Ressonância Magnética/métodos , Palato Mole/fisiologia , Faringe/fisiologia , Fonação/fisiologia , Fonética , Fala/fisiologia , Língua/fisiologia , Algoritmos , Feminino , Humanos , Contração Muscular/fisiologia , Ventilação Pulmonar/fisiologia , Decúbito Dorsal/fisiologiaRESUMO
This paper presents a computational approach to derive interpretable movement primitives from speech articulation data. It puts forth a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given data matrix into a set of spatiotemporal basis sequences and an activation matrix. The algorithm optimizes a cost function that trades off the mismatch between the proposed model and the input data against the number of primitives that are active at any given instant. The method is applied to both measured articulatory data obtained through electromagnetic articulography as well as synthetic data generated using an articulatory synthesizer. The paper then describes how to evaluate the algorithm performance quantitatively and further performs a qualitative assessment of the algorithm's ability to recover compositional structure from data. This is done using pseudo ground-truth primitives generated by the articulatory synthesizer based on an Articulatory Phonology frame-work [Browman and Goldstein (1995). "Dynamics and articulatory phonology," in Mind as motion: Explorations in the dynamics of cognition, edited by R. F. Port and T.van Gelder (MIT Press, Cambridge, MA), pp. 175-194]. The results suggest that the proposed algorithm extracts movement primitives from human speech production data that are linguistically interpretable. Such a framework might aid the understanding of longstanding issues in speech production such as motor control and coarticulation.
Assuntos
Laringe/fisiologia , Modelos Teóricos , Boca/fisiologia , Acústica da Fala , Qualidade da Voz , Algoritmos , Fenômenos Biomecânicos , Simulação por Computador , Fenômenos Eletromagnéticos , Feminino , Humanos , Masculino , Destreza Motora , Análise Numérica Assistida por Computador , Reprodutibilidade dos Testes , Medida da Produção da Fala , Fatores de TempoRESUMO
We present and evaluate two statistical methods for estimating kinematic relationships of the speech production system: Artificial Neural Networks and Locally-Weighted Regression. The work is motivated by the need to characterize this motor system, with particular focus on estimating differential aspects of kinematics. Kinematic analysis will facilitate progress in a variety of areas, including the nature of speech production goals, articulatory redundancy and, relatedly, acoustic-to-articulatory inversion. Statistical methods must be used to estimate these relationships from data since they are infeasible to express in closed form. Statistical models are optimized and evaluated - using a heldout data validation procedure - on two sets of synthetic speech data. The theoretical and practical advantages of both methods are also discussed. It is shown that both direct and differential kinematics can be estimated with high accuracy, even for complex, nonlinear relationships. Locally-Weighted Regression displays the best overall performance, which may be due to practical advantages in its training procedure. Moreover, accurate estimation can be achieved using only a modest amount of training data, as judged by convergence of performance. The algorithms are also applied to real-time MRI data, and the results are generally consistent with those obtained from synthetic data.
RESUMO
Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems.
Assuntos
Gestos , Percepção da Fala/fisiologia , Interface para o Reconhecimento da Fala , Fala/fisiologia , Teorema de Bayes , Humanos , Fonética , Acústica da Fala , VocabulárioRESUMO
Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech. For a given utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model is employed to generate a corresponding prototype gestural score. The gestural score is temporally optimized through an iterative timing-warping process such that the acoustic distance between the original and TADA-synthesized speech is minimized. This paper demonstrates that the proposed iterative approach is superior to conventional acoustically-referenced dynamic timing-warping procedures and provides reliable gestural annotation for speech datasets.
Assuntos
Acústica , Gestos , Glote/fisiologia , Boca/fisiologia , Acústica da Fala , Qualidade da Voz , Fenômenos Biomecânicos , Feminino , Humanos , Masculino , Modelos Teóricos , Processamento de Sinais Assistido por Computador , Espectrografia do Som , Medida da Produção da Fala/métodos , Fatores de TempoRESUMO
Certain consonant/vowel (CV) combinations are more frequent than would be expected from the individual C and V frequencies alone, both in babbling and, to a lesser extent, in adult language, based on dictionary counts: Labial consonants co-occur with central vowels more often than chance would dictate; coronals co-occur with front vowels, and velars with back vowels (Davis & MacNeilage, 1994). Plausible biomechanical explanations have been proposed, but it is also possible that infants are mirroring the frequency of the CVs that they hear. As noted, previous assessments of adult language were based on dictionaries; these "type" counts are incommensurate with the babbling measures, which are necessarily "token" counts. We analyzed the tokens in two spoken corpora for English, two for French and one for Mandarin. We found that the adult spoken CV preferences correlated with the type counts for Mandarin and French, not for English. Correlations between the adult spoken corpora and the babbling results had all three possible outcomes: significantly positive (French), uncorrelated (Mandarin), and significantly negative (English). There were no correlations of the dictionary data with the babbling results when we consider all nine combinations of consonants and vowels. The results indicate that spoken frequencies of CV combinations can differ from dictionary (type) counts and that the CV preferences apparent in babbling are biomechanically driven and can ignore the frequencies of CVs in the ambient spoken language.
Assuntos
Desenvolvimento da Linguagem , Lábio/fisiologia , Fonética , Percepção da Fala/fisiologia , Fala/fisiologia , Adulto , Fenômenos Biomecânicos/fisiologia , Bases de Dados Factuais , Retroalimentação , Humanos , LactenteRESUMO
INTRODUCTION: Most of the previous articulatory studies of stuttering have focussed on the fluent speech of people who stutter. However, to better understand what causes the actual moments of stuttering, it is necessary to probe articulatory behaviors during stuttered speech. We examined the supralaryngeal articulatory characteristics of stuttered speech using real-time structural magnetic resonance imaging (RT-MRI). We investigated how articulatory gestures differ across stuttered and fluent speech of the same speaker. METHODS: Vocal tract movements of an adult man who stutters during a pseudoword reading task were recorded using RT-MRI. Four regions of interest (ROIs) were defined on RT-MRI image sequences around the lips, tongue tip, tongue body, and velum. The variation of pixel intensity in each ROI over time provided an estimate of the movement of these four articulators. RESULTS: All disfluencies occurred on syllable-initial consonants. Three articulatory patterns were identified. Pattern 1 showed smooth gestural formation and release like fluent speech. Patterns 2 and 3 showed delayed release of gestures due to articulator fixation or oscillation respectively. Block and prolongation corresponded to either pattern 1 or 2. Repetition corresponded to pattern 3 or a mix of patterns. Gestures for disfluent consonants typically exhibited a greater constriction than fluent gestures, which was rarely corrected during disfluencies. Gestures for the upcoming vowel were initiated and executed during these consonant disfluencies, achieving a tongue body position similar to the fluent counterpart. CONCLUSION: Different perceptual types of disfluencies did not necessarily result from distinct articulatory patterns, highlighting the importance of collecting articulatory data of stuttering. Disfluencies on syllable-initial consonants were related to the delayed release and the overshoot of consonant gestures, rather than the delayed initiation of vowel gestures. This suggests that stuttering does not arise from problems with planning the vowel gestures, but rather with releasing the overly constricted consonant gestures.
Assuntos
Gagueira , Adulto , Gestos , Humanos , Imageamento por Ressonância Magnética , Masculino , Fala , Medida da Produção da FalaRESUMO
Individuals who have undergone treatment for oral cancer oftentimes exhibit compensatory behavior in consonant production. This pilot study investigates whether compensatory mechanisms utilized in the production of speech sounds with a given target constriction location vary systematically depending on target manner of articulation. The data reveal that compensatory strategies used to produce target alveolar segments vary systematically as a function of target manner of articulation in subtle yet meaningful ways. When target constriction degree at a particular constriction location cannot be preserved, individuals may leverage their ability to finely modulate constriction degree at multiple constriction locations along the vocal tract.
RESUMO
Understanding how the human speech production system is related to the human auditory system has been a perennial subject of inquiry. To investigate the production-perception link, in this paper, a computational analysis has been performed using the articulatory movement data obtained during speech production with concurrently recorded acoustic speech signals from multiple subjects in three different languages: English, Cantonese, and Georgian. The form of articulatory gestures during speech production varies across languages, and this variation is considered to be reflected in the articulatory position and kinematics. The auditory processing of the acoustic speech signal is modeled by a parametric representation of the cochlear filterbank which allows for realizing various candidate filterbank structures by changing the parameter value. Using mathematical communication theory, it is found that the uncertainty about the articulatory gestures in each language is maximally reduced when the acoustic speech signal is represented using the output of a filterbank similar to the empirically established cochlear filterbank in the human auditory system. Possible interpretations of this finding are discussed.