RESUMO
The current project undertakes a kinematic examination of vertical larynx actions and intergestural timing stability within multi-gesture complex segments such as ejectives and implosives that may possess specific temporal goals critical to their articulatory realization. Using real-time MRI (rtMRI) speech production data from Hausa non-pulmonic and pulmonic consonants, this study illuminates speech timing between oral constriction and vertical larynx actions within segments and the role this intergestural timing plays in realizing phonological contrasts and processes in varying prosodic contexts. Results suggest that vertical larynx actions have greater magnitude in the production of ejectives compared to their pulmonic counterparts, but implosives and pulmonic consonants are differentiated not by vertical larynx magnitude but by the intergestural timing patterns between their oral and vertical larynx gestures. Moreover, intergestural timing stability/variability between oral and non-oral (vertical larynx) actions differ among ejectives, implosives, and pulmonic consonants, with ejectives having the most stable temporal lags, followed by implosives and pulmonic consonants, respectively. Lastly, the findings show how contrastive linguistic 'molecules' - here, segment-sized phonological complexes with multiple gestures - interact with phrasal context in speech in such a way that it variably shapes temporal organization between participating gestures as well as respecting stability in relative timing between such gestures comprising a segment.
RESUMO
Accurately representing changes in mental states over time is crucial for understanding their complex dynamics. However, there is little methodological research on the validity and reliability of human-produced continuous-time annotation of these states. We present a psychometric perspective on valid and reliable construct assessment, examine the robustness of interval-scale (e.g., values between zero and one) continuous-time annotation, and identify three major threats to validity and reliability in current approaches. We then propose a novel ground truth generation pipeline that combines emerging techniques for improving validity and robustness. We demonstrate its effectiveness in a case study involving crowd-sourced annotation of perceived violence in movies, where our pipeline achieves a .95 Spearman correlation in summarized ratings compared to a .15 baseline. These results suggest that highly accurate ground truth signals can be produced from continuous annotations using additional comparative annotation (e.g., a versus b) to correct structured errors, highlighting the need for a paradigm shift in robust construct measurement over time.
Assuntos
Psicometria , Humanos , Psicometria/métodos , Psicometria/instrumentação , Reprodutibilidade dos Testes , Violência/psicologiaRESUMO
Social networks are the persons surrounding a patient who provide support, circulate information, and influence health behaviors. For patients seen by neurologists, social networks are one of the most proximate social determinants of health that are actually accessible to clinicians, compared with wider social forces such as structural inequalities. We can measure social networks and related phenomena of social connection using a growing set of scalable and quantitative tools increasing familiarity with social network effects and mechanisms. This scientific approach is built on decades of neurobiological and psychological research highlighting the impact of the social environment on physical and mental well-being, nervous system structure, and neuro-recovery. Here, we review the biology and psychology of social networks, assessment methods including novel social sensors, and the design of network interventions and social therapeutics.
Assuntos
Comportamentos Relacionados com a Saúde , Rede Social , Humanos , NeurologistasRESUMO
BACKGROUND: Each year, millions of Americans receive evidence-based psychotherapies (EBPs) like cognitive behavioral therapy (CBT) for the treatment of mental and behavioral health problems. Yet, at present, there is no scalable method for evaluating the quality of psychotherapy services, leaving EBP quality and effectiveness largely unmeasured and unknown. Project AFFECT will develop and evaluate an AI-based software system to automatically estimate CBT fidelity from a recording of a CBT session. Project AFFECT is an NIMH-funded research partnership between the Penn Collaborative for CBT and Implementation Science and Lyssn.io, Inc. ("Lyssn") a start-up developing AI-based technologies that are objective, scalable, and cost efficient, to support training, supervision, and quality assurance of EBPs. Lyssn provides HIPAA-compliant, cloud-based software for secure recording, sharing, and reviewing of therapy sessions, which includes AI-generated metrics for CBT. The proposed tool will build from and be integrated into this core platform. METHODS: Phase I will work from an existing software prototype to develop a LyssnCBT user interface geared to the needs of community mental health (CMH) agencies. Core activities include a user-centered design focus group and interviews with community mental health therapists, supervisors, and administrators to inform the design and development of LyssnCBT. LyssnCBT will be evaluated for usability and implementation readiness in a final stage of Phase I. Phase II will conduct a stepped-wedge, hybrid implementation-effectiveness randomized trial (N = 1,875 clients) to evaluate the effectiveness of LyssnCBT to improve therapist CBT skills and client outcomes and reduce client drop-out. Analyses will also examine the hypothesized mechanism of action underlying LyssnCBT. DISCUSSION: Successful execution will provide automated, scalable CBT fidelity feedback for the first time ever, supporting high-quality training, supervision, and quality assurance, and providing a core technology foundation that could support the quality delivery of a range of EBPs in the future. TRIAL REGISTRATION: ClinicalTrials.gov; NCT05340738 ; approved 4/21/2022.
Assuntos
Inteligência Artificial , Terapia Cognitivo-Comportamental , Terapia Cognitivo-Comportamental/métodos , Retroalimentação , Humanos , Saúde Mental , Psicoterapia , Estados UnidosRESUMO
Automatic inference of paralinguistic information from speech, such as age, is an important area of research with many technological applications. Speaker age estimation can help with age-appropriate curation of information content and personalized interactive experiences. However, automatic speaker age estimation in children is challenging due to the paucity of speech data representing the developmental spectrum, and the large signal variability including within a given age group. Most prior approaches in child speaker age estimation adopt methods directly drawn from research on adult speech. In this paper, we propose a novel technique that exploits temporal variability present in children's speech for estimation of children's age. We focus on phone durations as biomarker of children's age. Phone duration distributions are derived by forced-aligning children's speech with transcripts. Regression models are trained to predict speaker age among children studying in kindergarten up to grade 10. Experiments on two children's speech datasets are used to demonstrate the robustness and portability of proposed features over multiple domains of varying signal conditions. Phonemes contributing most to estimation of children speaker age are analyzed and presented. Experimental results suggest phone durations contain important development-related information of children. The proposed features are also suited for application under low data scenarios.
Assuntos
Instituições Acadêmicas , Telefone , Adulto , Criança , Humanos , FalaRESUMO
With the growing prevalence of psychological interventions, it is vital to have measures which rate the effectiveness of psychological care to assist in training, supervision, and quality assurance of services. Traditionally, quality assessment is addressed by human raters who evaluate recorded sessions along specific dimensions, often codified through constructs relevant to the approach and domain. This is, however, a cost-prohibitive and time-consuming method that leads to poor feasibility and limited use in real-world settings. To facilitate this process, we have developed an automated competency rating tool able to process the raw recorded audio of a session, analyzing who spoke when, what they said, and how the health professional used language to provide therapy. Focusing on a use case of a specific type of psychotherapy called "motivational interviewing", our system gives comprehensive feedback to the therapist, including information about the dynamics of the session (e.g., therapist's vs. client's talking time), low-level psychological language descriptors (e.g., type of questions asked), as well as other high-level behavioral constructs (e.g., the extent to which the therapist understands the clients' perspective). We describe our platform and its performance using a dataset of more than 5000 recordings drawn from its deployment in a real-world clinical setting used to assist training of new therapists. Widespread use of automated psychotherapy rating tools may augment experts' capabilities by providing an avenue for more effective training and skill improvement, eventually leading to more positive clinical outcomes.
Assuntos
Relações Profissional-Paciente , Fala , Humanos , Idioma , Psicoterapia/métodosRESUMO
To capitalize on investments in evidence-based practices, technology is needed to scale up fidelity assessment and supervision. Stakeholder feedback may facilitate adoption of such tools. This evaluation gathered stakeholder feedback and preferences to explore whether it would be fundamentally feasible or possible to implement an automated fidelity-scoring supervision tool in community mental health settings. A partially mixed, sequential research method design was used including focus group discussions with community mental health therapists (n = 18) and clinical leadership (n = 12) to explore typical supervision practices, followed by discussion of an automated fidelity feedback tool embedded in a cloud-based supervision platform. Interpretation of qualitative findings was enhanced through quantitative measures of participants' use of technology and perceptions of acceptability, appropriateness, and feasibility of the tool. Initial perceptions of acceptability, appropriateness, and feasibility of automated fidelity tools were positive and increased after introduction of an automated tool. Standard supervision was described as collaboratively guided and focused on clinical content, self-care, and documentation. Participants highlighted the tool's utility for supervision, training, and professional growth, but questioned its ability to evaluate rapport, cultural responsiveness, and non-verbal communication. Concerns were raised about privacy and the impact of low scores on therapist confidence. Desired features included intervention labeling and transparency about how scores related to session content. Opportunities for asynchronous, remote, and targeted supervision were particularly valued. Stakeholder feedback suggests that automated fidelity measurement could augment supervision practices. Future research should examine the relations among use of such supervision tools, clinician skill, and client outcomes.
Assuntos
Inteligência Artificial , Terapia Cognitivo-Comportamental , Atitude , Terapia Cognitivo-Comportamental/métodos , Grupos Focais , Humanos , Projetos de PesquisaRESUMO
PURPOSE: To provide 3D real-time MRI of speech production with improved spatio-temporal sharpness using randomized, variable-density, stack-of-spiral sampling combined with a 3D spatio-temporally constrained reconstruction. METHODS: We evaluated five candidate (k, t) sampling strategies using a previously proposed gradient-echo stack-of-spiral sequence and a 3D constrained reconstruction with spatial and temporal penalties. Regularization parameters were chosen by expert readers based on qualitative assessment. We experimentally determined the effect of spiral angle increment and kz temporal order. The strategy yielding highest image quality was chosen as the proposed method. We evaluated the proposed and original 3D real-time MRI methods in 2 healthy subjects performing speech production tasks that invoke rapid movements of articulators seen in multiple planes, using interleaved 2D real-time MRI as the reference. We quantitatively evaluated tongue boundary sharpness in three locations at two speech rates. RESULTS: The proposed data-sampling scheme uses a golden-angle spiral increment in the kx -ky plane and variable-density, randomized encoding along kz . It provided a statistically significant improvement in tongue boundary sharpness score (P < .001) in the blade, body, and root of the tongue during normal and 1.5-times speeded speech. Qualitative improvements were substantial during natural speech tasks of alternating high, low tongue postures during vowels. The proposed method was also able to capture complex tongue shapes during fast alveolar consonant segments. Furthermore, the proposed scheme allows flexible retrospective selection of temporal resolution. CONCLUSION: We have demonstrated improved 3D real-time MRI of speech production using randomized, variable-density, stack-of-spiral sampling with a 3D spatio-temporally constrained reconstruction.
Assuntos
Processamento de Imagem Assistida por Computador , Fala , Humanos , Imageamento Tridimensional , Imageamento por Ressonância Magnética , Estudos Retrospectivos , Língua/diagnóstico por imagemRESUMO
PURPOSE: To mitigate a common artifact in spiral real-time MRI, caused by aliasing of signal outside the desired FOV. This artifact frequently occurs in midsagittal speech real-time MRI. METHODS: Simulations were performed to determine the likely origin of the artifact. Two methods to mitigate the artifact are proposed. The first approach, denoted as "large FOV" (LF), keeps an FOV that is large enough to include the artifact signal source during reconstruction. The second approach, denoted as "estimation-subtraction" (ES), estimates the artifact signal source before subtracting a synthetic signal representing that source in multicoil k-space raw data. Twenty-five midsagittal speech-production real-time MRI data sets were used to evaluate both of the proposed methods. Reconstructions without and with corrections were evaluated by two expert readers using a 5-level Likert scale assessing artifact severity. Reconstruction time was also compared. RESULTS: The origin of the artifact was found to be a combination of gradient nonlinearity and imperfect anti-aliasing in spiral sampling. The LF and ES methods were both able to substantially reduce the artifact, with an averaged qualitative score improvement of 1.25 and 1.35 Likert levels for LF correction and ES correction, respectively. Average reconstruction time without correction, with LF correction, and with ES correction were 160.69 ± 1.56, 526.43 ± 5.17, and 171.47 ± 1.71 ms/frame. CONCLUSION: Both proposed methods were able to reduce the spiral aliasing artifacts, with the ES-reduction method being more effective and more time efficient.
Assuntos
Artefatos , Processamento de Imagem Assistida por Computador , Imageamento por Ressonância Magnética , FalaRESUMO
The glossectomy procedure, involving surgical resection of cancerous lingual tissue, has long been observed to affect speech production. This study aims to quantitatively index and compare complexity of vocal tract shaping due to lingual movement in individuals who have undergone glossectomy and typical speakers using real-time magnetic resonance imaging data and Principal Component Analysis. The data reveal that (i) the type of glossectomy undergone largely predicts the patterns in vocal tract shaping observed, (ii) gross forward and backward motion of the tongue body accounts for more change in vocal tract shaping than do subtler movements of the tongue (e.g., tongue tip constrictions) in patient data, and (iii) fewer vocal tract shaping components are required to account for the patients' speech data than typical speech data, suggesting that the patient data at hand exhibit less complex vocal tract shaping in the midsagittal plane than do the data from the typical speakers observed.
Assuntos
Glossectomia , Neoplasias da Língua , Humanos , Análise de Componente Principal , Fala , Língua/diagnóstico por imagem , Língua/cirurgia , Neoplasias da Língua/diagnóstico por imagem , Neoplasias da Língua/cirurgiaRESUMO
Autism spectrum disorder (ASD) is characterized by deficits in social communication, and even children with ASD with preserved language are often perceived as socially awkward. We ask if linguistic patterns are associated with social perceptions of speakers. Twenty-one adolescents with ASD participated in conversations with an adult; each conversation was then rated for the social dimensions of likability, outgoingness, social skilfulness, responsiveness, and fluency. Conversations were analysed for responses to questions, pauses, and acoustic variables. Wide intonation ranges and more pauses within children's own conversational turn were predictors of more positive social ratings while failure to respond to one's conversational partner, faster syllable rate, and smaller quantity of speech were negative predictors of social perceptions.
Assuntos
Transtorno do Espectro Autista , Adolescente , Adulto , Criança , Comunicação , Humanos , Julgamento , Idioma , FalaRESUMO
PURPOSE: To develop and evaluate a fast and effective method for deblurring spiral real-time MRI (RT-MRI) using convolutional neural networks. METHODS: We demonstrate a 3-layer residual convolutional neural networks to correct image domain off-resonance artifacts in speech production spiral RT-MRI without the knowledge of field maps. The architecture is motivated by the traditional deblurring approaches. Spatially varying off-resonance blur is synthetically generated by using discrete object approximation and field maps with data augmentation from a large database of 2D human speech production RT-MRI. The effect of off-resonance range, shift-invariance of blur, and readout durations on deblurring performance are investigated. The proposed method is validated using synthetic and real data with longer readouts, quantitatively using image quality metrics and qualitatively via visual inspection, and with a comparison to conventional deblurring methods. RESULTS: Deblurring performance was found superior to a current autocalibrated method for in vivo data and only slightly worse than an ideal reconstruction with perfect knowledge of the field map for synthetic test data. Convolutional neural networks deblurring made it possible to visualize articulator boundaries with readouts up to 8 ms at 1.5 T, which is 3-fold longer than the current standard practice. The computation time was 12.3 ± 2.2 ms per frame, enabling low-latency processing for RT-MRI applications. CONCLUSION: Convolutional neural networks deblurring is a practical, efficient, and field map-free approach for the deblurring of spiral RT-MRI. In the context of speech production imaging, this can enable 1.7-fold improvement in scan efficiency and the use of spiral readouts at higher field strengths such as 3 T.
Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador , Artefatos , Humanos , Imageamento por Ressonância Magnética , Redes Neurais de ComputaçãoRESUMO
OBJECTIVES: To evaluate a novel method for real-time tagged MRI with increased tag persistence using phase sensitive tagging (REALTAG), demonstrated for speech imaging. METHODS: Tagging is applied as a brief interruption to a continuous real-time spiral acquisition. REALTAG is implemented using a total tagging flip angle of 180° and a novel frame-by-frame phase sensitive reconstruction to remove smooth background phase while preserving the sign of the tag lines. Tag contrast-to-noise ratio of REALTAG and conventional tagging (total flip angle of 90°) is simulated and evaluated in vivo. The ability to extend tag persistence is tested during the production of vowel-to-vowel transitions by American English speakers. RESULTS: REALTAG resulted in a doubling of contrast-to-noise ratio at each time point and increased tag persistence by more than 1.9-fold. The tag persistence was 1150 ms with contrast-to-noise ratio >6 at 1.5T, providing 2 mm in-plane resolution, 179 frames/s, with 72.6 ms temporal window width, and phase sensitive reconstruction. The new imaging window is able to capture internal tongue deformation over word-to-word transitions in natural speech production. CONCLUSION: Tag persistence is substantially increased in intermittently tagged real-time MRI by using the improved REALTAG method. This makes it possible to capture longer motion patterns in the tongue, such as cross-word vowel-to-vowel transitions, and provides a powerful new window to study tongue biomechanics.
Assuntos
Idioma , Imageamento por Ressonância Magnética , Fenômenos Biomecânicos , Fala , Língua/diagnóstico por imagemRESUMO
Although substantial variability is observed in the articulatory implementation of the constriction gestures involved in /ɹ/ production, studies of articulatory-acoustic relations in /ɹ/ have largely ignored the potential for subtle variation in the implementation of these gestures to affect salient acoustic dimensions. This study examines how variation in the articulation of American English /ɹ/ influences the relative sensitivity of the third formant to variation in palatal, pharyngeal, and labial constriction degree. Simultaneously recorded articulatory and acoustic data from six speakers in the USC-TIMIT corpus was analyzed to determine how variation in the implementation of each constriction across tokens of /ɹ/ relates to variation in third formant values. Results show that third formant values are differentially affected by constriction degree for the different constrictions used to produce /ɹ/. Additionally, interspeaker variation is observed in the relative effect of different constriction gestures on third formant values, most notably in a division between speakers exhibiting relatively equal effects of palatal and pharyngeal constriction degree on F3 and speakers exhibiting a stronger palatal effect. This division among speakers mirrors interspeaker differences in mean constriction length and location, suggesting that individual differences in /ɹ/ production lead to variation in articulatory-acoustic relations.
Assuntos
Fonética , Acústica da Fala , Constrição , Idioma , Faringe , Medida da Produção da Fala , Estados UnidosRESUMO
It has been previously observed [McMicken, Salles, Berg, Vento-Wilson, Rogers, Toutios, and Narayanan. (2017). J. Commun. Disorders, Deaf Stud. Hear. Aids 5(2), 1-6] using real-time magnetic resonance imaging that a speaker with severe congenital tongue hypoplasia (aglossia) had developed a compensatory articulatory strategy where she, in the absence of a functional tongue tip, produced a plosive consonant perceptually similar to /d/ using a bilabial constriction. The present paper provides an updated account of this strategy. It is suggested that the previously observed compensatory bilabial closing that occurs during this speaker's /d/ production is consistent with vocal tract shaping resulting from hyoid raising created with mylohyoid action, which may also be involved in typical /d/ production. Simulating this strategy in a dynamic articulatory synthesis experiment leads to the generation of /d/-like formant transitions.
Assuntos
Língua , Voz , Feminino , Humanos , Fonética , Fala , Língua/diagnóstico por imagemRESUMO
While deep learning has driven recent improvements in audio speaker diarization, it often faces performance issues in challenging interaction scenarios and varied acoustic settings such as between a child and adult (caregiver/examiner). In this work, the role of contextual factors that affect diarization performance in such interactions is analyzed. Factors that affect each type of diarization error are identified. Furthermore, a DNN is trained on diarization outputs in conjunction with the factors to improve diarization performance. The results demonstrate the usefulness of incorporating context in improving diarization performance of child-adult interactions in clinical settings.
Assuntos
Acústica , Comunicação , Adulto , Criança , HumanosRESUMO
Artificial intelligence generally and machine learning specifically have become deeply woven into the lives and technologies of modern life. Machine learning is dramatically changing scientific research and industry and may also hold promise for addressing limitations encountered in mental health care and psychotherapy. The current paper introduces machine learning and natural language processing as related methodologies that may prove valuable for automating the assessment of meaningful aspects of treatment. Prediction of therapeutic alliance from session recordings is used as a case in point. Recordings from 1,235 sessions of 386 clients seen by 40 therapists at a university counseling center were processed using automatic speech recognition software. Machine learning algorithms learned associations between client ratings of therapeutic alliance exclusively from session linguistic content. Using a portion of the data to train the model, machine learning algorithms modestly predicted alliance ratings from session content in an independent test set (Spearman's ρ = .15, p < .001). These results highlight the potential to harness natural language processing and machine learning to predict a key psychotherapy process variable that is relatively distal from linguistic content. Six practical suggestions for conducting psychotherapy research using machine learning are presented along with several directions for future research. Questions of dissemination and implementation may be particularly important to explore as machine learning improves in its ability to automate assessment of psychotherapy process and outcome. (PsycInfo Database Record (c) 2020 APA, all rights reserved).
Assuntos
Pesquisa Biomédica/métodos , Aprendizado de Máquina , Transtornos Mentais/terapia , Processamento de Linguagem Natural , Psicoterapia/métodos , Aliança Terapêutica , Adolescente , Adulto , Pesquisa Biomédica/tendências , Aconselhamento/métodos , Aconselhamento/tendências , Feminino , Humanos , Aprendizado de Máquina/tendências , Masculino , Transtornos Mentais/psicologia , Relações Profissional-Paciente , Processos Psicoterapêuticos , Psicoterapia/tendências , Universidades/tendências , Adulto JovemRESUMO
OBJECTIVE: Close interpersonal relationships are fundamental to emotion regulation. Clinical theory suggests that one role of therapists in psychotherapy is to help clients regulate emotions, however, if and how clients and therapists serve to regulate each other's emotions has not been empirically tested. Emotion coregulation - the bidirectional emotional linkage of two people that promotes emotional stability - is a specific, temporal process that provides a framework for testing the way in which therapists' and clients' emotions may be related on a moment to moment basis in clinically relevant ways. METHOD: Utilizing 227 audio recordings from a relationally oriented treatment (Motivational Interviewing), we estimated continuous values of vocally encoded emotional arousal via mean fundamental frequency. We used dynamic systems models to examine emotional coregulation, and tested the hypothesis that each individual's emotional arousal would be significantly associated with fluctuations in the other's emotional state over the course of a psychotherapy session. RESULTS: Results indicated that when clients became more emotionally labile over the course of the session, therapists became less so. When changes in therapist arousal increased, the client's tendency to become more aroused during session slowed. Alternatively, when changes in client arousal increased, the therapist's tendency to become less aroused slowed.
Assuntos
Regulação Emocional , Emoções , Relações Profissional-Paciente , Psicoterapia , Nível de Alerta , HumanosRESUMO
PURPOSE: To improve the depiction and tracking of vocal tract articulators in spiral real-time MRI (RT-MRI) of speech production by estimating and correcting for dynamic changes in off-resonance. METHODS: The proposed method computes a dynamic field map from the phase of single-TE dynamic images after a coil phase compensation where complex coil sensitivity maps are estimated from the single-TE dynamic scan itself. This method is tested using simulations and in vivo data. The depiction of air-tissue boundaries is evaluated quantitatively using a sharpness metric and visual inspection. RESULTS: Simulations demonstrate that the proposed method provides robust off-resonance correction for spiral readout durations up to 5 ms at 1.5T. In -vivo experiments during human speech production demonstrate that image sharpness is improved in a majority of data sets at air-tissue boundaries including the upper lip, hard palate, soft palate, and tongue boundaries, whereas the lower lip shows little improvement in the edge sharpness after correction. CONCLUSION: Dynamic off-resonance correction is feasible from single-TE spiral RT-MRI data, and provides a practical performance improvement in articulator sharpness when applied to speech production imaging.
Assuntos
Imageamento por Ressonância Magnética , Boca/diagnóstico por imagem , Palato Mole/fisiologia , Faringe/fisiologia , Processamento de Sinais Assistido por Computador , Fala/fisiologia , Algoritmos , Simulação por Computador , Voluntários Saudáveis , Humanos , Processamento de Imagem Assistida por Computador/métodos , Reprodutibilidade dos Testes , Língua/fisiologiaRESUMO
PURPOSE: To demonstrate a tagging method compatible with RT-MRI for the study of speech production. METHODS: Tagging is applied as a brief interruption to a continuous real-time spiral acquisition. Tagging can be initiated manually by the operator, cued to the speech stimulus, or be automatically applied with a fixed frequency. We use a standard 2D 1-3-3-1 binomial SPAtial Modulation of Magnetization (SPAMM) sequence with 1 cm spacing in both in-plane directions. Tag persistence in tongue muscle is simulated and validated in vivo. The ability to capture internal tongue deformations is tested during speech production of American English diphthongs in native speakers. RESULTS: We achieved an imaging window of 650-800 ms at 1.5T, with imaging signal to noise ratio ≥ 17 and tag contrast to noise ratio ≥ 5 in human tongue, providing 36 frames/s temporal resolution and 2 mm in-plane spatial resolution with real-time interactive acquisition and view-sharing reconstruction. The proposed method was able to capture tongue motion patterns and their relative timing with adequate spatiotemporal resolution during the production of American English diphthongs and consonants. CONCLUSION: Intermittent tagging during real-time MRI of speech production is able to reveal the internal deformations of the tongue. This capability will allow new investigations of valuable spatiotemporal information on the biomechanics of the lingual subsystems during speech without reliance on binning speech utterance repetition.