RESUMEN
This work develops and evaluates a self-navigated variable density spiral (VDS)-based manifold regularization scheme to prospectively improve dynamic speech magnetic resonance imaging (MRI) at 3 T. Short readout duration spirals (1.3-ms long) were used to minimize sensitivity to off-resonance. A custom 16-channel speech coil was used for improved parallel imaging of vocal tract structures. The manifold model leveraged similarities between frames sharing similar vocal tract postures without explicit motion binning. The self-navigating capability of VDS was leveraged to learn the Laplacian structure of the manifold. Reconstruction was posed as a sensitivity-encoding-based nonlocal soft-weighted temporal regularization scheme. Our approach was compared with view-sharing, low-rank, temporal finite difference, extra dimension-based sparsity reconstruction constraints. Undersampling experiments were conducted on five volunteers performing repetitive and arbitrary speaking tasks at different speaking rates. Quantitative evaluation in terms of mean square error over moving edges was performed in a retrospective undersampling experiment on one volunteer. For prospective undersampling, blinded image quality evaluation in the categories of alias artifacts, spatial blurring, and temporal blurring was performed by three experts in voice research. Region of interest analysis at articulator boundaries was performed in both experiments to assess articulatory motion. Improved performance with manifold reconstruction constraints was observed over existing constraints. With prospective undersampling, a spatial resolution of 2.4 × 2.4 mm2/pixel and a temporal resolution of 17.4 ms/frame for single-slice imaging, and 52.2 ms/frame for concurrent three-slice imaging, were achieved. We demonstrated implicit motion binning by analyzing the mechanics of the Laplacian matrix. Manifold regularization demonstrated superior image quality scores in reducing spatial and temporal blurring compared with all other reconstruction constraints. While it exhibited faint (nonsignificant) alias artifacts that were similar to temporal finite difference, it provided statistically significant improvements compared with the other constraints. In conclusion, the self-navigated manifold regularized scheme enabled robust high spatiotemporal resolution dynamic speech MRI at 3 T.
Asunto(s)
Imagen por Resonancia Magnética , Habla , Humanos , Habla/fisiología , Algoritmos , Masculino , Estudios Prospectivos , Adulto , FemeninoRESUMEN
The purpose of this study was to determine whether the threshold of velopharyngeal (VP) coupling area at which listeners switch from identifying a consonant as a stop to a nasal in North American English was different for speech produced by a model based on an adult male, an adult female, and a 4-year-old child. V1CV2 stimuli were generated with a speech production model that encodes phonetic segments as relative acoustic targets imposed on an underlying vocal tract and laryngeal structure that can be scaled according to sex and age. Each V1CV2 was synthesized with a set of VP coupling functions whose maximum area ranged from 0 to 0.1 cm2. Results showed that scaling the vocal tract and vocal folds had essentially no effect on the VP coupling area at which listener identification shifted from stop to nasal. The range of coupling areas at which the crossover occurred was 0.037-0.049 cm2 for the male model, 0.040-0.055 cm2 for the female model, and 0.039-0.052 cm2 for the 4-year-old child model, and overall mean was 0.044 cm2. Calculations of band limited peak nasalance indicated that 85% peak nasalance during the consonant was well aligned with listener responses.
Asunto(s)
Laringe , Habla , Adulto , Femenino , Masculino , Humanos , Preescolar , Acústica , Lenguaje , NarizRESUMEN
A well-known concept of singing voice pedagogy is "formant tuning," where the lowest two vocal tract resonances ( fR1, fR2) are systematically tuned to harmonics of the laryngeal voice source to maximize the level of radiated sound. A comprehensive evaluation of this resonance tuning concept is still needed. Here, the effect of fR1, fR2 variation was systematically evaluated in silico across the entire fundamental frequency range of classical singing for three voice source characteristics with spectral slopes of -6, -12, and -18 dB/octave. Respective vocal tract transfer functions were generated with a previously introduced low-dimensional computational model, and resultant radiated sound levels were expressed in dB(A). Two distinct strategies for optimized sound output emerged for low vs high voices. At low pitches, spectral slope was the predominant factor for sound level increase, and resonance tuning only had a marginal effect. In contrast, resonance tuning strategies became more prevalent and voice source strength played an increasingly marginal role as fundamental frequency increased to the upper limits of the soprano range. This suggests that different voice classes (e.g., low male vs high female) likely have fundamentally different strategies for optimizing sound output, which has fundamental implications for pedagogical practice.
Asunto(s)
Canto , Voz , Masculino , Femenino , Humanos , Simulación por Computador , Sonido , VibraciónRESUMEN
The harmonics-to-noise ratio (HNR) and other spectral noise parameters are important in clinical objective voice assessment as they could indicate the presence of nonharmonic phenomena, which are tied to the perception of hoarseness or breathiness. Existing HNR estimators are built on the voice signals to be nearly periodic (fixed over a short period), although voice pathology could induce involuntary slow modulation to void this assumption. This paper proposes the use of a deterministically time-varying harmonic model to improve the HNR measurements. To estimate the time-varying model, a two-stage iterative least squares algorithm is proposed to reduce model overfitting. The efficacy of the proposed HNR estimator is demonstrated with synthetic signals, simulated tremor signals, and recorded acoustic signals. Results indicate that the proposed algorithm can produce consistent HNR measures as the extent and rate of tremor are varied.
Asunto(s)
Temblor , Voz , Acústica , Humanos , Ruido , Acústica del LenguajeRESUMEN
We agree with Cristina Romani (CR) about reducing confusion and agree that the issues raised in her commentary are central to the study of apraxia of speech (AOS). However, CR critiques our approach from the perspective of basic cognitive neuropsychology. This is confusing and misleading because, contrary to CR's claim, we did not attempt to inform models of typical speech production. Instead, we relied on such models to study the impairment in the clinical category of AOS (translational cognitive neuropsychology). Thus, the approach along with the underlying assumptions is different. This response aims to clarify these assumptions, broaden the discussion regarding the methodological approach, and address CR's concerns. We argue that our approach is well-suited to meet the goals of our recent studies and is commensurate with the current state of the science of AOS. Ultimately, a plurality of approaches is needed to understand a phenomenon as complex as AOS.
Asunto(s)
Afasia , Apraxias , Afasia/complicaciones , Apraxias/etiología , Confusión/complicaciones , Femenino , Humanos , Habla , Trastornos del Habla , Medición de la Producción del HablaRESUMEN
This study investigated the underlying nature of apraxia of speech (AOS) by testing two competing hypotheses. The Reduced Buffer Capacity Hypothesis argues that people with AOS can plan speech only one syllable at a time Rogers and Storkel [1999. Planning speech one syllable at a time: The reduced buffer capacity hypothesis in apraxia of speech. Aphasiology, 13(9-11), 793-805. https://doi.org/10.1080/026870399401885]. The Program Retrieval Deficit Hypothesis states that selecting a motor programme is difficult in face of competition from other simultaneously activated programmes Mailend and Maas [2013. Speech motor programming in apraxia of speech: Evidence from a delayed picture-word interference task. American Journal of Speech-Language Pathology, 22(2), S380-S396. https://doi.org/10.1044/1058-0360(2013/12-0101)]. Speakers with AOS and aphasia, aphasia without AOS, and unimpaired controls were asked to prepare and hold a two-word utterance until a go-signal prompted a spoken response. Phonetic similarity between target words was manipulated. Speakers with AOS had longer reaction times in conditions with two similar words compared to two identical words. The Control and the Aphasia group did not show this effect. These results suggest that speakers with AOS need additional processing time to retrieve target words when multiple motor programmes are simultaneously activated.
Asunto(s)
Afasia/fisiopatología , Apraxias/fisiopatología , Fonética , Trastornos del Habla/fisiopatología , Habla , Adulto , Anciano , Femenino , Humanos , Masculino , Persona de Mediana Edad , Tiempo de Reacción , Medición de la Producción del Habla/métodosRESUMEN
The purpose of this study was to determine the threshold of velopharyngeal coupling area at which listeners switch from identifying a consonant as a stop to a nasal in North American English, based on V1CV2 stimuli generated with a speech production model that encodes phonetic segments as relative acoustic targets. Each V1CV2 was synthesized with a set of velopharyngeal coupling functions whose area ranged from 0 to 0.1 cm2. Results show that consonants were identified by listeners as a stop when the coupling area was less than 0.035-0.057 cm2, depending on place of articulation and final vowel. The smallest coupling area (0.035 cm2) at which the stop-to-nasal switch occurred was found for an alveolar consonant in the /ÉCi/ context, whereas the largest (0.057 cm2) was for a bilabial in /ÉCÉ/. For each stimulus, the balance of oral versus nasal acoustic energy was characterized by the peak nasalance during the consonant. Stimuli with peak nasalance below 40% were mostly identified by listeners as stops, whereas those above 40% were identified as nasals. This study was intended to be a precursor to further investigations using the same model but scaled to represent the developing speech production system of male and female talkers.
Asunto(s)
Percepción del Habla , Habla , Femenino , Humanos , Masculino , América del Norte , Fonética , Medición de la Producción del HablaRESUMEN
In recent studies, it has been assumed that vocal tract formants (Fn) and the voice source could interact. However, there are only few studies analyzing this assumption in vivo. Here, the vowel transition /i/-/a/-/u/-/i/ of 12 professional classical singers (6 females, 6 males) when phonating on the pitch D4 [fundamental frequency (ƒo) ca. 294 Hz] were analyzed using transnasal high speed videoendoscopy (20.000 fps), electroglottography (EGG), and audio recordings. Fn data were calculated using a cepstral method. Source-filter interaction candidates (SFICs) were determined by (a) algorithmic detection of major intersections of Fn/nƒo and (b) perceptual assessment of the EGG signal. Although the open quotient showed some increase for the /i-a/ and /u-i/ transitions, there were no clear effects at the expected Fn/nƒo intersections. In contrast, ƒo adjustments and changes in the phonovibrogram occurred at perceptually derived SFICs, suggesting level-two interactions. In some cases, these were constituted by intersections between higher nƒo and Fn. The presented data partially corroborates that vowel transitions may result in level-two interactions also in professional singers. However, the lack of systematically detectable effects suggests either the absence of a strong interaction or existence of confounding factors, which may potentially counterbalance the level-two-interactions.
Asunto(s)
Canto , Voz , Femenino , Humanos , Masculino , Ocupaciones , Fonación , Calidad de la VozRESUMEN
The purpose of this study was to assess the effect of downsampling the acoustic signal on the accuracy of linear-predictive (LPC) formant estimation. Based on speech produced by men, women, and children, the first four formant frequencies were estimated at sampling rates of 48, 16, and 10 kHz using different anti-alias filtering. With proper selection of number of LPC coefficients, anti-alias filter and between-frame averaging, results suggest that accuracy is not improved by rates substantially below 48 kHz. Any downsampling should not go below 16 kHz with a filter cut-off centered at 8 kHz.
Asunto(s)
Acústica , Habla , Niño , Femenino , Humanos , Masculino , Acústica del LenguajeRESUMEN
A model is described in which the effects of articulatory movements to produce speech are generated by specifying relative acoustic events along a time axis. These events consist of directional changes of the vocal tract resonance frequencies that, when associated with a temporal event function, are transformed via acoustic sensitivity functions, into time-varying modulations of the vocal tract shape. Because the time course of the events may be considerably overlapped in time, coarticulatory effects are automatically generated. Production of sentence-level speech with the model is demonstrated with audio samples and vocal tract animations.
Asunto(s)
Modelos Biológicos , Medición de la Producción del Habla , Habla/fisiología , Acústica , Humanos , Maxilares/fisiología , Laringe/fisiología , Labio/fisiología , Masculino , Lengua/fisiologíaRESUMEN
The purpose of this study was to take a first step toward constructing a developmental and sex-specific version of a parametric vocal tract area function model representative of male and female vocal tracts ranging in age from infancy to 12 yrs, as well as adults. Anatomic measurements collected from a large imaging database of male and female children and adults provided the dataset from which length warping and cross-dimension scaling functions were derived, and applied to the adult-based vocal tract model to project it backward along an age continuum. The resulting model was assessed qualitatively by projecting hypothetical vocal tract shapes onto midsagittal images from the cohort of children, and quantitatively by comparison of formant frequencies produced by the model to those reported in the literature. An additional validation of modeled vocal tract shapes was made possible by comparison to cross-sectional area measurements obtained for children and adults using acoustic pharyngometry. This initial attempt to generate a sex-specific developmental vocal tract model paves a path to study the relation of vocal tract dimensions to documented prepubertal acoustic differences.
Asunto(s)
Desarrollo Infantil/fisiología , Caracteres Sexuales , Habla/fisiología , Pliegues Vocales/anatomía & histología , Pliegues Vocales/fisiología , Adulto , Factores de Edad , Niño , Preescolar , Femenino , Humanos , Lactante , Recién Nacido , Masculino , Factores Sexuales , Pliegues Vocales/diagnóstico por imagenRESUMEN
The purpose of this study was to develop a method for visualizing and assessing the characteristics of vowel production by measuring the local density of normalized F1 and F2 formant frequencies. The result is a three-dimensional plot called the vowel space density (VSD) and indicates the regions in the vowel space most heavily used by a talker during speech production. The area of a convex hull enclosing the vowel space at specific threshold density values was proposed as a means of quantifying the VSD.
Asunto(s)
Acústica , Fonética , Acústica del Lenguaje , Medición de la Producción del Habla/métodos , Calidad de la Voz , Humanos , Procesamiento de Señales Asistido por Computador , Espectrografía del SonidoRESUMEN
The purpose of this study was to further develop a multi-tier model of the vocal tract area function in which the modulations of shape to produce speech are generated by the product of a vowel substrate and a consonant superposition function. The new approach consists of specifying input parameters for a target consonant as a set of directional changes in the resonance frequencies of the vowel substrate. Using calculations of acoustic sensitivity functions, these "resonance deflection patterns" are transformed into time-varying deformations of the vocal tract shape without any direct specification of location or extent of the consonant constriction along the vocal tract. The configuration of the constrictions and expansions that are generated by this process were shown to be physiologically-realistic and produce speech sounds that are easily identifiable as the target consonants. This model is a useful enhancement for area function-based synthesis and can serve as a tool for understanding how the vocal tract is shaped by a talker during speech production.
RESUMEN
The purpose of this study was to investigate the effects of physiological adjustments on listeners' perception of the magnitude of modulation of voice and to determine the characteristics of the acoustical modulations that explained listeners' judgments. This research was carried out using singers producing vibrato as a model of vocal tremor. Twenty healthy adults participated in a perceptual study involving pair-comparisons of the magnitude of "shakiness" with singers' samples, which differed by fundamental frequency, vocal quality, and vowel. Results revealed that listeners perceived a higher magnitude of voice modulation when female samples had a pressed vocal quality. Acoustical analyses were performed with voice samples to determine the features that predicted listeners' judgments. Based on regression analyses, listeners' judgments were predicted to some extent by modulation information in frequency bands across the spectrum.
Asunto(s)
Temblor , Adulto , Femenino , Humanos , Juicio , Masculino , Canto , Voz , Calidad de la Voz , Adulto JovenRESUMEN
OBJECTIVE: The goal of the Arizona Child Acoustic Database project was to obtain a large set of acoustic recordings, primarily vowels, collected from a cohort of children over a critical period of growth and development. METHOD: Data was recorded longitudinally from 63 children between the ages of 2;0 and 7;0 at 3-month intervals. The protocol included individual American English vowels and diphthongs, nonsense multi-vowel transitions, word level multi-vowel sequences (e.g., Hawaii), single-syllable words targeting each American English vowel, short sentences, and conversation. RESULTS: Acoustic files are available for download through the University of Arizona Library Repository for use in future research projects. CONCLUSION: Longitudinal recordings may be of interest because they allow tracking of acoustic characteristics produced by an individual child during a period of rapid growth and speech development.
Asunto(s)
Bases de Datos Factuales , Acústica del Lenguaje , Acústica , Arizona , Niño , Preescolar , Comunicación , Femenino , Humanos , Lenguaje , Masculino , Fonética , Percepción del HablaRESUMEN
The purpose of this study was to determine if adjustments to the voice source [i.e., fundamental frequency (F0), degree of vocal fold adduction] or vocal tract filter (i.e., vocal tract shape for vowels) reduce the perception of simulated laryngeal vocal tremor and to determine if listener perception could be explained by characteristics of the acoustical modulations. This research was carried out using a computational model of speech production that allowed for precise control and manipulation of the glottal and vocal tract configurations. Forty-two healthy adults participated in a perceptual study involving pair-comparisons of the magnitude of "shakiness" with simulated samples of laryngeal vocal tremor. Results revealed that listeners perceived a higher magnitude of voice modulation when simulated samples had a higher mean F0, greater degree of vocal fold adduction, and vocal tract shape for /i/ vs /É/. However, the effect of F0 was significant only when glottal noise was not present in the acoustic signal. Acoustical analyses were performed with the simulated samples to determine the features that affected listeners' judgments. Based on regression analyses, listeners' judgments were predicted to some extent by modulation information present in both low and high frequency bands.
Asunto(s)
Trastornos del Habla/fisiopatología , Percepción del Habla/fisiología , Temblor/fisiopatología , Calidad de la Voz/fisiología , Estimulación Acústica , Adolescente , Adulto , Fenómenos Biomecánicos , Simulación por Computador , Femenino , Glotis/fisiopatología , Humanos , Juicio , Músculos Laríngeos/fisiopatología , Masculino , Persona de Mediana Edad , Variaciones Dependientes del Observador , Fonética , Psicoacústica , Acústica del Lenguaje , Pliegues Vocales/fisiopatología , Adulto JovenRESUMEN
Children's speech presents a challenging problem for formant frequency measurement. In part, this is because high fundamental frequencies, typical of a children's speech production, generate widely spaced harmonic components that may undersample the spectral shape of the vocal tract transfer function. In addition, there is often a weakening of upper harmonic energy and a noise component due to glottal turbulence. The purpose of this study was to develop a formant measurement technique based on cepstral analysis that does not require modification of the cepstrum itself or transformation back to the spectral domain. Instead, a narrow-band spectrum is low-pass filtered with a cutoff point (i.e., cutoff "quefrency" in the terminology of cepstral analysis) to preserve only the spectral envelope. To test the method, speech representative of a 2-3 year-old child was simulated with an airway modulation model of speech production. The model, which includes physiologically-scaled vocal folds and vocal tract, generates sound output analogous to a microphone signal. The vocal tract resonance frequencies can be calculated independently of the output signal and thus provide test cases that allow for assessing the accuracy of the formant tracking algorithm. When applied to the simulated child-like speech, the spectral filtering approach was shown to provide a clear spectrographic representation of formant change over the time course of the signal, and facilitates tracking formant frequencies for further analysis.
RESUMEN
Previous work has shown that human listeners are sensitive to level differences in high-frequency energy (HFE) in isolated vowel sounds produced by male singers. Results indicated that sensitivity to HFE level changes increased with overall HFE level, suggesting that listeners would be more "tuned" to HFE in vocal production exhibiting higher levels of HFE. It follows that sensitivity to HFE level changes should be higher (1) for female vocal production than for male vocal production and (2) for singing than for speech. To test this hypothesis, difference limens for HFE level changes in male and female speech and singing were obtained. Listeners showed significantly greater ability to detect level changes in singing vs speech but not in female vs male speech. Mean differences limen scores for speech and singing were about 5 dB in the 8-kHz octave (5.6-11.3 kHz) but 8-10 dB in the 16-kHz octave (11.3-22 kHz). These scores are lower (better) than those previously reported for isolated vowels and some musical instruments.
Asunto(s)
Discriminación de la Altura Tonal , Canto , Acústica del Lenguaje , Percepción del Habla , Calidad de la Voz , Estimulación Acústica , Adulto , Audiometría del Habla , Femenino , Humanos , Masculino , Psicoacústica , Factores Sexuales , Espectrografía del Sonido , Adulto JovenRESUMEN
All-pole modeling is a widely used formant estimation method, but its performance is known to deteriorate for high-pitched voices. In order to address this problem, several all-pole modeling methods robust to fundamental frequency have been proposed. This study compares five such previously known methods and introduces a technique, Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME). WLP-AME utilizes temporally weighted linear prediction (LP) in which the square of the prediction error is multiplied by a given parametric weighting function. The weighting downgrades the contribution of the main excitation of the vocal tract in optimizing the filter coefficients. Consequently, the resulting all-pole model is affected more by the characteristics of the vocal tract leading to less biased formant estimates. By using synthetic vowels created with a physical modeling approach, the results showed that WLP-AME yields improved formant frequencies for high-pitched sounds in comparison to the previously known methods (e.g., relative error in the first formant of the vowel [a] decreased from 11% to 3% when conventional LP was replaced with WLP-AME). Experiments conducted on natural vowels indicate that the formants detected by WLP-AME changed in a more regular manner between repetitions of different pitch than those computed by conventional LP.
Asunto(s)
Glotis/fisiología , Modelos Lineales , Fonación , Fonética , Percepción de la Altura Tonal , Acústica del Lenguaje , Calidad de la Voz , Adulto , Algoritmos , Fenómenos Biomecánicos , Preescolar , Simulación por Computador , Femenino , Glotis/anatomía & histología , Humanos , Masculino , Análisis Numérico Asistido por Computador , Reconocimiento de Normas Patrones Automatizadas , Presión , Procesamiento de Señales Asistido por Computador , Espectrografía del Sonido , Medición de la Producción del Habla , Factores de Tiempo , Pliegues Vocales/fisiologíaRESUMEN
Various authors have argued that belting is to be produced by "speech-like" sounds, with the first and second supraglottic vocal tract resonances (fR1 and fR2) at frequencies of the vowels determined by the lyrics to be sung. Acoustically, the hallmark of belting has been identified as a dominant second harmonic, possibly enhanced by first resonance tuning (fR1≈2fo). It is not clear how both these concepts - (a) phonating with "speech-like," unmodified vowels; and (b) producing a belting sound with a dominant second harmonic, typically enhanced by fR1 - can be upheld when singing across a singer's entire musical pitch range. For instance, anecdotal reports from pedagogues suggest that vowels with a low fR1, such as [i] or [u], might have to be modified considerably (by raising fR1) in order to phonate at higher pitches. These issues were systematically addressed in silico with respect to treble singing, using a linear source-filter voice production model. The dominant harmonic of the radiated spectrum was assessed in 12987 simulations, covering a parameter space of 37 fundamental frequencies (fo) across the musical pitch range from C3 to C6; 27 voice source spectral slope settings from -4 to -30 dB/octave; computed for 13 different IPA vowels. The results suggest that, for most unmodified vowels, the stereotypical belting sound characteristics with a dominant second harmonic can only be produced over a pitch range of about a musical fifth, centered at fo≈0.5fR1. In the [É] and [É] vowels, that range is extended to an octave, supported by a low second resonance. Data aggregation - considering the relative prevalence of vowels in American English - suggests that, historically, belting with fR1≈2fo was derived from speech, and that songs with an extended musical pitch range likely demand considerable vowel modification. We thus argue that - on acoustical grounds - the pedagogical commandment for belting with unmodified, "speech-like" vowels can not always be fulfilled.