ABSTRACT
PURPOSE: This article examines cepstral/spectral analyses of sustained /α/ vowels produced by speakers with hypokinetic dysarthria secondary to idiopathic Parkinson's disease (PD) before and after Lee Silverman Voice Treatment (LSVTĀ®LOUD) and the relationship of these measures with overall voice intensity. METHODOLOGY: Nine speakers with PD were examined in a pre-/post-treatment design, with multiple daily audio recordings before and after treatment. Sustained vowels were analyzed for cepstral peak prominence (CPP), CPP standard deviation (CPP SD), low/high spectral ratio (L/H SR), and Cepstral/Spectral Index of Dysphonia (CSID) using the KAYPENTAX computer software. RESULTS: CPP and CPP SD increased significantly and CSID decreased significantly from pre- to post-treatment recordings, with strong effect sizes. Increased CPP indicates increased dominance of harmonics in the spectrum following LSVT. After restricting the frequency cutoff to the region just above the first formant and second formant and below the third formant, L/H SR was observed to decrease significantly following treatment. Correlation analyses demonstrated that CPP was more strongly associated with CSID before treatment than after. CONCLUSION: In addition to increased vocal intensity following LSVT, speakers with PD exhibited both improved harmonic structure and voice quality as reflected by cepstral/spectral analysis, indicating that there was improved harmonic structure and reduced dysphonia following treatment.
Subject(s)
Dysarthria/therapy , Parkinson Disease/therapy , Phonation , Sound Spectrography , Voice Disorders/therapy , Voice Quality , Voice Training , Aged , Aged, 80 and over , Dysarthria/diagnosis , Female , Follow-Up Studies , Humans , Male , Middle Aged , Parkinson Disease/diagnosis , Speech Acoustics , Voice Disorders/diagnosisABSTRACT
Debates about neonatal imitation remain more open than Keven & Akins (K&A) imply. K&A do not recognize the primacy of the question concerning differential imitation and the links between experimental designs and more or less plausible theoretical assumptions. Moreover, they do not acknowledge previous theorizing on spontaneous behavior, the explanatory power of entrainment, and subtle connections with social cognition.
Subject(s)
Imitative Behavior , Speech , Interpersonal Relations , Social BehaviorABSTRACT
We report on the emergence of functional flexibility in vocalizations of human infants. This vastly underappreciated capability becomes apparent when prelinguistic vocalizations express a full range of emotional content--positive, neutral, and negative. The data show that at least three types of infant vocalizations (squeals, vowel-like sounds, and growls) occur with this full range of expression by 3-4 mo of age. In contrast, infant cry and laughter, which are species-specific signals apparently homologous to vocal calls in other primates, show functional stability, with cry overwhelmingly expressing negative and laughter positive emotional states. Functional flexibility is a sine qua non in spoken language, because all words or sentences can be produced as expressions of varying emotional states and because learning conventional "meanings" requires the ability to produce sounds that are free of any predetermined function. Functional flexibility is a defining characteristic of language, and empirically it appears before syntax, word learning, and even earlier-developing features presumed to be critical to language (e.g., joint attention, syllable imitation, and canonical babbling). The appearance of functional flexibility early in the first year of human life is a critical step in the development of vocal language and may have been a critical step in the evolution of human language, preceding protosyntax and even primitive single words. Such flexible affect expression of vocalizations has not yet been reported for any nonhuman primate but if found to occur would suggest deep roots for functional flexibility of vocalization in our primate heritage.
Subject(s)
Child Development/physiology , Language Development , Speech/physiology , Crying/physiology , Facial Expression , Humans , Infant , Laughter/physiology , Odds Ratio , TennesseeABSTRACT
We investigated how neural oscillations code the hierarchical nature of stress rhythms in speech and how stress processing varies with language experience. By measuring phase synchrony of multilevel EEG-acoustic tracking and intra-brain cross-frequency coupling, we show the encoding of stress involves different neural signatures (delta rhythmsĀ =Ā stress foot rate; theta rhythmsĀ =Ā syllable rate), is stronger for amplitude vs. duration stress cues, and induces nested delta-theta coherence mirroring the stress-syllable hierarchy in speech. Only native English, but not Mandarin, speakers exhibited enhanced neural entrainment at central stress (2Ā Hz) and syllable (4Ā Hz) rates intrinsic to natural English. English individuals with superior cortical-stress tracking capabilities also displayed stronger neural hierarchical coherence, highlighting a nuanced interplay between internal nesting of brain rhythms and external entrainment rooted in language-specific speech rhythms. Our cross-language findings reveal brain-speech synchronization is not purely a "bottom-up" but benefits from "top-down" processing from listeners' language-specific experience.
Subject(s)
Speech Perception , Humans , Female , Male , Speech Perception/physiology , Adult , Electroencephalography , Brain/physiology , Young Adult , Speech/physiology , Language , Acoustic StimulationABSTRACT
Considerable work suggests the dominant syllable rhythm of the acoustic envelope is remarkably similar across languages (Ć¢ĀĀ¼4-5 Hz) and that oscillatory brain activity tracks these quasiperiodic rhythms to facilitate speech processing. However, whether this fundamental periodicity represents a common organizing principle in both auditory and motor systems involved in speech has not been explicitly tested. To evaluate relations between entrainment in the perceptual and production domains, we measured individuals' (i) neuroacoustic tracking of the EEG to speech trains and their (ii) simultaneous and non-simultaneous productions synchronized to syllable rates between 2.5 and 8.5 Hz. Productions made without concurrent auditory presentation isolated motor speech functions more purely. We show that neural synchronization flexibly adapts to the heard stimuli in a rate-dependent manner, but that phase locking is boosted near Ć¢ĀĀ¼4.5 Hz, the purported dominant rate of speech. Cued speech productions (recruit sensorimotor interaction) were optimal between 2.5 and 4.5 Hz, suggesting a low-frequency constraint on motor output and/or sensorimotor integration. In contrast, "pure" motor productions (without concurrent sound cues) were most precisely generated at rates of 4.5 and 5.5 Hz, paralleling the neuroacoustic data. Correlations further revealed strong links between receptive (EEG) and production synchronization abilities; individuals with stronger auditory-perceptual entrainment better matched speech rhythms motorically. Together, our findings support an intimate link between exogenous and endogenous rhythmic processing that is optimized at 4-5 Hz in both auditory and motor systems. Parallels across modalities could result from dynamics of the speech motor system coupled with experience-dependent tuning of the perceptual system via the sensorimotor interface.
ABSTRACT
We investigated how neural oscillations code the hierarchical nature of stress rhythms in speech and how stress processing varies with language experience. By measuring phase synchrony of multilevel EEG-acoustic tracking and intra-brain cross-frequency coupling, we show the encoding of stress involves different neural signatures (delta rhythms = stress foot rate; theta rhythms = syllable rate), is stronger for amplitude vs. duration stress cues, and induces nested delta-theta coherence mirroring the stress-syllable hierarchy in speech. Only native English, but not Mandarin, speakers exhibited enhanced neural entrainment at central stress (2 Hz) and syllable (4 Hz) rates intrinsic to natural English. English individuals with superior cortical-stress tracking capabilities also displayed stronger neural hierarchical coherence, highlighting a nuanced interplay between internal nesting of brain rhythms and external entrainment rooted in language-specific speech rhythms. Our cross-language findings reveal brain-speech synchronization is not purely a "bottom-up" but benefits from "top-down" processing from listeners' language-specific experience.
ABSTRACT
Surrounding context influences speech listening, resulting in dynamic shifts to category percepts. To examine its neural basis, event-related potentials (ERPs) were recorded during vowel identification with continua presented in random, forward, and backward orders to induce perceptual warping. Behaviorally, sequential order shifted individual listeners' categorical boundary, versus random delivery, revealing perceptual warping (biasing) of the heard phonetic category dependent on recent stimulus history. ERPs revealed later (Ć¢ĀĀ¼300 ms) activity localized to superior temporal and middle/inferior frontal gyri that predicted listeners' hysteresis/enhanced contrast magnitudes. Findings demonstrate that interactions between frontotemporal brain regions govern top-down, stimulus history effects on speech categorization.
ABSTRACT
Acoustic analysis of infant vocalizations has typically employed traditional acoustic measures drawn from adult speech acoustics, such as f(0), duration, formant frequencies, amplitude, and pitch perturbation. Here an alternative and complementary method is proposed in which data-derived spectrographic features are central. 1-s-long spectrograms of vocalizations produced by six infants recorded longitudinally between ages 3 and 11 months are analyzed using a neural network consisting of a self-organizing map and a single-layer perceptron. The self-organizing map acquires a set of holistic, data-derived spectrographic receptive fields. The single-layer perceptron receives self-organizing map activations as input and is trained to classify utterances into prelinguistic phonatory categories (squeal, vocant, or growl), identify the ages at which they were produced, and identify the individuals who produced them. Classification performance was significantly better than chance for all three classification tasks. Performance is compared to another popular architecture, the fully supervised multilayer perceptron. In addition, the network's weights and patterns of activation are explored from several angles, for example, through traditional acoustic measurements of the network's receptive fields. Results support the use of this and related tools for deriving holistic acoustic features directly from infant vocalization data and for the automatic classification of infant vocalizations.
Subject(s)
Acoustics , Algorithms , Child Language , Models, Biological , Neural Networks, Computer , Phonation , Signal Processing, Computer-Assisted , Voice , Age Factors , Automation , Female , Humans , Infant , Male , Reproducibility of Results , Sound Spectrography , Time FactorsABSTRACT
Purpose This study measures the experience of spontaneous speech in everyday speaking situations. Spontaneity of speech is a novel concept developed to account for the subjective experience of speaking. Spontaneous speech is characterized by little premeditation and effortless production, and it is enjoyable and meaningful. Attention is not directed on the physical production of speech. Spontaneity is intended to be distinct from fluency so that it can be used to describe both stuttered and fluent speech. This is the first study to attempt to measure the concept of spontaneity of speech. Method The experience sampling method was used with 44 people who stutter. They were surveyed five times a day for 1 week through their cell phones. They reported on their perceived spontaneity, fluency, and speaking context. Results Results indicate that spontaneity and fluency are independent, though correlated, constructs that vary with context. Importantly, an increase in spontaneity significantly decreases the adverse impact of stuttering on people's lives. Fluency did not significantly affect adverse life impact of stuttering. Conclusion Findings support a theoretical construct of spontaneity that is distinct from speech fluency and that can inform our views of stuttering and approaches to stuttering treatment.
Subject(s)
Speech Perception , Stuttering , Attention , Humans , Speech , Surveys and QuestionnairesABSTRACT
Stressful conversation is a frequently occurring stressor in our daily life. Stressors not only adversely affect our physical and mental health but also our relationships with family, friends, and coworkers. In this paper, we present a model to automatically detect stressful conversations using wearable physiological and inertial sensors. We conducted a lab and a field study with cohabiting couples to collect ecologically valid sensor data with temporally-precise labels of stressors. We introduce the concept of stress cycles, i.e., the physiological arousal and recovery, within a stress event. We identify several novel features from stress cycles and show that they exhibit distinguishing patterns during stressful conversations when compared to physiological response due to other stressors. We observe that hand gestures also show a distinct pattern when stress occurs due to stressful conversations. We train and test our model using field data collected from 38 participants. Our model can determine whether a detected stress event is due to a stressful conversation with an F1-score of 0.83, using features obtained from only one stress cycle, facilitating intervention delivery within 3.9 minutes since the start of a stressful conversation.
ABSTRACT
The primary vocal registers of modal, falsetto, and fry have been studied in adults but not per se in infancy. The vocal ligament is thought to play a critical role in the modal-falsetto contrast but is still developing during infancy (Tateya and Tateya, 2015).41 Cover tissues are also implicated in the modal-fry contrast, but the low fundamental frequency (fo) cutoff of 70 Hz, shared between genders, suggests a psychoacoustic basis for the contrast. Buder, Chorna, Oller, and Robinson (2008)6 used the labels of "loft," "modal," and "pulse" for distinct vibratory regimes that appear to be identifiable based on spectrographic inspection of harmonic structure and auditory judgments in infants, but this work did not supply acoustic measurements to verify which of these nominally labeled regimes resembled adult registers. In this report, we identify clear transitions between registers within infant vocalizations and measure these registers and their transitions for fo and relative harmonic amplitudes (H1-H2). By selectively sampling first-year vocalizations, this manuscript quantifies acoustic patterns that correspond to vocal fold vibration types not previously cataloged in infancy. Results support a developmental basis for vocal registers, revealing that a well-developed ligament is not needed for loft-modal quality shifts as seen in harmonic amplitude measures. Results also reveal that a distinctively pulsatile register can occur in infants at a much higher fo than expected on psychoacoustic grounds. Overall results are consistent with cover tissues in infancy that are, for vibratory purposes, highly compliant and readily detached.
Subject(s)
Phonation , Vocal Cords/growth & development , Voice Quality , Acoustics , Age Factors , Child Development , Female , Humans , Infant , Sound Spectrography , Vibration , Video RecordingABSTRACT
Prior research has not evaluated acoustic features contributing to perception of human infant vocal distress or lack thereof on a continuum. The present research evaluates perception of infant vocalizations along a continuum ranging from the most prototypical intensely distressful cry sounds ("wails") to the most prototypical of infant sounds that typically express no distress (non-distress "vocants"). Wails are deemed little if at all related to speech while vocants are taken to be clear precursors to speech. We selected prototypical exemplars of utterances representing the whole continuum from 0 and 1 month-olds. In this initial study of the continuum, our goals are to determine (1) listener agreement on level of vocal distress across the continuum, (2) acoustic parameters predicting ratings of distress, (3) the extent to which individual listeners maintain or change their acoustic criteria for distress judgments across the study, (4) the extent to which different listeners use similar or different acoustic criteria to make judgments, and (5) the role of short-term experience among the listeners in judgments of infant vocalization distress. Results indicated that (1) both inter-rater and intra-rater listener agreement on degree of vocal distress was high, (2) the best predictors of vocal distress were number of vibratory regimes within utterances, utterance duration, spectral ratio (spectral concentration) in vibratory regimes within utterances, and mean pitch, (3) individual listeners significantly modified their acoustic criteria for distress judgments across the 10 trial blocks, (4) different listeners, while showing overall similarities in ratings of the 42 stimuli, also showed significant differences in acoustic criteria used in assigning the ratings of vocal distress, and (5) listeners who were both experienced and inexperienced in infant vocalizations coding showed high agreement in rating level of distress, but differed in the extent to which they relied on the different acoustic cues in making the ratings. The study provides clearer characterization of vocal distress expression in infants based on acoustic parameters and a new perspective on active adult perception of infant vocalizations. The results also highlight the importance of vibratory regime segmentation and analysis in acoustically based research on infant vocalizations and their perception.
ABSTRACT
How did vocal language originate? Before trying to determine how referential vocabulary or syntax may have arisen, it is critical to explain how ancient hominins began to produce vocalization flexibly, without binding to emotions or functions. A crucial factor in the vocal communicative split of hominins from the ape background may thus have been copious, functionally flexible vocalization, starting in infancy and continuing throughout life, long before there were more advanced linguistic features such as referential vocabulary. 2-3 month-old modern human infants produce "protophones", including at least three types of functionally flexible non-cry precursors to speech rarely reported in other ape infants. But how early in life do protophones actually appear? We report that the most common protophone types emerge abundantly as early as vocalization can be observed in infancy, in preterm infants still in neonatal intensive care. Contrary to the expectation that cries are the predominant vocalizations of infancy, our all-day recordings showed that protophones occurred far more frequently than cries in both preterm and full-term infants. Protophones were not limited to interactive circumstances, but also occurred at high rates when infants were alone, indicating an endogenous inclination to vocalize exploratorily, perhaps the most fundamental capacity underlying vocal language.
Subject(s)
Child Development , Language , Humans , Infant , Infant, Newborn , Infant, Premature , Speech , Verbal BehaviorABSTRACT
Monitoring of in-person conversations has largely been done using acoustic sensors. In this paper, we propose a new method to detect moment-by-moment conversation episodes by analyzing breathing patterns captured by a mobile respiration sensor. Since breathing is affected by physical and cognitive activities, we develop a comprehensive method for cleaning, screening, and analyzing noisy respiration data captured in the field environment at individual breath cycle level. Using training data collected from a speech dynamics lab study with 12 participants, we show that our algorithm can identify each respiration cycle with 96.34% accuracy even in presence of walking. We present a Conditional Random Field, Context-Free Grammar (CRF-CFG) based conversation model, called rConverse, to classify respiration cycles into speech or non-speech, and subsequently infer conversation episodes. Our model achieves 82.7% accuracy for speech/non-speech classification and it identifies conversation episodes with 95.9% accuracy on lab data using a leave-one-subject-out cross-validation. Finally, the system is validated against audio ground-truth in a field study with 32 participants. rConverse identifies conversation episodes with 71.7% accuracy on 254 hours of field data. For comparison, the accuracy from a high-quality audio-recorder on the same data is 71.9%.
ABSTRACT
Neonatal imitation has rich implications for neuroscience, developmental psychology, and social cognition, but there is little consensus about this phenomenon. The primary empirical question, whether or not neonatal imitation exists, is not settled. Is it possible to give a balanced evaluation of the theories and methodologies at stake so as to facilitate real progress with respect to the primary empirical question? In this paper, we address this question. We present the operational definition of differential imitation and discuss why it is important to keep it in mind. The operational definition indicates that neonatal imitation may not look like prototypical imitation and sets non-obvious requirements on what can count as evidence for imitation. We also examine the principal explanations for the extant findings and argue that two theories, the arousal hypothesis and the Association by Similarity Theory, which interprets neonatal imitation as differential induction of spontaneous behavior through similarity, offer better explanations than the others. With respect to methodology, we investigate what experimental design can best provide evidence for imitation, focusing on how differential induction may be maximized and detected. Finally, we discuss the significance of neonatal imitation for the field of social cognition. Specifically, we propose links with theories of social interaction and direct social perception. Overall, our goals are to help clarify the complex theoretical issues at stake and suggest fruitful guidelines for empirical research.
ABSTRACT
Voice clinicians require an objective, reliable, and relatively automatic method to assess voice change after medical, surgical, or behavioral intervention. This measure must be sensitive to a variety of voice qualities and severities, and preferably should reflect voice in continuous speech. The long-term average spectrum (LTAS) is a fast Fourier transform-generated power spectrum whose properties can be compared with a Gaussian bell curve using spectral moments analysis. Four spectral moments describe features of the LTAS: Spectral mean (Moment 1) and standard deviation (Moment 2) represent the spectrum's central tendency and dispersion, respectively. Skewness (based on Moment 3) and kurtosis (based on Moment 4) represent the spectrum's tilt and peakedness, respectively. To examine whether the first four spectral moments of the LTAS were sensitive to perceived voice improvement after voice therapy, this investigation compared pretreatment and posttreatment voice samples of 93 patients with functional dysphonia using spectral moments analysis. Inspection of the results revealed that spectral mean and standard deviation lowered significantly with perceived voice improvement after successful behavioral management (p < 0.001). However, changes in skewness and kurtosis were not significant. Furthermore, lowering of the spectral mean uniquely accounted for approximately 14% of the variance in the pretreatment to posttreatment changes observed in perceptual ratings of voice severity (p < 0.001), indicating that spectral mean (ie, Moment 1) of the LTAS may be one acoustic marker sensitive to improvement in dysphonia severity.
Subject(s)
Speech Acoustics , Voice Disorders/therapy , Voice Quality , Voice Training , Adolescent , Adult , Aged , Female , Humans , Middle Aged , Time FactorsABSTRACT
Spectral amplitude measures are sensitive to varying degrees of vocal fold adduction in normal speakers. This study examined the applicability of harmonic amplitude differences to adductor spasmodic dysphonia (ADSD) in comparison with normal controls. Amplitudes of the first and second harmonics (H1, H2) and of harmonics affiliated with the first, second, and third formants (A1, A2, A3) were obtained from spectra of vowels and /i/ excerpted from connected speech. Results indicated that these measures could be made reliably in ADSD. With the exception of H1(*)-H2(*), harmonic amplitude differences (H1(*)-A1, H1(*)-A2, and H1(*)-A3(*)) exhibited significant negative linear relationships (P < 0.05) with clinical judgments of overall severity. The four harmonic amplitude differences significantly differentiated between pre-BT and post-BT productions (P < 0.05). After treatment, measurements from detected significant differences between ADSD and normal controls (P < 0.05), but measurements from /i/ did not. LTAS analysis of ADSD patients' speech samples proved a good fit with harmonic amplitude difference measures. Harmonic amplitude differences also significantly correlated with perceptual judgments of breathiness and roughness (P < 0.05). These findings demonstrate high clinical applicability for harmonic amplitude differences for characterizing phonation in the speech of persons with ADSD, as well as normal speakers, and they suggest promise for future application to other voice pathologies.
Subject(s)
Speech Acoustics , Vocal Cords/physiopathology , Voice Disorders/diagnosis , Adult , Female , Humans , Mathematical Computing , Middle Aged , Predictive Value of Tests , Reproducibility of Results , Sensitivity and Specificity , Sound Spectrography , Speech Production Measurement , Tape Recording , Voice QualityABSTRACT
BACKGROUND: Voice tremor, like spasmodic dysphonia and other tremor disorders, may respond to botulinum toxin type A injections. OBJECTIVE: To evaluate the safety and efficacy of botulinum toxin type A injections as treatment for voice tremor. DESIGN: A randomized study of 3 doses of botulinum toxin type A with 6 weeks of follow-up. SETTING: A single-site tertiary care center. PARTICIPANTS AND METHODS: Thirteen subjects (11 women, 2 men; mean age, 73 years) with voice tremor and no spasmodic dysphonia or head, mouth, jaw, or facial tremor were entered into this study. Patients received 1.25 U (n = 5), 2.5 U (n = 5), or 3.75 U (n = 3) of botulinum toxin type A in each vocal cord. All patients were evaluated at baseline and postinjection at weeks 2, 4, and 6. MAIN OUTCOME MEASURES: The primary outcome measure was the patient tremor rating scale, with secondary measures including patient-rated functional disability, response rating scale, independent randomized tremor ratings, and acoustical measures. RESULTS: All patients at all dose levels noted an effect from the injection. The mean time to onset of effect was 2.3 days (range, 1-7 days). For all patients combined, mean tremor severity scale scores (rated by patients on a 5-point scale) improved 1.4 points at week 2, 1.6 points at week 4, and 1.7 points at week 6. Measures of functional disability, measures of the effect of injection, independent ratings of videotaped speech, and acoustic measures of tremor also showed improvement. The main adverse effects at all doses were breathiness and dysphagia. CONCLUSION: Voice tremor improves following injections of botulinum toxin type A.
Subject(s)
Botulinum Toxins, Type A/therapeutic use , Neuromuscular Agents/therapeutic use , Voice Disorders/drug therapy , Aged , Aged, 80 and over , Female , Humans , Male , Middle Aged , Sound Spectrography , Treatment Outcome , Video RecordingABSTRACT
A method is presented for analyzing phonatory instabilities that occur as modulations of fundamental frequency (f0) and sound pressure level (SPL) on the order of 0.2 to 20 cycles per second. Such long-term phonatory instabilities, including but not limited to traditional notions of tremor, are distinct from cycle-to-cycle perturbation such as jitter or shimmer. For each of the 2 parameters (f0, in Hz, and SPL, in dB), 3 frequency domains are proposed: (a) flutter (10-20 Hz), (b) tremor (2-10 Hz), and (c) wow (0.2-2.0 Hz), yielding 6 types of instability. Analyses were implemented using fast Fourier transforms (FFTs) with domain-specific analysis parameters. Outputs include a graphic display in the form of a set of low-frequency spectrograms (the "modulogram") and quantitative measures of the frequencies, magnitudes, durations, and sinusoidal form of the instabilities. An index of a given instability is developed by combining its duration and average modulation magnitude into a single quantity. Performance of the algorithms was assessed by analyzing test signals with known degrees of modulation, and a range of applications was reviewed to provide a rationale for use of modulograms in phonatory assessment.