RESUMEN
This paper deals with study of formant and harmonic contours by processing the group delay (GD) spectrograms of speech signals. The GD spectrum is the negative derivative of the phase spectrum with respect to frequency. Recent study shows that the GD spectrogram can be obtained without phase wrapping. Formant frequency contours can be observed in the display of the peaks of the instantaneous wideband equivalent GD spectrogram, derived using the modified single frequency filtering (SFF) analysis of speech signals. Harmonic frequency contours can be observed in the display of the peaks of the instantaneous narrowband equivalent GD spectrogram, derived using the modified SFF analysis of speech signals. For synthetic speech signals, the observed formant contours match the ground truth formant contours from which the signal is derived. For natural speech signals, the observed formant contours match approximately with the given ground truth formant contours mostly in the voiced regions. The results are illustrated for several randomly selected utterances from the TIMIT database. While this study helps to observe the contours of formants in the display, automatic extraction of the formant frequencies needs further processing, requiring logic for eliminating the spurious points, without forcing the number of formants.
Asunto(s)
Acústica del Lenguaje , Humanos , Espectrografía del Sonido , Procesamiento de Señales Asistido por Computador , Medición de la Producción del Habla/métodos , Calidad de la Voz , Factores de Tiempo , FonéticaRESUMEN
This study investigates whether downstep in Japanese is directly triggered by accents. When the pitch height of a word X is lower after an accented word (A) than after an unaccented word (U), X is diagnosed as downstepped. However, this diagnosis involves two confounding factors: the already lowered F0 before X and phonological phrasing. To control these factors, this study contrasts genitive and nominative case markers and adjusts measurement points. Eight native speakers of Tokyo Japanese participated in a production experiment. The results show six key findings. First, a structure-dependent F0 downtrend was observed in UX. Second, higher F0 peaks with larger initial lowering were observed after accents with a nominative case marker compared to those with a genitive case marker, suggesting a boosting effect by boundaries. Third, larger initial lowering was observed in AX compared to UX, contradicting the notion that X is more compressed in AX due to downstep. Fourth, the paradigmatic difference in F0 height between AX and UX decreases when F0 of X is increased, supporting that boundaries trigger downstep. Fifth, downstep is not physiologically constrained but is phonologically controlled. Finally, the blocking of initial lowering in heavy syllables is not phonological but rather an articulatory phenomenon.
Asunto(s)
Fonética , Acústica del Lenguaje , Medición de la Producción del Habla , Humanos , Masculino , Femenino , Adulto Joven , Medición de la Producción del Habla/métodos , Adulto , Calidad de la Voz , Percepción del HablaRESUMEN
Predictions of gradient degree of lenition of voiceless and voiced stops in a corpus of Argentine Spanish are evaluated using three acoustic measures (minimum and maximum intensity velocity and duration) and two recurrent neural network (Phonet) measures (posterior probabilities of sonorant and continuant phonological features). While mixed and inconsistent predictions were obtained across the acoustic metrics, sonorant and continuant probability values were consistently in the direction predicted by known factors of a stop's lenition with respect to its voicing, place of articulation, and surrounding contexts. The results suggest the effectiveness of Phonet as an additional or alternative method of lenition measurement. Furthermore, this study has enhanced the accessibility of Phonet by releasing the trained Spanish Phonet model used in this study and a pipeline with step-by-step instructions for training and inferencing new models.
Asunto(s)
Redes Neurales de la Computación , Fonética , Acústica del Lenguaje , Humanos , Medición de la Producción del Habla/métodos , Factores de Tiempo , Probabilidad , AcústicaRESUMEN
Previous research has shown that prosodic structure can regulate the relationship between co-speech gestures and speech itself. Most co-speech studies have focused on manual gestures, but head movements have also been observed to accompany speech events by Munhall, Jones, Callan, Kuratate, and Vatikiotis-Bateson [(2004). Psychol. Sci. 15(2), 133-137], and these co-verbal gestures may be linked to prosodic prominence, as shown by Esteve-Gibert, Borrás-Comes, Asor, Swerts, and Prieto [(2017). J. Acoust. Soc. Am. 141(6), 4727-4739], Hadar, Steiner, Grant, and Rose [(1984). Hum. Mov. Sci. 3, 237-245], and House, Beskow, and Granström [(2001). Lang. Speech 26(2), 117-129]. This study examines how the timing and magnitude of head nods may be related to degrees of prosodic prominence connected to different focus conditions. Using electromagnetic articulometry, a time-varying signal of vertical head movement for 12 native French speakers was generated to examine the relationship between head nod gestures and F0 peaks. The results suggest that speakers use two different alignment strategies, which integrate both temporal and magnitudinal aspects of the gesture. Some evidence of inter-speaker preferences in the use of the two strategies was observed, although the inter-speaker variability is not categorical. Importantly, prosodic prominence itself is not the cause of the difference between the two strategies, but instead magnifies their inherent differences. In this way, the use of co-speech head nod gestures under French focus conditions can be considered as a method of prosodic enhancement.
Asunto(s)
Movimientos de la Cabeza , Acústica del Lenguaje , Humanos , Masculino , Femenino , Adulto Joven , Adulto , Medición de la Producción del Habla/métodos , Factores de Tiempo , Gestos , Calidad de la Voz , Francia , LenguajeRESUMEN
Vowels vary in their acoustic similarity across regional dialects of American English, such that some vowels are more similar to one another in some dialects than others. Acoustic vowel distance measures typically evaluate vowel similarity at a discrete time point, resulting in distance estimates that may not fully capture vowel similarity in formant trajectory dynamics. In the current study, language and accent distance measures, which evaluate acoustic distances between talkers over time, were applied to the evaluation of vowel category similarity within talkers. These vowel category distances were then compared across dialects, and their utility in capturing predicted patterns of regional dialect variation in American English was examined. Dynamic time warping of mel-frequency cepstral coefficients was used to assess acoustic distance across the frequency spectrum and captured predicted Southern American English vowel similarity. Root-mean-square distance and generalized additive mixed models were used to assess acoustic distance for selected formant trajectories and captured predicted Southern, New England, and Northern American English vowel similarity. Generalized additive mixed models captured the most predicted variation, but, unlike the other measures, do not return a single acoustic distance value. All three measures are potentially useful for understanding variation in vowel category similarity across dialects.
Asunto(s)
Fonética , Acústica del Lenguaje , Medición de la Producción del Habla , Humanos , Medición de la Producción del Habla/métodos , Calidad de la Voz , Acústica , Femenino , Masculino , Factores de Tiempo , Lenguaje , Espectrografía del Sonido , AdultoRESUMEN
The study of citation tones, lexical tones produced in isolation, is one of the first steps towards understanding speech prosody in tone languages. However, methodologies for investigating citation tones vary significantly, often leading to limited comparability of tone inventories, both within and across languages. This paper presents a systematic review of research methods and practices in 136 citation tone studies on 129 tonal language varieties in China, including 99 studies published in Chinese, which are therefore not easily available to an international scientific readership. The review provides an overview of possible analytical decisions along the research pipeline, and unveils considerable variation in data collection, analysis, and reporting conventions, particularly in how f0, the primary acoustic correlate for tone, is operationalised and reported across studies. Key methodological issues are identified, including small sample sizes and inadequate transparency in communicating methodological decisions and procedure. This paper offers a clear road map for citation tone production research and proposes a range of recommendations on speaker sampling, experimental design, acoustic processing techniques, f0 analysis, and result reporting, with the goal of facilitating future tonal research and enhancing resources for underrepresented tonal varieties.
Asunto(s)
Lingüística , Acústica del Lenguaje , Medición de la Producción del Habla , Humanos , Lenguaje , Lingüística/métodos , Fonética , Percepción del Habla , Medición de la Producción del Habla/métodosRESUMEN
The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pretrained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end (E2E) accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in AID. Next, leveraging automatic speech recognition (ASR) and AID models is investigated to explore accentedness estimation. The ASR model is an E2E connectionist temporal classification model trained exclusively with American English (en-US) utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between scores estimated by the two models. Additionally, a robust correlation between objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of using AID-based and ASR-based systems for accentedness assessment in non-native speech. Such advanced systems would benefit accent assessment in language learning as well as speech and speaker assessment for intelligibility, quality, and speaker diarization and speech recognition advancements.
Asunto(s)
Percepción del Habla , Software de Reconocimiento del Habla , Humanos , Percepción del Habla/fisiología , Acústica del Lenguaje , Fonética , Lenguaje , Medición de la Producción del Habla/métodos , Femenino , MasculinoRESUMEN
The quality of speech input influences the efficiency of L1 and L2 acquisition. This study examined modifications in infant-directed speech (IDS) and foreigner-directed speech (FDS) in Standard Mandarin-a tonal language-and explored how IDS and FDS features were manifested in disyllabic words and a longer discourse. The study aimed to determine which characteristics of IDS and FDS were enhanced in comparison with adult-directed speech (ADS), and how IDS and FDS differed when measured in a common set of acoustic parameters. For words, it was found that tone-bearing vowel duration, mean and range of fundamental frequency (F0), and the lexical tone contours were enhanced in IDS and FDS relative to ADS, except for the dipping Tone 3 that exhibited an unexpected lowering in FDS, but no modification in IDS when compared with ADS. For the discourse, different aspects of temporal and F0 enhancements were emphasized in IDS and FDS: the mean F0 was higher in IDS whereas the total discourse duration was greater in FDS. These findings add to the growing literature on L1 and L2 speech input characteristics and their role in language acquisition.
Asunto(s)
Acústica del Lenguaje , Humanos , Femenino , Masculino , Lactante , Adulto , Fonética , Medición de la Producción del Habla/métodos , Adulto Joven , Multilingüismo , Calidad de la Voz , Acústica , Lenguaje , Factores de Tiempo , Percepción del HablaRESUMEN
This paper evaluates an innovative framework for spoken dialect density prediction on children's and adults' African American English. A speaker's dialect density is defined as the frequency with which dialect-specific language characteristics occur in their speech. Rather than treating the presence or absence of a target dialect in a user's speech as a binary decision, instead, a classifier is trained to predict the level of dialect density to provide a higher degree of specificity in downstream tasks. For this, self-supervised learning representations from HuBERT, handcrafted grammar-based features extracted from ASR transcripts, prosodic features, and other feature sets are experimented with as the input to an XGBoost classifier. Then, the classifier is trained to assign dialect density labels to short recorded utterances. High dialect density level classification accuracy is achieved for child and adult speech and demonstrated robust performance across age and regional varieties of dialect. Additionally, this work is used as a basis for analyzing which acoustic and grammatical cues affect machine perception of dialect.
Asunto(s)
Negro o Afroamericano , Acústica del Lenguaje , Humanos , Adulto , Niño , Masculino , Femenino , Medición de la Producción del Habla/métodos , Lenguaje , Preescolar , Adulto Joven , Percepción del Habla , Adolescente , Fonética , Lenguaje InfantilRESUMEN
Research has shown that talkers reliably coordinate the timing of articulator movements across variation in production rate and syllable stress, and that this precision of inter-articulator timing instantiates phonetic structure in the resulting acoustic signal. We here tested the hypothesis that immediate auditory feedback helps regulate that consistent articulatory timing control. Talkers with normal hearing recorded 480 /tV#Cat/ utterances using electromagnetic articulography, with alternative V (/É/-/É/) and C (/t/-/d/), across variation in production rate (fast-normal) and stress (first syllable stressed-unstressed). Utterances were split between two listening conditions: unmasked and masked. To quantify the effect of immediate auditory feedback on the coordination between the jaw and tongue-tip, the timing of tongue-tip raising onset for C, relative to the jaw opening-closing cycle for V, was obtained in each listening condition. Across both listening conditions, any manipulation that shortened the jaw opening-closing cycle reduced the latency of tongue-tip movement onset, relative to the onset of jaw opening. Moreover, tongue-tip latencies were strongly affiliated with utterance type. During auditory masking, however, tongue-tip latencies were less strongly affiliated with utterance type, demonstrating that talkers use afferent auditory signals in real-time to regulate the precision of inter-articulator timing in service to phonetic structure.
Asunto(s)
Retroalimentación Sensorial , Fonética , Percepción del Habla , Lengua , Humanos , Lengua/fisiología , Masculino , Femenino , Adulto , Retroalimentación Sensorial/fisiología , Adulto Joven , Percepción del Habla/fisiología , Maxilares/fisiología , Acústica del Lenguaje , Medición de la Producción del Habla/métodos , Factores de Tiempo , Habla/fisiología , Enmascaramiento PerceptualRESUMEN
In this study, a computer-driven, phoneme-agnostic method was explored for assessing speech disorders (SDs) in children, bypassing traditional labor-intensive phonetic transcription. Using the SpeechMark® automatic syllabic cluster (SC) analysis, which detects sequences of acoustic features that characterize well-formed syllables, 1952 American English utterances of 60 preschoolers were analyzed [16 with speech disorder present (SD-P) and 44 with speech disorder not present (SD-NP)] from two dialectal areas. A four-factor regression analysis evaluated the robustness of seven automated measures produced by SpeechMark® and their interactions. SCs significantly predicted SD status (p < 0.001). A secondary analysis using a generalized linear model with a negative binomial distribution evaluated the number of SCs produced by the groups. Results highlighted that children with SD-P produced fewer well-formed clusters [incidence rate ratio (IRR) = 0.8116, p ≤ 0.0137]. The interaction between speech group and age indicated that the effect of age on syllable count was more pronounced in children with SD-P (IRR = 1.0451, p = 0.0251), suggesting that even small changes in age can have a significant effect on SCs. In conclusion, speech status significantly influences the degree to which preschool children produce acoustically well-formed SCs, suggesting the potential for SCs to be speech biomarkers for SD in preschoolers.
Asunto(s)
Fonética , Acústica del Lenguaje , Trastornos del Habla , Medición de la Producción del Habla , Humanos , Preescolar , Masculino , Femenino , Medición de la Producción del Habla/métodos , Trastornos del Habla/fisiopatología , Trastornos del Habla/diagnóstico , Niño , Lenguaje Infantil , Factores de EdadRESUMEN
Voice and speech production change with age, which can lead to potential communication challenges. This study explored the use of Landmark-based analysis of speech (LMBAS), a knowledge-based speech analysis algorithm based on Stevens' Landmark Theory, to describe age-related changes in adult speakers. The speech samples analyzed were sourced from the University of Florida Aging Voice Database, which included recordings of 16 sentences from the Speech Perception in Noise test of Bilger, Rzcezkowski, Nuetzel, and Rabinowitz [J. Acoust. Soc. Am. 65, S98-S98 (1979)] and Bilger, Nuetzel, Rabinowitz, and Rzeczkowski [J. Speech. Lang. Hear. Res. 27, 32-84 (1984)]. These sentences were read in quiet environments by 50 young, 50 middle-aged, and 50 older American English speakers, with an equal distribution of sexes. Acoustic landmarks, specifically, glottal, bursts, and syllabicity landmarks, were extracted using SpeechMark®, MATLAB Toolbox version 1.1.2. The results showed significant age effect on glottal and burst landmarks. Furthermore, the sex effect was significant for burst and syllabicity landmarks. While the results of LMBAS suggest its potential in detecting age-related changes in speech, increase in syllabicity landmarks with age was unexpected. This finding may suggest the need for further refinement and adjustment of this analytical approach.
Asunto(s)
Envejecimiento , Acústica del Lenguaje , Medición de la Producción del Habla , Humanos , Masculino , Femenino , Persona de Mediana Edad , Anciano , Adulto , Adulto Joven , Envejecimiento/fisiología , Medición de la Producción del Habla/métodos , Factores de Edad , Calidad de la Voz , Algoritmos , Anciano de 80 o más Años , Percepción del Habla/fisiología , Habla/fisiologíaRESUMEN
For most of his illustrious career, Ken Stevens focused on examining and documenting the rich detail about vocal tract changes available to listeners underlying the acoustic signal of speech. Current approaches to speech inversion take advantage of this rich detail to recover information about articulatory movement. Our previous speech inversion work focused on movements of the tongue and lips, for which "ground truth" is readily available. In this study, we describe acquisition and validation of ground-truth articulatory data about velopharyngeal port constriction, using both the well-established measure of nasometry plus a novel technique-high-speed nasopharyngoscopy. Nasometry measures the acoustic output of the nasal and oral cavities to derive the measure nasalance. High-speed nasopharyngoscopy captures images of the nasopharyngeal region and can resolve velar motion during speech. By comparing simultaneously collected data from both acquisition modalities, we show that nasalance is a sufficiently sensitive measure to use as ground truth for our speech inversion system. Further, a speech inversion system trained on nasalance can recover known patterns of velopharyngeal port constriction shown by American English speakers. Our findings match well with Stevens' own studies of the acoustics of nasal consonants.
Asunto(s)
Acústica del Lenguaje , Medición de la Producción del Habla , Humanos , Masculino , Medición de la Producción del Habla/métodos , Adulto , Femenino , Adulto Joven , Calidad de la Voz , Constricción Patológica , Habla/fisiología , Endoscopía/métodos , Endoscopía/instrumentaciónRESUMEN
To study the possibility of using acoustic parameters, i.e., Acoustic Voice Quality Index (AVQI) and Maximum Phonation Time (MPT) for predicting the degree of lung involvement in COVID-19 patients. This cross-sectional case-control study was conducted on the voice samples collected from 163 healthy individuals and 181 patients with COVID-19. Each participant produced a sustained vowel/a/, and a phonetically balanced Persian text containing 36 syllables. AVQI and MPT were measured using Praat scripts. Each patient underwent a non-enhanced chest computed tomographic scan and the Total Opacity score was rated to assess the degree of lung involvement. The results revealed significant differences between patients with COVID-19 and healthy individuals in terms of AVQI and MPT. A significant difference was also observed between male and female participants in AVQI and MPT. The results from the receiver operating characteristic curve analysis and area under the curve indicated that MPT (0.909) had higher diagnostic accuracy than AVQI (0.771). A significant relationship was observed between AVQI and TO scores. In the case of MPT, however, no such relationship was observed. The findings indicated that MPT was a better classifier in differentiating patients from healthy individuals, in comparison with AVQI. The results also showed that AVQI can be used as a predictor of the degree of patients' and recovered individuals' lung involvement. A formula is suggested for calculating the degree of lung involvement using AVQI.
Asunto(s)
COVID-19 , Disfonía , Humanos , Masculino , Femenino , Disfonía/diagnóstico , Acústica del Lenguaje , Estudios de Casos y Controles , Estudios de Factibilidad , Estudios Transversales , Reproducibilidad de los Resultados , Índice de Severidad de la Enfermedad , Acústica , Tomografía , Medición de la Producción del Habla/métodosRESUMEN
BACKGROUND: Speech-language pathologists often multitask in order to be efficient with their commonly large caseloads. In stuttering assessment, multitasking often involves collecting multiple measures simultaneously. AIMS: The present study sought to determine reliability when collecting multiple measures simultaneously versus individually. METHODS & PROCEDURES: Over two time periods, 50 graduate students viewed videos of four persons who stutter (PWS) and counted the number of stuttered syllables and total number of syllables uttered, and rated speech naturalness. Students were randomly assigned to one of two groups: the simultaneous group, in which all measures were gathered during one viewing; and the individual group, in which one measure was gathered per viewing. Relative and absolute intra- and inter-rater reliability values were calculated for each measure. OUTCOMES & RESULTS: The following results were notable: better intra-rater relative reliability for the number of stuttered syllables for the individual group (intraclass correlation coefficient (ICC) = 0.839) compared with the simultaneous group (ICC = 0.350), smaller intra-rater standard error of measurement (SEM) (i.e., better absolute reliability) for the number of stuttered syllables for the individual group (7.40) versus the simultaneous group (15.67), and better inter-rater absolute reliability for the total number of syllables for the individual group (88.29) compared with the simultaneous group (125.05). Absolute reliability was unacceptable for all measures across both groups. CONCLUSIONS & IMPLICATIONS: These findings show that judges are likely to be more reliable when identifying stuttered syllables in isolation than when simultaneously collecting them with total syllables spoken and naturalness data. Results are discussed in terms of narrowing the reliability gap between data collection methods for stuttered syllables, improving overall reliability of stuttering measurements, and a procedural change when implementing widely used stuttering assessment protocols. WHAT THIS PAPER ADDS: What is already known on the subject The reliability of stuttering judgments has been found to be unacceptable across a number of studies, including those examining the reliability of the most popular stuttering assessment tool, the Stuttering Severity Instrument (4th edition). The SSI-4, and other assessment applications, involve collecting multiple measures simultaneously. It has been suggested, but not examined, that collecting measures simultaneously, which occurs in the most popular stuttering assessment protocols, may result in substantially inferior reliability when compared to collecting measures individually. What this paper adds to existing knowledge The present study has multiple novel findings. First, relative and absolute intra-rater reliability were substantially better when stuttered syllables data were collected individually compared to when the same data were collected simultaneously with total number of syllables and speech naturalness data. Second, inter-rater absolute reliability for total number of syllables was also substantially better when collected individually. Third, intra-rater and inter-rater reliability were similar when speech naturalness ratings were given individually compared to when they were given while simultaneously counting stuttered and fluent syllables. What are the potential or actual clinical implications of this work? Clinicians can be more reliable when identifying stuttered syllables individually compared to when they judge stuttering along with other clinical measures of stuttering. In addition, when clinicians and researchers use current popular protocols for assessing stuttering that recommend simultaneous data collection, including the SSI-4, they should instead consider collecting stuttering event counts individually. This procedural change will lead to more reliable data and stronger clinical decision making.
Asunto(s)
Tartamudeo , Humanos , Reproducibilidad de los Resultados , Índice de Severidad de la Enfermedad , Habla , Medición de la Producción del Habla/métodos , Tartamudeo/diagnósticoRESUMEN
BACKGROUND: Auditory-perceptual assessment of voice is a subjective procedure. Artificial intelligence with deep learning (DL) may improve the consistency and accessibility of this task. It is unclear how a DL model performs on different acoustic features. AIMS: To develop a generalizable DL framework for identifying dysphonia using a multidimensional acoustic feature. METHODS & PROCEDURES: Recordings of sustained phonations of /a/ and /i/ were retrospectively collected from a clinical database. Subjects contained 238 dysphonic and 223 vocally healthy speakers of Chinese Mandarin. All audio clips were split into multiple 1.5-s segments and normalized to the same loudness level. Mel frequency cepstral coefficients and mel-spectrogram were extracted from these standardized segments. Each set of features was used in a convolutional neural network (CNN) to perform a binary classification task. The best feature was obtained through a five-fold cross-validation on a random selection of 80% data. The resultant DL framework was tested on the remaining 20% data and a public German voice database. The performance of the DL framework was compared with those of two baseline machine-learning models. OUTCOMES & RESULTS: The mel-spectrogram yielded the best model performance, with a mean area under the receiver operating characteristic curve of 0.972 and an accuracy of 92% in classifying audio segments. The resultant DL framework significantly outperformed both baseline models in detecting dysphonic subjects on both test sets. The best outcomes were achieved when classifications were made based on all segments of both vowels, with 95% accuracy, 92% recall, 98% precision and 98% specificity on the Chinese test set, and 92%, 95%, 90% and 89%, respectively, on the German set. CONCLUSIONS & IMPLICATIONS: This study demonstrates the feasibility of DL for automatic detection of dysphonia. The mel-spectrogram is a preferred acoustic feature for the task. This framework may be used for vocal health screening and facilitate automatic perceptual evaluation of voice in the era of big data. WHAT THIS PAPER ADDS: What is already known on this subject Auditory-perceptual assessment is the current gold standard in clinical evaluation of voice quality, but its value may be limited by the rater's reliability and accessibility. DL is a new method of artificial intelligence that can overcome these disadvantages and promote automatic voice assessment. This study explored the feasibility of a DL approach for automatic detection of dysphonia, along with a quantitative comparison of two common sets of acoustic features. What this study adds to existing knowledge A CNN model is excellent at decoding multidimensional acoustic features, outperforming the baseline parameter-based models in identifying dysphonic voices. The first 13 mel-frequency cepstral coefficients (MFCCs) are sufficient for this task. The mel-spectrogram results in greater performance, indicating the acoustic features are presented in a more favourable way than the MFCCs to the CNN model. What are the potential or actual clinical implications of this work? DL is a feasible method for the detection of dysphonia. The current DL framework may be used for remote vocal health screening or documenting voice recovery after treatment. In future, DL models may potentially be used to perform auditory-perceptual tasks in an automatic, efficient, reliable and low-cost manner.
Asunto(s)
Aprendizaje Profundo , Disfonía , Humanos , Disfonía/diagnóstico , Acústica del Lenguaje , Estudios Retrospectivos , Inteligencia Artificial , Reproducibilidad de los Resultados , Medición de la Producción del Habla/métodos , AcústicaRESUMEN
Biometrics-based authentication has become the most well-established form of user recognition in systems that demand a certain level of security. For example, the most commonplace social activities stand out, such as access to the work environment or to one's own bank account. Among all biometrics, voice receives special attention due to factors such as ease of collection, the low cost of reading devices, and the high quantity of literature and software packages available for use. However, these biometrics may have the ability to represent the individual impaired by the phenomenon known as dysphonia, which consists of a change in the sound signal due to some disease that acts on the vocal apparatus. As a consequence, for example, a user with the flu may not be properly authenticated by the recognition system. Therefore, it is important that automatic voice dysphonia detection techniques be developed. In this work, we propose a new framework based on the representation of the voice signal by the multiple projection of cepstral coefficients to promote the detection of dysphonic alterations in the voice through machine learning techniques. Most of the best-known cepstral coefficient extraction techniques in the literature are mapped and analyzed separately and together with measures related to the fundamental frequency of the voice signal, and its representation capacity is evaluated on three classifiers. Finally, the experiments on a subset of the Saarbruecken Voice Database prove the effectiveness of the proposed material in detecting the presence of dysphonia in the voice.
Asunto(s)
Disfonía , Voz , Humanos , Disfonía/diagnóstico , Acústica del Lenguaje , Calidad de la Voz , Medición de la Producción del Habla/métodosRESUMEN
There is a general need for more knowledge on the development of French phonology, and little information is currently available for typically developing French-speaking three-year-old children. This study took place in Belgium and explores the accuracy of speech production of 34 typically developing French-speaking children using a picture naming task. Measures of speech accuracy revealed lower performance than previously seen in the literature. We investigated speech accuracy across different phonological contexts in light of characteristics of target words that are known to have an influence on speech production, namely the condition of production (spontaneous vs. imitated), the length of the word (in number of syllables), syllable complexity (singleton vs. cluster) and positional complexity (onset vs. coda). Results indicate that the accuracy of words produced spontaneously did not differ from imitated words. The presence of consonant clusters in the target word was associated with lower performance on measures of Percentage of Consonants Correct and Whole Word Proximity for both 1- and 4-syllable words. Singleton codas were produced less accurately than onsets in 1-syllable words. Word-internal singleton codas were produced less accurately than final codas. In our sample, 1-syllable words showed surprisingly low levels of performance which we can explain by an over-representation of phonologically complex properties in the target words used in the present study. These results highlight the importance of assessing various aspects of phonological complexity in French speech tasks in order to detect developmental errors in typically developing children and, ultimately, help identify children with speech sound disorders.
Asunto(s)
Lenguaje , Fonética , Humanos , Niño , Preescolar , Medición de la Producción del Habla/métodos , Habla , Lenguaje InfantilRESUMEN
This study aimed to determine the effect of phonological and morphological factors on the dysfluencies of Nepali-speaking adults who stutter. Eighteen Nepali-speaking adult speakers with mild to very severe developmental stuttering were recruited. The spontaneous speech sample was audio-video recorded and transcribed through orthographic transcription. A total of 350 syllables were analysed to calculate stuttering frequency. Phoneme position, phoneme category, and word length were considered as the phonological factors and word-class as morphological factors. The percentage of stuttering for each of these variables was computed. The study's outcome displayed a significant effect of phoneme position and word length but no effect of phoneme category. Significantly greater stuttering was noticed in the word-initial position and longer words compared to word-medial and shorter words, respectively. In morphological factors, content words and content-function words had a greater stuttering rate than function words. This study showed a significant effect of phoneme position, word length, and grammatical class on the frequency of dysfluency in Nepali-speaking adults who stutter but no effect of phoneme category. The phonetic complexity of these variables may lead to an increase in motor planning demand resulting in more stuttering.
Asunto(s)
Tartamudeo , Adulto , Humanos , Medición de la Producción del Habla/métodos , Lenguaje , Habla , FonéticaRESUMEN
Speech language pathologists regularly use perceptual methods in clinical practice to assess children's speech. In this study, we examined relationships between measures of speech intelligibility, clinical articulation test results, age, and perceptual ratings of articulatory goodness for children. We also examined the extent to which established measures of intelligibility and clinical articulation test results predicted articulatory goodness ratings, and whether goodness ratings were influenced by intelligibility. A sample of 164 (30-47 months) typically developing children provided speech samples and completed a standardised articulation test. Single word intelligibility scores and ratings of articulatory goodness were gathered from 328 naïve listeners; scores on a standardised articulation test were obtained from each child. Bivariate Pearson correlation, linear regression, and linear mixed effects modelling were used for analysis. Results showed that articulatory goodness ratings had the highest correlation with intelligibility, followed by age, followed by articulation score. Age and clinical articulation scores were both significant predictors of goodness ratings, but articulation scores made only a small contribution to prediction. Articulatory goodness ratings were substantially lower for unintelligible words compared to intelligible words, but articulatory goodness scores increased with age at the same rate for unintelligible and intelligible words. Perceptual ratings of articulatory goodness are sensitive to developmental changes in speech production (regardless of intelligibility) and yield a different kind of information than clinical articulation scores from standardised measures.