Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 696
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Cerebellum ; 23(2): 459-470, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37039956

RESUMO

Dysarthria is a common manifestation across cerebellar ataxias leading to impairments in communication, reduced social connections, and decreased quality of life. While dysarthria symptoms may be present in other neurological conditions, ataxic dysarthria is a perceptually distinct motor speech disorder, with the most prominent characteristics being articulation and prosody abnormalities along with distorted vowels. We hypothesized that uncertainty of vowel predictions by an automatic speech recognition system can capture speech changes present in cerebellar ataxia. Speech of participants with ataxia (N=61) and healthy controls (N=25) was recorded during the "picture description" task. Additionally, participants' dysarthric speech and ataxia severity were assessed on a Brief Ataxia Rating Scale (BARS). Eight participants with ataxia had speech and BARS data at two timepoints. A neural network trained for phoneme prediction was applied to speech recordings. Average entropy of vowel tokens predictions (AVE) was computed for each participant's recording, together with mean pitch and intensity standard deviations (MPSD and MISD) in the vowel segments. AVE and MISD demonstrated associations with BARS speech score (Spearman's rho=0.45 and 0.51), and AVE demonstrated associations with BARS total (rho=0.39). In the longitudinal cohort, Wilcoxon pairwise signed rank test demonstrated an increase in BARS total and AVE, while BARS speech and acoustic measures did not significantly increase. Relationship of AVE to both BARS speech and BARS total, as well as the ability to capture disease progression even in absence of measured speech decline, indicates the potential of AVE as a digital biomarker for cerebellar ataxia.


Assuntos
Ataxia Cerebelar , Disartria , Humanos , Disartria/etiologia , Disartria/complicações , Ataxia Cerebelar/diagnóstico , Ataxia Cerebelar/complicações , Incerteza , Qualidade de Vida , Ataxia/diagnóstico , Ataxia/complicações , Biomarcadores
2.
J Biomed Inform ; 150: 104598, 2024 02.
Artigo em Inglês | MEDLINE | ID: mdl-38253228

RESUMO

OBJECTIVES: We aimed to investigate how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy, specifically in the "Cookie Theft" picture description task. We aimed to assess whether imperfect ASR-generated transcripts could provide valuable information for distinguishing between language samples from cognitively healthy individuals and those with Alzheimer's disease (AD). METHODS: We conducted experiments using various ASR models, refining their transcripts with post-editing techniques. Both these imperfect ASR transcripts and manually transcribed ones were used as inputs for the downstream dementia classification. We conducted comprehensive error analysis to compare model performance and assess ASR-generated transcript effectiveness in dementia classification. RESULTS: Imperfect ASR-generated transcripts surprisingly outperformed manual transcription for distinguishing between individuals with AD and those without in the "Cookie Theft" task. These ASR-based models surpassed the previous state-of-the-art approach, indicating that ASR errors may contain valuable cues related to dementia. The synergy between ASR and classification models improved overall accuracy in dementia classification. CONCLUSION: Imperfect ASR transcripts effectively capture linguistic anomalies linked to dementia, improving accuracy in classification tasks. This synergy between ASR and classification models underscores ASR's potential as a valuable tool in assessing cognitive impairment and related clinical applications.


Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Percepção da Fala , Humanos , Fala , Idioma , Doença de Alzheimer/diagnóstico
3.
Audiol Neurootol ; : 1-7, 2024 May 20.
Artigo em Inglês | MEDLINE | ID: mdl-38768568

RESUMO

INTRODUCTION: This study aimed to verify the influence of speech stimulus presentation and speed on auditory recognition in cochlear implant (CI) users with poorer performance. METHODS: The cross-sectional observational study applied auditory speech perception tests to fifteen adults, using three different ways of presenting the stimulus, in the absence of competitive noise: monitored live voice (MLV); recorded speech at typical speed (RSTS); recorded speech at slow speed (RSSS). The scores were assessed using the Percent Sentence Recognition Index (PSRI). The data were inferentially analysed using the Friedman and Wilcoxon tests with a 95% confidence interval and 5% significance level (p < 0.05). RESULTS: The mean age was 41.1 years, the mean duration of CI use was 11.4 years, and the mean hearing threshold was 29.7 ± 5.9 dBHL. Test performance, as determined by the PSRI, was MLV = 42.4 ± 17.9%; RSTS = 20.3 ± 14.3%; RSSS = 40.6 ± 20.7%. There was a significant difference identified for RSTS compared to MLV and RSSS. CONCLUSION: The way the stimulus is presented and the speed at which it is presented enable greater auditory speech recognition in CI users, thus favouring comprehension when the tests are applied in the MLV and RSSS modalities.

4.
J Exp Child Psychol ; 249: 106088, 2024 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-39316884

RESUMO

Multi-talker noise impedes children's speech processing and may affect children listening to their second language more than children listening to their first language. Evidence suggests that multi-talker noise also may impede children's memory retention and learning. A total of 80 culturally and linguistically diverse children aged 7 to 9 years listened to narratives in two listening conditions: quiet and multi-talker noise (signal-to-noise ratio +6 dB). Repeated recall (immediate and delayed recall), was measured with a 1-week retention interval. Retention was calculated as the difference in recall accuracy per question between immediate and delayed recall. Working memory capacity was assessed, and the children's degree of school language (Swedish) exposure was quantified. Immediate narrative recall was lower for the narrative encoded in noise than in quiet. During delayed recall, narrative recall was similar for both listening conditions. Children with higher degrees of school language exposure and higher working memory capacity had better narrative recall overall, but these factors were not associated with an effect of listening condition or retention. Multi-talker babble noise does not impair culturally and linguistically diverse primary school children's retention of spoken narratives as measured by multiple-choice questions. Although a quiet listening condition allows for a superior encoding compared with a noisy listening condition, details are likely lost during memory consolidation and re-consolidation.

5.
Memory ; 32(2): 237-251, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38265997

RESUMO

Recognition of speech in noise is facilitated when spoken sentences are repeated a few minutes later, but the levels of representation involved in this effect have not been specified. Three experiments tested whether the effect would transfer across modalities and languages. In Experiment 1, participants listened to sets of high- and low-constraint sentences and read other sets in an encoding phase. At test, these sentences and new sentences were presented in noise, and participants attempted to report the final word of each sentence. Recognition was more accurate for repeated than for new sentences in both modalities. Experiment 2 was identical except for the implementation of an articulatory suppression task at encoding to reduce phonological recoding during reading. The cross-modal repetition priming effect persisted but was weaker than when the modality was the same at encoding and test. Experiment 3 showed that the repetition priming effect did not transfer across languages in bilinguals. Taken together, the results indicate that the facilitated recognition of repeated speech is based on a combination of modality-specific processes at the phonological word form level and modality-general processes at the lemma level of lexical representation, but the semantic level of representation is not involved.


Assuntos
Percepção da Fala , Fala , Humanos , Priming de Repetição , Idioma , Semântica
6.
Eur Arch Otorhinolaryngol ; 281(2): 683-691, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37552281

RESUMO

PURPOSE: To investigate 2-year post-operative hearing performance, safety, and patient-reported outcomes of hearing-impaired adults treated with the Osia® 2 System, an active osseointegrated bone-conduction hearing implant that uses piezoelectric technology. METHODS: A prospective, multicenter, open-label, single-arm, within-subject clinical study conducted at three tertiary referral clinical centers located in Melbourne, Sydney and Hong Kong. Twenty adult recipients of the Osia 2 System were enrolled and followed up between 12 and 24 months post-implantation: 17 with mixed or conductive hearing loss and 3 with single-sided sensorineural deafness. Safety data, audiological thresholds, speech recognition thresholds in noise, and patient-reported outcomes were collected and evaluated. In addition, pre-and 6-month post-implantation data were collected retrospectively for this recipient cohort enrolled into the earlier study (ClinicalTrials.gov NCT04041700). RESULTS: Between 6- and 24-month follow-up, there was no statistically significant change in free-field hearing thresholds or speech reception thresholds in noise (p = > 0.05), indicating that aided improvements were maintained up to 24 months of follow-up. Furthermore, improvements in health-related quality of life and daily hearing ability, as well as clinical and subjective measures of hearing benefit remained stable over the 24-month period. No serious adverse events were reported during extended follow-up. CONCLUSIONS: These study results provide further evidence to support the longer term clinical safety, hearing performance, and patient-related benefits of the Osia 2 System in patients with either a conductive hearing loss, mixed hearing loss, or single-sided sensorineural deafness. TRIAL REGISTRATION: ClinicalTrials.gov Identifier: NCT04754477. First posted: February 15, 2021.


Assuntos
Surdez , Auxiliares de Audição , Perda Auditiva Condutiva-Neurossensorial Mista , Perda Auditiva Neurossensorial , Perda Auditiva , Percepção da Fala , Adulto , Humanos , Perda Auditiva Condutiva/cirurgia , Perda Auditiva Condutiva-Neurossensorial Mista/cirurgia , Seguimentos , Estudos Prospectivos , Qualidade de Vida , Estudos Retrospectivos , Audição , Condução Óssea , Medidas de Resultados Relatados pelo Paciente
7.
Eur Arch Otorhinolaryngol ; 281(3): 1205-1214, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37792216

RESUMO

PURPOSE: To identify audiological and demographic variables that predict speech recognition abilities in patients with bilateral microtia who underwent Bonebridge (BB) implantation. METHODS: Fifty patients with bilateral microtia and bilateral conductive hearing loss (CHL) who underwent BB implantation were included. Demographic data, preoperative hearing aid use experience, and audiological outcomes (including pure-tone hearing threshold, sound field hearing threshold [SFHT], and speech recognition ability) for each participant were obtained. The Chinese-Mandarin Speech Test Materials were used to test speech recognition ability. The word recognition score (WRS) of disyllabic words at 65 dB SPL signals was measured before and after BB implantation in quiet and noisy conditions. RESULTS: The mean preoperative WRS under quiet and noisy conditions was 10.44 ± 12.73% and 5.90 ± 8.76%, which was significantly improved to 86.38 ± 9.03% and 80.70 ± 11.34%, respectively, following BB fitting. Multiple linear regression analysis revealed that lower preoperative SFHT suggested higher preoperative WRS under both quiet and noisy conditions. Higher age at implantation predicted higher preoperative WRS under quiet conditions. Furthermore, patients with more preoperative hearing aid experience and lower postoperative SFHT were more likely to have higher postoperative WRS under both quiet and noisy testing conditions. CONCLUSIONS: This study represents the first attempt to identify predictors of preoperative and postoperative speech recognition abilities in patients with bilateral microtia with BB implantation. These findings emphasize that early hearing intervention before implantation surgery, combined with appropriate postoperative fitting, contributes to optimal benefits in terms of postoperative speech recognition ability.


Assuntos
Microtia Congênita , Auxiliares de Audição , Percepção da Fala , Humanos , Microtia Congênita/complicações , Microtia Congênita/cirurgia , Estudos Retrospectivos , Fala , Perda Auditiva Condutiva/cirurgia , Condução Óssea
8.
Eur Arch Otorhinolaryngol ; 281(6): 3265-3268, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38409582

RESUMO

BACKGROUND: Mitochondrial encephalopathy, lactic acidosis, and stroke-like episodes (MELAS) is a maternally inherited mitochondrial disease that affects various systems in the body, particularly the brain, nervous system, and muscles. Among these systems, sensorineural hearing loss is a common additional symptom. METHODS: A 42-year-old female patient with MELAS who experienced bilateral profound deafness and underwent bilateral sequential cochlear implantation (CIs). Speech recognition and subjective outcomes were evaluated. RESULTS: Following the first CI follow-up, the patient exhibited improved speech recognition ability and decided to undergo the implantation of the second ear just two months after the initial CI surgery. The second CI also demonstrated enhanced speech recognition ability. Subjective outcomes were satisfactory for bilateral CIs. CONCLUSIONS: MELAS patients receiving bilateral CIs can attain satisfactory post-CI speech recognition, spatial hearing, and sound qualities.


Assuntos
Implante Coclear , Implantes Cocleares , Síndrome MELAS , Humanos , Feminino , Adulto , Síndrome MELAS/complicações , Implante Coclear/métodos , Perda Auditiva Neurossensorial/cirurgia , Perda Auditiva Neurossensorial/etiologia , Percepção da Fala
9.
Int J Audiol ; : 1-8, 2024 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-38327074

RESUMO

OBJECTIVES: (1) to determine whether the standard Dutch word lists for speech audiometry are equally intelligible in normal-hearing listeners (Experiment 1), (2) to investigate whether synthetic speech can be used to create word lists (Experiment 1) and (3) to determine whether the list effect found in Experiment 1 can be reduced by combining two lists into pairs (Experiment 2). DESIGN: Participants performed speech tests in quiet with the original (natural) and synthetic word lists (Experiment 1.). In Experiment 2, new participants performed speech tests with list pairs from the original lists constructed from the results of Experiment 1. STUDY SAMPLES: Twenty-four and twenty-eight normal-hearing adults. RESULTS: There was a significant list effect in the natural speech lists; not in the synthetic speech lists. Variability in intelligibility was significantly higher in the former, with list differences up to 20% at fixed presentation levels. The 95% confidence interval of a list with a score of approximately 70% is around 10%-points wider than of a list pair. CONCLUSIONS: The original Dutch word lists show large variations in intelligibility. List effects can be reduced by combining two lists per condition. Synthetic speech is a promising alternative to natural speech in speech audiometry in quiet.

10.
Int J Audiol ; : 1-6, 2024 Aug 05.
Artigo em Inglês | MEDLINE | ID: mdl-39101925

RESUMO

OBJECTIVES: Wireless sound transfer methods for cochlear implant sound processors have become popular for remote self-assessed hearing tests. The aim of this study was to determine (1) spectral differences in stimuli between different wireless sound transfer options and (2) the effect on outcomes of speech recognition tests in noise. DESIGN: In study 1, the frequency response of different streaming options (Phonak Roger Select, Cochlear Mini Mic 2+, telecoil and Bluetooth connection) was measured by connecting headphones to CI sound processors. Study 2 followed a repeated measures design in which digits-in-noise (DIN) tests were performed using wireless streaming to sound processors from Cochlear, Advanced Bionics, and MED-EL. STUDY SAMPLE: 20 normal hearing participants. RESULTS: Differences in frequency response between loudspeaker and wireless streaming conditions were minimal. We did not find significant difference in DIN outcome (F4,194 = 1.062, p = 0.376) between the wireless transfer options with the Cochlear Nucleus 7 processor. No significant difference in DIN outcomes was found between Bluetooth streaming and the loudspeaker condition for all of the three tested brands. The mean standard error of measurement was 0.72 dB. CONCLUSIONS: No significant differences in DIN test outcomes between wireless sound transfer and the reference method were found.

11.
Sensors (Basel) ; 24(12)2024 Jun 14.
Artigo em Inglês | MEDLINE | ID: mdl-38931629

RESUMO

Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.


Assuntos
Algoritmos , Interface para o Reconhecimento da Fala , Humanos , Fala/fisiologia , Dinâmica não Linear , Reconhecimento Automatizado de Padrão/métodos
12.
Sensors (Basel) ; 24(14)2024 Jul 20.
Artigo em Inglês | MEDLINE | ID: mdl-39066111

RESUMO

In air traffic control (ATC), speech communication with radio transmission is the primary way to exchange information between the controller and the pilot. As a result, the integration of automatic speech recognition (ASR) systems holds immense potential for reducing controllers' workload and plays a crucial role in various ATC scenarios, which is particularly significant for ATC research. This article provides a comprehensive review of ASR technology's applications in the ATC communication system. Firstly, it offers a comprehensive overview of current research, including ATC corpora, ASR models, evaluation measures and application scenarios. A more comprehensive and accurate evaluation methodology tailored for ATC is proposed, considering advancements in communication sensing systems and deep learning techniques. This methodology helps researchers in enhancing ASR systems and improving the overall performance of ATC systems. Finally, future research recommendations are identified based on the primary challenges and issues. The authors sincerely hope this work will serve as a clear technical roadmap for ASR endeavors within the ATC domain and make a valuable contribution to the research community.

13.
Sensors (Basel) ; 24(10)2024 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-38793860

RESUMO

In environments where silent communication is essential, such as libraries and conference rooms, the need for a discreet means of interaction is paramount. Here, we present a single-electrode, contact-separated triboelectric nanogenerator (CS-TENG) characterized by robust high-frequency sensing capabilities and long-term stability. Integrating this TENG onto the inner surface of a mask allows for the capture of conversational speech signals through airflow vibrations, generating a comprehensive dataset. Employing advanced signal processing techniques, including short-time Fourier transform (STFT), Mel-frequency cepstral coefficients (MFCC), and deep learning neural networks, facilitates the accurate identification of speaker content and verification of their identity. The accuracy rates for each category of vocabulary and identity recognition exceed 92% and 90%, respectively. This system represents a pivotal advancement in facilitating secure and efficient unobtrusive communication in quiet settings, with promising implications for smart home applications, virtual assistant technology, and potential deployment in security and confidentiality-sensitive contexts.

14.
Sensors (Basel) ; 24(7)2024 Apr 03.
Artigo em Inglês | MEDLINE | ID: mdl-38610492

RESUMO

In recent years, attention to the realization of a distributed fiber-optic microphone for the detection and recognition of the human voice has increased, whereby the most popular schemes are based on φ-OTDR. Many issues related to the selection of optimal system parameters and the recognition of registered signals, however, are still unresolved. In this research, we conducted theoretical studies of these issues based on the φ-OTDR mathematical model and verified them with experiments. We designed an algorithm for fiber sensor signal processing, applied a testing kit, and designed a method for the quantitative evaluation of our obtained results. We also proposed a new setup model for lab tests of φ-OTDR single coordinate sensors, which allows for the quick variation of their parameters. As a result, it was possible to define requirements for the best quality of speech recognition; estimation using the percentage of recognized words yielded a value of 96.3%, and estimation with Levenshtein distance provided a value of 15.

15.
Sensors (Basel) ; 24(13)2024 Jul 04.
Artigo em Inglês | MEDLINE | ID: mdl-39001130

RESUMO

In recent years, embedded system technologies and products for sensor networks and wearable devices used for monitoring people's activities and health have become the focus of the global IT industry. In order to enhance the speech recognition capabilities of wearable devices, this article discusses the implementation of audio positioning and enhancement in embedded systems using embedded algorithms for direction detection and mixed source separation. The two algorithms are implemented using different embedded systems: direction detection developed using TI TMS320C6713 DSK and mixed source separation developed using Raspberry Pi 2. For mixed source separation, in the first experiment, the average signal-to-interference ratio (SIR) at 1 m and 2 m distances was 16.72 and 15.76, respectively. In the second experiment, when evaluated using speech recognition, the algorithm improved speech recognition accuracy to 95%.


Assuntos
Algoritmos , Dispositivos Eletrônicos Vestíveis , Humanos , Processamento de Sinais Assistido por Computador , Localização de Som
16.
Sensors (Basel) ; 24(8)2024 Apr 17.
Artigo em Inglês | MEDLINE | ID: mdl-38676191

RESUMO

This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.


Assuntos
Ruído , Interface para o Reconhecimento da Fala , Humanos , Análise por Conglomerados , Algoritmos , Fala/fisiologia
17.
J Oral Rehabil ; 2024 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-39135293

RESUMO

BACKGROUND: Automatic speech recognition (ASR) can potentially help older adults and people with disabilities reduce their dependence on others and increase their participation in society. However, maxillectomy patients with reduced speech intelligibility may encounter some problems using such technologies. OBJECTIVES: To investigate the accuracy of three commonly used ASR platforms when used by Japanese maxillectomy patients with and without their obturator placed. METHODS: Speech samples were obtained from 29 maxillectomy patients with and without their obturator and 17 healthy volunteers. The samples were input into three speaker-independent speech recognition platforms and the transcribed text was compared with the original text to calculate the syllable error rate (SER). All participants also completed a conventional speech intelligibility test to grade their speech using Taguchi's method. A comprehensive articulation assessment of patients without their obturator was also performed. RESULTS: Significant differences in SER were observed between healthy and maxillectomy groups. Maxillectomy patients with an obturator showed a significant negative correlation between speech intelligibility scores and SER. However, for those without an obturator, no significant correlations were observed. Furthermore, for maxillectomy patients without an obturator, significant differences were found between syllables grouped by vowels. Syllables containing /i/, /u/ and /e/ exhibited higher error rates compared to those containing /a/ and /o/. Additionally, significant differences were observed when syllables were grouped by consonant place of articulation and manner of articulation. CONCLUSION: The three platforms performed well for healthy volunteers and maxillectomy patients with their obturator, but the SER for maxillectomy patients without their obturator was high, rendering the platforms unusable. System improvement is needed to increase accuracy for maxillectomy patients.

18.
Phonetica ; 2024 Sep 05.
Artigo em Inglês | MEDLINE | ID: mdl-39248125

RESUMO

Given an orthographic transcription, forced alignment systems automatically determine boundaries between segments in speech, facilitating the use of large corpora. In the present paper, we introduce a neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). MAPS serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model as a tagger, rather than a classifier, motivated by the common understanding that segments are not truly discrete and often overlap. The second is an interpolation technique to allow more precise boundaries than the typical 10 ms limit in modern systems. During testing, all system configurations we trained significantly outperformed the state-of-the-art Montreal Forced Aligner in the 10 ms boundary placement tolerance threshold. The greatest difference achieved was a 28.13 % relative performance increase. The Montreal Forced Aligner began to slightly outperform our models at around a 30 ms tolerance. We also reflect on the training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciling this tension may require rethinking the task and output targets or how speech itself should be segmented.

19.
Clin Linguist Phon ; : 1-14, 2024 Aug 20.
Artigo em Inglês | MEDLINE | ID: mdl-39162064

RESUMO

This study presents a model of automatic speech recognition (ASR) that is designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Because ASR models trained for general purposes mainly predict input speech into standard spelling words, well-known high-performance ASR models are not suitable for evaluating pronunciation in children with SSDs. We fine-tuned the wav2vec2.0 XLS-R model to recognise words as they are pronounced by children, rather than converting the speech into their standard spelling words. The model was fine-tuned with a speech dataset of 137 children with SSDs pronouncing 73 Korean words that are selected for actual clinical diagnosis. The model's Phoneme Error Rate (PER) was only 10% when its predictions of children's pronunciations were compared to human annotations of pronunciations as heard. In contrast, despite its robust performance on general tasks, the state-of-the-art ASR model Whisper showed limitations in recognising the speech of children with SSDs, with a PER of approximately 50%. While the model still requires improvement in terms of the recognition of unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.

20.
Dement Geriatr Cogn Disord ; 52(4): 240-248, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37433284

RESUMO

INTRODUCTION: Alzheimer's disease (AD) is the most prevalent type of dementia and can cause abnormal cognitive function and progressive loss of essential life skills. Early screening is thus necessary for the prevention and intervention of AD. Speech dysfunction is an early onset symptom of AD patients. Recent studies have demonstrated the promise of automated acoustic assessment using acoustic or linguistic features extracted from speech. However, most previous studies have relied on manual transcription of text to extract linguistic features, which weakens the efficiency of automated assessment. The present study thus investigates the effectiveness of automatic speech recognition (ASR) in building an end-to-end automated speech analysis model for AD detection. METHODS: We implemented three publicly available ASR engines and compared the classification performance using the ADReSS-IS2020 dataset. Besides, the SHapley Additive exPlanations algorithm was then used to identify critical features that contributed most to model performance. RESULTS: Three automatic transcription tools obtained mean word error rate texts of 32%, 43%, and 40%, respectively. These automated texts achieved similar or even better results than manual texts in model performance for detecting dementia, achieving classification accuracies of 89.58%, 83.33%, and 81.25%, respectively. CONCLUSION: Our best model, using ensemble learning, is comparable to the state-of-the-art manual transcription-based methods, suggesting the possibility of an end-to-end medical assistance system for AD detection with ASR engines. Moreover, the critical linguistic features might provide insight into further studies on the mechanism of AD.


Assuntos
Doença de Alzheimer , Percepção da Fala , Humanos , Doença de Alzheimer/psicologia , Linguística , Fala , Cognição
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa