Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 825
Filtrar
1.
IEEE J Transl Eng Health Med ; 12: 382-389, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38606392

RESUMO

Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.


Assuntos
Percepção da Fala , Fala , Humanos , Interface para o Reconhecimento da Fala , Disartria/diagnóstico , Distúrbios da Fala
2.
J Affect Disord ; 355: 40-49, 2024 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-38552911

RESUMO

BACKGROUND: Prior research has associated spoken language use with depression, yet studies often involve small or non-clinical samples and face challenges in the manual transcription of speech. This paper aimed to automatically identify depression-related topics in speech recordings collected from clinical samples. METHODS: The data included 3919 English free-response speech recordings collected via smartphones from 265 participants with a depression history. We transcribed speech recordings via automatic speech recognition (Whisper tool, OpenAI) and identified principal topics from transcriptions using a deep learning topic model (BERTopic). To identify depression risk topics and understand the context, we compared participants' depression severity and behavioral (extracted from wearable devices) and linguistic (extracted from transcribed texts) characteristics across identified topics. RESULTS: From the 29 topics identified, we identified 6 risk topics for depression: 'No Expectations', 'Sleep', 'Mental Therapy', 'Haircut', 'Studying', and 'Coursework'. Participants mentioning depression risk topics exhibited higher sleep variability, later sleep onset, and fewer daily steps and used fewer words, more negative language, and fewer leisure-related words in their speech recordings. LIMITATIONS: Our findings were derived from a depressed cohort with a specific speech task, potentially limiting the generalizability to non-clinical populations or other speech tasks. Additionally, some topics had small sample sizes, necessitating further validation in larger datasets. CONCLUSION: This study demonstrates that specific speech topics can indicate depression severity. The employed data-driven workflow provides a practical approach for analyzing large-scale speech data collected from real-world settings.


Assuntos
Aprendizado Profundo , Fala , Humanos , Smartphone , Depressão/diagnóstico , Interface para o Reconhecimento da Fala
3.
JASA Express Lett ; 4(2)2024 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-38350077

RESUMO

Measuring how well human listeners recognize speech under varying environmental conditions (speech intelligibility) is a challenge for theoretical, technological, and clinical approaches to speech communication. The current gold standard-human transcription-is time- and resource-intensive. Recent advances in automatic speech recognition (ASR) systems raise the possibility of automating intelligibility measurement. This study tested 4 state-of-the-art ASR systems with second language speech-in-noise and found that one, whisper, performed at or above human listener accuracy. However, the content of whisper's responses diverged substantially from human responses, especially at lower signal-to-noise ratios, suggesting both opportunities and limitations for ASR--based speech intelligibility modeling.


Assuntos
Percepção da Fala , Humanos , Percepção da Fala/fisiologia , Ruído/efeitos adversos , Inteligibilidade da Fala/fisiologia , Interface para o Reconhecimento da Fala , Reconhecimento Psicológico
4.
Stud Health Technol Inform ; 310: 124-128, 2024 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-38269778

RESUMO

Creating notes in the EHR is one of the most problematic aspects for health professionals. The main challenges are the time spent on this task and the quality of the records. Automatic speech recognition technologies aim to facilitate clinical documentation for users, optimizing their workflow. In our hospital, we internally developed an automatic speech recognition system (ASR) to record progress notes in a mobile EHR. The objective of this article is to describe the pilot study carried out to evaluate the implementation of ASR to record progress notes in a mobile EHR application. As a result, the specialty that used ASR the most was Home Medicine. The lack of access to a computer at the time of care and the need to perform short and fast evolutions were the main reasons for users to use the system.


Assuntos
Documentação , Interface para o Reconhecimento da Fala , Humanos , Projetos Piloto , Pessoal de Saúde , Hospitais
5.
Radiol Artif Intell ; 6(2): e230205, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38265301

RESUMO

This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision, 76.9%; recall, 100%; F1 score, 86.9%) and not clinically significant errors (precision, 93.9%; recall, 94.7%; F1 score, 94.3%). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. Keywords: CT, Large Language Model, Machine Learning, MRI, Natural Language Processing, Radiology Reports, Speech, Unsupervised Learning Supplemental material is available for this article.


Assuntos
Camelídeos Americanos , Sistemas de Informação em Radiologia , Radiologia , Percepção da Fala , Animais , Fala , Interface para o Reconhecimento da Fala , Reprodutibilidade dos Testes
6.
Sci Rep ; 14(1): 313, 2024 01 03.
Artigo em Inglês | MEDLINE | ID: mdl-38172277

RESUMO

Tashlhiyt is a low-resource language with respect to acoustic databases, language corpora, and speech technology tools, such as Automatic Speech Recognition (ASR) systems. This study investigates whether a method of cross-language re-use of ASR is viable for Tashlhiyt from an existing commercially-available system built for Arabic. The source and target language in this case have similar phonological inventories, but Tashlhiyt permits typologically rare phonological patterns, including vowelless words, while Arabic does not. We find systematic disparities in ASR transfer performance (measured as word error rate (WER) and Levenshtein distance) for Tashlhiyt across word forms and speaking style variation. Overall, performance was worse for casual speaking modes across the board. In clear speech, performance was lower for vowelless than for voweled words. These results highlight systematic speaking mode- and phonotactic-disparities in cross-language ASR transfer. They also indicate that linguistically-informed approaches to ASR re-use can provide more effective ways to adapt existing speech technology tools for low resource languages, especially when they contain typologically rare structures. The study also speaks to issues of linguistic disparities in ASR and speech technology more broadly. It can also contribute to understanding the extent to which machines are similar to, or different from, humans in mapping the acoustic signal to discrete linguistic representations.


Assuntos
Percepção da Fala , Humanos , Idioma , Linguística , Fala , Interface para o Reconhecimento da Fala
7.
Int J Med Inform ; 178: 105213, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37690224

RESUMO

PURPOSE: Considering the significant workload of nursing tasks, enhancing the efficiency of nursing documentation is imperative. This study aimed to evaluate the effectiveness of a machine learning-based speech recognition (SR) system in reducing the clinical workload associated with typing nursing records, implemented in a psychiatry ward. METHODS: The study was conducted between July 15, 2020, and June 30, 2021, at Cheng Hsin General Hospital in Taiwan. The language corpus was based on the existing records from the hospital nursing information system. The participating ward's nursing activities, clinical conversation, and accent data were also collected for deep learning-based SR-engine training. A total of 21 nurses participated in the evaluation of the SR system. Documentation time and recognition error rate were evaluated in parallel between SR-generated records and keyboard entry over 4 sessions. Any differences between SR and keyboard transcriptions were regarded as SR errors. FINDINGS: A total of 200 data were obtained from four evaluation sessions, 10 participants were asked to use SR and keyboard entry in parallel at each session and 5 entries were collected from each participant. Overall, the SR system processed 30,112 words in 32,456 s (0.928 words per second). The mean accuracy of the SR system improved after each session, from 87.06% in 1st session to 95.07% in 4th session. CONCLUSION: This pilot study demonstrated our machine learning-based SR system has an acceptable recognition accuracy and may reduce the burden of documentation for nurses. However, the potential error with the SR transcription should continually be recognized and improved. Further studies are needed to improve the integration of SR in digital documentation of nursing records, in terms of both productivity and accuracy across different clinical specialties.


Assuntos
Interface para o Reconhecimento da Fala , Fala , Humanos , Projetos Piloto , Percepção , Documentação
9.
Artigo em Inglês | MEDLINE | ID: mdl-37603475

RESUMO

Automatic Speech Recognition (ASR) technologies can be life-changing for individuals who suffer from dysarthria, a speech impairment that affects articulatory muscles and results in incomprehensive speech. Nevertheless, the performance of the current dysarthric ASR systems is unsatisfactory, especially for speakers with severe dysarthria who most benefit from this technology. While transformer and neural attention-base sequences-to-sequence ASR systems achieved state-of-the-art results in converting healthy speech to text, their applications as a Dysarthric ASR remain unexplored due to the complexities of dysarthric speech and the lack of extensive training data. In this study, we addressed this gap and proposed our Dysarthric Speech Transformer that uses a customized deep transformer architecture. To deal with the data scarcity problem, we designed a two-phase transfer learning pipeline to leverage healthy speech, investigated neural freezing configurations, and utilized audio data augmentation. Overall, we trained 45 speaker-adaptive dysarthric ASR in our investigations. Results indicate the effectiveness of the transfer learning pipeline and data augmentation, and emphasize the significance of deeper transformer architectures. The proposed ASR outperformed the state-of-the-art and delivered better accuracies for 73% of the dysarthric subjects whose speech samples were employed in this study, in which up to 23% of improvements were achieved.


Assuntos
Disartria , Fala , Humanos , Interface para o Reconhecimento da Fala , Distúrbios da Fala , Aprendizagem
10.
Sensors (Basel) ; 23(13)2023 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-37447886

RESUMO

This paper proposes a speech recognition method based on a domain-specific language speech network (DSL-Net) and a confidence decision network (CD-Net). The method involves automatically training a domain-specific dataset, using pre-trained model parameters for migration learning, and obtaining a domain-specific speech model. Importance sampling weights were set for the trained domain-specific speech model, which was then integrated with the trained speech model from the benchmark dataset. This integration automatically expands the lexical content of the model to accommodate the input speech based on the lexicon and language model. The adaptation attempts to address the issue of out-of-vocabulary words that are likely to arise in most realistic scenarios and utilizes external knowledge sources to extend the existing language model. By doing so, the approach enhances the adaptability of the language model in new domains or scenarios and improves the prediction accuracy of the model. For domain-specific vocabulary recognition, a deep fully convolutional neural network (DFCNN) and a candidate temporal classification (CTC)-based approach were employed to achieve effective recognition of domain-specific vocabulary. Furthermore, a confidence-based classifier was added to enhance the accuracy and robustness of the overall approach. In the experiments, the method was tested on a proprietary domain audio dataset and compared with an automatic speech recognition (ASR) system trained on a large-scale dataset. Based on experimental verification, the model achieved an accuracy improvement from 82% to 91% in the medical domain. The inclusion of domain-specific datasets resulted in a 5% to 7% enhancement over the baseline, while the introduction of model confidence further improved the baseline by 3% to 5%. These findings demonstrate the significance of incorporating domain-specific datasets and model confidence in advancing speech recognition technology.


Assuntos
Modelos Teóricos , Redes Neurais de Computação , Interface para o Reconhecimento da Fala , Fala , Percepção da Fala , Conjuntos de Dados como Assunto , Espectrografia do Som
11.
Int J Med Inform ; 176: 105112, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37276615

RESUMO

BACKGROUND: The purpose of this study is to develop an audio speech recognition (ASR) deep learning model for transcribing clinician-patient conversations in radiation oncology clinics. METHODS: We finetuned the pre-trained English QuartzNet 15x5 model for the Korean language using a publicly available dataset of simulated situations between clinicians and patients. Subsequently, real conversations between a radiation oncologist and 115 patients in actual clinics were then prospectively collected, transcribed, and divided into training (30.26 h) and testing (0.79 h) sets. These datasets were used to develop the ASR model for clinics, which was benchmarked against other ASR models, including the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.' RESULTS: The pre-trained English ASR model was successfully fine-tuned and converted to recognize the Korean language, resulting in a character error rate (CER) of 0.17. However, we found that this performance was not sustained on the real conversation dataset. To address this, we further fine-tuned the model, resulting in an improved CER of 0.26. Other developed ASR models, including 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.', showed a CER of 0.31, 0.28, and 0.25, respectively. On the general Korean conversation dataset, 'zeroth-korean,' our model showed a CER of 0.44, while the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model' resulted in CERs of 0.26, 0.98, and 0.99, respectively. CONCLUSION: In conclusion, we developed a Korean ASR model to transcribe real conversations between a radiation oncologist and patients. The performance of the model was deemed acceptable for both specific and general purposes, compared to other models. We anticipate that this model will reduce the time required for clinicians to document the patient's chief complaints or side effects.


Assuntos
Radioterapia (Especialidade) , Percepção da Fala , Humanos , Interface para o Reconhecimento da Fala , Benchmarking , Idioma , República da Coreia
12.
Sensors (Basel) ; 23(11)2023 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-37299935

RESUMO

The field of computational paralinguistics emerged from automatic speech processing, and it covers a wide range of tasks involving different phenomena present in human speech. It focuses on the non-verbal content of human speech, including tasks such as spoken emotion recognition, conflict intensity estimation and sleepiness detection from speech, showing straightforward application possibilities for remote monitoring with acoustic sensors. The two main technical issues present in computational paralinguistics are (1) handling varying-length utterances with traditional classifiers and (2) training models on relatively small corpora. In this study, we present a method that combines automatic speech recognition and paralinguistic approaches, which is able to handle both of these technical issues. That is, we trained a HMM/DNN hybrid acoustic model on a general ASR corpus, which was then used as a source of embeddings employed as features for several paralinguistic tasks. To convert the local embeddings into utterance-level features, we experimented with five different aggregation methods, namely mean, standard deviation, skewness, kurtosis and the ratio of non-zero activations. Our results show that the proposed feature extraction technique consistently outperforms the widely used x-vector method used as the baseline, independently of the actual paralinguistic task investigated. Furthermore, the aggregation techniques could be combined effectively as well, leading to further improvements depending on the task and the layer of the neural network serving as the source of the local embeddings. Overall, based on our experimental results, the proposed method can be considered as a competitive and resource-efficient approach for a wide range of computational paralinguistic tasks.


Assuntos
Percepção da Fala , Fala , Humanos , Redes Neurais de Computação , Interface para o Reconhecimento da Fala , Acústica
13.
Psychiatry Res ; 325: 115252, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37236098

RESUMO

Natural language processing (NLP) tools are increasingly used to quantify semantic anomalies in schizophrenia. Automatic speech recognition (ASR) technology, if robust enough, could significantly speed up the NLP research process. In this study, we assessed the performance of a state-of-the-art ASR tool and its impact on diagnostic classification accuracy based on a NLP model. We compared ASR to human transcripts quantitatively (Word Error Rate (WER)) and qualitatively by analyzing error type and position. Subsequently, we evaluated the impact of ASR on classification accuracy using semantic similarity measures. Two random forest classifiers were trained with similarity measures derived from automatic and manual transcriptions, and their performance was compared. The ASR tool had a mean WER of 30.4%. Pronouns and words in sentence-final position had the highest WERs. The classification accuracy was 76.7% (sensitivity 70%; specificity 86%) using automated transcriptions and 79.8% (sensitivity 75%; specificity 86%) for manual transcriptions. The difference in performance between the models was not significant. These findings demonstrate that using ASR for semantic analysis is associated with only a small decrease in accuracy in classifying schizophrenia, compared to manual transcripts. Thus, combining ASR technology with semantic NLP models qualifies as a robust and efficient method for diagnosing schizophrenia.


Assuntos
Esquizofrenia , Percepção da Fala , Humanos , Semântica , Interface para o Reconhecimento da Fala , Processamento de Linguagem Natural , Esquizofrenia/complicações , Esquizofrenia/diagnóstico , Fala
14.
Artigo em Inglês | MEDLINE | ID: mdl-37030692

RESUMO

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.


Assuntos
Disartria , Percepção da Fala , Humanos , Fala , Interface para o Reconhecimento da Fala , Redes Neurais de Computação , Inteligibilidade da Fala
15.
BMC Med Educ ; 23(1): 272, 2023 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-37085837

RESUMO

BACKGROUND: To investigate whether speech recognition software for generating interview transcripts can provide more specific and precise feedback for evaluating medical interviews. METHODS: The effects of the two feedback methods on student performance in medical interviews were compared using a prospective observational trial. Seventy-nine medical students in a clinical clerkship were assigned to receive either speech-recognition feedback (n = 39; SRS feedback group) or voice-recording feedback (n = 40; IC recorder feedback group). All students' medical interviewing skills during mock patient encounters were assessed twice, first using a mini-clinical evaluation exercise (mini-CEX) and then a checklist. Medical students then made the most appropriate diagnoses based on medical interviews. The diagnostic accuracy, mini-CEX, and checklist scores of the two groups were compared. RESULTS: According to the study results, the mean diagnostic accuracy rate (SRS feedback group:1st mock 51.3%, 2nd mock 89.7%; IC recorder feedback group, 57.5%-67.5%; F(1, 77) = 4.0; p = 0.049), mini-CEX scores for overall clinical competence (SRS feedback group: 1st mock 5.2 ± 1.1, 2nd mock 7.4 ± 0.9; IC recorder feedback group: 1st mock 5.6 ± 1.4, 2nd mock 6.1 ± 1.2; F(1, 77) = 35.7; p < 0.001), and checklist scores for clinical performance (SRS feedback group: 1st mock 12.2 ± 2.4, 2nd mock 16.1 ± 1.7; IC recorder feedback group: 1st mock 13.1 ± 2.5, 2nd mock 13.8 ± 2.6; F(1, 77) = 26.1; p < 0.001) were higher with speech recognition-based feedback. CONCLUSIONS: Speech-recognition-based feedback leads to higher diagnostic accuracy rates and higher mini-CEX and checklist scores. TRIAL REGISTRATION: This study was registered in the Japan Registry of Clinical Trials on June 14, 2022. Due to our misunderstanding of the trial registration requirements, we registered the trial retrospectively. This study was registered in the Japan Registry of Clinical Trials on 7/7/2022 (Clinical trial registration number: jRCT1030220188).


Assuntos
Avaliação Educacional , Estudantes de Medicina , Humanos , Avaliação Educacional/métodos , Interface para o Reconhecimento da Fala , Estudos Retrospectivos , Competência Clínica
16.
JASA Express Lett ; 3(3): 035208, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-37003705

RESUMO

Automatic speech recognition (ASR) systems are vulnerable to adversarial attacks due to their reliance on machine learning models. Many of the defenses explored for defending ASR systems simply adapt defense approaches developed for the image domain. This paper explores speech-specific defenses in the feature domain and introduces a defense method called mel domain noise flooding (MDNF). MDNF injects additive noise to the mel spectrogram speech representation prior to re-synthesizing the audio signal input to ASR. The defense is evaluated against strong white-box threat models and shows competitive robustness.


Assuntos
Percepção da Fala , Fala , Interface para o Reconhecimento da Fala , Ruído/efeitos adversos , Aprendizado de Máquina
17.
Sensors (Basel) ; 23(3)2023 Jan 19.
Artigo em Inglês | MEDLINE | ID: mdl-36772184

RESUMO

Automatic speech recognition systems with a large vocabulary and other natural language processing applications cannot operate without a language model. Most studies on pre-trained language models have focused on more popular languages such as English, Chinese, and various European languages, but there is no publicly available Uzbek speech dataset. Therefore, language models of low-resource languages need to be studied and created. The objective of this study is to address this limitation by developing a low-resource language model for the Uzbek language and understanding linguistic occurrences. We proposed the Uzbek language model named UzLM by examining the performance of statistical and neural-network-based language models that account for the unique features of the Uzbek language. Our Uzbek-specific linguistic representation allows us to construct more robust UzLM, utilizing 80 million words from various sources while using the same or fewer training words, as applied in previous studies. Roughly sixty-eight thousand different words and 15 million sentences were collected for the creation of this corpus. The experimental results of our tests on the continuous recognition of Uzbek speech show that, compared with manual encoding, the use of neural-network-based language models reduced the character error rate to 5.26%.


Assuntos
Percepção da Fala , Fala , Humanos , Interface para o Reconhecimento da Fala , Idioma , Vocabulário
18.
Sensors (Basel) ; 23(2)2023 Jan 12.
Artigo em Inglês | MEDLINE | ID: mdl-36679666

RESUMO

Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.


Assuntos
Percepção da Fala , Fala , Idioma , Interface para o Reconhecimento da Fala , Ruído
19.
J Am Med Inform Assoc ; 30(4): 703-711, 2023 03 16.
Artigo em Inglês | MEDLINE | ID: mdl-36688526

RESUMO

OBJECTIVES: Ambient clinical documentation technology uses automatic speech recognition (ASR) and natural language processing (NLP) to turn patient-clinician conversations into clinical documentation. It is a promising approach to reducing clinician burden and improving documentation quality. However, the performance of current-generation ASR remains inadequately validated. In this study, we investigated the impact of non-lexical conversational sounds (NLCS) on ASR performance. NLCS, such as Mm-hm and Uh-uh, are commonly used to convey important information in clinical conversations, for example, Mm-hm as a "yes" response from the patient to the clinician question "are you allergic to antibiotics?" MATERIALS AND METHODS: In this study, we evaluated 2 contemporary ASR engines, Google Speech-to-Text Clinical Conversation ("Google ASR"), and Amazon Transcribe Medical ("Amazon ASR"), both of which have their language models specifically tailored to clinical conversations. The empirical data used were from 36 primary care encounters. We conducted a series of quantitative and qualitative analyses to examine the word error rate (WER) and the potential impact of misrecognized NLCS on the quality of clinical documentation. RESULTS: Out of a total of 135 647 spoken words contained in the evaluation data, 3284 (2.4%) were NLCS. Among these NLCS, 76 (0.06% of total words, 2.3% of all NLCS) were used to convey clinically relevant information. The overall WER, of all spoken words, was 11.8% for Google ASR and 12.8% for Amazon ASR. However, both ASR engines demonstrated poor performance in recognizing NLCS: the WERs across frequently used NLCS were 40.8% (Google) and 57.2% (Amazon), respectively; and among the NLCS that conveyed clinically relevant information, 94.7% and 98.7%, respectively. DISCUSSION AND CONCLUSION: Current ASR solutions are not capable of properly recognizing NLCS, particularly those that convey clinically relevant information. Although the volume of NLCS in our evaluation data was very small (2.4% of the total corpus; and for NLCS that conveyed clinically relevant information: 0.06%), incorrect recognition of them could result in inaccuracies in clinical documentation and introduce new patient safety risks.


Assuntos
Idioma , Interface para o Reconhecimento da Fala , Humanos , Fala/fisiologia , Tecnologia , Documentação
20.
J Voice ; 37(6): 971.e9-971.e16, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34256982

RESUMO

As part of our contributions to researches on the ongoing COVID-19 pandemic worldwide, we have studied the cough changes to the infected people based on the Hidden Markov Model (HMM) speech recognition classification, formants frequency and pitch analysis. In this paper, An HMM-based cough recognition system was implemented with 5 HMM states, 8 Gaussian Mixture Distributions (GMMs) and 13 dimensions of the basic Mel-Frequency Cepstral Coefficients (MFCC) with 39 dimensions of the overall feature vector. A comparison between formants frequency and pitch extracted values is realized based on the cough of COVID-19 infected people and healthy ones to confirm our cough recognition system results. The experimental results present that the difference between the recognition rates of infected and non-infected people is 6.7%. Whereas, the formant analysis variation based on the cough of infected and non-infected people is clearly observed with F1, F3, and F4 and lower for F0 and F2.


Assuntos
COVID-19 , Interface para o Reconhecimento da Fala , Humanos , Tosse/diagnóstico , Tosse/etiologia , Pandemias , COVID-19/complicações , COVID-19/diagnóstico , Fala
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...