RESUMO
In recent years, artificial intelligence, and machine learning (ML) models have advanced significantly, offering transformative solutions across diverse sectors. Emotion recognition in speech has particularly benefited from ML techniques, revolutionizing its accuracy and applicability. This article proposes a method for emotion detection in Romanian speech analysis by combining two distinct approaches: semantic analysis using GPT Transformer and acoustic analysis using openSMILE. The results showed an accuracy of 74% and a precision of almost 82%. Several system limitations were observed due to the limited and low-quality dataset. However, it also opened a new horizon in our research by analyzing emotions to identify mental health disorders.
Assuntos
Emoções , Interface para o Reconhecimento da Fala , Humanos , Romênia , Aprendizado de Máquina , Semântica , Inteligência ArtificialRESUMO
Objective. The decline in the performance of electromyography (EMG)-based silent speech recognition is widely attributed to disparities in speech patterns, articulation habits, and individual physiology among speakers. Feature alignment by learning a discriminative network that resolves domain offsets across speakers is an effective method to address this problem. The prevailing adversarial network with a branching discriminator specializing in domain discrimination renders insufficiently direct contribution to categorical predictions of the classifier.Approach. To this end, we propose a simplified discrepancy-based adversarial network with a streamlined end-to-end structure for EMG-based cross-subject silent speech recognition. Highly aligned features across subjects are obtained by introducing a Nuclear-norm Wasserstein discrepancy metric on the back end of the classification network, which could be utilized for both classification and domain discrimination. Given the low-level and implicitly noisy nature of myoelectric signals, we devise a cascaded adaptive rectification network as the front-end feature extraction network, adaptively reshaping the intermediate feature map with automatically learnable channel-wise thresholds. The resulting features effectively filter out domain-specific information between subjects while retaining domain-invariant features critical for cross-subject recognition.Main results. A series of sentence-level classification experiments with 100 Chinese sentences demonstrate the efficacy of our method, achieving an average accuracy of 89.46% tested on 40 new subjects by training with data from 60 subjects. Especially, our method achieves a remarkable 10.07% improvement compared to the state-of-the-art model when tested on 10 new subjects with 20 subjects employed for training, surpassing its result even with three times training subjects.Significance. Our study demonstrates an improved classification performance of the proposed adversarial architecture using cross-subject myoelectric signals, providing a promising prospect for EMG-based speech interactive application.
Assuntos
Eletromiografia , Humanos , Eletromiografia/métodos , Masculino , Feminino , Redes Neurais de Computação , Adulto , Interface para o Reconhecimento da Fala , Adulto Jovem , Reconhecimento Automatizado de Padrão/métodos , Fala/fisiologiaRESUMO
An automated speaker verification system uses the process of speech recognition to verify the identity of a user and block illicit access. Logical access attacks are efforts to obtain access to a system by tampering with its algorithms or data, or by circumventing security mechanisms. DeepFake attacks are a form of logical access threats that employs artificial intelligence to produce highly realistic audio clips of human voice, that may be used to circumvent vocal authentication systems. This paper presents a framework for the detection of Logical Access and DeepFake audio spoofings by integrating audio file components and time-frequency representation spectrograms into a lower-dimensional space using sequential prediction models. Bidirectional-LSTM trained on the bonafide class generates significant one-dimensional features for both classes. The feature set is then standardized to a fixed set using a novel Bags of Auditory Bites (BoAB) feature standardizing algorithm. The Extreme Learning Machine maps the feature space to predictions that differentiate between genuine and spoofed speeches. The framework is evaluated using the ASVspoof 2021 dataset, a comprehensive collection of audio recordings designed for evaluating the strength of speaker verification systems against spoofing attacks. It achieves favorable results on synthesized DeepFake attacks with an Equal Error Rate (EER) of 1.18% in the most optimal setting. Logical Access attacks were more challenging to detect at an EER of 12.22%. Compared to the state-of-the-arts in the ASVspoof2021 dataset, the proposed method notably improves EER for DeepFake attacks by an improvement rate of 95.16%.
Assuntos
Algoritmos , Humanos , Interface para o Reconhecimento da Fala , Segurança Computacional , Voz , Fala , Inteligência ArtificialRESUMO
Enabling patients to actively document their health information significantly improves understanding of how therapies work, disease progression, and overall life quality affects for those living with chronic disorders such as hematologic malignancies. Advancements in artificial intelligence, particularly in areas such as natural language processing and speech recognition, have resulted in the development of interactive tools tailored for healthcare. This paper introduces an innovative conversational agent tailored to the Greek language. The design and deployment of this tool, which incorporates sentiment analysis, aims at gathering detailed family histories and symptom data from individuals diagnosed with hematologic malignancies. Furthermore, we discuss the preliminary findings from a feasibility study assessing the tool's effectiveness. Initial feedback on the user experience suggests a positive reception towards the agent's usability, highlighting its potential to enhance patient engagement in a clinical setting.
Assuntos
Neoplasias Hematológicas , Processamento de Linguagem Natural , Humanos , Grécia , Interface Usuário-Computador , Inteligência Artificial , Interface para o Reconhecimento da FalaRESUMO
Speech emotion recognition (SER) technology involves feature extraction and prediction models. However, recognition efficiency tends to decrease because of gender differences and the large number of extracted features. Consequently, this paper introduces a SER system based on gender. First, gender and emotion features are extracted from speech signals to develop gender recognition and emotion classification models. Second, according to gender differences, distinct emotion recognition models are established for male and female speakers. The gender of speakers is determined before executing the corresponding emotion model. Third, the accuracy of these emotion models is enhanced by utilizing an advanced differential evolution algorithm (ADE) to select optimal features. ADE incorporates new difference vectors, mutation operators, and position learning, which effectively balance global and local searches. A new position repairing method is proposed to address gender differences. Finally, experiments on four English datasets demonstrate that ADE is superior to comparison algorithms in recognition accuracy, recall, precision, F1-score, the number of used features and execution time. The findings highlight the significance of gender in refining emotion models, while mel-frequency cepstral coefficients are important factors in gender differences.
Assuntos
Algoritmos , Emoções , Humanos , Emoções/fisiologia , Feminino , Masculino , Fala , Fatores Sexuais , Idioma , Interface para o Reconhecimento da FalaRESUMO
BACKGROUND: Traditional methodologies for diagnosing post-traumatic stress disorder (PTSD) primarily rely on interviews, incurring considerable costs and lacking objective indices. Integrating biomarkers and machine learning techniques into this diagnostic process has the potential to facilitate accurate PTSD assessment by clinicians. METHODS: We assembled a dataset encompassing recordings from 76 individuals diagnosed with PTSD and 60 healthy controls. Leveraging the openSmile framework, we extracted acoustic features from these recordings and employed a random forest algorithm for feature selection. Subsequently, these selected features were utilized as inputs for six distinct classification models and a regression model. RESULTS: Classification models employing a feature set of 18 elements yielded robust binary prediction outcomes for PTSD. Notably, the RF model achieved peak accuracy at 0.975 with the highest AUC of 1.0. In terms of the regression model, it exhibited significant predictive capability for PCL-5 scores (MSE = 0.90, MAE = 0.76, R2 = 0.10, p < 0.001). Noteworthy was the correlation coefficient of 0.33 (p < 0.01) between predicted and actual values. LIMITATIONS: Firstly, the process of feature selection may compromise the stability of models, which leads to potentially overestimating results. Secondly, it is hard to elucidate the nature of biological mechanisms behind between PTSD patients and healthy individuals. Lastly, the regression model has a limited prediction for PTSD. CONCLUSIONS: Distinct speech patterns differentiate PTSD patients and controls. Classification models accurately discern both groups. Regression model gauges PTSD severity, but further validation on larger datasets is needed.
Assuntos
Aprendizado de Máquina , Transtornos de Estresse Pós-Traumáticos , Humanos , Transtornos de Estresse Pós-Traumáticos/diagnóstico , Masculino , Feminino , Adulto , Pessoa de Meia-Idade , Índice de Gravidade de Doença , Interface para o Reconhecimento da Fala , Estudos de Casos e ControlesRESUMO
BACKGROUND: Anesthesia monitors and devices are usually controlled with some combination of dials, keypads, a keyboard, or a touch screen. Thus, anesthesiologists can operate their monitors only when they are physically close to them, and not otherwise task-loaded with sterile procedures such as line or block placement. Voice recognition technology has become commonplace and may offer advantages in anesthesia practice such as reducing surface contamination rates and allowing anesthesiologists to effect changes in monitoring and therapy when they would otherwise presently be unable to do so. We hypothesized that this technology is practicable and that anesthesiologists would consider it useful. METHODS: A novel voice-driven prototype controller was designed for the GE Solar 8000M anesthesia patient monitor. The apparatus was implemented using a Raspberry Pi 4 single-board computer, an external conference audio device, a Google Cloud Speech-to-Text platform, and a modified Solar controller to effect commands. Fifty anesthesia providers tested the prototype. Evaluations and surveys were completed in a nonclinical environment to avoid any ethical or safety concerns regarding the use of the device in direct patient care. All anesthesiologists sampled were fluent English speakers; many with inflections from their first language or national origin, reflecting diversity in the population of practicing anesthesiologists. RESULTS: The prototype was uniformly well-received by anesthesiologists. Ease-of-use, usefulness, and effectiveness were assessed on a Likert scale with means of 9.96, 7.22, and 8.48 of 10, respectively. No population cofactors were associated with these results. Advancing level of training (eg, nonattending versus attending) was not correlated with any preference. Accent of country or region was not correlated with any preference. Vocal pitch register did not correlate with any preference. Statistical analyses were performed with analysis of variance and the unpaired t -test. CONCLUSIONS: The use of voice recognition to control operating room monitors was well-received anesthesia providers. Additional commands are easily implemented on the prototype controller. No adverse relationship was found between acceptability and level of anesthesia experience, pitch of voice, or presence of accent. Voice recognition is a promising method of controlling anesthesia monitors and devices that could potentially increase usability and situational awareness in circumstances where the anesthesiologist is otherwise out-of-position or task-loaded.
Assuntos
Anestesiologistas , Monitorização Intraoperatória , Humanos , Monitorização Intraoperatória/instrumentação , Monitorização Intraoperatória/métodos , Masculino , Desenho de Equipamento , Voz , Interface para o Reconhecimento da Fala , Feminino , Anestesiologia/instrumentação , Pessoa de Meia-Idade , Pressão Sanguínea , Anestesia , Determinação da Pressão Arterial/instrumentação , Determinação da Pressão Arterial/métodos , AdultoRESUMO
In Japan, the excessive length of time required for nursing records has become a social problem. A shift to concise "bulleted" records is needed to apply speech recognition and to work with foreign caregivers. Therefore, using 96,000 descriptively described anonymized nursing records, we identified typical situations for each information source and attempted to convert them to "bulleted" records using ChatGPT-3.5(For return from the operating room, Status on return, Temperature control, Blood drainage, Stoma care, Monitoring, Respiration and Oxygen, Sensation and pain, etc.). The results showed that ChatGPT-3.5 has some usable functionality as a tool for extracting keywords in "bulleted" records. Furthermore, through the process of converting to a "bulleted" record, it became clear that the transition to a standardized nursing record utilizing the "Standard Terminology for Nursing Observation and Action (STerNOA)" would be facilitated.
Assuntos
Registros de Enfermagem , Japão , Registros Eletrônicos de Saúde , Interface para o Reconhecimento da Fala , Processamento de Linguagem Natural , Terminologia Padronizada em Enfermagem , HumanosRESUMO
The complex nature of verbal patient-nurse communication holds valuable insights for nursing research, but traditional documentation methods often miss these crucial details. This study explores the emerging role of speech processing technology in nursing research, emphasizing patient-nurse verbal communication. We conducted case studies across various healthcare settings, revealing a substantial gap in electronic health records for capturing vital patient-nurse encounters. Our research demonstrates that speech processing technology can effectively bridge this gap, enhancing documentation accuracy and enriching data for quality care assessment and risk prediction. The technology's application in home healthcare, outpatient settings, and specialized areas like dementia care illustrates its versatility. It offers the potential for real-time decision support, improved communication training, and enhanced telehealth practices. This paper provides insights into the promises and challenges of integrating speech processing into nursing practice, paving the way for future patient care and healthcare data management advancements.
Assuntos
Registros Eletrônicos de Saúde , Relações Enfermeiro-Paciente , Humanos , Interface para o Reconhecimento da Fala , Registros de Enfermagem , Pesquisa em Enfermagem , Fonte de InformaçãoRESUMO
This pilot study addresses the pervasive issue of burnout among nurses and health disciplines, often exacerbated by the use of electronic health record (EHR) systems. Recognizing the potential of dictation to alleviate documentation burden, the study focuses on the adoption of speech recognition technology (SRT) in a large Canadian urban mental health and addiction teaching hospital. Clinicians who participated in the pilot provided feedback on their experiences via a survey, and analytics data were examined to measure usage and adoption patterns. Preliminary feedback reveals a subset of participants rapidly embracing the technology, reporting decreased documentation times and increased efficiency. However, some clinicians experienced challenges related to initial setup time and the effort of adjusting to a novel documentation approach.
Assuntos
Registros Eletrônicos de Saúde , Interface para o Reconhecimento da Fala , Projetos Piloto , Humanos , Canadá , Esgotamento ProfissionalRESUMO
This literature review explores the impact of Speech Recognition Technology (SRT) on nursing documentation within electronic health records (EHR). A search across PubMed, CINAHL, and Google Scholar identified 156 studies, with seven meeting the inclusion criteria. These studies investigated the impact of SRT on documentation time, accuracy, and user satisfaction. Findings suggest SRT, particularly when integrated with artificial intelligence can speed up documentation by up to 15%. However, challenges remain in its implementation in real-world clinical settings and existing EHR workflows. Future studies should focus on developing SRT systems that process conversational nursing assessments and integrate into current EHRs.
Assuntos
Registros Eletrônicos de Saúde , Registros de Enfermagem , Interface para o Reconhecimento da Fala , Inteligência Artificial , Humanos , DocumentaçãoRESUMO
Speech emotion recognition (SER) stands as a prominent and dynamic research field in data science due to its extensive application in various domains such as psychological assessment, mobile services, and computer games, mobile services. In previous research, numerous studies utilized manually engineered features for emotion classification, resulting in commendable accuracy. However, these features tend to underperform in complex scenarios, leading to reduced classification accuracy. These scenarios include: 1. Datasets that contain diverse speech patterns, dialects, accents, or variations in emotional expressions. 2. Data with background noise. 3. Scenarios where the distribution of emotions varies significantly across datasets can be challenging. 4. Combining datasets from different sources introduce complexities due to variations in recording conditions, data quality, and emotional expressions. Consequently, there is a need to improve the classification performance of SER techniques. To address this, a novel SER framework was introduced in this study. Prior to feature extraction, signal preprocessing and data augmentation methods were applied to augment the available data, resulting in the derivation of 18 informative features from each signal. The discriminative feature set was obtained using feature selection techniques which was then utilized as input for emotion recognition using the SAVEE, RAVDESS, and EMO-DB datasets. Furthermore, this research also implemented a cross-corpus model that incorporated all speech files related to common emotions from three datasets. The experimental outcomes demonstrated the superior performance of SER framework compared to existing frameworks in the field. Notably, the framework presented in this study achieved remarkable accuracy rates across various datasets. Specifically, the proposed model obtained an accuracy of 95%, 94%,97%, and 97% on SAVEE, RAVDESS, EMO-DB and cross-corpus datasets respectively. These results underscore the significant contribution of our proposed framework to the field of SER.
Assuntos
Emoções , Humanos , Emoções/fisiologia , Fala/fisiologia , Masculino , Feminino , Interface para o Reconhecimento da Fala , Bases de Dados Factuais , Processamento de Sinais Assistido por ComputadorRESUMO
In human-computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users' emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model's capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.
Assuntos
Emoções , Humanos , Emoções/fisiologia , Fala/fisiologia , Aprendizado Profundo , Interface para o Reconhecimento da Fala , Bases de Dados Factuais , AlgoritmosRESUMO
Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.
Assuntos
Algoritmos , Interface para o Reconhecimento da Fala , Humanos , Fala/fisiologia , Dinâmica não Linear , Reconhecimento Automatizado de Padrão/métodosRESUMO
The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pretrained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end (E2E) accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in AID. Next, leveraging automatic speech recognition (ASR) and AID models is investigated to explore accentedness estimation. The ASR model is an E2E connectionist temporal classification model trained exclusively with American English (en-US) utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between scores estimated by the two models. Additionally, a robust correlation between objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of using AID-based and ASR-based systems for accentedness assessment in non-native speech. Such advanced systems would benefit accent assessment in language learning as well as speech and speaker assessment for intelligibility, quality, and speaker diarization and speech recognition advancements.
Assuntos
Percepção da Fala , Interface para o Reconhecimento da Fala , Humanos , Percepção da Fala/fisiologia , Acústica da Fala , Fonética , Idioma , Medida da Produção da Fala/métodos , Feminino , MasculinoRESUMO
To obtain a reliable and accurate automatic speech recognition (ASR) machine learning model, it is necessary to have sufficient audio data transcribed, for training. Many languages in the world, especially the agglutinative languages of the Turkic family, suffer from a lack of this type of data. Many studies have been conducted in order to obtain better models for low-resource languages, using different approaches. The most popular approaches include multilingual training and transfer learning. In this study, we combined five agglutinative languages from the Turkic family-Kazakh, Bashkir, Kyrgyz, Sakha, and Tatar,-in order to provide multilingual training using connectionist temporal classification and an attention mechanism including a language model, because these languages have cognate words, sentence formation rules, and alphabet (Cyrillic). Data from the open-source database Common voice was used for the study, to make the experiments reproducible. The results of the experiments showed that multilingual training could improve ASR performances for all languages included in the experiment, except Bashkir language. A dramatic result was achieved for the Kyrgyz language: word error rate decreased to nearly one-fifth and character error rate decreased to one-fourth, which proves that this approach can be helpful for critically low-resource languages.
Assuntos
Idioma , Multilinguismo , Humanos , Aprendizado de Máquina , Interface para o Reconhecimento da FalaRESUMO
BACKGROUND: The method of documentation during a clinical encounter may affect the patient-physician relationship. OBJECTIVES: Evaluate how the use of ambient voice recognition, coupled with natural language processing and artificial intelligence (DAX), affects the patient-physician relationship. METHODS: This was a prospective observational study with a primary aim of evaluating any difference in patient satisfaction on the Patient-Doctor Relationship Questionnaire-9 (PDRQ-9) scale between primary care encounters in which DAX was utilized for documentation as compared to another method. A single-arm open-label phase was also performed to query direct feedback from patients. RESULTS: A total of 288 patients were include in the open-label arm and 304 patients were included in the masked phase of the study comparing encounters with and without DAX use. In the open-label phase, patients strongly agreed that the provider was more focused on them, spent less time typing, and made the encounter feel more personable. In the masked phase of the study, no difference was seen in the total PDRQ-9 score between patients whose encounters used DAX (median: 45, interquartile range [IQR]: 8) and those who did not (median: 45 [IQR: 3.5]; p = 0.31). The adjusted odds ratio for DAX use was 0.8 (95% confidence interval: 0.48-1.34) for the patient reporting complete satisfaction on how well their clinician listened to them during their encounter. CONCLUSION: Patients strongly agreed with the use of ambient voice recognition, coupled with natural language processing and artificial intelligence (DAX) for documentation in primary care. However, no difference was detected in the patient-physician relationship on the PDRQ-9 scale.
Assuntos
Inteligência Artificial , Processamento de Linguagem Natural , Relações Médico-Paciente , Humanos , Masculino , Feminino , Pessoa de Meia-Idade , Satisfação do Paciente/estatística & dados numéricos , Adulto , Voz , Estudos Prospectivos , Interface para o Reconhecimento da Fala , Inquéritos e QuestionáriosRESUMO
Expert testimony is only admissible in common-law systems if it will potentially assist the trier of fact. In order for a forensic-voice-comparison expert's testimony to assist a trier of fact, the expert's forensic voice comparison should be more accurate than the trier of fact's speaker identification. "Speaker identification in courtroom contexts - Part I" addressed the question of whether speaker identification by an individual lay listener (such as a judge) would be more or less accurate than the output of a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology. The present paper addresses the question of whether speaker identification by a group of collaborating lay listeners (such as a jury) would be more or less accurate than the output of such a forensic-voice-comparison system. As members of collaborating groups, participants listen to pairs of recordings reflecting the conditions of the questioned- and known-speaker recordings in an actual case, confer, and make a probabilistic consensus judgement on each pair of recordings. The present paper also compares group-consensus responses with "wisdom of the crowd" which uses the average of the responses from multiple independent individual listeners.
Assuntos
Ciências Forenses , Voz , Humanos , Ciências Forenses/métodos , Prova Pericial , Masculino , Feminino , Adulto , Interface para o Reconhecimento da Fala , Comportamento Cooperativo , Identificação Biométrica/métodosRESUMO
This study presents a pioneering approach that leverages advanced sensing technologies and data processing techniques to enhance the process of clinical documentation generation during medical consultations. By employing sophisticated sensors to capture and interpret various cues such as speech patterns, intonations, or pauses, the system aims to accurately perceive and understand patient-doctor interactions in real time. This sensing capability allows for the automation of transcription and summarization tasks, facilitating the creation of concise and informative clinical documents. Through the integration of automatic speech recognition sensors, spoken dialogue is seamlessly converted into text, enabling efficient data capture. Additionally, deep models such as Transformer models are utilized to extract and analyze crucial information from the dialogue, ensuring that the generated summaries encapsulate the essence of the consultations accurately. Despite encountering challenges during development, experimentation with these sensing technologies has yielded promising results. The system achieved a maximum ROUGE-1 metric score of 0.57, demonstrating its effectiveness in summarizing complex medical discussions. This sensor-based approach aims to alleviate the administrative burden on healthcare professionals by automating documentation tasks and safeguarding important patient information. Ultimately, by enhancing the efficiency and reliability of clinical documentation, this innovative method contributes to improving overall healthcare outcomes.
Assuntos
Aprendizado Profundo , Humanos , Interface para o Reconhecimento da FalaRESUMO
Speakers tailor their speech to different types of interlocutors. For example, speech directed to voice technology has different acoustic-phonetic characteristics than speech directed to a human. The present study investigates the perceptual consequences of human- and device-directed registers in English. We compare two groups of speakers: participants whose first language is English (L1) and bilingual L1 Mandarin-L2 English talkers. Participants produced short sentences in several conditions: an initial production and a repeat production after a human or device guise indicated either understanding or misunderstanding. In experiment 1, a separate group of L1 English listeners heard these sentences and transcribed the target words. In experiment 2, the same productions were transcribed by an automatic speech recognition (ASR) system. Results show that transcription accuracy was highest for L1 talkers for both human and ASR transcribers. Furthermore, there were no overall differences in transcription accuracy between human- and device-directed speech. Finally, while human listeners showed an intelligibility benefit for coda repair productions, the ASR transcriber did not benefit from these enhancements. Findings are discussed in terms of models of register adaptation, phonetic variation, and human-computer interaction.