Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 810
Filtrar
1.
Comput Intell Neurosci ; 2022: 7846877, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35498214

RESUMO

With the economic globalization continuous growth of China's socioeconomic level tends to be internationalized, China's attention to English has been significantly improved. However, the domestic English teaching level is limited, so it is impossible to correct students' English pronunciation and make a reasonable evaluation at all times so that oral training has certain disadvantages. However, the computer-aided language learning system at home and abroad focuses on the practice of words and grammar, and the evaluation indicators are less and not comprehensive. In view of the complexity of English pronunciation changes, traditional speech recognition is difficult to recognize speech speed and improve its accuracy. Furthermore, to strengthen the English pronunciation of domestic students, a nonlinear network structure is studied in depth to simulate the human brain to analyze a model of speech recognition is established Mel frequency cepstrum characteristic parameters of human ear model and deep belief network. In this paper, the traditional computer pronunciation evaluation method is improved in an all-round way, and a set of high-quality speech recognition system of speech recognition method is constructed. Aiming at the above problems, it takes the students as the research, which proves that the method adopted in this paper can give the learners accurate pronunciation quality analysis report and guidance and correct their intonation and improve the learning effect, and the experimental data verify that the improved speech recognition system model recognition ability is higher than the traditional model.


Assuntos
Idioma , Fala , Algoritmos , Humanos , Redes Neurais de Computação , Interface para o Reconhecimento da Fala
2.
BMC Med Inform Decis Mak ; 22(1): 96, 2022 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-35395798

RESUMO

BACKGROUND: Despite the rapid expansion of electronic health records, the use of computer mouse and keyboard, challenges the data entry into these systems. Speech recognition software is one of the substitutes for the mouse and keyboard. The objective of this study was to evaluate the use of online and offline speech recognition software on spelling errors in nursing reports and to compare them with errors in handwritten reports. METHODS: For this study, online and offline speech recognition software were selected and customized based on unrecognized terms by these softwares. Two groups of 35 nurses provided the admission notes of hospitalized patients upon their arrival using three data entry methods (using the handwritten method or two types of speech recognition software). After at least a month, they created the same reports using the other methods. The number of spelling errors in each method was determined. These errors were compared between the paper method and the two electronic methods before and after the correction of errors. RESULTS: The lowest accuracy was related to online software with 96.4% and accuracy. On the average per report, the online method 6.76, and the offline method 4.56 generated more errors than the paper method. After correcting the errors by the participants, the number of errors in the online reports decreased by 94.75% and the number of errors in the offline reports decreased by 97.20%. The highest number of reports with errors was related to reports created by online software. CONCLUSION: Although two software had relatively high accuracy, they created more errors than the paper method that can be lowered by optimizing and upgrading these softwares. The results showed that error correction by users significantly reduced the documentation errors caused by the software.


Assuntos
Percepção da Fala , Documentação , Humanos , Fala , Interface para o Reconhecimento da Fala , Tecnologia
3.
Sensors (Basel) ; 22(8)2022 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-35458932

RESUMO

Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google's trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications.


Assuntos
Inteligência Artificial , Fala , Computação em Nuvem , Redes Neurais de Computação , Interface para o Reconhecimento da Fala
4.
Sensors (Basel) ; 22(8)2022 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-35459013

RESUMO

Automatic speech recognition (ASR) is an essential technique of human-computer interactions; gain control is a commonly used operation in ASR. However, inappropriate gain control strategies can lead to an increase in the word error rate (WER) of ASR. As there is a current lack of sufficient theoretical analyses and proof of the relationship between gain control and WER, various unconstrained gain control strategies have been adopted on realistic ASR systems, and the optimal gain control with respect to the lowest WER, is rarely achieved. A gain control strategy named maximized original signal transmission (MOST) is proposed in this study to minimize the adverse impact of gain control on ASR systems. First, by modeling the gain control strategy, the quantitative relationship between the gain control strategy and the ASR performance was established using the noise figure index. Second, through an analysis of the quantitative relationship, an optimal MOST gain control strategy with minimal performance degradation was theoretically deduced. Finally, comprehensive comparative experiments on a Mandarin dataset show that the proposed MOST gain control strategy can significantly reduce the WER of the experimental ASR system, with a 10% mean absolute WER reduction at -9 dB gain.


Assuntos
Percepção da Fala , Interface para o Reconhecimento da Fala , Humanos , Ruído , Fala
5.
JMIR Mhealth Uhealth ; 9(10): e32301, 2021 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-34636729

RESUMO

BACKGROUND: Prehospitalization documentation is a challenging task and prone to loss of information, as paramedics operate under disruptive environments requiring their constant attention to the patients. OBJECTIVE: The aim of this study is to develop a mobile platform for hands-free prehospitalization documentation to assist first responders in operational medical environments by aggregating all existing solutions for noise resiliency and domain adaptation. METHODS: The platform was built to extract meaningful medical information from the real-time audio streaming at the point of injury and transmit complete documentation to a field hospital prior to patient arrival. To this end, the state-of-the-art automatic speech recognition (ASR) solutions with the following modular improvements were thoroughly explored: noise-resilient ASR, multi-style training, customized lexicon, and speech enhancement. The development of the platform was strictly guided by qualitative research and simulation-based evaluation to address the relevant challenges through progressive improvements at every process step of the end-to-end solution. The primary performance metrics included medical word error rate (WER) in machine-transcribed text output and an F1 score calculated by comparing the autogenerated documentation to manual documentation by physicians. RESULTS: The total number of 15,139 individual words necessary for completing the documentation were identified from all conversations that occurred during the physician-supervised simulation drills. The baseline model presented a suboptimal performance with a WER of 69.85% and an F1 score of 0.611. The noise-resilient ASR, multi-style training, and customized lexicon improved the overall performance; the finalized platform achieved a medical WER of 33.3% and an F1 score of 0.81 when compared to manual documentation. The speech enhancement degraded performance with medical WER increased from 33.3% to 46.33% and the corresponding F1 score decreased from 0.81 to 0.78. All changes in performance were statistically significant (P<.001). CONCLUSIONS: This study presented a fully functional mobile platform for hands-free prehospitalization documentation in operational medical environments and lessons learned from its implementation.


Assuntos
Interface para o Reconhecimento da Fala , Fala , Documentação , Humanos , Tecnologia
6.
Sensors (Basel) ; 21(19)2021 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-34640780

RESUMO

Within the field of Automatic Speech Recognition (ASR) systems, facing impaired speech is a big challenge because standard approaches are ineffective in the presence of dysarthria. The first aim of our work is to confirm the effectiveness of a new speech analysis technique for speakers with dysarthria. This new approach exploits the fine-tuning of the size and shift parameters of the spectral analysis window used to compute the initial short-time Fourier transform, to improve the performance of a speaker-dependent ASR system. The second aim is to define if there exists a correlation among the speaker's voice features and the optimal window and shift parameters that minimises the error of an ASR system, for that specific speaker. For our experiments, we used both impaired and unimpaired Italian speech. Specifically, we used 30 speakers with dysarthria from the IDEA database and 10 professional speakers from the CLIPS database. Both databases are freely available. The results confirm that, if a standard ASR system performs poorly with a speaker with dysarthria, it can be improved by using the new speech analysis. Otherwise, the new approach is ineffective in cases of unimpaired and low impaired speech. Furthermore, there exists a correlation between some speaker's voice features and their optimal parameters.


Assuntos
Disartria , Percepção da Fala , Humanos , Fala , Distúrbios da Fala , Interface para o Reconhecimento da Fala
7.
Yearb Med Inform ; 30(1): 191-199, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-34479391

RESUMO

OBJECTIVES: To describe the use and promise of conversational agents in digital health-including health promotion andprevention-and how they can be combined with other new technologies to provide healthcare at home. METHOD: A narrative review of recent advances in technologies underpinning conversational agents and their use and potential for healthcare and improving health outcomes. RESULTS: By responding to written and spoken language, conversational agents present a versatile, natural user interface and have the potential to make their services and applications more widely accessible. Historically, conversational interfaces for health applications have focused mainly on mental health, but with an increase in affordable devices and the modernization of health services, conversational agents are becoming more widely deployed across the health system. We present our work on context-aware voice assistants capable of proactively engaging users and delivering health information and services. The proactive voice agents we deploy, allow us to conduct experience sampling in people's homes and to collect information about the contexts in which users are interacting with them. CONCLUSION: In this article, we describe the state-of-the-art of these and other enabling technologies for speech and conversation and discuss ongoing research efforts to develop conversational agents that "live" with patients and customize their service offerings around their needs. These agents can function as 'digital companions' who will send reminders about medications and appointments, proactively check in to gather self-assessments, and follow up with patients on their treatment plans. Together with an unobtrusive and continuous collection of other health data, conversational agents can provide novel and deeply personalized access to digital health care, and they will continue to become an increasingly important part of the ecosystem for future healthcare delivery.


Assuntos
Promoção da Saúde , Acesso aos Serviços de Saúde , Interface para o Reconhecimento da Fala , Telemedicina , Comunicação , Humanos , Monitorização Fisiológica/métodos , Interface Usuário-Computador
8.
Zhonghua Bing Li Xue Za Zhi ; 50(9): 1034-1038, 2021 Sep 08.
Artigo em Chinês | MEDLINE | ID: mdl-34496495

RESUMO

Objective: To establish a speech recognition system based on adaptive technology and to evaluate its value in pathological grossing processes. Methods: A total of 600 tissue specimens were collected at the Zhejiang Provincial People's Hospital affiliated to Hangzhou Medical College between October 1, 2020 and December 31, 2020. A speech recognition system based on adaptive technology was used in the pathological grossing processes, and the pathological examination reports were generated and extracted. Results: The speech recognition system based on adaptive technology showed a good recognition rate (overall recognition rate = 77.87%) and helped achieve rapid input and output of pathological examination data. Conclusions: The speech recognition system can reduce the labor costs, improve the work efficiency of pathologists and increase the quality of medical services, which may be valuable for building next-generation smart hospitals.


Assuntos
Interface para o Reconhecimento da Fala , Tecnologia , Humanos
9.
Nat Commun ; 12(1): 4234, 2021 07 09.
Artigo em Inglês | MEDLINE | ID: mdl-34244491

RESUMO

We propose a Double EXponential Adaptive Threshold (DEXAT) neuron model that improves the performance of neuromorphic Recurrent Spiking Neural Networks (RSNNs) by providing faster convergence, higher accuracy and a flexible long short-term memory. We present a hardware efficient methodology to realize the DEXAT neurons using tightly coupled circuit-device interactions and experimentally demonstrate the DEXAT neuron block using oxide based non-filamentary resistive switching devices. Using experimentally extracted parameters we simulate a full RSNN that achieves a classification accuracy of 96.1% on SMNIST dataset and 91% on Google Speech Commands (GSC) dataset. We also demonstrate full end-to-end real-time inference for speech recognition using real fabricated resistive memory circuit based DEXAT neurons. Finally, we investigate the impact of nanodevice variability and endurance illustrating the robustness of DEXAT based RSNNs.


Assuntos
Modelos Neurológicos , Redes Neurais de Computação , Neurônios/fisiologia , Computadores , Conjuntos de Dados como Assunto , Humanos , Nanoestruturas , Interface para o Reconhecimento da Fala
10.
Prensa méd. argent ; 107(5): 282-286, 20210000.
Artigo em Inglês | LILACS, BINACIS | ID: biblio-1359365

RESUMO

El aprendizaje profundo es un tipo de inteligencia artificial computarizada que tiene como objetivo entrenar a una computadora para que realice tareas que normalmente realizan los humanos basándose en redes neuronales artificiales. Los avances tecnológicos recientes han demostrado que las redes neuronales artificiales se pueden aplicar a campos como el reconocimiento de voz y audio, la traducción automática, los juegos de mesa, el diseño de fármacos y el análisis de imágenes médicas. El desarrollo de estas técnicas ha sido extremadamente rápido en los últimos años y las redes neuronales artificiales hoy en día superan a los humanos en muchas de estas tareas. Las redes neuronales artificiales se inspiraron en la función de sistemas biológicos como el cerebro y los nodos conectados dentro de estas redes que modelan las neuronas. El principio de tales redes es que están capacitadas con conjuntos de datos donde se conoce la verdad fundamental. Como ejemplo, la red debe estar capacitada para identificar imágenes donde se representa una bicicleta. Esto requiere una gran cantidad de imágenes donde las bicicletas se etiquetan manualmente (la llamada verdad fundamental) que luego son analizadas por la computadora. Si se utilizan suficientes imágenes con bicicleta o sin bicicleta, la red neuronal artificial puede entrenarse para identificar bicicletas en otros conjuntos de imágenes. En las imágenes médicas, los enfoques clásicos incluyen la extracción de características semánticas definidas por expertos humanos o características agonísticas definidas por ecuaciones. Las características semánticas pueden proporcionar una buena especificidad para el diagnóstico de enfermedades, pero pueden diferir entre diferentes médicos dependiendo de su nivel de experiencia, requieren mucho tiempo y son costosas. Las características agonísticas pueden tener una especificidad limitada, pero ofrecen la ventaja de una alta reproducibilidad. El aprendizaje profundo tiene un enfoque diferente. Se requiere un conjunto de datos de entrenamiento donde se conoce la verdad básica, en este caso el diagnóstico. El número de datos necesarios es elevado y, por lo general, se utilizan 100.000 imágenes o más. Una vez que se entrena la red neuronal artificial, se puede aplicar a un conjunto de datos de validación en el que también se conoce el diagnóstico, pero no se informa a la computadora. La salida de la red neuronal artificial es, en el caso más simple, una enfermedad o ninguna enfermedad que pueda compararse con la verdad fundamental. La concordancia con la verdad del terreno se cuantifica utilizando medidas como el área bajo la curva (AUC, puede tomar valores entre 0 y 1, siendo 1 la discriminación perfecta entre salud y enfermedad), especificidad (puede tomar valores entre 0% y 100% y la proporción de negativos reales que se identifican correctamente) y la sensibilidad (puede tomar valores entre 0% y 100% y cuantifica la proporción de positivos reales que se identifican correctamente). Si se requiere una alta sensibilidad o una alta especificidad depende de la enfermedad, la prevalencia de la enfermedad, así como el entorno clínico real donde se debe emplear esta red


Assuntos
Humanos , Inteligência Artificial , Redes Neurais de Computação , Interface para o Reconhecimento da Fala , Aprendizado Profundo
11.
Neural Netw ; 142: 303-315, 2021 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-34082286

RESUMO

The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.


Assuntos
Redes Neurais de Computação , Fala , Idioma , Interface para o Reconhecimento da Fala
12.
J Trauma Acute Care Surg ; 91(2S Suppl 2): S40-S45, 2021 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-33938509

RESUMO

ABSTRACT: The objective of this project was to identify and develop software for an augmented reality application that runs on the US Army Integrated Visual Augmentation System (IVAS) to support a medical caregiver during tactical combat casualty care scenarios. In this augmented reality tactical combat casualty care application, human anatomy of individual soldiers obtained predeployment is superimposed on the view of an injured war fighter through the IVAS. This offers insight into the anatomy of the injured war fighter to advance treatment in austere environments.In this article, we describe various software components required for an augmented reality tactical combat casualty care tool. These include a body pose tracking system to track the patient's body pose, a virtual rendering of a human anatomy avatar, speech input to control the application and rendering techniques to visualize the virtual anatomy, and treatment information on the augmented reality display. We then implemented speech commands and visualization for four common medical scenarios including injury of a limb, a blast to the pelvis, cricothyrotomy, and a pneumothorax on the Microsoft HoloLens 1 (Microsoft, Redmond, WA).The software is designed for a forward surgical care tool on the US Army IVAS, with the intention to provide the medical caregiver with a unique ability to quickly assess affected internal anatomy. The current software components still had some limitations with respect to speech recognition reliability during noise and body pose tracking. These will likely be improved with the improved hardware of the IVAS, which is based on a modified HoloLens 2.


Assuntos
Realidade Aumentada , Medicina Militar , Traumatologia , Lesões Relacionadas à Guerra/cirurgia , Diagnóstico por Imagem , Previsões , Humanos , Iluminação , Medicina Militar/métodos , Medicina Militar/tendências , Software , Interface para o Reconhecimento da Fala , Traumatologia/métodos , Traumatologia/tendências , Estados Unidos
13.
J Med Internet Res ; 23(5): e22959, 2021 05 25.
Artigo em Inglês | MEDLINE | ID: mdl-33999834

RESUMO

Artificial intelligence-driven voice technology deployed on mobile phones and smart speakers has the potential to improve patient management and organizational workflow. Voice chatbots have been already implemented in health care-leveraging innovative telehealth solutions during the COVID-19 pandemic. They allow for automatic acute care triaging and chronic disease management, including remote monitoring, preventive care, patient intake, and referral assistance. This paper focuses on the current clinical needs and applications of artificial intelligence-driven voice chatbots to drive operational effectiveness and improve patient experience and outcomes.


Assuntos
Inteligência Artificial , COVID-19 , Comunicação , Atenção à Saúde/métodos , Interface para o Reconhecimento da Fala , Telemedicina/métodos , Voz , Telefone Celular , Doença Crônica/terapia , Cuidados Críticos/métodos , Humanos , Pandemias , Encaminhamento e Consulta , Triagem
14.
Neural Netw ; 140: 261-273, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-33838592

RESUMO

Continuous dimensional emotion recognition from speech helps robots or virtual agents capture the temporal dynamics of a speaker's emotional state in natural human-robot interactions. Temporal modulation cues obtained directly from the time-domain model of auditory perception can better reflect temporal dynamics than the acoustic features usually processed in the frequency domain. Feature extraction, which can reflect temporal dynamics of emotion from temporal modulation cues, is challenging because of the complexity and diversity of the auditory perception model. A recent neuroscientific study suggests that human brains derive multi-resolution representations through temporal modulation analysis. This study investigates multi-resolution representations of an auditory perception model and proposes a novel feature called multi-resolution modulation-filtered cochleagram (MMCG) for predicting valence and arousal values of emotional primitives. The MMCG is constructed by combining four modulation-filtered cochleagrams at different resolutions to capture various temporal and contextual modulation information. In addition, to model the multi-temporal dependencies of the MMCG, we designed a parallel long short-term memory (LSTM) architecture. The results of extensive experiments on the RECOLA and SEWA datasets demonstrate that MMCG provides the best recognition performance in both datasets among all evaluated features. The results also show that the parallel LSTM can build multi-temporal dependencies from the MMCG features, and the performance on valence and arousal prediction is better than that of a plain LSTM method.


Assuntos
Emoções , Modelos Neurológicos , Percepção da Fala , Interface para o Reconhecimento da Fala , Cóclea/fisiologia , Sinais (Psicologia) , Humanos , Aprendizado de Máquina
15.
Neural Netw ; 139: 305-325, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-33873122

RESUMO

How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs: ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN). These combine Deep Convolutional GAN architecture for audio data (WaveGAN; Donahue et al., 2019) with the information theoretic extension of GAN - InfoGAN (Chen et al., 2016) - and propose a new latent space structure that can model featural learning simultaneously with a higher level classification and allows for a very low-dimension vector representation of lexical items. In addition to the Generator and Discriminator networks, the architectures introduce a network that learns to retrieve latent codes from generated audio outputs. Lexical learning is thus modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from the TIMIT corpus learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on suit and dark outputs innovative start, even though it never saw start or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code. Probing deep neural networks trained on well understood dependencies in speech bears implications for latent space interpretability and understanding how deep neural networks learn meaningful representations, as well as potential for unsupervised text-to-speech generation in the GAN framework.


Assuntos
Aprendizado de Máquina , Processamento de Linguagem Natural , Acústica , Interface para o Reconhecimento da Fala
16.
Sensors (Basel) ; 21(9)2021 Apr 28.
Artigo em Inglês | MEDLINE | ID: mdl-33924798

RESUMO

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.


Assuntos
Percepção da Fala , Fala , Acústica , Interface para o Reconhecimento da Fala , Vocabulário
17.
Artigo em Inglês | MEDLINE | ID: mdl-33929963

RESUMO

Dysarthria is a disorder that affects an individual's speech intelligibility due to the paralysis of muscles and organs involved in the articulation process. As the condition is often associated with physically debilitating disabilities, not only do such individuals face communication problems, but also interactions with digital devices can become a burden. For these individuals, automatic speech recognition (ASR) technologies can make a significant difference in their lives as computing and portable digital devices can become an interaction medium, enabling them to communicate with others and computers. However, ASR technologies have performed poorly in recognizing dysarthric speech, especially for severe dysarthria, due to multiple challenges facing dysarthric ASR systems. We identified these challenges are due to the alternation and inaccuracy of dysarthric phonemes, the scarcity of dysarthric speech data, and the phoneme labeling imprecision. This paper reports on our second dysarthric-specific ASR system, called Speech Vision (SV) that tackles these challenges by adopting a novel approach towards dysarthric ASR in which speech features are extracted visually, then SV learns to see the shape of the words pronounced by dysarthric individuals. This visual acoustic modeling feature of SV eliminates phoneme-related challenges. To address the data scarcity problem, SV adopts visual data augmentation techniques, generates synthetic dysarthric acoustic visuals, and leverages transfer learning. Benchmarking with other state-of-the-art dysarthric ASR considered in this study, SV outperformed them by improving recognition accuracies for 67% of UA-Speech speakers, where the biggest improvements were achieved for severe dysarthria.


Assuntos
Aprendizado Profundo , Disartria , Disartria/diagnóstico , Humanos , Inteligibilidade da Fala , Medida da Produção da Fala , Interface para o Reconhecimento da Fala
18.
Neural Netw ; 141: 225-237, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-33930564

RESUMO

The traditional generalized sidelobe canceller (GSC) is a common speech enhancement front end to improve the noise robustness of automatic speech recognition (ASR) systems in the far-field cases. However, the traditional GSC is optimized based on the signal level criteria, causing it not to guarantee the optimal ASR performance. To address this issue, we propose a novel dual-channel deep neural network (DNN)-based GSC structure, called nnGSC, which is optimized by using the objective of maximizing the ASR performance. Our key idea is to make each module of the traditional GSC fully learnable and use the acoustic model to perform joint optimization with GSC. We use the coefficients of the traditional GSC to initialize nnGSC, so that both traditional signal processing knowledge and large amounts of data can be used to guide the network learning. In addition, nnGSC can automatically track the target direction-of-arrival (DOA) frame-by-frame without the need for additional localization algorithms. In the experiments, nnGSC achieves a relative character error rate (CER) improvement of 23.7% compared to the microphone observation, 13.5% compared to the oracle direction-based super-directive beamformer, 12.2% compared to the oracle direction-based traditional GSC and 5.9% compared to the oracle mask-based minimum variance distortionless response (MVDR) beamformer. Moreover, we can improve the robustness of nnGSC against array geometry mismatches by training with multi-geometry data.


Assuntos
Aprendizado Profundo , Interface para o Reconhecimento da Fala , Fala , Humanos , Ruído
19.
Neural Netw ; 141: 72-86, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-33866304

RESUMO

Deep learning methods for language recognition have achieved promising performance. However, most of the studies focus on frameworks for single types of acoustic features and single tasks. In this paper, we propose the deep joint learning strategies based on the Multi-Feature (MF) and Multi-Task (MT) models. First, we investigate the efficiency of integrating multiple acoustic features and explore two kinds of training constraints, one is introducing auxiliary classification constraints with adaptive weights for loss functions in feature encoder sub-networks, and the other option is introducing the Canonical Correlation Analysis (CCA) constraint to maximize the correlation of different feature representations. Correlated speech tasks, such as phoneme recognition, are applied as auxiliary tasks in order to learn related information to enhance the performance of language recognition. We analyze phoneme-aware information from different learning strategies, like joint learning on the frame-level, adversarial learning on the segment-level, and the combination mode. In addition, we present the Language-Phoneme embedding extraction structure to learn and extract language and phoneme embedding representations simultaneously. We demonstrate the effectiveness of the proposed approaches with experiments on the Oriental Language Recognition (OLR) data sets. Experimental results indicate that joint learning on the multi-feature and multi-task models extracts instinct feature representations for language identities and improves the performance, especially in complex challenges, such as cross-channel or open-set conditions.


Assuntos
Aprendizado Profundo , Idioma , Interface para o Reconhecimento da Fala , Acústica , Humanos
20.
Neural Netw ; 139: 326-334, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-33878611

RESUMO

Keyword search (KWS) means searching for keywords given by the user from continuous speech. Conventional KWS systems are based on Automatic Speech Recognition (ASR), where the input speech has to be first processed by the ASR system before keyword searching. In the recent decade, as deep learning and deep neural networks (DNN) become increasingly popular, KWS systems can also be trained in an end-to-end (E2E) manner. The main advantage of E2E KWS is that there is no need for speech recognition, which makes the training and searching procedure much more straightforward than the traditional ones. This article proposes an E2E KWS model, which consists of four parts: speech encoder-decoder, query encoder-decoder, attention mechanism, and energy scorer. Firstly, the proposed model outperforms the baseline model. Secondly, we find that under various supervision, character or phoneme sequences, speech or query encoders can extract the corresponding information, resulting in different performances. Moreover, we introduce an attention mechanism and invent a novel energy scorer, where the former can help locate keywords. The latter can make final decisions by considering speech embeddings, query embeddings, and attention weights in parallel. We evaluate our model on low resource conditions with about 10-hour training data for four different languages. The experiment results prove that the proposed model can work well on low resource conditions.


Assuntos
Aprendizado de Máquina , Interface para o Reconhecimento da Fala , Processamento de Linguagem Natural
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...