ABSTRACT
Speech neuroprostheses have the potential to restore communication to people living with paralysis, but naturalistic speed and expressivity are elusive1. Here we use high-density surface recordings of the speech cortex in a clinical-trial participant with severe limb and vocal paralysis to achieve high-performance real-time decoding across three complementary speech-related output modalities: text, speech audio and facial-avatar animation. We trained and evaluated deep-learning models using neural data collected as the participant attempted to silently speak sentences. For text, we demonstrate accurate and rapid large-vocabulary decoding with a median rate of 78 words per minute and median word error rate of 25%. For speech audio, we demonstrate intelligible and rapid speech synthesis andĀ personalizationĀ to the participant's pre-injury voice. For facial-avatar animation, we demonstrate the control of virtual orofacial movements for speech and non-speech communicative gestures. The decoders reached high performance with less than two weeks of training. Our findings introduce a multimodal speech-neuroprosthetic approach that has substantial promise to restore full, embodied communication to people living with severe paralysis.
Subject(s)
Face , Neural Prostheses , Paralysis , Speech , Humans , Cerebral Cortex/physiology , Cerebral Cortex/physiopathology , Clinical Trials as Topic , Communication , Deep Learning , Gestures , Movement , Neural Prostheses/standards , Paralysis/physiopathology , Paralysis/rehabilitation , Vocabulary , VoiceABSTRACT
Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.
Subject(s)
Cerebral Cortex/physiology , Movement/physiology , Neural Networks, Computer , Speech Acoustics , Speech/physiology , Adult , Biomechanical Phenomena/physiology , Female , Humans , Male , Speech Articulation Tests , Speech IntelligibilityABSTRACT
BACKGROUND: Technology to restore the ability to communicate in paralyzed persons who cannot speak has the potential to improve autonomy and quality of life. An approach that decodes words and sentences directly from the cerebral cortical activity of such patients may represent an advancement over existing methods for assisted communication. METHODS: We implanted a subdural, high-density, multielectrode array over the area of the sensorimotor cortex that controls speech in a person with anarthria (the loss of the ability to articulate speech) and spastic quadriparesis caused by a brain-stem stroke. Over the course of 48 sessions, we recorded 22 hours of cortical activity while the participant attempted to say individual words from a vocabulary set of 50 words. We used deep-learning algorithms to create computational models for the detection and classification of words from patterns in the recorded cortical activity. We applied these computational models, as well as a natural-language model that yielded next-word probabilities given the preceding words in a sequence, to decode full sentences as the participant attempted to say them. RESULTS: We decoded sentences from the participant's cortical activity in real time at a median rate of 15.2 words per minute, with a median word error rate of 25.6%. In post hoc analyses, we detected 98% of the attempts by the participant to produce individual words, and we classified words with 47.1% accuracy using cortical signals that were stable throughout the 81-week study period. CONCLUSIONS: In a person with anarthria and spastic quadriparesis caused by a brain-stem stroke, words and sentences were decoded directly from cortical activity during attempted speech with the use of deep-learning models and a natural-language model. (Funded by Facebook and others; ClinicalTrials.gov number, NCT03698149.).
Subject(s)
Brain Stem Infarctions/complications , Brain-Computer Interfaces , Deep Learning , Dysarthria/rehabilitation , Neural Prostheses , Speech , Adult , Dysarthria/etiology , Electrocorticography , Electrodes, Implanted , Humans , Male , Natural Language Processing , Quadriplegia/etiology , Sensorimotor Cortex/physiologyABSTRACT
The human auditory system extracts rich linguistic abstractions from speech signals. Traditional approaches to understanding this complex process have used linear feature-encoding models, with limited success. Artificial neural networks excel in speech recognition tasks and offer promising computational models of speech processing. We used speech representations in state-of-the-art deep neural network (DNN) models to investigate neural coding from the auditory nerve to the speech cortex. Representations in hierarchical layers of the DNN correlated well with the neural activity throughout the ascending auditory system. Unsupervised speech models performed at least as well as other purely supervised or fine-tuned models. Deeper DNN layers were better correlated with the neural activity in the higher-order auditory cortex, with computations aligned with phonemic and syllabic structures in speech. Accordingly, DNN models trained on either English or Mandarin predicted cortical responses in native speakers of each language. These results reveal convergence between DNN model representations and the biological auditory pathway, offering new approaches for modeling neural coding in the auditory cortex.
Subject(s)
Auditory Cortex , Speech Perception , Humans , Speech/physiology , Auditory Pathways , Auditory Cortex/physiology , Neural Networks, Computer , Perception , Speech Perception/physiologyABSTRACT
Neuroprostheses have the potential to restore communication to people who cannot speak or type due to paralysis. However, it is unclear if silent attempts to speak can be used to control a communication neuroprosthesis. Here, we translated direct cortical signals in a clinical-trial participant (ClinicalTrials.gov; NCT03698149) with severe limb and vocal-tract paralysis into single letters to spell out full sentences in real time. We used deep-learning and language-modeling techniques to decode letter sequences as the participant attempted to silently spell using code words that represented the 26 English letters (e.g. "alpha" for "a"). We leveraged broad electrode coverage beyond speech-motor cortex to include supplemental control signals from hand cortex and complementary information from low- and high-frequency signal components to improve decoding accuracy. We decoded sentences using words from a 1,152-word vocabulary at a median character error rate of 6.13% and speed of 29.4 characters per minute. In offline simulations, we showed that our approach generalized to large vocabularies containing over 9,000 words (median character error rate of 8.23%). These results illustrate the clinical viability of a silently controlled speech neuroprosthesis to generate sentences from a large vocabulary through a spelling-based approach, complementing previous demonstrations of direct full-word decoding.
Subject(s)
Speech Perception , Speech , Humans , Language , Vocabulary , ParalysisABSTRACT
Objective.Decoding language representations directly from the brain can enable new brain-computer interfaces (BCIs) for high bandwidth human-human and human-machine communication. Clinically, such technologies can restore communication in people with neurological conditions affecting their ability to speak.Approach. In this study, we propose a novel deep network architecture Brain2Char, for directly decoding text (specifically character sequences) from direct brain recordings (called electrocorticography, ECoG). Brain2Char framework combines state-of-the-art deep learning modules-3D Inception layers for multiband spatiotemporal feature extraction from neural data and bidirectional recurrent layers, dilated convolution layers followed by language model weighted beam search to decode character sequences, and optimizing a connectionist temporal classification loss. Additionally, given the highly non-linear transformations that underlie the conversion of cortical function to character sequences, we perform regularizations on the network's latent representations motivated by insights into cortical encoding of speech production and artifactual aspects specific to ECoG data acquisition. To do this, we impose auxiliary losses on latent representations for articulatory movements, speech acoustics and session specific non-linearities.Main results.In three (out of four) participants reported here, Brain2Char achieves 10.6%, 8.5%, and 7.0% word error rates respectively on vocabulary sizes ranging from 1200 to 1900 words.Significance.These results establish a newend-to-end approachon decoding text frombrain signalsand demonstrate the potential of Brain2Char as a high-performance communication BCI.
Subject(s)
Brain-Computer Interfaces , Speech , Brain , Electrocorticography , Humans , LanguageABSTRACT
When speaking, we dynamically coordinate movements of our jaw, tongue, lips, and larynx. To investigate the neural mechanisms underlying articulation, we used direct cortical recordings from human sensorimotor cortex while participants spoke natural sentences that included sounds spanning the entire English phonetic inventory. We used deep neural networks to infer speakers' articulator movements from produced speech acoustics. Individual electrodes encoded a diversity of articulatory kinematic trajectories (AKTs), each revealing coordinated articulator movements toward specific vocal tract shapes. AKTs captured a wide range of movement types, yet they could be differentiated by the place of vocal tract constriction. Additionally, AKTs manifested out-and-back trajectories with harmonic oscillator dynamics. While AKTs were functionally stereotyped across different sentences, context-dependent encoding of preceding and following movements during production of the same phoneme demonstrated the cortical representation of coarticulation. Articulatory movements encoded in sensorimotor cortex give rise to the complex kinematics underlying continuous speech production. VIDEO ABSTRACT.
Subject(s)
Neural Networks, Computer , Sensorimotor Cortex/physiology , Speech/physiology , Adult , Biomechanical Phenomena , Electrocorticography , Epilepsy , Female , Humans , Jaw , Larynx , Lip , Middle Aged , Models, Neurological , Phonetics , TongueABSTRACT
BACKGROUND: Interictal epileptiform discharges are an important biomarker for localization of focal epilepsy, especially in patients who undergo chronic intracranial monitoring. Manual detection of these pathophysiological events is cumbersome, but is still superior to current rule-based approaches in most automated algorithms. OBJECTIVE: To develop an unsupervised machine-learning algorithm for the improved, automated detection and localization of interictal epileptiform discharges based on spatiotemporal pattern recognition. METHODS: We decomposed 24 h of intracranial electroencephalography signals into basis functions and activation vectors using non-negative matrix factorization (NNMF). Thresholding the activation vector and the basis function of interest detected interictal epileptiform discharges in time and space (specific electrodes), respectively. We used convolutive NNMF, a refined algorithm, to add a temporal dimension to basis functions. RESULTS: The receiver operating characteristics for NNMF-based detection are close to the gold standard of human visual-based detection and superior to currently available alternative automated approaches (93% sensitivity and 97% specificity). The algorithm successfully identified thousands of interictal epileptiform discharges across a full day of neurophysiological recording and accurately summarized their localization into a single map. Adding a temporal window allowed for visualization of the archetypal propagation network of these epileptiform discharges. CONCLUSION: Unsupervised learning offers a powerful approach towards automated identification of recurrent pathological neurophysiological signals, which may have important implications for precise, quantitative, and individualized evaluation of focal epilepsy.
Subject(s)
Algorithms , Electroencephalography/methods , Epilepsies, Partial/physiopathology , Unsupervised Machine Learning , Adult , Aged , Epilepsies, Partial/diagnosis , Female , Humans , Male , Middle Aged , Retrospective Studies , Seizures/diagnosis , Seizures/physiopathologyABSTRACT
A complete neurobiological understanding of speech motor control requires determination of the relationship between simultaneously recorded neural activity and the kinematics of the lips, jaw, tongue, and larynx. Many speech articulators are internal to the vocal tract, and therefore simultaneously tracking the kinematics of all articulators is nontrivial--especially in the context of human electrophysiology recordings. Here, we describe a noninvasive, multi-modal imaging system to monitor vocal tract kinematics, demonstrate this system in six speakers during production of nine American English vowels, and provide new analysis of such data. Classification and regression analysis revealed considerable variability in the articulator-to-acoustic relationship across speakers. Non-negative matrix factorization extracted basis sets capturing vocal tract shapes allowing for higher vowel classification accuracy than traditional methods. Statistical speech synthesis generated speech from vocal tract measurements, and we demonstrate perceptual identification. We demonstrate the capacity to predict lip kinematics from ventral sensorimotor cortical activity. These results demonstrate a multi-modal system to non-invasively monitor articulator kinematics during speech production, describe novel analytic methods for relating kinematic data to speech acoustics, and provide the first decoding of speech kinematics from electrocorticography. These advances will be critical for understanding the cortical basis of speech production and the creation of vocal prosthetics.