Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 62
Filter
1.
Nature ; 626(7999): 593-602, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38093008

ABSTRACT

Understanding the neural basis of speech perception requires that we study the human brain both at the scale of the fundamental computational unit of neurons and in their organization across the depth of cortex. Here we used high-density Neuropixels arrays1-3 to record from 685 neurons across cortical layers at nine sites in a high-level auditory region that is critical for speech, the superior temporal gyrus4,5, while participants listened to spoken sentences. Single neurons encoded a wide range of speech sound cues, including features of consonants and vowels, relative vocal pitch, onsets, amplitude envelope and sequence statistics. Neurons at each cross-laminar recording exhibited dominant tuning to a primary speech feature while also containing a substantial proportion of neurons that encoded other features contributing to heterogeneous selectivity. Spatially, neurons at similar cortical depths tended to encode similar speech features. Activity across all cortical layers was predictive of high-frequency field potentials (electrocorticography), providing a neuronal origin for macroelectrode recordings from the cortical surface. Together, these results establish single-neuron tuning across the cortical laminae as an important dimension of speech encoding in human superior temporal gyrus.


Subject(s)
Auditory Cortex , Neurons , Speech Perception , Temporal Lobe , Humans , Acoustic Stimulation , Auditory Cortex/cytology , Auditory Cortex/physiology , Neurons/physiology , Phonetics , Speech , Speech Perception/physiology , Temporal Lobe/cytology , Temporal Lobe/physiology , Cues , Electrodes
2.
PLoS Biol ; 21(6): e3002128, 2023 06.
Article in English | MEDLINE | ID: mdl-37279203

ABSTRACT

Humans can easily tune in to one talker in a multitalker environment while still picking up bits of background speech; however, it remains unclear how we perceive speech that is masked and to what degree non-target speech is processed. Some models suggest that perception can be achieved through glimpses, which are spectrotemporal regions where a talker has more energy than the background. Other models, however, require the recovery of the masked regions. To clarify this issue, we directly recorded from primary and non-primary auditory cortex (AC) in neurosurgical patients as they attended to one talker in multitalker speech and trained temporal response function models to predict high-gamma neural activity from glimpsed and masked stimulus features. We found that glimpsed speech is encoded at the level of phonetic features for target and non-target talkers, with enhanced encoding of target speech in non-primary AC. In contrast, encoding of masked phonetic features was found only for the target, with a greater response latency and distinct anatomical organization compared to glimpsed phonetic features. These findings suggest separate mechanisms for encoding glimpsed and masked speech and provide neural evidence for the glimpsing model of speech perception.


Subject(s)
Speech Perception , Speech , Humans , Speech/physiology , Acoustic Stimulation , Phonetics , Speech Perception/physiology , Reaction Time
3.
J Neurosci ; 42(17): 3648-3658, 2022 04 27.
Article in English | MEDLINE | ID: mdl-35347046

ABSTRACT

Speech perception in noise is a challenging everyday task with which many listeners have difficulty. Here, we report a case in which electrical brain stimulation of implanted intracranial electrodes in the left planum temporale (PT) of a neurosurgical patient significantly and reliably improved subjective quality (up to 50%) and objective intelligibility (up to 97%) of speech in noise perception. Stimulation resulted in a selective enhancement of speech sounds compared with the background noises. The receptive fields of the PT sites whose stimulation improved speech perception were tuned to spectrally broad and rapidly changing sounds. Corticocortical evoked potential analysis revealed that the PT sites were located between the sites in Heschl's gyrus and the superior temporal gyrus. Moreover, the discriminability of speech from nonspeech sounds increased in population neural responses from Heschl's gyrus to the PT to the superior temporal gyrus sites. These findings causally implicate the PT in background noise suppression and may point to a novel potential neuroprosthetic solution to assist in the challenging task of speech perception in noise.SIGNIFICANCE STATEMENT Speech perception in noise remains a challenging task for many individuals. Here, we present a case in which the electrical brain stimulation of intracranially implanted electrodes in the planum temporale of a neurosurgical patient significantly improved both the subjective quality (up to 50%) and objective intelligibility (up to 97%) of speech perception in noise. Stimulation resulted in a selective enhancement of speech sounds compared with the background noises. Our local and network-level functional analyses placed the planum temporale sites in between the sites in the primary auditory areas in Heschl's gyrus and nonprimary auditory areas in the superior temporal gyrus. These findings causally implicate planum temporale in acoustic scene analysis and suggest potential neuroprosthetic applications to assist hearing in noise.


Subject(s)
Auditory Cortex , Speech Perception , Acoustic Stimulation , Auditory Cortex/physiology , Brain , Brain Mapping/methods , Hearing , Humans , Magnetic Resonance Imaging/methods , Speech/physiology , Speech Perception/physiology
4.
J Neurosci ; 42(32): 6285-6294, 2022 08 10.
Article in English | MEDLINE | ID: mdl-35790403

ABSTRACT

Neuronal coherence is thought to be a fundamental mechanism of communication in the brain, where synchronized field potentials coordinate synaptic and spiking events to support plasticity and learning. Although the spread of field potentials has garnered great interest, little is known about the spatial reach of phase synchronization, or neuronal coherence. Functional connectivity between different brain regions is known to occur across long distances, but the locality of synchronization across the neocortex is understudied. Here we used simultaneous recordings from electrocorticography (ECoG) grids and high-density microelectrode arrays to estimate the spatial reach of neuronal coherence and spike-field coherence (SFC) across frontal, temporal, and occipital cortices during cognitive tasks in humans. We observed the strongest coherence within a 2-3 cm distance from the microelectrode arrays, potentially defining an effective range for local communication. This range was relatively consistent across brain regions, spectral frequencies, and cognitive tasks. The magnitude of coherence showed power law decay with increasing distance from the microelectrode arrays, where the highest coherence occurred between ECoG contacts, followed by coherence between ECoG and deep cortical local field potential (LFP), and then SFC (i.e., ECoG > LFP > SFC). The spectral frequency of coherence also affected its magnitude. Alpha coherence (8-14 Hz) was generally higher than other frequencies for signals nearest the microelectrode arrays, whereas delta coherence (1-3 Hz) was higher for signals that were farther away. Action potentials in all brain regions were most coherent with the phase of alpha oscillations, which suggests that alpha waves could play a larger, more spatially local role in spike timing than other frequencies. These findings provide a deeper understanding of the spatial and spectral dynamics of neuronal synchronization, further advancing knowledge about how activity propagates across the human brain.SIGNIFICANCE STATEMENT Coherence is theorized to facilitate information transfer across cerebral space by providing a convenient electrophysiological mechanism to modulate membrane potentials in spatiotemporally complex patterns. Our work uses a multiscale approach to evaluate the spatial reach of phase coherence and spike-field coherence during cognitive tasks in humans. Locally, coherence can reach up to 3 cm around a given area of neocortex. The spectral properties of coherence revealed that alpha phase-field and spike-field coherence were higher within ranges <2 cm, whereas lower-frequency delta coherence was higher for contacts farther away. Spatiotemporally shared information (i.e., coherence) across neocortex seems to reach farther than field potentials alone.


Subject(s)
Neocortex , Action Potentials/physiology , Electrocorticography , Humans , Microelectrodes , Neurons/physiology
5.
Neuroimage ; 266: 119819, 2023 02 01.
Article in English | MEDLINE | ID: mdl-36529203

ABSTRACT

The human auditory system displays a robust capacity to adapt to sudden changes in background noise, allowing for continuous speech comprehension despite changes in background environments. However, despite comprehensive studies characterizing this ability, the computations that underly this process are not well understood. The first step towards understanding a complex system is to propose a suitable model, but the classical and easily interpreted model for the auditory system, the spectro-temporal receptive field (STRF), cannot match the nonlinear neural dynamics involved in noise adaptation. Here, we utilize a deep neural network (DNN) to model neural adaptation to noise, illustrating its effectiveness at reproducing the complex dynamics at the levels of both individual electrodes and the cortical population. By closely inspecting the model's STRF-like computations over time, we find that the model alters both the gain and shape of its receptive field when adapting to a sudden noise change. We show that the DNN model's gain changes allow it to perform adaptive gain control, while the spectro-temporal change creates noise filtering by altering the inhibitory region of the model's receptive field. Further, we find that models of electrodes in nonprimary auditory cortex also exhibit noise filtering changes in their excitatory regions, suggesting differences in noise filtering mechanisms along the cortical hierarchy. These findings demonstrate the capability of deep neural networks to model complex neural adaptation and offer new hypotheses about the computations the auditory cortex performs to enable noise-robust speech perception in real-world, dynamic environments.


Subject(s)
Auditory Cortex , Humans , Acoustic Stimulation/methods , Auditory Perception , Neurons , Neural Networks, Computer
6.
J Neurosci ; 40(44): 8530-8542, 2020 10 28.
Article in English | MEDLINE | ID: mdl-33023923

ABSTRACT

Natural conversation is multisensory: when we can see the speaker's face, visual speech cues improve our comprehension. The neuronal mechanisms underlying this phenomenon remain unclear. The two main alternatives are visually mediated phase modulation of neuronal oscillations (excitability fluctuations) in auditory neurons and visual input-evoked responses in auditory neurons. Investigating this question using naturalistic audiovisual speech with intracranial recordings in humans of both sexes, we find evidence for both mechanisms. Remarkably, auditory cortical neurons track the temporal dynamics of purely visual speech using the phase of their slow oscillations and phase-related modulations in broadband high-frequency activity. Consistent with known perceptual enhancement effects, the visual phase reset amplifies the cortical representation of concomitant auditory speech. In contrast to this, and in line with earlier reports, visual input reduces the amplitude of evoked responses to concomitant auditory input. We interpret the combination of improved phase tracking and reduced response amplitude as evidence for more efficient and reliable stimulus processing in the presence of congruent auditory and visual speech inputs.SIGNIFICANCE STATEMENT Watching the speaker can facilitate our understanding of what is being said. The mechanisms responsible for this influence of visual cues on the processing of speech remain incompletely understood. We studied these mechanisms by recording the electrical activity of the human brain through electrodes implanted surgically inside the brain. We found that visual inputs can operate by directly activating auditory cortical areas, and also indirectly by modulating the strength of cortical responses to auditory input. Our results help to understand the mechanisms by which the brain merges auditory and visual speech into a unitary perception.


Subject(s)
Auditory Cortex/physiology , Evoked Potentials/physiology , Nonverbal Communication/physiology , Adult , Drug Resistant Epilepsy/surgery , Electrocorticography , Evoked Potentials, Auditory/physiology , Evoked Potentials, Visual/physiology , Female , Humans , Middle Aged , Neurons/physiology , Nonverbal Communication/psychology , Photic Stimulation , Young Adult
7.
Neuroimage ; 227: 117586, 2021 02 15.
Article in English | MEDLINE | ID: mdl-33346131

ABSTRACT

Acquiring a new language requires individuals to simultaneously and gradually learn linguistic attributes on multiple levels. Here, we investigated how this learning process changes the neural encoding of natural speech by assessing the encoding of the linguistic feature hierarchy in second-language listeners. Electroencephalography (EEG) signals were recorded from native Mandarin speakers with varied English proficiency and from native English speakers while they listened to audio-stories in English. We measured the temporal response functions (TRFs) for acoustic, phonemic, phonotactic, and semantic features in individual participants and found a main effect of proficiency on linguistic encoding. This effect of second-language proficiency was particularly prominent on the neural encoding of phonemes, showing stronger encoding of "new" phonemic contrasts (i.e., English contrasts that do not exist in Mandarin) with increasing proficiency. Overall, we found that the nonnative listeners with higher proficiency levels had a linguistic feature representation more similar to that of native listeners, which enabled the accurate decoding of language proficiency. This result advances our understanding of the cortical processing of linguistic information in second-language learners and provides an objective measure of language proficiency.


Subject(s)
Brain/physiology , Comprehension/physiology , Multilingualism , Speech Perception/physiology , Adolescent , Adult , Electroencephalography , Female , Humans , Language , Male , Middle Aged , Phonetics , Young Adult
8.
Neuroimage ; 235: 118003, 2021 07 15.
Article in English | MEDLINE | ID: mdl-33789135

ABSTRACT

Heschl's gyrus (HG) is a brain area that includes the primary auditory cortex in humans. Due to the limitations in obtaining direct neural measurements from this region during naturalistic speech listening, the functional organization and the role of HG in speech perception remain uncertain. Here, we used intracranial EEG to directly record neural activity in HG in eight neurosurgical patients as they listened to continuous speech stories. We studied the spatial distribution of acoustic tuning and the organization of linguistic feature encoding. We found a main gradient of change from posteromedial to anterolateral parts of HG. We also observed a decrease in frequency and temporal modulation tuning and an increase in phonemic representation, speaker normalization, speech sensitivity, and response latency. We did not observe a difference between the two brain hemispheres. These findings reveal a functional role for HG in processing and transforming simple to complex acoustic features and inform neurophysiological models of speech processing in the human auditory cortex.


Subject(s)
Auditory Cortex/physiology , Brain Mapping , Speech Perception/physiology , Adult , Electrocorticography , Epilepsy/diagnosis , Epilepsy/surgery , Female , Humans , Male , Middle Aged , Neurosurgical Procedures
9.
Neuroimage ; 223: 117282, 2020 12.
Article in English | MEDLINE | ID: mdl-32828921

ABSTRACT

Hearing-impaired people often struggle to follow the speech stream of an individual talker in noisy environments. Recent studies show that the brain tracks attended speech and that the attended talker can be decoded from neural data on a single-trial level. This raises the possibility of "neuro-steered" hearing devices in which the brain-decoded intention of a hearing-impaired listener is used to enhance the voice of the attended speaker from a speech separation front-end. So far, methods that use this paradigm have focused on optimizing the brain decoding and the acoustic speech separation independently. In this work, we propose a novel framework called brain-informed speech separation (BISS)1 in which the information about the attended speech, as decoded from the subject's brain, is directly used to perform speech separation in the front-end. We present a deep learning model that uses neural data to extract the clean audio signal that a listener is attending to from a multi-talker speech mixture. We show that the framework can be applied successfully to the decoded output from either invasive intracranial electroencephalography (iEEG) or non-invasive electroencephalography (EEG) recordings from hearing-impaired subjects. It also results in improved speech separation, even in scenes with background noise. The generalization capability of the system renders it a perfect candidate for neuro-steered hearing-assistive devices.


Subject(s)
Brain/physiology , Electroencephalography , Signal Processing, Computer-Assisted , Speech Acoustics , Speech Perception/physiology , Acoustic Stimulation , Adult , Algorithms , Deep Learning , Hearing Loss/physiopathology , Humans , Middle Aged
10.
Nature ; 495(7441): 327-32, 2013 Mar 21.
Article in English | MEDLINE | ID: mdl-23426266

ABSTRACT

Speaking is one of the most complex actions that we perform, but nearly all of us learn to do it effortlessly. Production of fluent speech requires the precise, coordinated movement of multiple articulators (for example, the lips, jaw, tongue and larynx) over rapid time scales. Here we used high-resolution, multi-electrode cortical recordings during the production of consonant-vowel syllables to determine the organization of speech sensorimotor cortex in humans. We found speech-articulator representations that are arranged somatotopically on ventral pre- and post-central gyri, and that partially overlap at individual electrodes. These representations were coordinated temporally as sequences during syllable production. Spatial patterns of cortical activity showed an emergent, population-level representation, which was organized by phonetic features. Over tens of milliseconds, the spatial patterns transitioned between distinct representations for different consonants and vowels. These results reveal the dynamic organization of speech sensorimotor cortex during the generation of multi-articulator movements that underlies our ability to speak.


Subject(s)
Cerebral Cortex/physiology , Speech/physiology , Electromagnetic Phenomena , Feedback, Sensory/physiology , Humans , Phonetics , Principal Component Analysis , Time Factors
11.
J Acoust Soc Am ; 146(6): EL459, 2019 12.
Article in English | MEDLINE | ID: mdl-31893764

ABSTRACT

The analyses of more than 2000 spectro-temporal receptive fields (STRFs) obtained from the ferret primary auditory cortex revealed their dominant encoding properties along time and frequency domains. Results showed that the peak responses of the STRFs were roughly aligned along the time axis, and enhanced low modulation components of the signal around 3 Hz. On the contrary, the peaks of the STRF population along the frequency axis varied and were distributed widely. Further analyses revealed that there were some general properties along the frequency axis around the best frequency of each STRF. The STRFs enhanced modulation frequencies around 0.25 cycles/octave. These findings are consistent with techniques that have been reported in the literature to result in improvements in speech recognition performance, highlighting the significant potential of biological insights into the design of better machines.


Subject(s)
Action Potentials/physiology , Auditory Cortex/physiology , Auditory Perception/physiology , Evoked Potentials, Auditory/physiology , Acoustic Stimulation/methods , Animals , Ferrets , Models, Neurological , Neurons/physiology
12.
J Neurosci ; 37(8): 2176-2185, 2017 02 22.
Article in English | MEDLINE | ID: mdl-28119400

ABSTRACT

Humans are unique in their ability to communicate using spoken language. However, it remains unclear how the speech signal is transformed and represented in the brain at different stages of the auditory pathway. In this study, we characterized electroencephalography responses to continuous speech by obtaining the time-locked responses to phoneme instances (phoneme-related potential). We showed that responses to different phoneme categories are organized by phonetic features. We found that each instance of a phoneme in continuous speech produces multiple distinguishable neural responses occurring as early as 50 ms and as late as 400 ms after the phoneme onset. Comparing the patterns of phoneme similarity in the neural responses and the acoustic signals confirms a repetitive appearance of acoustic distinctions of phonemes in the neural data. Analysis of the phonetic and speaker information in neural activations revealed that different time intervals jointly encode the acoustic similarity of both phonetic and speaker categories. These findings provide evidence for a dynamic neural transformation of low-level speech features as they propagate along the auditory pathway, and form an empirical framework to study the representational changes in learning, attention, and speech disorders.SIGNIFICANCE STATEMENT We characterized the properties of evoked neural responses to phoneme instances in continuous speech. We show that each instance of a phoneme in continuous speech produces several observable neural responses at different times occurring as early as 50 ms and as late as 400 ms after the phoneme onset. Each temporal event explicitly encodes the acoustic similarity of phonemes, and linguistic and nonlinguistic information are best represented at different time intervals. Finally, we show a joint encoding of phonetic and speaker information, where the neural representation of speakers is dependent on phoneme category. These findings provide compelling new evidence for dynamic processing of speech sounds in the auditory pathway.


Subject(s)
Brain Mapping , Evoked Potentials, Auditory/physiology , Phonetics , Speech Perception/physiology , Speech/physiology , Acoustic Stimulation , Acoustics , Electroencephalography , Female , Humans , Language , Male , Reaction Time , Statistics as Topic , Time Factors
13.
Nature ; 485(7397): 233-6, 2012 May 10.
Article in English | MEDLINE | ID: mdl-22522927

ABSTRACT

Humans possess a remarkable ability to attend to a single speaker's voice in a multi-talker background. How the auditory system manages to extract intelligible speech under such acoustically complex and adverse listening conditions is not known, and, indeed, it is not clear how attended speech is internally represented. Here, using multi-electrode surface recordings from the cortex of subjects engaged in a listening task with two simultaneous speakers, we demonstrate that population responses in non-primary human auditory cortex encode critical features of attended speech: speech spectrograms reconstructed based on cortical responses to the mixture of speakers reveal the salient spectral and temporal features of the attended speaker, as if subjects were listening to that speaker alone. A simple classifier trained solely on examples of single speakers can decode both attended words and speaker identity. We find that task performance is well predicted by a rapid increase in attention-modulated neural selectivity across both single-electrode and population-level cortical responses. These findings demonstrate that the cortical representation of speech does not merely reflect the external acoustic environment, but instead gives rise to the perceptual aspects relevant for the listener's intended goal.


Subject(s)
Acoustic Stimulation , Attention/physiology , Auditory Cortex/physiology , Speech Perception/physiology , Speech , Acoustics , Electrodes , Female , Humans , Language , Male , Models, Neurological , Noise , Sound Spectrography
14.
J Neurosci ; 36(49): 12338-12350, 2016 12 07.
Article in English | MEDLINE | ID: mdl-27927954

ABSTRACT

A primary goal of auditory neuroscience is to identify the sound features extracted and represented by auditory neurons. Linear encoding models, which describe neural responses as a function of the stimulus, have been primarily used for this purpose. Here, we provide theoretical arguments and experimental evidence in support of an alternative approach, based on decoding the stimulus from the neural response. We used a Bayesian normative approach to predict the responses of neurons detecting relevant auditory features, despite ambiguities and noise. We compared the model predictions to recordings from the primary auditory cortex of ferrets and found that: (1) the decoding filters of auditory neurons resemble the filters learned from the statistics of speech sounds; (2) the decoding model captures the dynamics of responses better than a linear encoding model of similar complexity; and (3) the decoding model accounts for the accuracy with which the stimulus is represented in neural activity, whereas linear encoding model performs very poorly. Most importantly, our model predicts that neuronal responses are fundamentally shaped by "explaining away," a divisive competition between alternative interpretations of the auditory scene. SIGNIFICANCE STATEMENT: Neural responses in the auditory cortex are dynamic, nonlinear, and hard to predict. Traditionally, encoding models have been used to describe neural responses as a function of the stimulus. However, in addition to external stimulation, neural activity is strongly modulated by the responses of other neurons in the network. We hypothesized that auditory neurons aim to collectively decode their stimulus. In particular, a stimulus feature that is decoded (or explained away) by one neuron is not explained by another. We demonstrated that this novel Bayesian decoding model is better at capturing the dynamic responses of cortical neurons in ferrets. Whereas the linear encoding model poorly reflects selectivity of neurons, the decoding model can account for the strong nonlinearities observed in neural data.


Subject(s)
Auditory Perception/physiology , Ferrets/physiology , Sensory Receptor Cells/physiology , Acoustic Stimulation , Algorithms , Animals , Auditory Cortex/physiology , Bayes Theorem , Female , Male , Models, Neurological , Nerve Net/cytology , Nerve Net/physiology , Noise , Phonetics
15.
J Neurosci ; 36(6): 2014-26, 2016 Feb 10.
Article in English | MEDLINE | ID: mdl-26865624

ABSTRACT

The human superior temporal gyrus (STG) is critical for speech perception, yet the organization of spectrotemporal processing of speech within the STG is not well understood. Here, to characterize the spatial organization of spectrotemporal processing of speech across human STG, we use high-density cortical surface field potential recordings while participants listened to natural continuous speech. While synthetic broad-band stimuli did not yield sustained activation of the STG, spectrotemporal receptive fields could be reconstructed from vigorous responses to speech stimuli. We find that the human STG displays a robust anterior-posterior spatial distribution of spectrotemporal tuning in which the posterior STG is tuned for temporally fast varying speech sounds that have relatively constant energy across the frequency axis (low spectral modulation) while the anterior STG is tuned for temporally slow varying speech sounds that have a high degree of spectral variation across the frequency axis (high spectral modulation). This work illustrates organization of spectrotemporal processing in the human STG, and illuminates processing of ethologically relevant speech signals in a region of the brain specialized for speech perception. SIGNIFICANCE STATEMENT: Considerable evidence has implicated the human superior temporal gyrus (STG) in speech processing. However, the gross organization of spectrotemporal processing of speech within the STG is not well characterized. Here we use natural speech stimuli and advanced receptive field characterization methods to show that spectrotemporal features within speech are well organized along the posterior-to-anterior axis of the human STG. These findings demonstrate robust functional organization based on spectrotemporal modulation content, and illustrate that much of the encoded information in the STG represents the physical acoustic properties of speech stimuli.


Subject(s)
Speech Perception/physiology , Temporal Lobe/physiology , Acoustic Stimulation , Adult , Algorithms , Brain Mapping , Energy Metabolism/physiology , Evoked Potentials/physiology , Female , Humans , Male , Phonetics
16.
Proc Natl Acad Sci U S A ; 111(18): 6792-7, 2014 May 06.
Article in English | MEDLINE | ID: mdl-24753585

ABSTRACT

Humans and animals can reliably perceive behaviorally relevant sounds in noisy and reverberant environments, yet the neural mechanisms behind this phenomenon are largely unknown. To understand how neural circuits represent degraded auditory stimuli with additive and reverberant distortions, we compared single-neuron responses in ferret primary auditory cortex to speech and vocalizations in four conditions: clean, additive white and pink (1/f) noise, and reverberation. Despite substantial distortion, responses of neurons to the vocalization signal remained stable, maintaining the same statistical distribution in all conditions. Stimulus spectrograms reconstructed from population responses to the distorted stimuli resembled more the original clean than the distorted signals. To explore mechanisms contributing to this robustness, we simulated neural responses using several spectrotemporal receptive field models that incorporated either a static nonlinearity or subtractive synaptic depression and multiplicative gain normalization. The static model failed to suppress the distortions. A dynamic model incorporating feed-forward synaptic depression could account for the reduction of additive noise, but only the combined model with feedback gain normalization was able to predict the effects across both additive and reverberant conditions. Thus, both mechanisms can contribute to the abilities of humans and animals to extract relevant sounds in diverse noisy environments.


Subject(s)
Auditory Cortex/physiology , Speech Perception/physiology , Acoustic Stimulation , Animals , Female , Ferrets/physiology , Humans , Models, Neurological , Neurons/physiology , Noise , Nonlinear Dynamics , Vocalization, Animal
17.
Cereb Cortex ; 25(7): 1697-706, 2015 Jul.
Article in English | MEDLINE | ID: mdl-24429136

ABSTRACT

How humans solve the cocktail party problem remains unknown. However, progress has been made recently thanks to the realization that cortical activity tracks the amplitude envelope of speech. This has led to the development of regression methods for studying the neurophysiology of continuous speech. One such method, known as stimulus-reconstruction, has been successfully utilized with cortical surface recordings and magnetoencephalography (MEG). However, the former is invasive and gives a relatively restricted view of processing along the auditory hierarchy, whereas the latter is expensive and rare. Thus it would be extremely useful for research in many populations if stimulus-reconstruction was effective using electroencephalography (EEG), a widely available and inexpensive technology. Here we show that single-trial (≈60 s) unaveraged EEG data can be decoded to determine attentional selection in a naturalistic multispeaker environment. Furthermore, we show a significant correlation between our EEG-based measure of attention and performance on a high-level attention task. In addition, by attempting to decode attention at individual latencies, we identify neural processing at ∼200 ms as being critical for solving the cocktail party problem. These findings open up new avenues for studying the ongoing dynamics of cognition using EEG and for developing effective and natural brain-computer interfaces.


Subject(s)
Attention/physiology , Brain/physiology , Electroencephalography/methods , Signal Processing, Computer-Assisted , Speech Perception/physiology , Acoustic Stimulation , Adult , Female , Humans , Male , Neuropsychological Tests , Time Factors
18.
PLoS Biol ; 10(1): e1001251, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22303281

ABSTRACT

How the human auditory system extracts perceptually relevant acoustic features of speech is unknown. To address this question, we used intracranial recordings from nonprimary auditory cortex in the human superior temporal gyrus to determine what acoustic information in speech sounds can be reconstructed from population neural activity. We found that slow and intermediate temporal fluctuations, such as those corresponding to syllable rate, were accurately reconstructed using a linear model based on the auditory spectrogram. However, reconstruction of fast temporal fluctuations, such as syllable onsets and offsets, required a nonlinear sound representation based on temporal modulation energy. Reconstruction accuracy was highest within the range of spectro-temporal fluctuations that have been found to be critical for speech intelligibility. The decoded speech representations allowed readout and identification of individual words directly from brain activity during single trial sound presentations. These findings reveal neural encoding mechanisms of speech acoustic parameters in higher order human auditory cortex.


Subject(s)
Auditory Cortex/physiology , Brain Mapping , Speech Acoustics , Algorithms , Computer Simulation , Electrodes, Implanted , Electroencephalography , Female , Humans , Linear Models , Male , Models, Biological
19.
J Acoust Soc Am ; 135(3): 1380-91, 2014 Mar.
Article in English | MEDLINE | ID: mdl-24606276

ABSTRACT

Sounds such as the voice or musical instruments can be recognized on the basis of timbre alone. Here, sound recognition was investigated with severely reduced timbre cues. Short snippets of naturally recorded sounds were extracted from a large corpus. Listeners were asked to report a target category (e.g., sung voices) among other sounds (e.g., musical instruments). All sound categories covered the same pitch range, so the task had to be solved on timbre cues alone. The minimum duration for which performance was above chance was found to be short, on the order of a few milliseconds, with the best performance for voice targets. Performance was independent of pitch and was maintained when stimuli contained less than a full waveform cycle. Recognition was not generally better when the sound snippets were time-aligned with the sound onset compared to when they were extracted with a random starting time. Finally, performance did not depend on feedback or training, suggesting that the cues used by listeners in the artificial gating task were similar to those relevant for longer, more familiar sounds. The results show that timbre cues for sound recognition are available at a variety of time scales, including very short ones.


Subject(s)
Cues , Discrimination, Psychological , Pitch Discrimination , Recognition, Psychology , Acoustic Stimulation , Adult , Analysis of Variance , Audiometry , Feedback, Psychological , Female , Humans , Male , Music , Sensory Gating , Signal Detection, Psychological , Singing , Sound Spectrography , Time Factors , Voice Quality , Young Adult
20.
Article in English | MEDLINE | ID: mdl-39049981

ABSTRACT

In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augments different levels of audio features, including waveforms, Mel spectrograms, and generalized cross-correlation (GCC) features. In addition, we introduce simple yet effective channel-wise augmentation methods to randomly swap the order of the microphones and mask Mel and GCC channels. By using these augmentations, we find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error. We also perform a comprehensive analysis of the effect of each augmentation method and a comparison of the fine-tuning performance using different amounts of labeled data.

SELECTION OF CITATIONS
SEARCH DETAIL