Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 58
Filter
1.
Heliyon ; 10(2): e24750, 2024 Jan 30.
Article in English | MEDLINE | ID: mdl-38312568

ABSTRACT

Objective: Lipreading, which plays a major role in the communication of the hearing impaired, lacked a French standardised tool. Our aim was to create and validate an audio-visual (AV) version of the French Matrix Sentence Test (FrMST). Design: Video recordings were created by dubbing the existing audio files. Sample: Thirty-five young, normal-hearing participants were tested in auditory and visual modalities alone (Ao, Vo) and in AV conditions, in quiet, noise, and open and closed-set response formats. Results: Lipreading ability (Vo) ranged from 1 % to 77%-word comprehension. The absolute AV benefit was 9.25 dB SPL in quiet and 4.6 dB SNR in noise. The response format did not influence the results in the AV noise condition, except during the training phase. Lipreading ability and AV benefit were significantly correlated. Conclusions: The French video material achieved similar AV benefits as those described in the literature for AV MST in other languages. For clinical purposes, we suggest targeting SRT80 to avoid ceiling effects, and performing two training lists in the AV condition in noise, followed by one AV list in noise, one Ao list in noise and one Vo list, in a randomised order, in open or close set-format.

2.
Neuroimage ; 282: 120391, 2023 11 15.
Article in English | MEDLINE | ID: mdl-37757989

ABSTRACT

There is considerable debate over how visual speech is processed in the absence of sound and whether neural activity supporting lipreading occurs in visual brain areas. Much of the ambiguity stems from a lack of behavioral grounding and neurophysiological analyses that cannot disentangle high-level linguistic and phonetic/energetic contributions from visual speech. To address this, we recorded EEG from human observers as they watched silent videos, half of which were novel and half of which were previously rehearsed with the accompanying audio. We modeled how the EEG responses to novel and rehearsed silent speech reflected the processing of low-level visual features (motion, lip movements) and a higher-level categorical representation of linguistic units, known as visemes. The ability of these visemes to account for the EEG - beyond the motion and lip movements - was significantly enhanced for rehearsed videos in a way that correlated with participants' trial-by-trial ability to lipread that speech. Source localization of viseme processing showed clear contributions from visual cortex, with no strong evidence for the involvement of auditory areas. We interpret this as support for the idea that the visual system produces its own specialized representation of speech that is (1) well-described by categorical linguistic features, (2) dissociable from lip movements, and (3) predictive of lipreading ability. We also suggest a reinterpretation of previous findings of auditory cortical activation during silent speech that is consistent with hierarchical accounts of visual and audiovisual speech perception.


Subject(s)
Auditory Cortex , Speech Perception , Humans , Lipreading , Speech Perception/physiology , Brain/physiology , Auditory Cortex/physiology , Phonetics , Visual Perception/physiology
3.
Brain Sci ; 13(7)2023 Jun 29.
Article in English | MEDLINE | ID: mdl-37508940

ABSTRACT

Traditionally, speech perception training paradigms have not adequately taken into account the possibility that there may be modality-specific requirements for perceptual learning with auditory-only (AO) versus visual-only (VO) speech stimuli. The study reported here investigated the hypothesis that there are modality-specific differences in how prior information is used by normal-hearing participants during vocoded versus VO speech training. Two different experiments, one with vocoded AO speech (Experiment 1) and one with VO, lipread, speech (Experiment 2), investigated the effects of giving different types of prior information to trainees on each trial during training. The training was for four ~20 min sessions, during which participants learned to label novel visual images using novel spoken words. Participants were assigned to different types of prior information during training: Word Group trainees saw a printed version of each training word (e.g., "tethon"), and Consonant Group trainees saw only its consonants (e.g., "t_th_n"). Additional groups received no prior information (i.e., Experiment 1, AO Group; Experiment 2, VO Group) or a spoken version of the stimulus in a different modality from the training stimuli (Experiment 1, Lipread Group; Experiment 2, Vocoder Group). That is, in each experiment, there was a group that received prior information in the modality of the training stimuli from the other experiment. In both experiments, the Word Groups had difficulty retaining the novel words they attempted to learn during training. However, when the training stimuli were vocoded, the Word Group improved their phoneme identification. When the training stimuli were visual speech, the Consonant Group improved their phoneme identification and their open-set sentence lipreading. The results are considered in light of theoretical accounts of perceptual learning in relationship to perceptual modality.

4.
Sensors (Basel) ; 23(4)2023 Feb 12.
Article in English | MEDLINE | ID: mdl-36850669

ABSTRACT

Endangered language generally has low-resource characteristics, as an immaterial cultural resource that cannot be renewed. Automatic speech recognition (ASR) is an effective means to protect this language. However, for low-resource language, native speakers are few and labeled corpora are insufficient. ASR, thus, suffers deficiencies including high speaker dependence and over fitting, which greatly harms the accuracy of recognition. To tackle the deficiencies, the paper puts forward an approach of audiovisual speech recognition (AVSR) based on LSTM-Transformer. The approach introduces visual modality information including lip movements to reduce the dependence of acoustic models on speakers and the quantity of data. Specifically, the new approach, through the fusion of audio and visual information, enhances the expression of speakers' feature space, thus achieving the speaker adaptation that is difficult in a single modality. The approach also includes experiments on speaker dependence and evaluates to what extent audiovisual fusion is dependent on speakers. Experimental results show that the CER of AVSR is 16.9% lower than those of traditional models (optimal performance scenario), and 11.8% lower than that for lip reading. The accuracy for recognizing phonemes, especially finals, improves substantially. For recognizing initials, the accuracy improves for affricates and fricatives where the lip movements are obvious and deteriorates for stops where the lip movements are not obvious. In AVSR, the generalization onto different speakers is also better than in a single modality and the CER can drop by as much as 17.2%. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.


Subject(s)
Acclimatization , Speech , Acoustics , Electric Power Supplies , Language
5.
Sensors (Basel) ; 23(4)2023 Feb 17.
Article in English | MEDLINE | ID: mdl-36850882

ABSTRACT

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human-computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora-LRW and AUTSL-and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.


Subject(s)
Gestures , Speech , Humans , Computers, Handheld , Acoustics , Computer Systems
6.
Small ; 19(17): e2205058, 2023 04.
Article in English | MEDLINE | ID: mdl-36703524

ABSTRACT

Lip-reading provides an effective speech communication interface for people with voice disorders and for intuitive human-machine interactions. Existing systems are generally challenged by bulkiness, obtrusiveness, and poor robustness against environmental interferences. The lack of a truly natural and unobtrusive system for converting lip movements to speech precludes the continuous use and wide-scale deployment of such devices. Here, the design of a hardware-software architecture to capture, analyze, and interpret lip movements associated with either normal or silent speech is presented. The system can recognize different and similar visemes. It is robust in a noisy or dark environment. Self-adhesive, skin-conformable, and semi-transparent dry electrodes are developed to track high-fidelity speech-relevant electromyogram signals without impeding daily activities. The resulting skin-like sensors can form seamless contact with the curvilinear and dynamic surfaces of the skin, which is crucial for a high signal-to-noise ratio and minimal interference. Machine learning algorithms are employed to decode electromyogram signals and convert them to spoken words. Finally, the applications of the developed lip-reading system in augmented reality and medical service are demonstrated, which illustrate the great potential in immersive interaction and healthcare applications.


Subject(s)
Movement , Skin , Humans , Electromyography/methods , Electrodes , Machine Learning
7.
Int J Audiol ; 62(12): 1155-1165, 2023 Dec.
Article in English | MEDLINE | ID: mdl-36129442

ABSTRACT

OBJECTIVE: To understand the communicational and psychosocial effects of COVID-19 protective measures in real-life everyday communication settings. DESIGN: An online survey consisting of close-set and open-ended questions aimed to describe the communication difficulties experienced in different communication activities (in-person and telecommunication) during the COVID-19 pandemic. STUDY SAMPLE: 172 individuals with hearing loss and 130 who reported not having a hearing loss completed the study. They were recruited through social media, private audiology clinics, hospitals and monthly newsletters sent by the non-profit organisation "Audition Quebec." RESULTS: Face masks were the most problematic protective measure for communication in 75-90% of participants. For all in-person communication activities, participants with hearing loss reported significantly more impact on communication than participants with normal hearing. They also exhibited more activity limitations and negative emotions associated with communication difficulties. CONCLUSION: These results suggest that, in times of pandemic, individuals with hearing loss are more likely to exhibit communication breakdowns in their everyday activities. This may lead to social isolation and have a deleterious effect on their mental health. When interacting with individuals with hearing loss, communication strategies to optimise speech understanding should be used.


Subject(s)
COVID-19 , Deafness , Hearing Loss , Humans , Pandemics , Hearing Loss/epidemiology , Hearing Loss/psychology , Hearing , Communication
8.
J Child Lang ; 50(1): 27-51, 2023 Jan.
Article in English | MEDLINE | ID: mdl-36503546

ABSTRACT

This study investigates how children aged two to eight years (N = 129) and adults (N = 29) use auditory and visual speech for word recognition. The goal was to bridge the gap between apparent successes of visual speech processing in young children in visual-looking tasks, with apparent difficulties of speech processing in older children from explicit behavioural measures. Participants were presented with familiar words in audio-visual (AV), audio-only (A-only) or visual-only (V-only) speech modalities, then presented with target and distractor images, and looking to targets was measured. Adults showed high accuracy, with slightly less target-image looking in the V-only modality. Developmentally, looking was above chance for both AV and A-only modalities, but not in the V-only modality until 6 years of age (earlier on /k/-initial words). Flexible use of visual cues for lexical access develops throughout childhood.


Subject(s)
Lipreading , Speech Perception , Adult , Child , Humans , Child, Preschool , Speech , Language Development , Cues
9.
Brain Behav ; 13(2): e2869, 2023 02.
Article in English | MEDLINE | ID: mdl-36579557

ABSTRACT

INTRODUCTION: Few of us are skilled lipreaders while most struggle with the task. Neural substrates that enable comprehension of connected natural speech via lipreading are not yet well understood. METHODS: We used a data-driven approach to identify brain areas underlying the lipreading of an 8-min narrative with participants whose lipreading skills varied extensively (range 6-100%, mean = 50.7%). The participants also listened to and read the same narrative. The similarity between individual participants' brain activity during the whole narrative, within and between conditions, was estimated by a voxel-wise comparison of the Blood Oxygenation Level Dependent (BOLD) signal time courses. RESULTS: Inter-subject correlation (ISC) of the time courses revealed that lipreading, listening to, and reading the narrative were largely supported by the same brain areas in the temporal, parietal and frontal cortices, precuneus, and cerebellum. Additionally, listening to and reading connected naturalistic speech particularly activated higher-level linguistic processing in the parietal and frontal cortices more consistently than lipreading, probably paralleling the limited understanding obtained via lip-reading. Importantly, higher lipreading test score and subjective estimate of comprehension of the lipread narrative was associated with activity in the superior and middle temporal cortex. CONCLUSIONS: Our new data illustrates that findings from prior studies using well-controlled repetitive speech stimuli and stimulus-driven data analyses are also valid for naturalistic connected speech. Our results might suggest an efficient use of brain areas dealing with phonological processing in skilled lipreaders.


Subject(s)
Lipreading , Speech Perception , Humans , Female , Brain , Auditory Perception , Cognition , Magnetic Resonance Imaging
10.
Sensors (Basel) ; 22(20)2022 Oct 12.
Article in English | MEDLINE | ID: mdl-36298089

ABSTRACT

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user-system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.


Subject(s)
Speech Perception , Speech , Speech Recognition Software , Noise , Signal-To-Noise Ratio
11.
J Imaging ; 8(10)2022 Sep 21.
Article in English | MEDLINE | ID: mdl-36286349

ABSTRACT

Despite the success of hand-crafted features in computer visioning for many years, nowadays, this has been replaced by end-to-end learnable features that are extracted from deep convolutional neural networks (CNNs). Whilst CNNs can learn robust features directly from image pixels, they require large amounts of samples and extreme augmentations. On the contrary, hand-crafted features, like SIFT, exhibit several interesting properties as they can provide local rotation invariance. In this work, a novel scheme combining the strengths of SIFT descriptors with CNNs, namely SIFT-CNN, is presented. Given a single-channel image, one SIFT descriptor is computed for every pixel, and thus, every pixel is represented as an M-dimensional histogram, which ultimately results in an M-channel image. Thus, the SIFT image is generated from the SIFT descriptors for all the pixels in a single-channel image, while at the same time, the original spatial size is preserved. Next, a CNN is trained to utilize these M-channel images as inputs by operating directly on the multiscale SIFT images with the regular convolution processes. Since these images incorporate spatial relations between the histograms of the SIFT descriptors, the CNN is guided to learn features from local gradient information of images that otherwise can be neglected. In this manner, the SIFT-CNN implicitly acquires a local rotation invariance property, which is desired for problems where local areas within the image can be rotated without affecting the overall classification result of the respective image. Some of these problems refer to indirect immunofluorescence (IIF) cell image classification, ground-based all-sky image-cloud classification and human lip-reading classification. The results for the popular datasets related to the three different aforementioned problems indicate that the proposed SIFT-CNN can improve the performance and surpasses the corresponding CNNs trained directly on pixel values in various challenging tasks due to its robustness in local rotations. Our findings highlight the importance of the input image representation in the overall efficiency of a data-driven system.

12.
J Neurosci ; 42(31): 6108-6120, 2022 08 03.
Article in English | MEDLINE | ID: mdl-35760528

ABSTRACT

Speech perception in noisy environments is enhanced by seeing facial movements of communication partners. However, the neural mechanisms by which audio and visual speech are combined are not fully understood. We explore MEG phase-locking to auditory and visual signals in MEG recordings from 14 human participants (6 females, 8 males) that reported words from single spoken sentences. We manipulated the acoustic clarity and visual speech signals such that critical speech information is present in auditory, visual, or both modalities. MEG coherence analysis revealed that both auditory and visual speech envelopes (auditory amplitude modulations and lip aperture changes) were phase-locked to 2-6 Hz brain responses in auditory and visual cortex, consistent with entrainment to syllable-rate components. Partial coherence analysis was used to separate neural responses to correlated audio-visual signals and showed non-zero phase-locking to auditory envelope in occipital cortex during audio-visual (AV) speech. Furthermore, phase-locking to auditory signals in visual cortex was enhanced for AV speech compared with audio-only speech that was matched for intelligibility. Conversely, auditory regions of the superior temporal gyrus did not show above-chance partial coherence with visual speech signals during AV conditions but did show partial coherence in visual-only conditions. Hence, visual speech enabled stronger phase-locking to auditory signals in visual areas, whereas phase-locking of visual speech in auditory regions only occurred during silent lip-reading. Differences in these cross-modal interactions between auditory and visual speech signals are interpreted in line with cross-modal predictive mechanisms during speech perception.SIGNIFICANCE STATEMENT Verbal communication in noisy environments is challenging, especially for hearing-impaired individuals. Seeing facial movements of communication partners improves speech perception when auditory signals are degraded or absent. The neural mechanisms supporting lip-reading or audio-visual benefit are not fully understood. Using MEG recordings and partial coherence analysis, we show that speech information is used differently in brain regions that respond to auditory and visual speech. While visual areas use visual speech to improve phase-locking to auditory speech signals, auditory areas do not show phase-locking to visual speech unless auditory speech is absent and visual speech is used to substitute for missing auditory signals. These findings highlight brain processes that combine visual and auditory signals to support speech understanding.


Subject(s)
Auditory Cortex , Speech Perception , Visual Cortex , Acoustic Stimulation , Auditory Cortex/physiology , Auditory Perception , Female , Humans , Lipreading , Male , Speech/physiology , Speech Perception/physiology , Visual Cortex/physiology , Visual Perception/physiology
13.
Sensors (Basel) ; 22(10)2022 May 13.
Article in English | MEDLINE | ID: mdl-35632141

ABSTRACT

Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different speakers will have various lip movements for the same word. For these problems, we focused on the spatial-temporal feature extraction in word-level lipreading in this paper, and an efficient two-stream model was proposed to learn the relative dynamic information of lip motion. In this model, two different channel capacity CNN streams are used to extract static features in a single frame and dynamic information between multi-frame sequences, respectively. We explored a more effective convolution structure for each component in the front-end model and improved by about 8%. Then, according to the characteristics of the word-level lipreading dataset, we further studied the impact of the two sampling methods on the fast and slow channels. Furthermore, we discussed the influence of the fusion methods of the front-end and back-end models under the two-stream network structure. Finally, we evaluated the proposed model on two large-scale lipreading datasets and achieved a new state-of-the-art.


Subject(s)
Algorithms , Lipreading , Humans , Learning , Motion , Movement
14.
Sensors (Basel) ; 22(9)2022 May 09.
Article in English | MEDLINE | ID: mdl-35591284

ABSTRACT

Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0°), 4.79% (30°), 5.51% (45°), 6.18% (60°), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications.


Subject(s)
Lipreading , Neural Networks, Computer , Attention , Humans , Language , Speech
15.
Sensors (Basel) ; 22(8)2022 Apr 12.
Article in English | MEDLINE | ID: mdl-35458932

ABSTRACT

Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google's trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications.


Subject(s)
Artificial Intelligence , Speech , Cloud Computing , Neural Networks, Computer , Speech Recognition Software
16.
Int J Dev Disabil ; 68(1): 47-55, 2022.
Article in English | MEDLINE | ID: mdl-35173963

ABSTRACT

A weaker McGurk effect is observed in individuals with autism spectrum disorder (ASD); weaker integration is considered to be the key to understanding how low-order atypical processing leads to their maladaptive social behaviors. However, the mechanism for this weaker McGurk effect has not been fully understood. Here, we investigated (1) whether the weaker McGurk effect in individuals with high autistic traits is caused by poor lip-reading ability and (2) whether the hearing environment modifies the weaker McGurk effect in individuals with high autistic traits. To confirm them, we conducted two analogue studies among university students, based on the dimensional model of ASD. Results showed that individuals with high autistic traits have intact lip-reading ability as well as abilities to listen and recognize audiovisual congruent speech (Experiment 1). Furthermore, a weaker McGurk effect in individuals with high autistic traits, which appear under the without-noise condition, would disappear under the high noise condition (Experiments 1 and 2). Our findings suggest that high background noise might shift weight on the visual cue, thereby increasing the strength of the McGurk effect among individuals with high autistic traits.

17.
HNO ; 70(6): 456-465, 2022 Jun.
Article in German | MEDLINE | ID: mdl-35024877

ABSTRACT

BACKGROUND: When reading lips, many people benefit from additional visual information from the lip movements of the speaker, which is, however, very error prone. Algorithms for lip reading with artificial intelligence based on artificial neural networks significantly improve word recognition but are not available for the German language. MATERIALS AND METHODS: A total of 1806 videoclips with only one German-speaking person each were selected, split into word segments, and assigned to word classes using speech-recognition software. In 38,391 video segments with 32 speakers, 18 polysyllabic, visually distinguishable words were used to train and validate a neural network. The 3D Convolutional Neural Network and Gated Recurrent Units models and a combination of both models (GRUConv) were compared, as were different image sections and color spaces of the videos. The accuracy was determined in 5000 training epochs. RESULTS: Comparison of the color spaces did not reveal any relevant different correct classification rates in the range from 69% to 72%. With a cut to the lips, a significantly higher accuracy of 70% was achieved than when cut to the entire speaker's face (34%). With the GRUConv model, the maximum accuracies were 87% with known speakers and 63% in the validation with unknown speakers. CONCLUSION: The neural network for lip reading, which was first developed for the German language, shows a very high level of accuracy, comparable to English-language algorithms. It works with unknown speakers as well and can be generalized with more word classes.


Subject(s)
Deep Learning , Language , Algorithms , Artificial Intelligence , Humans , Lipreading
18.
Front Artif Intell ; 5: 1070964, 2022.
Article in English | MEDLINE | ID: mdl-36714203

ABSTRACT

Unlike the conventional frame-based camera, the event-based camera detects changes in the brightness value for each pixel over time. This research work on lip-reading as a new application by the event-based camera. This paper proposes an event camera-based lip-reading for isolated single sound recognition. The proposed method consists of imaging from event data, face and facial feature points detection, and recognition using a Temporal Convolutional Network. Furthermore, this paper proposes a method that combines the two modalities of the frame-based camera and an event-based camera. In order to evaluate the proposed method, the utterance scenes of 15 Japanese consonants from 20 speakers were collected using an event-based camera and a video camera and constructed an original dataset. Several experiments were conducted by generating images at multiple frame rates from an event-based camera. As a result, the highest recognition accuracy was obtained in the image of the event-based camera at 60 fps. Moreover, it was confirmed that combining two modalities yields higher recognition accuracy than a single modality.

19.
Folia Phoniatr Logop ; 74(2): 131-140, 2022.
Article in English | MEDLINE | ID: mdl-34348290

ABSTRACT

INTRODUCTION: To the best of our knowledge, there is a lack of reliable, validated, and standardized (Dutch) measuring instruments to document visual speech perception in a structured way. This study aimed to: (1) evaluate the effects of age, gender, and the used word list on visual speech perception examined by a first version of the Dutch Test for (Audio-)Visual Speech Perception on word level (TAUVIS-words) and (2) assess the internal reliability of the TAUVIS-words. METHODS: Thirty-nine normal-hearing adults divided into the following 3 age categories were included: (1) younger adults, age 18-39 years; (2) middle-aged adults, age 40-59 years; and (3) older adults, age >60 years. The TAUVIS-words consist of 4 word lists, i.e., 2 monosyllabic word lists (MS 1 and MS 2) and 2 polysyllabic word lists (PS 1 and PS 2). A first exploration of the effects of age, gender, and test stimuli (i.e., the used word list) on visual speech perception was conducted using the TAUVIS-words. A mixed-design analysis of variance (ANOVA) was conducted to analyze the results statistically. Lastly, the internal reliability of the TAUVIS-words was assessed by calculating the Chronbach α. RESULTS: The results revealed a significant effect of the used list. More specifically, the score for MS 1 was significantly better compared to that for PS 2, and the score for PS 1 was significantly better compared to that for PS 2. Furthermore, a significant main effect of gender was found. Women scored significantly better compared to men. The effect of age was not significant. The TAUVIS-word lists were found to have good internal reliability. CONCLUSION: This study was a first exploration of the effects of age, gender, and test stimuli on visual speech perception using the TAUVIS-words. Further research is necessary to optimize and validate the TAUVIS-words, making use of a larger study sample.


Subject(s)
Speech Perception , Adolescent , Adult , Aged , Female , Hearing Tests , Humans , Language , Male , Middle Aged , Reproducibility of Results , Young Adult
20.
J Neurosci ; 42(3): 435-442, 2022 01 19.
Article in English | MEDLINE | ID: mdl-34815317

ABSTRACT

In everyday conversation, we usually process the talker's face as well as the sound of the talker's voice. Access to visual speech information is particularly useful when the auditory signal is degraded. Here, we used fMRI to monitor brain activity while adult humans (n = 60) were presented with visual-only, auditory-only, and audiovisual words. The audiovisual words were presented in quiet and in several signal-to-noise ratios. As expected, audiovisual speech perception recruited both auditory and visual cortex, with some evidence for increased recruitment of premotor cortex in some conditions (including in substantial background noise). We then investigated neural connectivity using psychophysiological interaction analysis with seed regions in both primary auditory cortex and primary visual cortex. Connectivity between auditory and visual cortices was stronger in audiovisual conditions than in unimodal conditions, including a wide network of regions in posterior temporal cortex and prefrontal cortex. In addition to whole-brain analyses, we also conducted a region-of-interest analysis on the left posterior superior temporal sulcus (pSTS), implicated in many previous studies of audiovisual speech perception. We found evidence for both activity and effective connectivity in pSTS for visual-only and audiovisual speech, although these were not significant in whole-brain analyses. Together, our results suggest a prominent role for cross-region synchronization in understanding both visual-only and audiovisual speech that complements activity in integrative brain regions like pSTS.SIGNIFICANCE STATEMENT In everyday conversation, we usually process the talker's face as well as the sound of the talker's voice. Access to visual speech information is particularly useful when the auditory signal is hard to understand (e.g., background noise). Prior work has suggested that specialized regions of the brain may play a critical role in integrating information from visual and auditory speech. Here, we show a complementary mechanism relying on synchronized brain activity among sensory and motor regions may also play a critical role. These findings encourage reconceptualizing audiovisual integration in the context of coordinated network activity.


Subject(s)
Auditory Cortex/physiology , Language , Lipreading , Nerve Net/physiology , Speech Perception/physiology , Visual Cortex/physiology , Visual Perception/physiology , Adult , Aged , Aged, 80 and over , Auditory Cortex/diagnostic imaging , Female , Humans , Magnetic Resonance Imaging , Male , Middle Aged , Nerve Net/diagnostic imaging , Visual Cortex/diagnostic imaging , Young Adult
SELECTION OF CITATIONS
SEARCH DETAIL