RESUMO
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio-visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio-visual emotion database collected from TV broadcasts such as soap-operas and movies, called the IIIT-H Audio-Visual Emotion (IIIT-H AVE) database, is used. The database consists of audio-alone, video-alone, and audio-visual data in English. Using data of all three modes, perceptual tests are conducted for four basic emotions (angry, happy, neutral, and sad) based on category labeling and for two dimensions, namely arousal (active or passive) and valence (positive or negative), based on dimensional labeling. The results indicated that the participants' perception of emotions was remarkably different between the audio-alone, video-alone, and audio-video data. This finding emphasizes the importance of emotion-specific features compared to commonly used features in the development of emotion-aware systems.
Assuntos
Nível de Alerta , Emoções , HumanosRESUMO
Voiced speech is generated by the glottal flow interacting with vocal fold vibrations. However, the details of vibrations in the anterior-posterior direction (the so-called zipper-effect) and their correspondence with speech and other glottal signals are not fully understood due to challenges in direct measurements of vocal fold vibrations. In this proof-of-concept study, the potential of four parameters extracted from high-speed videoendoscopy (HSV), electroglottography, and speech signals to indicate the presence of a zipper-type glottal opening is investigated. Comparison with manual labeling of the HSV videos highlighted the importance of multiple parameter-signal pairs in indicating the presence of a zipper-type glottal opening.
Assuntos
Fonação , Voz , Glote , Fala , Vibração , Prega VocalRESUMO
Existing studies in classification of phonation types in singing use voice source features and Mel-frequency cepstral coefficients (MFCCs) showing poor performance due to high pitch in singing. In this study, high-resolution spectra obtained using the zero-time windowing (ZTW) method is utilized to capture the effect of voice excitation. ZTW does not call for computing the source-filter decomposition (which is needed by many voice source features) which makes it robust to high pitch. For the classification, the study proposes extracting MFCCs from the ZTW spectrum. The results show that the proposed features give a clear improvement in classification accuracy compared to the existing features.
RESUMO
In the production of voiced speech, glottal flow skewing refers to the tilting of the glottal flow pulses to the right, often characterized as a delay of the peak, compared to the glottal area. In the past four decades, several studies have addressed this phenomenon using modeling of voice production with analog circuits and computer simulations. However, previous studies measuring flow skewing in natural production of speech are sparse and they contain little quantitative data about the degree of skewing between flow and area. In the current study, flow skewing was measured from the natural production of 40 vowel utterances produced by 10 speakers. Glottal flow was measured from speech using glottal inverse filtering and glottal area was captured with high-speed videoendoscopy. The estimated glottal flow and area waveforms were parameterized with four robust parameters that measure pulse skewness quantitatively. Statistical tests obtained for all four parameters showed that the flow pulse was significantly more skewed to the right than the area pulse. Hence, this study corroborates the existence of flow skewing using measurements from natural speech production. In addition, the study yields quantitative data about pulse skewness in simultaneous measured glottal flow and area in natural production of speech.
Assuntos
Glote/fisiologia , Fonação/fisiologia , Fala/fisiologia , Voz/fisiologia , Adulto , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Acústica da Fala , Medida da Produção da FalaRESUMO
Estimation of the spectral tilt of the glottal source has several applications in speech analysis and modification. However, direct estimation of the tilt from telephone speech is challenging due to vocal tract resonances and distortion caused by speech compression. In this study, a deep neural network is used for the tilt estimation from telephone speech by training the network with tilt estimates computed by glottal inverse filtering. An objective evaluation shows that the proposed technique gives more accurate estimates for the spectral tilt than previously used techniques that estimate the tilt directly from telephone speech without glottal inverse filtering.
Assuntos
Acústica , Aprendizado Profundo , Glote/fisiologia , Processamento de Sinais Assistido por Computador , Acústica da Fala , Medida da Produção da Fala/métodos , Telefone , Qualidade da Voz , Feminino , Humanos , Masculino , Fonação , Espectrografia do SomRESUMO
Recently, a quasi-closed phase (QCP) analysis of speech signals for accurate glottal inverse filtering was proposed. However, the QCP analysis which belongs to the family of temporally weighted linear prediction (WLP) methods uses the conventional forward type of sample prediction. This may not be the best choice especially in computing WLP models with a hard-limiting weighting function. A sample selective minimization of the prediction error in WLP reduces the effective number of samples available within a given window frame. To counter this problem, a modified quasi-closed phase forward-backward (QCP-FB) analysis is proposed, wherein each sample is predicted based on its past as well as future samples thereby utilizing the available number of samples more effectively. Formant detection and estimation experiments on synthetic vowels generated using a physical modeling approach as well as natural speech utterances show that the proposed QCP-FB method yields statistically significant improvements over the conventional linear prediction and QCP methods.
RESUMO
Recent studies have shown that acoustically distorted sentences can be perceived as either unintelligible or intelligible depending on whether one has previously been exposed to the undistorted, intelligible versions of the sentences. This allows studying processes specifically related to speech intelligibility since any change between the responses to the distorted stimuli before and after the presentation of their undistorted counterparts cannot be attributed to acoustic variability but, rather, to the successful mapping of sensory information onto memory representations. To estimate how the complexity of the message is reflected in speech comprehension, we applied this rapid change in perception to behavioral and magnetoencephalography (MEG) experiments using vowels, words and sentences. In the experiments, stimuli were initially presented to the subject in a distorted form, after which undistorted versions of the stimuli were presented. Finally, the original distorted stimuli were presented once more. The resulting increase in intelligibility observed for the second presentation of the distorted stimuli depended on the complexity of the stimulus: vowels remained unintelligible (behaviorally measured intelligibility 27%) whereas the intelligibility of the words increased from 19% to 45% and that of the sentences from 31% to 65%. This increase in the intelligibility of the degraded stimuli was reflected as an enhancement of activity in the auditory cortex and surrounding areas at early latencies of 130-160ms. In the same regions, increasing stimulus complexity attenuated mean currents at latencies of 130-160ms whereas at latencies of 200-270ms the mean currents increased. These modulations in cortical activity may reflect feedback from top-down mechanisms enhancing the extraction of information from speech. The behavioral results suggest that memory-driven expectancies can have a significant effect on speech comprehension, especially in acoustically adverse conditions where the bottom-up information is decreased.
Assuntos
Encéfalo/fisiologia , Compreensão/fisiologia , Percepção da Fala/fisiologia , Estimulação Acústica , Adulto , Feminino , Humanos , Magnetoencefalografia , Masculino , Processamento de Sinais Assistido por Computador , Inteligibilidade da Fala/fisiologia , Adulto JovemRESUMO
Effective speech sound discrimination at preschool age is known to be a prerequisite for the development of language skills and later literacy acquisition. However, the speech specificity of cortical discrimination skills in small children is currently not known, as previous research has either studied speech functions without comparison with non-speech sounds, or used much simpler sounds such as harmonic or sinusoidal tones as control stimuli. We investigated the cortical discrimination of five syllable features (consonant, vowel, vowel duration, fundamental frequency, and intensity), covering both segmental and prosodic phonetic changes, and their acoustically matched non-speech counterparts in 63 6-year-old typically developed children, by using a multi-feature mismatch negativity (MMN) paradigm. Each of the five investigated features elicited a unique pattern of differentiating negativities: an early differentiating negativity, MMN, and a late differentiating negativity. All five studied features showed speech-related enhancement of at least one of these responses, suggesting experience-related neural commitment in both phonetic and prosodic speech processing. In addition, the cognitive performance and language skills of the children were tested extensively. The speech-related neural enhancement was positively associated with the level of performance in several neurocognitive tasks, indicating a relationship between successful establishment of cortical memory traces for speech and enhanced cognitive functioning. The results contribute to the understanding of typical developmental trajectories of linguistic vs. non-linguistic auditory skills, and provide a reference for future studies investigating deficits in language-related disorders at preschool age.
Assuntos
Córtex Cerebral/fisiologia , Cognição , Discriminação Psicológica , Percepção da Fala , Córtex Cerebral/crescimento & desenvolvimento , Pré-Escolar , Feminino , Humanos , Desenvolvimento da Linguagem , MasculinoRESUMO
Natural auditory scenes often consist of several sound sources overlapping in time, but separated in space. Yet, location is not fully exploited in auditory grouping: spatially separated sounds can get perceptually fused into a single auditory object and this leads to difficulties in the identification and localization of concurrent sounds. Here, the brain mechanisms responsible for grouping across spatial locations were explored in magnetoencephalography (MEG) recordings. The results show that the cortical representation of a vowel spatially separated into two locations reflects the perceived location of the speech sound rather than the physical locations of the individual components. In other words, the auditory scene is neurally rearranged to bring components into spatial alignment when they were deemed to belong to the same object. This renders the original spatial information unavailable at the level of the auditory cortex and may contribute to difficulties in concurrent sound segregation.
Assuntos
Córtex Auditivo/fisiologia , Vias Auditivas/fisiologia , Localização de Som , Acústica da Fala , Percepção da Fala , Qualidade da Voz , Estimulação Acústica , Humanos , Magnetoencefalografia , Masculino , Psicoacústica , Detecção de Sinal Psicológico , Espectrografia do SomRESUMO
The automatic classification of phonation types in singing voice is essential for tasks such as identification of singing style. In this study, it is proposed to use wavelet scattering network (WSN)-based features for classification of phonation types in singing voice. WSN, which has a close similarity with auditory physiological models, generates acoustic features that greatly characterize the information related to pitch, formants, and timbre. Hence, the WSN-based features can effectively capture the discriminative information across phonation types in singing voice. The experimental results show that the proposed WSN-based features improved phonation classification accuracy by at least 9% compared to state-of-the-art features.
RESUMO
Many acoustic features and machine learning models have been studied to build automatic detection systems to distinguish dysarthric speech from healthy speech. These systems can help to improve the reliability of diagnosis. However, speech recorded for diagnosis in real-life clinical conditions can differ from the training data of the detection system in terms of, for example, recording conditions, speaker identity, and language. These mismatches may lead to a reduction in detection performance in practical applications. In this study, we investigate the use of the wav2vec2 model as a feature extractor together with a support vector machine (SVM) classifier to build automatic detection systems for dysarthric speech. The performance of the wav2vec2 features is evaluated in two cross-database scenarios, language-dependent and language-independent, to study their generalizability to unseen speakers, recording conditions, and languages before and after fine-tuning the wav2vec2 model. The results revealed that the fine-tuned wav2vec2 features showed better generalization in both scenarios and gave an absolute accuracy improvement of 1.46%-8.65% compared to the non-fine-tuned wav2vec2 features.
Assuntos
Disartria , Máquina de Vetores de Suporte , Humanos , Disartria/fisiopatologia , Disartria/diagnóstico , Masculino , Feminino , Processamento de Sinais Assistido por Computador , Adulto , Adulto Jovem , Bases de Dados Factuais , Pessoa de Meia-Idade , AlgoritmosRESUMO
OBJECTIVES: Increased prevalence of social creak particularly among female speakers has been reported in several studies. The study of social creak has been previously conducted by combining perceptual evaluation of speech with conventional acoustical parameters such as the harmonic-to-noise ratio and cepstral peak prominence. In the current study, machine learning (ML) was used to automatically distinguish speech of low amount of social creak from speech of high amount of social creak. METHODS: The amount of creak in continuous speech samples produced in Finnish by 90 female speakers was first perceptually assessed by two voice specialists. Based on their assessments, the speech samples were divided into two categories (low vs high amount of creak). Using the speech signals and their creak labels, seven different ML models were trained. Three spectral representations were used as feature for each model. RESULTS: The results show that the best performance (accuracy of 71.1%) was obtained by the following two systems: an Adaboost classifier using the mel-spectrogram feature and a decision tree classifier using the mel-frequency cepstral coefficient feature. CONCLUSIONS: The study of social creak is becoming increasingly popular in sociolinguistic and vocological research. The conventional human perceptual assessment of the amount of creak is laborious and therefore ML technology could be used to assist researchers studying social creak. The classification systems reported in this study could be considered as baselines in future ML-based studies on social creak.
RESUMO
OBJECTIVES: Sound pressure and exhaled flow have been identified as important factors associated with higher particle emissions. The aim of this study was to assess how different vocalizations affect the particle generation independently from other factors. DESIGN: Experimental study. METHODS: Thirty-three experienced singers repeated two different sentences in normal loudness and whispering. The first sentence consisted mainly of consonants like /k/ and /t/ as well as open vowels, while the second sentence also included the /s/ sound and contained primarily closed vowels. The particle emission was measured using condensation particle counter (CPC, 3775 TSI Inc.) and aerodynamic particle sizer (APS, 3321 TSI Inc.). The CPC measured particle number concentration for particles larger than 4 nm and mainly reflects the number of particles smaller than 0.5 µm since these particles dominate total number concentration. The APS measured particle size distribution and number concentration in the size range of 0.5-10 µm and data were divided into >1 µm and <1 µm particle size ranges. Generalized linear mixed-effects models were constructed to assess the factors affecting particle generation. RESULTS: Whispering produced more particles than speaking and sentence 1 produced more particles than sentence 2 while speaking. Sound pressure level had effect on particle production independently from vocalization. The effect of exhaled airflow was not statistically significant. CONCLUSIONS: Based on our results the type of vocalization has a significant effect on particle production independently from other factors such as sound pressure level.
RESUMO
High vocal effort has characteristic acoustic effects on speech. This study focuses on the utilization of this information by human listeners and a machine-based detection system in the task of detecting shouted speech in the presence of noise. Both female and male speakers read Finnish sentences using normal and shouted voice in controlled conditions, with the sound pressure level recorded. The speech material was artificially corrupted by noise and supplemented with pure noise. The human performance level was statistically evaluated by a listening test, where the subjects labeled noisy samples according to whether shouting was heard or not. A Bayesian detection system was constructed and statistically evaluated. Its performance was compared against that of human listeners, substituting different spectrum analysis methods in the feature extraction stage. Using features capable of taking into account the spectral fine structure (i.e., the fundamental frequency and its harmonics), the machine reached the detection level of humans even in the noisiest conditions. In the listening test, male listeners detected shouted speech significantly better than female listeners, especially with speakers making a smaller vocal effort increase for shouting.
Assuntos
Acústica/instrumentação , Percepção Sonora , Ruído/efeitos adversos , Mascaramento Perceptivo , Acústica da Fala , Percepção da Fala , Estimulação Acústica , Análise de Variância , Audiometria da Fala , Teorema de Bayes , Feminino , Humanos , Masculino , Fatores Sexuais , Detecção de Sinal Psicológico , Processamento de Sinais Assistido por Computador , Razão Sinal-Ruído , Espectrografia do Som , Medida da Produção da FalaRESUMO
All-pole modeling is a widely used formant estimation method, but its performance is known to deteriorate for high-pitched voices. In order to address this problem, several all-pole modeling methods robust to fundamental frequency have been proposed. This study compares five such previously known methods and introduces a technique, Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME). WLP-AME utilizes temporally weighted linear prediction (LP) in which the square of the prediction error is multiplied by a given parametric weighting function. The weighting downgrades the contribution of the main excitation of the vocal tract in optimizing the filter coefficients. Consequently, the resulting all-pole model is affected more by the characteristics of the vocal tract leading to less biased formant estimates. By using synthetic vowels created with a physical modeling approach, the results showed that WLP-AME yields improved formant frequencies for high-pitched sounds in comparison to the previously known methods (e.g., relative error in the first formant of the vowel [a] decreased from 11% to 3% when conventional LP was replaced with WLP-AME). Experiments conducted on natural vowels indicate that the formants detected by WLP-AME changed in a more regular manner between repetitions of different pitch than those computed by conventional LP.
Assuntos
Glote/fisiologia , Modelos Lineares , Fonação , Fonética , Percepção da Altura Sonora , Acústica da Fala , Qualidade da Voz , Adulto , Algoritmos , Fenômenos Biomecânicos , Pré-Escolar , Simulação por Computador , Feminino , Glote/anatomia & histologia , Humanos , Masculino , Análise Numérica Assistida por Computador , Reconhecimento Automatizado de Padrão , Pressão , Processamento de Sinais Assistido por Computador , Espectrografia do Som , Medida da Produção da Fala , Fatores de Tempo , Prega Vocal/fisiologiaRESUMO
Previous studies on fusion in speech perception have demonstrated the ability of the human auditory system to group separate components of speech-like sounds together and consequently to enable the identification of speech despite the spatial separation between the components. Typically, the spatial separation has been implemented using headphone reproduction where the different components evoke auditory images at different lateral positions. In the present study, a multichannel loudspeaker system was used to investigate whether the correct vowel is identified and whether two auditory events are perceived when a noise-excited vowel is divided into two components that are spatially separated. The two components consisted of the even and odd formants. Both the amount of spatial separation between the components and the directions of the components were varied. Neither the spatial separation nor the directions of the components affected the vowel identification. Interestingly, an additional auditory event not associated with any vowel was perceived at the same time when the components were presented symmetrically in front of the listener. In such scenarios, the vowel was perceived from the direction of the odd formant components.
Assuntos
Sinais (Psicologia) , Acústica da Fala , Percepção da Fala , Qualidade da Voz , Estimulação Acústica , Adulto , Audiometria da Fala , Limiar Auditivo , Humanos , Masculino , Reconhecimento Fisiológico de Modelo , Reconhecimento Psicológico , Localização de Som , Fatores de Tempo , Adulto JovemRESUMO
Human speech perception is highly resilient to acoustic distortions. In addition to distortions from external sound sources, degradation of the acoustic structure of the sound itself can substantially reduce the intelligibility of speech. The degradation of the internal structure of speech happens, for example, when the digital representation of the signal is impoverished by reducing its amplitude resolution. Further, the perception of speech is also influenced by whether the distortion is transient, coinciding with speech, or is heard continuously in the background. However, the complex effects of the acoustic structure and continuity of the distortion on the cortical processing of degraded speech are unclear. In the present magnetoencephalography study, we investigated how the cortical processing of degraded speech sounds as measured through the auditory N1m response is affected by variation of both the distortion type (internal, external) and the continuity of distortion (transient, continuous). We found that when the distortion was continuous, the N1m was significantly delayed, regardless of the type of distortion. The N1m amplitude, in turn, was affected only when speech sounds were degraded with transient internal distortion, which resulted in larger response amplitudes. The results suggest that external and internal distortions of speech result in divergent patterns of activity in the auditory cortex, and that the effects are modulated by the temporal continuity of the distortion.
Assuntos
Córtex Auditivo/fisiologia , Fonética , Percepção da Fala/fisiologia , Adulto , Feminino , Humanos , Magnetoencefalografia , Masculino , Fatores de Tempo , Adulto JovemRESUMO
BACKGROUND: The robustness of speech perception in the face of acoustic variation is founded on the ability of the auditory system to integrate the acoustic features of speech and to segregate them from background noise. This auditory scene analysis process is facilitated by top-down mechanisms, such as recognition memory for speech content. However, the cortical processes underlying these facilitatory mechanisms remain unclear. The present magnetoencephalography (MEG) study examined how the activity of auditory cortical areas is modulated by acoustic degradation and intelligibility of connected speech. The experimental design allowed for the comparison of cortical activity patterns elicited by acoustically identical stimuli which were perceived as either intelligible or unintelligible. RESULTS: In the experiment, a set of sentences was presented to the subject in distorted, undistorted, and again in distorted form. The intervening exposure to undistorted versions of sentences rendered the initially unintelligible, distorted sentences intelligible, as evidenced by an increase from 30% to 80% in the proportion of sentences reported as intelligible. These perceptual changes were reflected in the activity of the auditory cortex, with the auditory N1m response (~100 ms) being more prominent for the distorted stimuli than for the intact ones. In the time range of auditory P2m response (>200 ms), auditory cortex as well as regions anterior and posterior to this area generated a stronger response to sentences which were intelligible than unintelligible. During the sustained field (>300 ms), stronger activity was elicited by degraded stimuli in auditory cortex and by intelligible sentences in areas posterior to auditory cortex. CONCLUSIONS: The current findings suggest that the auditory system comprises bottom-up and top-down processes which are reflected in transient and sustained brain activity. It appears that analysis of acoustic features occurs during the first 100 ms, and sensitivity to speech intelligibility emerges in auditory cortex and surrounding areas from 200 ms onwards. The two processes are intertwined, with the activity of auditory cortical areas being modulated by top-down processes related to memory traces of speech and supporting speech intelligibility.
Assuntos
Córtex Auditivo/fisiologia , Mapeamento Encefálico/psicologia , Inteligibilidade da Fala/fisiologia , Percepção da Fala/fisiologia , Fala/fisiologia , Estimulação Acústica/métodos , Adulto , Mapeamento Encefálico/métodos , Potenciais Evocados Auditivos/fisiologia , Humanos , Processamento de Imagem Assistida por Computador/métodos , Magnetoencefalografia/métodos , Magnetoencefalografia/psicologiaRESUMO
Post-filtering can be utilized to improve the quality and intelligibility of telephone speech. Previous studies have shown that energy reallocation with a high-pass type filter works effectively in improving the intelligibility of speech in difficult noise conditions. The present study introduces a signal-to-noise ratio adaptive post-filtering method that utilizes energy reallocation to transfer energy from the first formant to higher frequencies. The proposed method adapts to the level of the background noise so that, in favorable noise conditions, the post-filter has a flat frequency response and the effect of the post-filtering is increased as the level of the ambient noise increases. The performance of the proposed method is compared with a similar post-filtering algorithm and unprocessed speech in subjective listening tests which evaluate both intelligibility and listener preference. The results indicate that both of the post-filtering methods maintain the quality of speech in negligible noise conditions and are able to provide intelligibility improvement over unprocessed speech in adverse noise conditions. Furthermore, the proposed post-filtering algorithm performs better than the other post-filtering method under evaluation in moderate to difficult noise conditions, where intelligibility improvement is mostly required.
Assuntos
Acústica , Telefone Celular , Processamento de Sinais Assistido por Computador , Acústica da Fala , Inteligibilidade da Fala , Qualidade da Voz , Estimulação Acústica , Adulto , Algoritmos , Análise de Variância , Feminino , Humanos , Masculino , Ruído/efeitos adversos , Mascaramento Perceptivo , Razão Sinal-Ruído , Teste do Limiar de Recepção da Fala , Adulto JovemRESUMO
Artificial bandwidth extension methods have been developed to improve the quality and intelligibility of narrowband telephone speech and to reduce the difference with wideband speech. Such methods have commonly been evaluated with objective measures or subjective listening-only tests, but conversational evaluations have been rare. This article presents a conversational evaluation of two methods for the artificial bandwidth extension of telephone speech. Bandwidth-extended narrowband speech is compared with narrowband and wideband speech in a test setting including a simulated telephone connection, realistic conversation tasks, and various background noise conditions. The responses of the subjects indicate that speech processed with one of the methods is preferred to narrowband speech in noise, but wideband speech is superior to both narrowband and bandwidth-extended speech. Bandwidth extension was found to be beneficial for telephone conversation in noisy listening conditions.