Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 99
Filtrar
1.
Sensors (Basel) ; 22(13)2022 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-35808423

RESUMO

Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio-visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio-visual emotion database collected from TV broadcasts such as soap-operas and movies, called the IIIT-H Audio-Visual Emotion (IIIT-H AVE) database, is used. The database consists of audio-alone, video-alone, and audio-visual data in English. Using data of all three modes, perceptual tests are conducted for four basic emotions (angry, happy, neutral, and sad) based on category labeling and for two dimensions, namely arousal (active or passive) and valence (positive or negative), based on dimensional labeling. The results indicated that the participants' perception of emotions was remarkably different between the audio-alone, video-alone, and audio-video data. This finding emphasizes the importance of emotion-specific features compared to commonly used features in the development of emotion-aware systems.


Assuntos
Nível de Alerta , Emoções , Humanos
2.
J Acoust Soc Am ; 148(2): EL141, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-32873022

RESUMO

Voiced speech is generated by the glottal flow interacting with vocal fold vibrations. However, the details of vibrations in the anterior-posterior direction (the so-called zipper-effect) and their correspondence with speech and other glottal signals are not fully understood due to challenges in direct measurements of vocal fold vibrations. In this proof-of-concept study, the potential of four parameters extracted from high-speed videoendoscopy (HSV), electroglottography, and speech signals to indicate the presence of a zipper-type glottal opening is investigated. Comparison with manual labeling of the HSV videos highlighted the importance of multiple parameter-signal pairs in indicating the presence of a zipper-type glottal opening.


Assuntos
Fonação , Voz , Glote , Fala , Vibração , Prega Vocal
3.
J Acoust Soc Am ; 146(5): EL418, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31795672

RESUMO

Existing studies in classification of phonation types in singing use voice source features and Mel-frequency cepstral coefficients (MFCCs) showing poor performance due to high pitch in singing. In this study, high-resolution spectra obtained using the zero-time windowing (ZTW) method is utilized to capture the effect of voice excitation. ZTW does not call for computing the source-filter decomposition (which is needed by many voice source features) which makes it robust to high pitch. For the classification, the study proposes extracting MFCCs from the ZTW spectrum. The results show that the proposed features give a clear improvement in classification accuracy compared to the existing features.

4.
J Acoust Soc Am ; 146(4): 2501, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31671985

RESUMO

In the production of voiced speech, glottal flow skewing refers to the tilting of the glottal flow pulses to the right, often characterized as a delay of the peak, compared to the glottal area. In the past four decades, several studies have addressed this phenomenon using modeling of voice production with analog circuits and computer simulations. However, previous studies measuring flow skewing in natural production of speech are sparse and they contain little quantitative data about the degree of skewing between flow and area. In the current study, flow skewing was measured from the natural production of 40 vowel utterances produced by 10 speakers. Glottal flow was measured from speech using glottal inverse filtering and glottal area was captured with high-speed videoendoscopy. The estimated glottal flow and area waveforms were parameterized with four robust parameters that measure pulse skewness quantitatively. Statistical tests obtained for all four parameters showed that the flow pulse was significantly more skewed to the right than the area pulse. Hence, this study corroborates the existence of flow skewing using measurements from natural speech production. In addition, the study yields quantitative data about pulse skewness in simultaneous measured glottal flow and area in natural production of speech.


Assuntos
Glote/fisiologia , Fonação/fisiologia , Fala/fisiologia , Voz/fisiologia , Adulto , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Acústica da Fala , Medida da Produção da Fala
5.
J Acoust Soc Am ; 141(4): EL327, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-28464691

RESUMO

Estimation of the spectral tilt of the glottal source has several applications in speech analysis and modification. However, direct estimation of the tilt from telephone speech is challenging due to vocal tract resonances and distortion caused by speech compression. In this study, a deep neural network is used for the tilt estimation from telephone speech by training the network with tilt estimates computed by glottal inverse filtering. An objective evaluation shows that the proposed technique gives more accurate estimates for the spectral tilt than previously used techniques that estimate the tilt directly from telephone speech without glottal inverse filtering.


Assuntos
Acústica , Aprendizado Profundo , Glote/fisiologia , Processamento de Sinais Assistido por Computador , Acústica da Fala , Medida da Produção da Fala/métodos , Telefone , Qualidade da Voz , Feminino , Humanos , Masculino , Fonação , Espectrografia do Som
6.
J Acoust Soc Am ; 142(3): 1542, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28964072

RESUMO

Recently, a quasi-closed phase (QCP) analysis of speech signals for accurate glottal inverse filtering was proposed. However, the QCP analysis which belongs to the family of temporally weighted linear prediction (WLP) methods uses the conventional forward type of sample prediction. This may not be the best choice especially in computing WLP models with a hard-limiting weighting function. A sample selective minimization of the prediction error in WLP reduces the effective number of samples available within a given window frame. To counter this problem, a modified quasi-closed phase forward-backward (QCP-FB) analysis is proposed, wherein each sample is predicted based on its past as well as future samples thereby utilizing the available number of samples more effectively. Formant detection and estimation experiments on synthetic vowels generated using a physical modeling approach as well as natural speech utterances show that the proposed QCP-FB method yields statistically significant improvements over the conventional linear prediction and QCP methods.

7.
Neuroimage ; 125: 131-143, 2016 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-26477651

RESUMO

Recent studies have shown that acoustically distorted sentences can be perceived as either unintelligible or intelligible depending on whether one has previously been exposed to the undistorted, intelligible versions of the sentences. This allows studying processes specifically related to speech intelligibility since any change between the responses to the distorted stimuli before and after the presentation of their undistorted counterparts cannot be attributed to acoustic variability but, rather, to the successful mapping of sensory information onto memory representations. To estimate how the complexity of the message is reflected in speech comprehension, we applied this rapid change in perception to behavioral and magnetoencephalography (MEG) experiments using vowels, words and sentences. In the experiments, stimuli were initially presented to the subject in a distorted form, after which undistorted versions of the stimuli were presented. Finally, the original distorted stimuli were presented once more. The resulting increase in intelligibility observed for the second presentation of the distorted stimuli depended on the complexity of the stimulus: vowels remained unintelligible (behaviorally measured intelligibility 27%) whereas the intelligibility of the words increased from 19% to 45% and that of the sentences from 31% to 65%. This increase in the intelligibility of the degraded stimuli was reflected as an enhancement of activity in the auditory cortex and surrounding areas at early latencies of 130-160ms. In the same regions, increasing stimulus complexity attenuated mean currents at latencies of 130-160ms whereas at latencies of 200-270ms the mean currents increased. These modulations in cortical activity may reflect feedback from top-down mechanisms enhancing the extraction of information from speech. The behavioral results suggest that memory-driven expectancies can have a significant effect on speech comprehension, especially in acoustically adverse conditions where the bottom-up information is decreased.


Assuntos
Encéfalo/fisiologia , Compreensão/fisiologia , Percepção da Fala/fisiologia , Estimulação Acústica , Adulto , Feminino , Humanos , Magnetoencefalografia , Masculino , Processamento de Sinais Assistido por Computador , Inteligibilidade da Fala/fisiologia , Adulto Jovem
8.
Eur J Neurosci ; 43(6): 738-50, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26647120

RESUMO

Effective speech sound discrimination at preschool age is known to be a prerequisite for the development of language skills and later literacy acquisition. However, the speech specificity of cortical discrimination skills in small children is currently not known, as previous research has either studied speech functions without comparison with non-speech sounds, or used much simpler sounds such as harmonic or sinusoidal tones as control stimuli. We investigated the cortical discrimination of five syllable features (consonant, vowel, vowel duration, fundamental frequency, and intensity), covering both segmental and prosodic phonetic changes, and their acoustically matched non-speech counterparts in 63 6-year-old typically developed children, by using a multi-feature mismatch negativity (MMN) paradigm. Each of the five investigated features elicited a unique pattern of differentiating negativities: an early differentiating negativity, MMN, and a late differentiating negativity. All five studied features showed speech-related enhancement of at least one of these responses, suggesting experience-related neural commitment in both phonetic and prosodic speech processing. In addition, the cognitive performance and language skills of the children were tested extensively. The speech-related neural enhancement was positively associated with the level of performance in several neurocognitive tasks, indicating a relationship between successful establishment of cortical memory traces for speech and enhanced cognitive functioning. The results contribute to the understanding of typical developmental trajectories of linguistic vs. non-linguistic auditory skills, and provide a reference for future studies investigating deficits in language-related disorders at preschool age.


Assuntos
Córtex Cerebral/fisiologia , Cognição , Discriminação Psicológica , Percepção da Fala , Córtex Cerebral/crescimento & desenvolvimento , Pré-Escolar , Feminino , Humanos , Desenvolvimento da Linguagem , Masculino
9.
J Acoust Soc Am ; 137(6): 3356-65, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-26093425

RESUMO

Natural auditory scenes often consist of several sound sources overlapping in time, but separated in space. Yet, location is not fully exploited in auditory grouping: spatially separated sounds can get perceptually fused into a single auditory object and this leads to difficulties in the identification and localization of concurrent sounds. Here, the brain mechanisms responsible for grouping across spatial locations were explored in magnetoencephalography (MEG) recordings. The results show that the cortical representation of a vowel spatially separated into two locations reflects the perceived location of the speech sound rather than the physical locations of the individual components. In other words, the auditory scene is neurally rearranged to bring components into spatial alignment when they were deemed to belong to the same object. This renders the original spatial information unavailable at the level of the auditory cortex and may contribute to difficulties in concurrent sound segregation.


Assuntos
Córtex Auditivo/fisiologia , Vias Auditivas/fisiologia , Localização de Som , Acústica da Fala , Percepção da Fala , Qualidade da Voz , Estimulação Acústica , Humanos , Magnetoencefalografia , Masculino , Psicoacústica , Detecção de Sinal Psicológico , Espectrografia do Som
10.
JASA Express Lett ; 4(6)2024 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-38847582

RESUMO

The automatic classification of phonation types in singing voice is essential for tasks such as identification of singing style. In this study, it is proposed to use wavelet scattering network (WSN)-based features for classification of phonation types in singing voice. WSN, which has a close similarity with auditory physiological models, generates acoustic features that greatly characterize the information related to pitch, formants, and timbre. Hence, the WSN-based features can effectively capture the discriminative information across phonation types in singing voice. The experimental results show that the proposed WSN-based features improved phonation classification accuracy by at least 9% compared to state-of-the-art features.

11.
Artigo em Inglês | MEDLINE | ID: mdl-38669173

RESUMO

Many acoustic features and machine learning models have been studied to build automatic detection systems to distinguish dysarthric speech from healthy speech. These systems can help to improve the reliability of diagnosis. However, speech recorded for diagnosis in real-life clinical conditions can differ from the training data of the detection system in terms of, for example, recording conditions, speaker identity, and language. These mismatches may lead to a reduction in detection performance in practical applications. In this study, we investigate the use of the wav2vec2 model as a feature extractor together with a support vector machine (SVM) classifier to build automatic detection systems for dysarthric speech. The performance of the wav2vec2 features is evaluated in two cross-database scenarios, language-dependent and language-independent, to study their generalizability to unseen speakers, recording conditions, and languages before and after fine-tuning the wav2vec2 model. The results revealed that the fine-tuned wav2vec2 features showed better generalization in both scenarios and gave an absolute accuracy improvement of 1.46% - 8.65% compared to the non-fine-tuned wav2vec2 features.

12.
J Acoust Soc Am ; 133(4): 2377-89, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-23556603

RESUMO

High vocal effort has characteristic acoustic effects on speech. This study focuses on the utilization of this information by human listeners and a machine-based detection system in the task of detecting shouted speech in the presence of noise. Both female and male speakers read Finnish sentences using normal and shouted voice in controlled conditions, with the sound pressure level recorded. The speech material was artificially corrupted by noise and supplemented with pure noise. The human performance level was statistically evaluated by a listening test, where the subjects labeled noisy samples according to whether shouting was heard or not. A Bayesian detection system was constructed and statistically evaluated. Its performance was compared against that of human listeners, substituting different spectrum analysis methods in the feature extraction stage. Using features capable of taking into account the spectral fine structure (i.e., the fundamental frequency and its harmonics), the machine reached the detection level of humans even in the noisiest conditions. In the listening test, male listeners detected shouted speech significantly better than female listeners, especially with speakers making a smaller vocal effort increase for shouting.


Assuntos
Acústica/instrumentação , Percepção Sonora , Ruído/efeitos adversos , Mascaramento Perceptivo , Acústica da Fala , Percepção da Fala , Estimulação Acústica , Análise de Variância , Audiometria da Fala , Teorema de Bayes , Feminino , Humanos , Masculino , Fatores Sexuais , Detecção de Sinal Psicológico , Processamento de Sinais Assistido por Computador , Razão Sinal-Ruído , Espectrografia do Som , Medida da Produção da Fala
13.
J Acoust Soc Am ; 134(6): 4508, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25669261

RESUMO

Previous studies on fusion in speech perception have demonstrated the ability of the human auditory system to group separate components of speech-like sounds together and consequently to enable the identification of speech despite the spatial separation between the components. Typically, the spatial separation has been implemented using headphone reproduction where the different components evoke auditory images at different lateral positions. In the present study, a multichannel loudspeaker system was used to investigate whether the correct vowel is identified and whether two auditory events are perceived when a noise-excited vowel is divided into two components that are spatially separated. The two components consisted of the even and odd formants. Both the amount of spatial separation between the components and the directions of the components were varied. Neither the spatial separation nor the directions of the components affected the vowel identification. Interestingly, an additional auditory event not associated with any vowel was perceived at the same time when the components were presented symmetrically in front of the listener. In such scenarios, the vowel was perceived from the direction of the odd formant components.


Assuntos
Sinais (Psicologia) , Acústica da Fala , Percepção da Fala , Qualidade da Voz , Estimulação Acústica , Adulto , Audiometria da Fala , Limiar Auditivo , Humanos , Masculino , Reconhecimento Fisiológico de Modelo , Reconhecimento Psicológico , Localização de Som , Fatores de Tempo , Adulto Jovem
14.
J Acoust Soc Am ; 134(2): 1295-313, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23927127

RESUMO

All-pole modeling is a widely used formant estimation method, but its performance is known to deteriorate for high-pitched voices. In order to address this problem, several all-pole modeling methods robust to fundamental frequency have been proposed. This study compares five such previously known methods and introduces a technique, Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME). WLP-AME utilizes temporally weighted linear prediction (LP) in which the square of the prediction error is multiplied by a given parametric weighting function. The weighting downgrades the contribution of the main excitation of the vocal tract in optimizing the filter coefficients. Consequently, the resulting all-pole model is affected more by the characteristics of the vocal tract leading to less biased formant estimates. By using synthetic vowels created with a physical modeling approach, the results showed that WLP-AME yields improved formant frequencies for high-pitched sounds in comparison to the previously known methods (e.g., relative error in the first formant of the vowel [a] decreased from 11% to 3% when conventional LP was replaced with WLP-AME). Experiments conducted on natural vowels indicate that the formants detected by WLP-AME changed in a more regular manner between repetitions of different pitch than those computed by conventional LP.


Assuntos
Glote/fisiologia , Modelos Lineares , Fonação , Fonética , Percepção da Altura Sonora , Acústica da Fala , Qualidade da Voz , Adulto , Algoritmos , Fenômenos Biomecânicos , Pré-Escolar , Simulação por Computador , Feminino , Glote/anatomia & histologia , Humanos , Masculino , Análise Numérica Assistida por Computador , Reconhecimento Automatizado de Padrão , Pressão , Processamento de Sinais Assistido por Computador , Espectrografia do Som , Medida da Produção da Fala , Fatores de Tempo , Prega Vocal/fisiologia
15.
Neuroimage ; 60(2): 1036-45, 2012 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-22289805

RESUMO

Human speech perception is highly resilient to acoustic distortions. In addition to distortions from external sound sources, degradation of the acoustic structure of the sound itself can substantially reduce the intelligibility of speech. The degradation of the internal structure of speech happens, for example, when the digital representation of the signal is impoverished by reducing its amplitude resolution. Further, the perception of speech is also influenced by whether the distortion is transient, coinciding with speech, or is heard continuously in the background. However, the complex effects of the acoustic structure and continuity of the distortion on the cortical processing of degraded speech are unclear. In the present magnetoencephalography study, we investigated how the cortical processing of degraded speech sounds as measured through the auditory N1m response is affected by variation of both the distortion type (internal, external) and the continuity of distortion (transient, continuous). We found that when the distortion was continuous, the N1m was significantly delayed, regardless of the type of distortion. The N1m amplitude, in turn, was affected only when speech sounds were degraded with transient internal distortion, which resulted in larger response amplitudes. The results suggest that external and internal distortions of speech result in divergent patterns of activity in the auditory cortex, and that the effects are modulated by the temporal continuity of the distortion.


Assuntos
Córtex Auditivo/fisiologia , Fonética , Percepção da Fala/fisiologia , Adulto , Feminino , Humanos , Magnetoencefalografia , Masculino , Fatores de Tempo , Adulto Jovem
16.
BMC Neurosci ; 13: 157, 2012 Dec 31.
Artigo em Inglês | MEDLINE | ID: mdl-23276297

RESUMO

BACKGROUND: The robustness of speech perception in the face of acoustic variation is founded on the ability of the auditory system to integrate the acoustic features of speech and to segregate them from background noise. This auditory scene analysis process is facilitated by top-down mechanisms, such as recognition memory for speech content. However, the cortical processes underlying these facilitatory mechanisms remain unclear. The present magnetoencephalography (MEG) study examined how the activity of auditory cortical areas is modulated by acoustic degradation and intelligibility of connected speech. The experimental design allowed for the comparison of cortical activity patterns elicited by acoustically identical stimuli which were perceived as either intelligible or unintelligible. RESULTS: In the experiment, a set of sentences was presented to the subject in distorted, undistorted, and again in distorted form. The intervening exposure to undistorted versions of sentences rendered the initially unintelligible, distorted sentences intelligible, as evidenced by an increase from 30% to 80% in the proportion of sentences reported as intelligible. These perceptual changes were reflected in the activity of the auditory cortex, with the auditory N1m response (~100 ms) being more prominent for the distorted stimuli than for the intact ones. In the time range of auditory P2m response (>200 ms), auditory cortex as well as regions anterior and posterior to this area generated a stronger response to sentences which were intelligible than unintelligible. During the sustained field (>300 ms), stronger activity was elicited by degraded stimuli in auditory cortex and by intelligible sentences in areas posterior to auditory cortex. CONCLUSIONS: The current findings suggest that the auditory system comprises bottom-up and top-down processes which are reflected in transient and sustained brain activity. It appears that analysis of acoustic features occurs during the first 100 ms, and sensitivity to speech intelligibility emerges in auditory cortex and surrounding areas from 200 ms onwards. The two processes are intertwined, with the activity of auditory cortical areas being modulated by top-down processes related to memory traces of speech and supporting speech intelligibility.


Assuntos
Córtex Auditivo/fisiologia , Mapeamento Encefálico/psicologia , Inteligibilidade da Fala/fisiologia , Percepção da Fala/fisiologia , Fala/fisiologia , Estimulação Acústica/métodos , Adulto , Mapeamento Encefálico/métodos , Potenciais Evocados Auditivos/fisiologia , Humanos , Processamento de Imagem Assistida por Computador/métodos , Magnetoencefalografia/métodos , Magnetoencefalografia/psicologia
17.
J Acoust Soc Am ; 132(6): 3990-4001, 2012 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-23231128

RESUMO

Post-filtering can be utilized to improve the quality and intelligibility of telephone speech. Previous studies have shown that energy reallocation with a high-pass type filter works effectively in improving the intelligibility of speech in difficult noise conditions. The present study introduces a signal-to-noise ratio adaptive post-filtering method that utilizes energy reallocation to transfer energy from the first formant to higher frequencies. The proposed method adapts to the level of the background noise so that, in favorable noise conditions, the post-filter has a flat frequency response and the effect of the post-filtering is increased as the level of the ambient noise increases. The performance of the proposed method is compared with a similar post-filtering algorithm and unprocessed speech in subjective listening tests which evaluate both intelligibility and listener preference. The results indicate that both of the post-filtering methods maintain the quality of speech in negligible noise conditions and are able to provide intelligibility improvement over unprocessed speech in adverse noise conditions. Furthermore, the proposed post-filtering algorithm performs better than the other post-filtering method under evaluation in moderate to difficult noise conditions, where intelligibility improvement is mostly required.


Assuntos
Acústica , Telefone Celular , Processamento de Sinais Assistido por Computador , Acústica da Fala , Inteligibilidade da Fala , Qualidade da Voz , Estimulação Acústica , Adulto , Algoritmos , Análise de Variância , Feminino , Humanos , Masculino , Ruído/efeitos adversos , Mascaramento Perceptivo , Razão Sinal-Ruído , Teste do Limiar de Recepção da Fala , Adulto Jovem
18.
J Acoust Soc Am ; 132(2): 848-61, 2012 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-22894208

RESUMO

Artificial bandwidth extension methods have been developed to improve the quality and intelligibility of narrowband telephone speech and to reduce the difference with wideband speech. Such methods have commonly been evaluated with objective measures or subjective listening-only tests, but conversational evaluations have been rare. This article presents a conversational evaluation of two methods for the artificial bandwidth extension of telephone speech. Bandwidth-extended narrowband speech is compared with narrowband and wideband speech in a test setting including a simulated telephone connection, realistic conversation tasks, and various background noise conditions. The responses of the subjects indicate that speech processed with one of the methods is preferred to narrowband speech in noise, but wideband speech is superior to both narrowband and bandwidth-extended speech. Bandwidth extension was found to be beneficial for telephone conversation in noisy listening conditions.


Assuntos
Algoritmos , Processamento de Sinais Assistido por Computador , Acústica da Fala , Inteligibilidade da Fala , Percepção da Fala , Telefone , Adulto , Análise de Variância , Audiometria da Fala , Comunicação , Compreensão , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Ruído/efeitos adversos , Mascaramento Perceptivo , Medida da Produção da Fala , Adulto Jovem
19.
J Voice ; 2022 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-35490081

RESUMO

Automatic voice pathology detection is a research topic, which has gained increasing interest recently. Although methods based on deep learning are becoming popular, the classical pipeline systems based on a two-stage architecture consisting of a feature extraction stage and a classifier stage are still widely used. In these classical detection systems, frame-wise computation of mel-frequency cepstral coefficients (MFCCs) is the most popular feature extraction method. However, no systematic study has been conducted to investigate the effect of the MFCC frame length on automatic voice pathology detection. In this work, we studied the effect of the MFCC frame length in voice pathology detection using three disorders (hyperkinetic dysphonia, hypokinetic dysphonia and reflux laryngitis) from the Saarbrücken Voice Disorders (SVD) database. The detection performance was compared between speaker-dependent and speaker-independent scenarios as well as between speaking task -dependent and speaking task -independent scenarios. The Support Vector Machine, which is the most widely used classifier in the study area, was used as the classifier. The results show that the detection accuracy depended on the MFFC frame length in all the scenarios studied. The best detection accuracy was obtained by using a MFFC frame length of 500 ms with a shift of 5 ms.

20.
J Voice ; 2022 Nov 21.
Artigo em Inglês | MEDLINE | ID: mdl-36424242

RESUMO

Neurogenic voice disorders (NVDs) are caused by damage or malfunction of the central or peripheral nervous system that controls vocal fold movement. In this paper, we investigate the potential of the Fisher vector (FV) encoding in automatic detection of people with NVDs. FVs are used to convert features from frame level (local descriptors) to utterance level (global descriptors). At the frame level, we extract two popular cepstral representations, namely, Mel-frequency cepstral coefficients (MFCCs) and perceptual linear prediction cepstral coefficients (PLPCCs), from acoustic voice signals. In addition, the MFCC features are also extracted from every frame of the glottal source signal computed using a glottal inverse filtering (GIF) technique. The global descriptors derived from the local descriptors are used to train a support vector machine (SVM) classifier. Experiments are conducted using voice signals from 80 healthy speakers and 80 patients with NVDs (40 with spasmodic dysphonia (SD) and 40 with recurrent laryngeal nerve palsy (RLNP)) taken from the Saarbruecken voice disorder (SVD) database. The overall results indicate that the use of the FV encoding leads to better identification of people with NVDs, compared to the defacto temporal encoding. Furthermore, the SVM trained using the combination of FVs derived from the cepstral and glottal features provides the overall best detection performance.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA