RESUMEN
On the one hand, the relationship between formant frequencies and vocal tract length (VTL) has been intensively studied over the years. On the other hand, the connection involving mel-frequency cepstral coefficients (MFCCs), which concisely codify the overall shape of a speaker's spectral envelope with just a few cepstral coefficients, and VTL has only been modestly analyzed, being worth of further investigation. Thus, based on different statistical models, this article explores the advantages and disadvantages of the latter approach, which is relatively novel, in contrast to the former which arises from more traditional studies. Additionally, VTL is assumed to be a static and inherent characteristic of speakers, that is, a single length parameter is frequently estimated per speaker. By contrast, in this paper we consider VTL estimation from a dynamic perspective using modern real-time Magnetic Resonance Imaging (rtMRI) to measure VTL in parallel with audio signals. To support the experiments, data obtained from USC-TIMIT magnetic resonance videos were used, allowing for the 2D real-time analysis of articulators in motion. As a result, we observed that the performance of MFCCs in case of speaker-dependent modeling is higher, however, in case of cross-speaker modeling, which uses different speakers' data for training and evaluating, its performance is not significantly different of that obtained with formants. In complement, we note that the estimation based on MFCCs is robust, with an acceptable computational time complexity, coherent with the traditional approach.
RESUMEN
La inversión articulatoria, cuyo objetivo es estimar la posición de los órganos articuladores a partir de la información contenida en la señal de voz, ofrece una variedad de potenciales aplicaciones en el campo de la voz; sin embargo, este es un problema aún por resolver. En este sentido, buscar representaciones con la capacidad de incrementar el desempeño de los sistemas de inversión articulatoria es una tarea importante. El presente trabajo analiza la relevancia de los formantes como entrada para los sistemas de inversión articulatoria. Para ello se implementa un análisis analítico y estadístico. En el caso analítico se utiliza un sintetizador articulario, el cual simula la ecuación de tubos concatenados que modelan el tracto vocal. Para el análisis estadístico se estudian datos reales provenientes de un articulógrafo electromagnético para los cuales se estima la asociación entre las características acústicas y los movimientos de los órganos articuladores. A modo de medida de asociación estadística se utiliza la medida de información . Los resultados entregados por el análisis son corroborados en un sistema de inversión articulatoria basado en redes neuronales. Se observa una mejora en el valor de error cuadrático medio del 2,2% y para el caso de la medida de desempeño de la correlación, una mejora del 2,8%.
Acoustic-to-Articulatory inversion, which seeks to estimate an articulator position using the acoustic information in the speech signal, offers several potential applications in the field of speech processing. In this context, it is important to use acoustic parameters with the ability to increase the performance of acoustic-to-articulatory inversion systems. This paper analyzes the importance of formants as inputs to such inversion systems from an analytical and a statistical perspective. The former is based on an articulatory synthesizer that simulates the voice signal from the vocal tract. The statistical analysis is based on real data provided by an electromagnetic articulograph, for which we estimate the statistical association between acoustic features and articulator movement. As a measure of statistical association, the information measure is utilized. The results are tested on a neuralnetwork- based Acoustic-to-Articulatory inversion system. The use of formants as inputs led to an improvement of 2.2% and 2.8% in the root-mean-square error and correlation values, respectively.