Pesquisa | BVS Doenças Infecciosas e Parasitárias

Cycle consistent network for end-to-end style transfer TTS training.

Xue, Liumeng; Pan, Shifeng; He, Lei; Xie, Lei; Soong, Frank K.

Neural Netw ; 140: 223-236, 2021 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-33780874

RESUMO

In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.

Assuntos

Aprendizado de Máquina , Interface para o Reconhecimento da Fala , Redação

Effective and direct control of neural TTS prosody by removing interactions between different attributes.

An, Xiaochun; Soong, Frank K; Yang, Shan; Xie, Lei.

Neural Netw ; 143: 250-260, 2021 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-34157649

RESUMO

End-to-end TTS advancement has shown that synthesized speech prosody can be controlled by conditioning the decoder with speech prosody attribute labels. However, to annotate quantitatively the prosody patterns of a large set of training data is both time consuming and expensive. To use unannotated data, variational autoencoder (VAE) has been proposed to model individual prosody attribute as a random variable in the latent space. The VAE is an unsupervised approach and the corresponding latent variables are in general correlated with each other. For more effective and direct control of speech prosody along each attribute dimension, it is highly desirable to disentangle the correlated latent variables. Additionally, being able to interpret the disentangled attributes as speech perceptual cues is useful for designing more efficient prosody control of TTS. In this paper, we propose two attribute separation schemes: (1) using 3 separate VAEs to model the real-valued, different prosodic features, i.e., F0, energy and duration; (2) minimizing mutual information between different prosody attributes to remove their mutual correlations, for facilitating more direct prosody control. Experimental results confirm that the two proposed schemes can indeed make individual prosody attributes more interpretable and direct TTS prosody control more effective. The improvements are measured objectively by F0 Frame Error (FFE) and subjectively with MOS and A/B comparison listening tests, respectively. The scatter diagrams of t-SNE also demonstrate the correlations between prosody attributes, which are well disentangled by minimizing their mutual information. Synthesized TTS samples can be found at https://xiaochunan.github.io/prosody/index.html.

Assuntos

Percepção da Fala , Fala , Sinais (Psicologia)

Tone recognition in continuous Cantonese speech using supratone models.

Qian, Yao; Lee, Tan; Soong, Frank K.

J Acoust Soc Am ; 121(5 Pt1): 2936-45, 2007 May.

Artigo em Inglês | MEDLINE | ID: mdl-17550191

RESUMO

This paper studies automatic tone recognition in continuous Cantonese speech. Cantonese is a major Chinese dialect that is known for being rich in tones. Tone information serves as a useful knowledge source for automatic speech recognition of Cantonese. Cantonese tone recognition is difficult because the tones have similar shapes of pitch contours. The tones are differentiated mainly by their relative pitch heights. In natural speech, the pitch level of a tone may shift up and down and the F0 ranges of different tones overlap with each other, making them acoustically indistinguishable within the domain of a syllable. Our study shows that the relative pitch heights are largely preserved between neighboring tones. A novel method of supratone modeling is proposed for Cantonese tone recognition. Each supratone model characterizes the F0 contour of two or three tones in succession. The tone sequence of a continuous utterance is formed as an overlapped concatenation of supratone units. The most likely tone sequence is determined under phonological constraints on syllable-tone combinations. The proposed method attains an accuracy of 74.68% in speaker-independent tone recognition experiments. In particular, the confusion among the tones with similar contour shapes is greatly resolved.

Assuntos

Idioma , Percepção da Altura Sonora , Percepção da Fala , Feminino , Humanos , Masculino , Reconhecimento Psicológico , Espectrografia do Som , Acústica da Fala , Testes de Discriminação da Fala

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA