Research on Speech Synthesis Based on Mixture Alignment Mechanism.

Deng, Yan; Wu, Ning; Qiu, Chengjun; Chen, Yan; Gao, Xueshan

Deng, Yan; Wu, Ning; Qiu, Chengjun; Chen, Yan; Gao, Xueshan.

Afiliação

Deng Y; School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China.
Wu N; Key Laboratory of Beibu Gulf Offshore Engineering Equipment and Technology, Beibu Gulf University, Qinzhou 535011, China.
Qiu C; College of Mechanical Naval Architecture and Ocean Engineering, Beibu Gulf University, Qinzhou 535011, China.
Chen Y; Guangxi Key Laboratory of Ocean Engineering Equipment and Technology, Qinzhou 535011, China.
Gao X; School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China.

Sensors (Basel) ; 23(16)2023 Aug 20.

Article em En | MEDLINE | ID: mdl-37631819

ABSTRACT

ABSTRACT

In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective.

Assuntos

Linguística; Fala; Aprendizado de Máquina; Semântica

Palavras-chave

acoustic signal processing; deep learning; mixture attention mechanism; speech synthesis

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Fala / Linguística Tipo de estudo: Prognostic_studies Idioma: En Revista: Sensors (Basel) Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google