Plain-to-clear speech video conversion for enhanced intelligibility.

Sachdeva, Shubam; Ruan, Haoyao; Hamarneh, Ghassan; Behne, Dawn M; Jongman, Allard; Sereno, Joan A; Wang, Yue

Sachdeva, Shubam; Ruan, Haoyao; Hamarneh, Ghassan; Behne, Dawn M; Jongman, Allard; Sereno, Joan A; Wang, Yue.

Afiliação

Sachdeva S; Language and Brain Lab, Department of Linguistics, Simon Fraser University, Burnaby, BC Canada.
Ruan H; Language and Brain Lab, Department of Linguistics, Simon Fraser University, Burnaby, BC Canada.
Hamarneh G; Medical Image, Analysis Research Group, School of Computing Science, Simon Fraser University, Burnaby, BC Canada.
Behne DM; NTNU Speech Lab, Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway.
Jongman A; KU Phonetics and Psycholinguistics Lab, Department of Linguistics, University of Kansas, Lawrence, KS USA.
Sereno JA; KU Phonetics and Psycholinguistics Lab, Department of Linguistics, University of Kansas, Lawrence, KS USA.
Wang Y; Language and Brain Lab, Department of Linguistics, Simon Fraser University, Burnaby, BC Canada.

Int J Speech Technol ; 26(1): 163-184, 2023.

Article em En | MEDLINE | ID: mdl-37008883

ABSTRACT

ABSTRACT

Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker's visual speech style; (3) we introduce "displacement factor" as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies.

Palavras-chave

AI lip reading; Intelligibility; Speech enhancement; Speech style; Video speech synthesis

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Int J Speech Technol Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Int J Speech Technol Ano de publicação: 2023 Tipo de documento: Article