Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement.

Ravi, Vijay; Wang, Jinhan; Flint, Jonathan; Alwan, Abeer

Ravi, Vijay; Wang, Jinhan; Flint, Jonathan; Alwan, Abeer.

Afiliação

Ravi V; Department of Electrical and Computer Engineering, University of California, Los Angeles, CA, 90095, USA.
Wang J; Department of Electrical and Computer Engineering, University of California, Los Angeles, CA, 90095, USA.
Flint J; Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, CA, 90095, USA.
Alwan A; Department of Electrical and Computer Engineering, University of California, Los Angeles, CA, 90095, USA.

Comput Speech Lang ; 862024 Jun.

Article em En | MEDLINE | ID: mdl-38313320

ABSTRACT

ABSTRACT

Speech signals are valuable biomarkers for assessing an individual's mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this - adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness GV D and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a GV D of -1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a GV D of -0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.

Palavras-chave

Depression-detection; Privacy; Speaker-disentanglement

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links