RESUMEN
OBJECTIVES: There is currently a lack of objective treatment outcome measures for transgender individuals undergoing gender-affirming voice care. Recently, Bensoussan et al developed an AI model that is able to generate a voice femininity rating based on a short voice sample provided through a smartphone application. The purpose of this study was to examine the feasibility of using this model as a treatment outcome measure by comparing its performance to human listeners. Additionally, we examined the effect of two different training datasets on the model's accuracy and performance when presented with external data. METHODS: 100 voice recordings from 50 cisgender males and 50 cisgender females were retrospectively collected from patients presenting at a university voice clinic for reasons other than dysphonia. The recordings were evaluated by expert and naïve human listeners, who rated each voice based on how sure they were the voice belonged to a female speaker (% voice femininity [R]). Human ratings were compared to ratings generated by (1) the AI model trained on a high-quality low-quantity dataset (voices from the Perceptual Voice Quality Database) (PVQD model), and (2) the AI model trained on a low-quality high-quantity dataset (voices from the Mozilla Common Voice database) (Mozilla model). Ambiguity scores were calculated as the absolute value of the difference between the rating and certainty (0 or 100%). RESULTS: Both expert and naïve listeners achieved 100% accuracy in identifying voice gender based on a binary classification (female >50% voice femininity [R]). In comparison, the Mozilla-trained model achieved 92% accuracy and the previously published PVQD model achieved 84% accuracy in determining voice gender (female >50% AI voice femininity). While both AI models correlated with human ratings, the Mozilla-trained model showed a stronger correlation as well as lower overall rating ambiguity than the PVQD-trained model. The Mozilla model also appeared to handle pitch information in a similar way to human raters. CONCLUSIONS: The AI model predicted voice gender with high accuracy when compared to human listeners and has potential as a useful outcome measure for transgender individuals receiving gender-affirming voice training. The Mozilla-trained model performed better than the PVQD-trained model, indicating that for binary classification tasks, the quantity of data may influence accuracy more than the quality of the data used for training the voice AI models.