Identifying social determinants of health from clinical narratives: A study of performance, documentation ratio, and potential bias.

Yu, Zehao; Peng, Cheng; Yang, Xi; Dang, Chong; Adekkanattu, Prakash; Gopal Patra, Braja; Peng, Yifan; Pathak, Jyotishman; Wilson, Debbie L; Chang, Ching-Yuan; Lo-Ciganic, Wei-Hsuan; George, Thomas J; Hogan, William R; Guo, Yi; Bian, Jiang; Wu, Yonghui

Yu, Zehao; Peng, Cheng; Yang, Xi; Dang, Chong; Adekkanattu, Prakash; Gopal Patra, Braja; Peng, Yifan; Pathak, Jyotishman; Wilson, Debbie L; Chang, Ching-Yuan; Lo-Ciganic, Wei-Hsuan; George, Thomas J; Hogan, William R; Guo, Yi; Bian, Jiang; Wu, Yonghui.

Afiliação

Yu Z; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
Peng C; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
Yang X; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
Dang C; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
Adekkanattu P; Information Technologies and Services, Weill Cornell Medicine, New York, NY, USA.
Gopal Patra B; Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
Peng Y; Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
Pathak J; Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
Wilson DL; Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA.
Chang CY; Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA.
Lo-Ciganic WH; Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA.
George TJ; Division of Hematology & Oncology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA.
Hogan WR; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
Guo Y; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
Bian J; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
Wu Y; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA. Electronic address: yonghui.wu@ufl.edu.

J Biomed Inform ; 153: 104642, 2024 May.

Article em En | MEDLINE | ID: mdl-38621641

ABSTRACT

ABSTRACT

OBJECTIVE:

To develop a natural language processing (NLP) package to extract social determinants of health (SDoH) from clinical narratives, examine the bias among race and gender groups, test the generalizability of extracting SDoH for different disease groups, and examine population-level extraction ratio.

METHODS:

We developed SDoH corpora using clinical notes identified at the University of Florida (UF) Health. We systematically compared 7 transformer-based large language models (LLMs) and developed an open-source package - SODA (i.e., SOcial DeterminAnts) to facilitate SDoH extraction from clinical narratives. We examined the performance and potential bias of SODA for different race and gender groups, tested the generalizability of SODA using two disease domains including cancer and opioid use, and explored strategies for improvement. We applied SODA to extract 19 categories of SDoH from the breast (n = 7,971), lung (n = 11,804), and colorectal cancer (n = 6,240) cohorts to assess patient-level extraction ratio and examine the differences among race and gender groups.

RESULTS:

We developed an SDoH corpus using 629 clinical notes of cancer patients with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH, and another cross-disease validation corpus using 200 notes from opioid use patients with 4,342 SDoH concepts/attributes. We compared 7 transformer models and the GatorTron model achieved the best mean average strict/lenient F1 scores of 0.9122 and 0.9367 for SDoH concept extraction and 0.9584 and 0.9593 for linking attributes to SDoH concepts. There is a small performance gap (â¼4%) between Males and Females, but a large performance gap (>16 %) among race groups. The performance dropped when we applied the cancer SDoH model to the opioid cohort; fine-tuning using a smaller opioid SDoH corpus improved the performance. The extraction ratio varied in the three cancer cohorts, in which 10 SDoH could be extracted from over 70 % of cancer patients, but 9 SDoH could be extracted from less than 70 % of cancer patients. Individuals from the White and Black groups have a higher extraction ratio than other minority race groups.

CONCLUSIONS:

Our SODA package achieved good performance in extracting 19 categories of SDoH from clinical narratives. The SODA package with pre-trained transformer models is available at https//github.com/uf-hobi-informatics-lab/SODA_Docker.

Assuntos

Narração; Processamento de Linguagem Natural; Determinantes Sociais da Saúde; Humanos; Feminino; Masculino; Viés; Registros Eletrônicos de Saúde; Documentação/métodos; Mineração de Dados/métodos

Palavras-chave

Cancer; Clinical concept extraction; Large language model; Natural language processing; Social determinants of health; Transformer

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Narração / Determinantes Sociais da Saúde Limite: Female / Humans / Male Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google