A compressed large language model embedding dataset of ICD 10 CM descriptions.

Kane, Michael J; King, Casey; Esserman, Denise; Latham, Nancy K; Greene, Erich J; Ganz, David A

Kane, Michael J; King, Casey; Esserman, Denise; Latham, Nancy K; Greene, Erich J; Ganz, David A.

Afiliação

Kane MJ; Department of Biostatistics, School of Public Health, Yale University, New Haven, USA. michael.kane@yale.edu.
King C; The Jackson School of Global Affairs, Yale University, New Haven, USA.
Esserman D; US Healthcare and Life Sciences Microsoft, Redmond, USA.
Latham NK; Department of Biostatistics, School of Public Health, Yale University, New Haven, USA.
Greene EJ; Research Program in Men's Health: Aging and Metabolism, Boston Claude D. Pepper Older Americans Independence Center for Function Promoting Therapies, Brigham and Women's Hospital, Boston, USA.
Ganz DA; Department of Biostatistics, School of Public Health, Yale University, New Haven, USA.

BMC Bioinformatics ; 24(1): 482, 2023 Dec 17.

Article em En | MEDLINE | ID: mdl-38105180

ABSTRACT

ABSTRACT

This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing relationships among categories and preserving inherent context information. The model generating the data was validated in two ways. First, the dimension reduction was validated using an autoencoder, and secondly, a supervised model was created to estimate the ICD-10-CM hierarchical categories. Results show that the dimension of the data can be reduced to as few as 10 dimensions while maintaining the ability to reproduce the original embeddings, with the fidelity decreasing as the reduced-dimension representation decreases. Multiple compression levels are provided, allowing users to choose as per their requirements, download and use without any other setup. The readily available datasets of ICD-10-CM codes are anticipated to be highly valuable for researchers in biomedical informatics, enabling more advanced analyses in the field. This approach has the potential to significantly improve the utility of ICD-10-CM codes in the biomedical domain.

Assuntos

Registros Eletrônicos de Saúde; Classificação Internacional de Doenças; Idioma; Aprendizado de Máquina; Processamento de Linguagem Natural

Palavras-chave

Autoencoder; EHR; Electronic health records; ICD-10-CM; Large language model; NLP

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Classificação Internacional de Doenças / Registros Eletrônicos de Saúde Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google