Your browser doesn't support javascript.
loading
Comparing natural language processing representations of coded disease sequences for prediction in electronic health records.
Beaney, Thomas; Jha, Sneha; Alaa, Asem; Smith, Alexander; Clarke, Jonathan; Woodcock, Thomas; Majeed, Azeem; Aylin, Paul; Barahona, Mauricio.
Afiliación
  • Beaney T; Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom.
  • Jha S; Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Alaa A; Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Smith A; Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Clarke J; Department of Epidemiology and Biostatistics, Imperial College London, London, W2 1PG, United Kingdom.
  • Woodcock T; Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Majeed A; Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom.
  • Aylin P; Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom.
  • Barahona M; Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom.
J Am Med Inform Assoc ; 31(7): 1451-1462, 2024 Jun 20.
Article en En | MEDLINE | ID: mdl-38719204
ABSTRACT

OBJECTIVE:

Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes. MATERIALS AND

METHODS:

This cohort study used primary care EHRs from 6 286 233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis.

RESULTS:

Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes. DISCUSSION AND

CONCLUSION:

Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.
Asunto(s)
Palabras clave

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Algoritmos / Procesamiento de Lenguaje Natural / Registros Electrónicos de Salud Límite: Female / Humans / Male País/Región como asunto: Europa Idioma: En Revista: J Am Med Inform Assoc Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: Reino Unido

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Algoritmos / Procesamiento de Lenguaje Natural / Registros Electrónicos de Salud Límite: Female / Humans / Male País/Región como asunto: Europa Idioma: En Revista: J Am Med Inform Assoc Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: Reino Unido