Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data.

Meaney, Christopher; Escobar, Michael; Stukel, Therese A; Austin, Peter C; Kalia, Sumeet; Aliarzadeh, Babak; Greiver, Michelle

Meaney, Christopher; Escobar, Michael; Stukel, Therese A; Austin, Peter C; Kalia, Sumeet; Aliarzadeh, Babak; Greiver, Michelle.

Afiliación

Meaney C; 7938University of Toronto, Toronto, ON, Canada.
Escobar M; 7938University of Toronto, Toronto, ON, Canada.
Stukel TA; ICES, Toronto, ON, Canada; 7938University of Toronto, Toronto, ON, Canada.
Austin PC; ICES, Toronto, ON, Canada; 7938University of Toronto, Toronto, ON, Canada.
Kalia S; 7938University of Toronto, Toronto, ON, Canada.
Aliarzadeh B; 7938University of Toronto, Toronto, ON, Canada.
Rahim Moineddin; 7938University of Toronto, Toronto, ON, Canada.
Greiver M; 7938University of Toronto, Toronto, ON, Canada; North York General Hospital, Toronto, ON, Canada.

Health Informatics J ; 29(1): 14604582221115667, 2023.

Article en En | MEDLINE | ID: mdl-36639910

ABSTRACT

ABSTRACT

Background/

Objectives:

Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design/

Setting:

Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01/01/2017 through 12/31/2020.

Methods:

We fit a non-negative matrix factorization topic model, with K = 50 latent topics/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model.

Results:

The DTM consisted of 382,666 documents and 2210 words/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye/ear/nose/throat conditions, conditions of the urinary system, and dermatological conditions, etc.).

Conclusions:

When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models.

Asunto(s)

Registros Electrónicos de Salud; Clasificación Internacional de Enfermedades; Humanos; Femenino; Estudios Retrospectivos; Aprendizaje; Atención Primaria de Salud

Palabras clave

ICD-9 codes; clinical text data; concurrent validity; convergent validity; discriminant validity; electronic medical record; external validation; non-negative matrix factorization; topic model

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Clasificación Internacional de Enfermedades / Registros Electrónicos de Salud Tipo de estudio: Diagnostic_studies / Observational_studies / Prognostic_studies / Risk_factors_studies Idioma: En Revista: Health Informatics J Año: 2023 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google