Impact of possible errors in natural language processing-derived data on downstream epidemiologic analysis.

Lan, Zhou; Turchin, Alexander

Lan, Zhou; Turchin, Alexander.

Afiliação

Lan Z; Center for Clinical Investigation, Brigham & Women's Hospital, Boston, MA 02115, United States.
Turchin A; Harvard Medical School, Boston, MA 02115, United States.

JAMIA Open ; 6(4): ooad111, 2023 Dec.

Article em En | MEDLINE | ID: mdl-38152447

ABSTRACT

ABSTRACT

Objective:

To assess the impact of potential errors in natural language processing (NLP) on the results of epidemiologic studies. Materials and

Methods:

We utilized data from three outcomes research studies where the primary predictor variable was generated using NLP. For each of these studies, Monte Carlo simulations were applied to generate datasets simulating potential errors in NLP-derived variables. We subsequently fit the original regression models to these partially simulated datasets and compared the distribution of coefficient estimates to the original study results.

Results:

Among the four models evaluated, the mean change in the point estimate of the relationship between the predictor variable and the outcome ranged from -21.9% to 4.12%. In three of the four models, significance of this relationship was not eliminated in a single of the 500 simulations, and in one model it was eliminated in 12% of simulations. Mean changes in the estimates for confounder variables ranged from 0.27% to 2.27% and significance of the relationship was eliminated between 0% and 9.25% of the time. No variables underwent a shift in the direction of its interpretation.

Discussion:

Impact of simulated NLP errors on the results of epidemiologic studies was modest, with only small changes in effect estimates and no changes in the interpretation of the findings (direction and significance of association with the outcome) for either the NLP-generated variables or other variables in the models.

Conclusion:

NLP errors are unlikely to affect the results of studies that use NLP as the source of data.

Palavras-chave

Monte Carlo method; electronic health record; epidemiology; natural language processing; outcomes research

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: JAMIA Open Ano de publicação: 2023 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google