RESUMO
Background and Aims: Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting three IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across two institutions. Methods: Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California San Francisco (UCSF), and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive and negative predictive value. Additionally, we conducted fairness and error assessments. Results: Inter-rater reliability between annotators was >90%. On the UCSF test set (n=50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n=250), tNLP models failed to generalize (61-62% accuracy) while GPT-4 maintained accuracies >90%. PaLM-2 and GPT-4 showed similar performance. No biases were detected based on demographics or diagnosis. Conclusions: LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.