RESUMEN
BACKGROUND/AIMS: Performance status is crucial for most clinical research, as an eligibility criterion, a comorbidity covariate, or a trial endpoint. Yet information on performance status often is embedded as free text within a patient's electronic medical record, rather than coded directly, thereby making this concept extremely difficult to extract for research. Furthermore, performance status information frequently resides in outside reports, which are scanned into the electronic medical record along with thousands of clinic notes. The image format of scanned documents also is a major obstacle to the search and retrieval of information, as natural language processing cannot be applied to unstructured text within an image. We, therefore, utilized optical character recognition software to convert images to a searchable format, allowing the application of natural language processing to identify pertinent performance status data elements within scanned electronic medical records. METHODS: Our study cohort consisted of 189 subjects diagnosed with diffuse large B-cell lymphoma for whom performance status was a required data element for analysis of prognostic factors related to recurrence and survival. Manual abstraction of performance status was previously conducted by a clinical Subject Matter Expert, serving as the gold standard. Leveraging our data warehouse, we extracted relevant scanned electronic medical record documents and applied optical character recognition to these images using the ABBYY FineReader software. The Linguamatics i2e natural language processing software was then used to run queries for performance status against the corpus of electronic medical record documents. We evaluated our optical character recognition/natural language processing pipeline for accuracy and reduction in data extraction effort. RESULTS: We found that there was high accuracy and reduced time for extraction of performance status data by applying our optical character recognition/natural language processing pipeline. The transformed scanned documents from a random sample of patients yielded excellent precision, recall, and F score, with <1% incorrect results. Time savings from a second cohort showed that median time to review documents for patients with performance status data present was reduced by a third. The major time savings was in the review of those documents that in fact did not contain performance status information: median of 18 minutes versus 108 minutes for manual review, an 83% reduction in data abstraction effort. CONCLUSION: By applying this optical character recognition/natural language processing pipeline, we achieved significant operational improvement and reduced time for information retrieval to support clinical research. Our study demonstrated that optical character recognition software provides an effective mechanism to transform scanned electronic medical record images to allow the application of natural language processing, yielding highly accurate data abstraction. We conclude that our optical character recognition/natural language processing pipeline can greatly facilitate research data abstraction by providing a highly focused data review, eliminating unnecessary manual review of the entire chart, and thus freeing time for abstracting other data elements requiring more human interpretation.
Asunto(s)
Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Automatización , Ensayos Clínicos como Asunto , Registros Electrónicos de Salud , Humanos , Programas InformáticosRESUMEN
PURPOSE: Although BRCA1/2 testing in ovarian cancer improves outcomes, it is vastly underutilized. Scalable approaches are urgently needed to improve genomically guided care. METHODS: We developed a Natural Language Processing (NLP) pipeline to extract electronic medical record information to identify recipients of BRCA testing. We applied the NLP pipeline to assess testing status in 308 patients with ovarian cancer receiving care at a National Cancer Institute Comprehensive Cancer Center (main campus [MC] and five affiliated clinical network sites [CNS]) from 2017 to 2019. We compared characteristics between (1) patients who had/had not received testing and (2) testing utilization by site. RESULTS: We found high uptake of BRCA testing (approximately 78%) from 2017 to 2019 with no significant differences between the MC and CNS. We observed an increase in testing over time (67%-85%), higher uptake of testing among younger patients (mean age tested = 61 years v untested = 65 years, P = .01), and higher testing among Hispanic (84%) compared with White, Non-Hispanic (78%), and Asian (75%) patients (P = .006). Documentation of referral for an internal genetics consultation for BRCA pathogenic variant carriers was higher at the MC compared with the CNS (94% v 31%). CONCLUSION: We were able to successfully use a novel NLP pipeline to assess use of BRCA testing among patients with ovarian cancer. Despite relatively high levels of BRCA testing at our institution, 22% of patients had no documentation of genetic testing and documentation of referral to genetics among BRCA carriers in the CNS was low. Given success of the NLP pipeline, such an informatics-based approach holds promise as a scalable solution to identify gaps in genetic testing to ensure optimal treatment interventions in a timely manner.
Asunto(s)
Proteína BRCA2 , Informática Aplicada a la Salud de los Consumidores , Neoplasias Ováricas , Proteína BRCA1/genética , Proteína BRCA2/genética , Informática Aplicada a la Salud de los Consumidores/métodos , Femenino , Pruebas Genéticas , Humanos , Persona de Mediana Edad , Procesamiento de Lenguaje Natural , Neoplasias Ováricas/diagnóstico , Neoplasias Ováricas/genética , Neoplasias Ováricas/patología , Derivación y ConsultaRESUMEN
PURPOSE: We performed detailed genomic analysis on 87 cases of de novo diffuse large B-cell lymphoma of germinal center type (GCB DLBCL) to identify characteristics that are associated with survival in those treated with R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone). EXPERIMENTAL DESIGN: The cases were extensively characterized by combining the results of IHC, cell-of-origin gene expression profiling (GEP; NanoString), double-hit GEP (DLBCL90), FISH cytogenetic analysis for double/triple-hit lymphoma, copy-number analysis, and targeted deep sequencing using a custom mutation panel of 334 genes. RESULTS: We identified four distinct biologic subgroups with different survivals, and with similarities to the genomic classifications from two large retrospective studies of DLBCL. Patients with the double-hit signature, but no abnormalities of TP53, and those lacking EZH2 mutation and/or BCL2 translocation, had an excellent prognosis. However, patients with an EZB-like profile had an intermediate prognosis, whereas those with TP53 inactivation combined with the double-hit signature had an extremely poor prognosis. This latter finding was validated using two independent cohorts. CONCLUSIONS: We propose a practical schema to use genomic variables to risk-stratify patients with GCB DLBCL. This schema provides a promising new approach to identify high-risk patients for new and innovative therapies.