Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 4 de 4
Filter
Add more filters











Database
Language
Publication year range
1.
Patterns (N Y) ; 5(3): 100933, 2024 Mar 08.
Article in English | MEDLINE | ID: mdl-38487800

ABSTRACT

In cancer research, pathology report text is a largely untapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing the data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using artificial intelligence (AI) allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. Finally, we perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.

2.
Patterns (N Y) ; 4(12): 100889, 2023 Dec 08.
Article in English | MEDLINE | ID: mdl-38106616

ABSTRACT

Coronavirus disease 2019 (COVID-19), the disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, has had extensive economic, social, and public health impacts in the United States and around the world. To date, there have been more than 600 million reported infections worldwide with more than 6 million reported deaths. Retrospective analysis, which identified comorbidities, risk factors, and treatments, has underpinned the response. As the situation transitions to an endemic, retrospective analyses using electronic health records will be important to identify the long-term effects of COVID-19. However, these analyses can be complicated by incomplete records, which makes it difficult to differentiate visits where the patient had COVID-19. To address this issue, we trained a random Forest classifier to assign a probability of a patient having been diagnosed with COVID-19 during each visit. Using these probabilities, we found that higher COVID-19 probabilities were associated with a future diagnosis of myocardial infarction, urinary tract infection, acute renal failure, and type 2 diabetes.

3.
medRxiv ; 2023 Aug 08.
Article in English | MEDLINE | ID: mdl-37609238

ABSTRACT

In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.

4.
medRxiv ; 2023 Jun 27.
Article in English | MEDLINE | ID: mdl-37425701

ABSTRACT

Cancer staging is an essential clinical attribute informing patient prognosis and clinical trial eligibility. However, it is not routinely recorded in structured electronic health records. Here, we present a generalizable method for the automated classification of TNM stage directly from pathology report text. We train a BERT-based model using publicly available pathology reports across approximately 7,000 patients and 23 cancer types. We explore the use of different model types, with differing input sizes, parameters, and model architectures. Our final model goes beyond term-extraction, inferring TNM stage from context when it is not included in the report text explicitly. As external validation, we test our model on almost 8,000 pathology reports from Columbia University Medical Center, finding that our trained model achieved an AU-ROC of 0.815-0.942. This suggests that our model can be applied broadly to other institutions without additional institution-specific fine-tuning.

SELECTION OF CITATIONS
SEARCH DETAIL