Búsqueda | Portal Regional de la BVS

1.

Machine learning and deep learning tools for the automated capture of cancer surveillance data.

Hsu, Elizabeth; Hanson, Heidi; Coyle, Linda; Stevens, Jennifer; Tourassi, Georgia; Penberthy, Lynne.

J Natl Cancer Inst Monogr ; 2024(65): 145-151, 2024 Aug 01.

Artículo en Inglés | MEDLINE | ID: mdl-39102883

RESUMEN

The National Cancer Institute and the Department of Energy strategic partnership applies advanced computing and predictive machine learning and deep learning models to automate the capture of information from unstructured clinical text for inclusion in cancer registries. Applications include extraction of key data elements from pathology reports, determination of whether a pathology or radiology report is related to cancer, extraction of relevant biomarker information, and identification of recurrence. With the growing complexity of cancer diagnosis and treatment, capturing essential information with purely manual methods is increasingly difficult. These new methods for applying advanced computational capabilities to automate data extraction represent an opportunity to close critical information gaps and create a nimble, flexible platform on which new information sources, such as genomics, can be added. This will ultimately provide a deeper understanding of the drivers of cancer and outcomes in the population and increase the timeliness of reporting. These advances will enable better understanding of how real-world patients are treated and the outcomes associated with those treatments in the context of our complex medical and social environment.

Asunto(s)

Aprendizaje Profundo , Aprendizaje Automático , Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/epidemiología , Estados Unidos/epidemiología , Sistema de Registros , National Cancer Institute (U.S.)

2.

Deep learning uncertainty quantification for clinical text classification.

Peluso, Alina; Danciu, Ioana; Yoon, Hong-Jun; Yusof, Jamaludin Mohd; Bhattacharya, Tanmoy; Spannaus, Adam; Schaefferkoetter, Noah; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen; Wiggins, Charles; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia D; Gao, Shang.

J Biomed Inform ; 149: 104576, 2024 01.

Artículo en Inglés | MEDLINE | ID: mdl-38101690

RESUMEN

INTRODUCTION: Machine learning algorithms are expected to work side-by-side with humans in decision-making pipelines. Thus, the ability of classifiers to make reliable decisions is of paramount importance. Deep neural networks (DNNs) represent the state-of-the-art models to address real-world classification. Although the strength of activation in DNNs is often correlated with the network's confidence, in-depth analyses are needed to establish whether they are well calibrated. METHOD: In this paper, we demonstrate the use of DNN-based classification tools to benefit cancer registries by automating information extraction of disease at diagnosis and at surgery from electronic text pathology reports from the US National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) population-based cancer registries. In particular, we introduce multiple methods for selective classification to achieve a target level of accuracy on multiple classification tasks while minimizing the rejection amount-that is, the number of electronic pathology reports for which the model's predictions are unreliable. We evaluate the proposed methods by comparing our approach with the current in-house deep learning-based abstaining classifier. RESULTS: Overall, all the proposed selective classification methods effectively allow for achieving the targeted level of accuracy or higher in a trade-off analysis aimed to minimize the rejection rate. On in-distribution validation and holdout test data, with all the proposed methods, we achieve on all tasks the required target level of accuracy with a lower rejection rate than the deep abstaining classifier (DAC). Interpreting the results for the out-of-distribution test data is more complex; nevertheless, in this case as well, the rejection rate from the best among the proposed methods achieving 97% accuracy or higher is lower than the rejection rate based on the DAC. CONCLUSIONS: We show that although both approaches can flag those samples that should be manually reviewed and labeled by human annotators, the newly proposed methods retain a larger fraction and do so without retraining-thus offering a reduced computational cost compared with the in-house deep learning-based abstaining classifier.

Asunto(s)

Aprendizaje Profundo , Humanos , Incertidumbre , Redes Neurales de la Computación , Algoritmos , Aprendizaje Automático

3.

Using ensembles and distillation to optimize the deployment of deep learning models for the classification of electronic cancer pathology reports.

De Angeli, Kevin; Gao, Shang; Blanchard, Andrew; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M; Wiggins, Charles; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia; Yoon, Hong-Jun.

JAMIA Open ; 5(3): ooac075, 2022 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-36110150

RESUMEN

Objective: We aim to reduce overfitting and model overconfidence by distilling the knowledge of an ensemble of deep learning models into a single model for the classification of cancer pathology reports. Materials and Methods: We consider the text classification problem that involves 5 individual tasks. The baseline model consists of a multitask convolutional neural network (MtCNN), and the implemented ensemble (teacher) consists of 1000 MtCNNs. We performed knowledge transfer by training a single model (student) with soft labels derived through the aggregation of ensemble predictions. We evaluate performance based on accuracy and abstention rates by using softmax thresholding. Results: The student model outperforms the baseline MtCNN in terms of abstention rates and accuracy, thereby allowing the model to be used with a larger volume of documents when deployed. The highest boost was observed for subsite and histology, for which the student model classified an additional 1.81% reports for subsite and 3.33% reports for histology. Discussion: Ensemble predictions provide a useful strategy for quantifying the uncertainty inherent in labeled data and thereby enable the construction of soft labels with estimated probabilities for multiple classes for a given document. Training models with the derived soft labels reduce model confidence in difficult-to-classify documents, thereby leading to a reduction in the number of highly confident wrong predictions. Conclusions: Ensemble model distillation is a simple tool to reduce model overconfidence in problems with extreme class imbalance and noisy datasets. These methods can facilitate the deployment of deep learning models in high-risk domains with low computational resources where minimizing inference time is required.

4.

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology.

Yoon, Hong-Jun; Stanley, Christopher; Christian, J Blair; Klasky, Hilda B; Blanchard, Andrew E; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; Tourassi, Georgia D.

Cancer Biomark ; 33(2): 185-198, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-35213361

RESUMEN

BACKGROUND: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. OBJECTIVE: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. RESULTS: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.

Asunto(s)

Confidencialidad , Aprendizaje Profundo , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Neoplasias/epidemiología , Inteligencia Artificial , Aprendizaje Profundo/normas , Humanos , Neoplasias/patología , Sistema de Registros

5.

A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification.

Blanchard, Andrew E; Gao, Shang; Yoon, Hong-Jun; Christian, J Blair; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M; Wiggins, Charles; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia D.

IEEE J Biomed Health Inform ; 26(6): 2796-2803, 2022 06.

Artículo en Inglés | MEDLINE | ID: mdl-35020599

RESUMEN

Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.

Asunto(s)

Reproducibilidad de los Resultados , Recolección de Datos , Humanos

6.

Predictive Radiation Oncology - A New NCI-DOE Scientific Space and Community.

Buchsbaum, Jeffrey C; Jaffray, David A; Ba, Demba; Borkon, Lynn L; Chalk, Christine; Chung, Caroline; Coleman, Matthew A; Coleman, C Norman; Diehn, Maximilian; Droegemeier, Kelvin K; Enderling, Heiko; Espey, Michael G; Greenspan, Emily J; Hartshorn, Christopher M; Hoang, Thuc; Hsiao, H Timothy; Keppel, Cynthia; Moore, Nathan W; Prior, Fred; Stahlberg, Eric A; Tourassi, Georgia; Willcox, Karen E.

Radiat Res ; 197(4): 434-445, 2022 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-35090025

RESUMEN

With a widely attended virtual kickoff event on January 29, 2021, the National Cancer Institute (NCI) and the Department of Energy (DOE) launched a series of 4 interactive, interdisciplinary workshops-and a final concluding "World Café" on March 29, 2021-focused on advancing computational approaches for predictive oncology in the clinical and research domains of radiation oncology. These events reflect 3,870 human hours of virtual engagement with representation from 8 DOE national laboratories and the Frederick National Laboratory for Cancer Research (FNL), 4 research institutes, 5 cancer centers, 17 medical schools and teaching hospitals, 5 companies, 5 federal agencies, 3 research centers, and 27 universities. Here we summarize the workshops by first describing the background for the workshops. Participants identified twelve key questions-and collaborative parallel ideas-as the focus of work going forward to advance the field. These were then used to define short-term and longer-term "Blue Sky" goals. In addition, the group determined key success factors for predictive oncology in the context of radiation oncology, if not the future of all of medicine. These are: cross-discipline collaboration, targeted talent development, development of mechanistic mathematical and computational models and tools, and access to high-quality multiscale data that bridges mechanisms to phenotype. The workshop participants reported feeling energized and highly motivated to pursue next steps together to address the unmet needs in radiation oncology specifically and in cancer research generally and that NCI and DOE project goals align at the convergence of radiation therapy and advanced computing.

Asunto(s)

Oncología por Radiación , Academias e Institutos , Humanos , National Cancer Institute (U.S.) , Oncología por Radiación/educación , Estados Unidos

7.

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

De Angeli, Kevin; Gao, Shang; Danciu, Ioana; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia D; Yoon, Hong-Jun.

J Biomed Inform ; 125: 103957, 2022 01.

Artículo en Inglés | MEDLINE | ID: mdl-34823030

RESUMEN

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

Asunto(s)

Procesamiento de Lenguaje Natural , Neoplasias , Registros Electrónicos de Salud , Humanos , Aprendizaje Automático , Redes Neurales de la Computación

8.

Artificial intelligence in cancer research, diagnosis and therapy.

Elemento, Olivier; Leslie, Christina; Lundin, Johan; Tourassi, Georgia.

Nat Rev Cancer ; 21(12): 747-752, 2021 12.

Artículo en Inglés | MEDLINE | ID: mdl-34535775

RESUMEN

STANDFIRST: Artificial intelligence and machine learning techniques are breaking into biomedical research and health care, which importantly includes cancer research and oncology, where the potential applications are vast. These include detection and diagnosis of cancer, subtype classification, optimization of cancer treatment and identification of new therapeutic targets in drug discovery. While big data used to train machine learning models may already exist, leveraging this opportunity to realize the full promise of artificial intelligence in both the cancer research space and the clinical space will first require significant obstacles to be surmounted. In this Viewpoint article, we asked four experts for their opinions on how we can begin to implement artificial intelligence while ensuring standards are maintained so as transform cancer diagnosis and the prognosis and treatment of patients with cancer and to drive biological discovery.

Asunto(s)

Investigación Biomédica , Neoplasias , Inteligencia Artificial , Descubrimiento de Drogas/métodos , Humanos , Aprendizaje Automático , Oncología Médica , Neoplasias/diagnóstico , Neoplasias/terapia

9.

Deep active learning for classifying cancer pathology reports.

De Angeli, Kevin; Gao, Shang; Alawad, Mohammed; Yoon, Hong-Jun; Schaefferkoetter, Noah; Wu, Xiao-Cheng; Durbin, Eric B; Doherty, Jennifer; Stroup, Antoinette; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia.

BMC Bioinformatics ; 22(1): 113, 2021 Mar 09.

Artículo en Inglés | MEDLINE | ID: mdl-33750288

RESUMEN

BACKGROUND: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model. RESULTS: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes. CONCLUSIONS: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.

Asunto(s)

Aprendizaje Automático , Neoplasias , Algoritmos , Humanos , Neoplasias/genética , Neoplasias/patología , Redes Neurales de la Computación

10.

COVID-19 Evidence Accelerator: A parallel analysis to describe the use of Hydroxychloroquine with or without Azithromycin among hospitalized COVID-19 patients.

Stewart, Mark; Rodriguez-Watson, Carla; Albayrak, Adem; Asubonteng, Julius; Belli, Andrew; Brown, Thomas; Cho, Kelly; Das, Ritankar; Eldridge, Elizabeth; Gatto, Nicolle; Gelman, Alice; Gerlovin, Hanna; Goldberg, Stuart L; Hansen, Eric; Hirsch, Jonathan; Ho, Yuk-Lam; Ip, Andrew; Izano, Monika; Jones, Jason; Justice, Amy C; Klesh, Reyna; Kuranz, Seth; Lam, Carson; Mao, Qingqing; Mataraso, Samson; Mera, Robertino; Posner, Daniel C; Rassen, Jeremy A; Siefkas, Anna; Schrag, Andrew; Tourassi, Georgia; Weckstein, Andrew; Wolf, Frank; Bhat, Amar; Winckler, Susan; Sigal, Ellen V; Allen, Jeff.

PLoS One ; 16(3): e0248128, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-33730088

RESUMEN

BACKGROUND: The COVID-19 pandemic remains a significant global threat. However, despite urgent need, there remains uncertainty surrounding best practices for pharmaceutical interventions to treat COVID-19. In particular, conflicting evidence has emerged surrounding the use of hydroxychloroquine and azithromycin, alone or in combination, for COVID-19. The COVID-19 Evidence Accelerator convened by the Reagan-Udall Foundation for the FDA, in collaboration with Friends of Cancer Research, assembled experts from the health systems research, regulatory science, data science, and epidemiology to participate in a large parallel analysis of different data sets to further explore the effectiveness of these treatments. METHODS: Electronic health record (EHR) and claims data were extracted from seven separate databases. Parallel analyses were undertaken on data extracted from each source. Each analysis examined time to mortality in hospitalized patients treated with hydroxychloroquine, azithromycin, and the two in combination as compared to patients not treated with either drug. Cox proportional hazards models were used, and propensity score methods were undertaken to adjust for confounding. Frequencies of adverse events in each treatment group were also examined. RESULTS: Neither hydroxychloroquine nor azithromycin, alone or in combination, were significantly associated with time to mortality among hospitalized COVID-19 patients. No treatment groups appeared to have an elevated risk of adverse events. CONCLUSION: Administration of hydroxychloroquine, azithromycin, and their combination appeared to have no effect on time to mortality in hospitalized COVID-19 patients. Continued research is needed to clarify best practices surrounding treatment of COVID-19.

Asunto(s)

Antivirales/uso terapéutico , Azitromicina/uso terapéutico , Tratamiento Farmacológico de COVID-19 , Hidroxicloroquina/uso terapéutico , Pandemias/prevención & control , Manejo de Datos/métodos , Quimioterapia Combinada/métodos , Femenino , Hospitalización , Humanos , Masculino , SARS-CoV-2/efectos de los fármacos

11.

Limitations of Transformers on Clinical Text Classification.

Gao, Shang; Alawad, Mohammed; Young, M Todd; Gounley, John; Schaefferkoetter, Noah; Yoon, Hong Jun; Wu, Xiao-Cheng; Durbin, Eric B; Doherty, Jennifer; Stroup, Antoinette; Coyle, Linda; Tourassi, Georgia.

IEEE J Biomed Health Inform ; 25(9): 3596-3607, 2021 09.

Artículo en Inglés | MEDLINE | ID: mdl-33635801

RESUMEN

Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text.

Asunto(s)

Procesamiento de Lenguaje Natural , Redes Neurales de la Computación , Humanos

12.

Privacy-Preserving Deep Learning NLP Models for Cancer Registries.

Alawad, Mohammed; Yoon, Hong-Jun; Gao, Shang; Mumphrey, Brent; Wu, Xiao-Cheng; Durbin, Eric B; Jeong, Jong Cheol; Hands, Isaac; Rust, David; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia.

IEEE Trans Emerg Top Comput ; 9(3): 1219-1230, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-36117774

RESUMEN

Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this paper, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics - tumor site, subsite, laterality, behavior, histology, and grade - from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).

13.

Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports.

Yoon, Hong-Jun; Klasky, Hilda B; Gounley, John P; Alawad, Mohammed; Gao, Shang; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Coyle, Linda; Penberthy, Lynne; Blair Christian, J; Tourassi, Georgia D.

J Biomed Inform ; 110: 103564, 2020 10.

Artículo en Inglés | MEDLINE | ID: mdl-32919043

RESUMEN

OBJECTIVE: In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. MATERIALS AND METHODS: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem-thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). RESULTS: We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. CONCLUSION: Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.

Asunto(s)

Neoplasias , Redes Neurales de la Computación , Metodologías Computacionales , Humanos , Almacenamiento y Recuperación de la Información , Aprendizaje Automático

14.

Using case-level context to classify cancer pathology reports.

Gao, Shang; Alawad, Mohammed; Schaefferkoetter, Noah; Penberthy, Lynne; Wu, Xiao-Cheng; Durbin, Eric B; Coyle, Linda; Ramanathan, Arvind; Tourassi, Georgia.

PLoS One ; 15(5): e0232840, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-32396579

RESUMEN

Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence-for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks-site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.

Asunto(s)

Registros Electrónicos de Salud/clasificación , Neoplasias/patología , Técnicas Histológicas , Humanos , Procesamiento de Lenguaje Natural , Programa de VERF

15.

Knowledge Graph-Enabled Cancer Data Analytics.

Hasan, S M Shamimul; Rivera, Donna; Wu, Xiao-Cheng; Durbin, Eric B; Christian, J Blair; Tourassi, Georgia.

IEEE J Biomed Health Inform ; 24(7): 1952-1967, 2020 07.

Artículo en Inglés | MEDLINE | ID: mdl-32386166

RESUMEN

Cancer registries collect unstructured and structured cancer data for surveillance purposes which provide important insights regarding cancer characteristics, treatments, and outcomes. Cancer registry data typically (1) categorize each reportable cancer case or tumor at the time of diagnosis, (2) contain demographic information about the patient such as age, gender, and location at time of diagnosis, (3) include planned and completed primary treatment information, and (4) may contain survival outcomes. As structured data is being extracted from various unstructured sources, such as pathology reports, radiology reports, medical records, and stored for reporting and other needs, the associated information representing a reportable cancer is constantly expanding and evolving. While some popular analytic approaches including SEER*Stat and SAS exist, we provide a knowledge graph approach to organizing cancer registry data. Our approach offers unique advantages for timely data analysis and presentation and visualization of valuable information. This knowledge graph approach semantically enriches the data, and easily enables linking with third-party data which can help explain variation in cancer incidence patterns, disparities, and outcomes. We developed a prototype knowledge graph based on the Louisiana Tumor Registry dataset. We present the advantages of the knowledge graph approach by examining: i) scenario-specific queries, ii) links with openly available external datasets, iii) schema evolution for iterative analysis, and iv) data visualization. Our results demonstrate that this graph based solution can perform complex queries, improve query run-time performance by up to 76%, and more easily conduct iterative analyses to enhance researchers' understanding of cancer registry data.

Asunto(s)

Bases del Conocimiento , Neoplasias , Sistema de Registros , Adulto , Anciano , Anciano de 80 o más Años , Algoritmos , Bases de Datos Factuales , Femenino , Humanos , Incidencia , Masculino , Persona de Mediana Edad , Neoplasias/diagnóstico , Neoplasias/epidemiología , Neoplasias/fisiopatología

16.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

Alawad, Mohammed; Gao, Shang; Qiu, John X; Yoon, Hong Jun; Blair Christian, J; Penberthy, Lynne; Mumphrey, Brent; Wu, Xiao-Cheng; Coyle, Linda; Tourassi, Georgia.

J Am Med Inform Assoc ; 27(1): 89-98, 2020 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-31710668

RESUMEN

OBJECTIVE: We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. MATERIALS AND METHODS: Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). RESULTS: MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. CONCLUSIONS: The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task-specific model.

Asunto(s)

Almacenamiento y Recuperación de la Información/métodos , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Neoplasias/patología , Redes Neurales de la Computación , Sistema de Registros , Humanos , Neoplasias/clasificación , Máquina de Vectores de Soporte

17.

Classifying cancer pathology reports with hierarchical self-attention networks.

Gao, Shang; Qiu, John X; Alawad, Mohammed; Hinkle, Jacob D; Schaefferkoetter, Noah; Yoon, Hong-Jun; Christian, Blair; Fearn, Paul A; Penberthy, Lynne; Wu, Xiao-Cheng; Coyle, Linda; Tourassi, Georgia; Ramanathan, Arvind.

Artif Intell Med ; 101: 101726, 2019 11.

Artículo en Inglés | MEDLINE | ID: mdl-31813492

RESUMEN

We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks - site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data - Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.

Asunto(s)

Neoplasias/patología , Aprendizaje Profundo , Humanos , Procesamiento de Lenguaje Natural , Neoplasias/clasificación , Redes Neurales de la Computación

18.

AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing.

Bhattacharya, Tanmoy; Brettin, Thomas; Doroshow, James H; Evrard, Yvonne A; Greenspan, Emily J; Gryshuk, Amy L; Hoang, Thuc T; Lauzon, Carolyn B Vea; Nissley, Dwight; Penberthy, Lynne; Stahlberg, Eric; Stevens, Rick; Streitz, Fred; Tourassi, Georgia; Xia, Fangfang; Zaki, George.

Front Oncol ; 9: 984, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-31632915

RESUMEN

The application of data science in cancer research has been boosted by major advances in three primary areas: (1) Data: diversity, amount, and availability of biomedical data; (2) Advances in Artificial Intelligence (AI) and Machine Learning (ML) algorithms that enable learning from complex, large-scale data; and (3) Advances in computer architectures allowing unprecedented acceleration of simulation and machine learning algorithms. These advances help build in silico ML models that can provide transformative insights from data including: molecular dynamics simulations, next-generation sequencing, omics, imaging, and unstructured clinical text documents. Unique challenges persist, however, in building ML models related to cancer, including: (1) access, sharing, labeling, and integration of multimodal and multi-institutional data across different cancer types; (2) developing AI models for cancer research capable of scaling on next generation high performance computers; and (3) assessing robustness and reliability in the AI models. In this paper, we review the National Cancer Institute (NCI) -Department of Energy (DOE) collaboration, Joint Design of Advanced Computing Solutions for Cancer (JDACS4C), a multi-institution collaborative effort focused on advancing computing and data technologies to accelerate cancer research on three levels: molecular, cellular, and population. This collaboration integrates various types of generated data, pre-exascale compute resources, and advances in ML models to increase understanding of basic cancer biology, identify promising new treatment options, predict outcomes, and eventually prescribe specialized treatments for patients with cancer.

19.

Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records.

Savova, Guergana K; Danciu, Ioana; Alamudun, Folami; Miller, Timothy; Lin, Chen; Bitterman, Danielle S; Tourassi, Georgia; Warner, Jeremy L.

Cancer Res ; 79(21): 5463-5470, 2019 11 01.

Artículo en Inglés | MEDLINE | ID: mdl-31395609

RESUMEN

Current models for correlating electronic medical records with -omics data largely ignore clinical text, which is an important source of phenotype information for patients with cancer. This data convergence has the potential to reveal new insights about cancer initiation, progression, metastasis, and response to treatment. Insights from this real-world data will catalyze clinical care, research, and regulatory activities. Natural language processing (NLP) methods are needed to extract these rich cancer phenotypes from clinical text. Here, we review the advances of NLP and information extraction methods relevant to oncology based on publications from PubMed as well as NLP and machine learning conference proceedings in the last 3 years. Given the interdisciplinary nature of the fields of oncology and information extraction, this analysis serves as a critical trail marker on the path to higher fidelity oncology phenotypes from real-world data.

Asunto(s)

Minería de Datos/métodos , Oncología Médica/métodos , Registros Electrónicos de Salud , Humanos , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Fenotipo

20.

Harnessing the Power of Collaboration and Training Within Clinical Data Science to Generate Real-World Evidence in the Era of Precision Oncology.

Rivera, Donna R; Lee, Jerry S H; Hsu, Elizabeth; Khoury, Muin J; Meng, Frank; Olivero, Ofelia; Penberthy, Lynne; Tourassi, Georgia D.

Clin Pharmacol Ther ; 106(1): 60-66, 2019 07.

Artículo en Inglés | MEDLINE | ID: mdl-31166005

Asunto(s)

Conducta Cooperativa , Ciencia de los Datos/organización & administración , Neoplasias/terapia , Medicina de Precisión/métodos , Vigilancia de Productos Comercializados/métodos , Recolección de Datos , Interpretación Estadística de Datos , Difusión de Innovaciones , Humanos , Comunicación Interdisciplinaria

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA