Pesquisa | BVS - MINISTÉRIO DA SAÚDE

Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports.

Elmarakeby, Haitham A; Trukhanov, Pavel S; Arroyo, Vidal M; Riaz, Irbaz Bin; Schrag, Deborah; Van Allen, Eliezer M; Kehl, Kenneth L.

BMC Bioinformatics ; 24(1): 328, 2023 Sep 02.

Artigo em Inglês | MEDLINE | ID: mdl-37658330

RESUMO

BACKGROUND: Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. RESULTS: We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler "bag of words" or convolutional neural network models. CONCLUSION: When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.

Assuntos

Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Carcinoma Pulmonar de Células não Pequenas/diagnóstico por imagem , Neoplasias Pulmonares/diagnóstico por imagem , Progressão da Doença , Fontes de Energia Elétrica , Registros Eletrônicos de Saúde

Biologically informed deep neural network for prostate cancer discovery.

Elmarakeby, Haitham A; Hwang, Justin; Arafeh, Rand; Crowdis, Jett; Gang, Sydney; Liu, David; AlDubayan, Saud H; Salari, Keyan; Kregel, Steven; Richter, Camden; Arnoff, Taylor E; Park, Jihye; Hahn, William C; Van Allen, Eliezer M.

Nature ; 598(7880): 348-352, 2021 10.

Artigo em Inglês | MEDLINE | ID: mdl-34552244

RESUMO

The determination of molecular features that mediate clinically aggressive phenotypes in prostate cancer remains a major biological and clinical challenge1,2. Recent advances in interpretability of machine learning models as applied to biomedical problems may enable discovery and prediction in clinical cancer genomics3-5. Here we developed P-NET-a biologically informed deep learning model-to stratify patients with prostate cancer by treatment-resistance state and evaluate molecular drivers of treatment resistance for therapeutic targeting through complete model interpretability. We demonstrate that P-NET can predict cancer state using molecular data with a performance that is superior to other modelling approaches. Moreover, the biological interpretability within P-NET revealed established and novel molecularly altered candidates, such as MDM4 and FGFR1, which were implicated in predicting advanced disease and validated in vitro. Broadly, biologically informed fully interpretable neural networks enable preclinical discovery and clinical prediction in prostate cancer and may have general applicability across cancer types.

Assuntos

Aprendizado Profundo , Neoplasias da Próstata/diagnóstico , Neoplasias da Próstata/tratamento farmacológico , Proteínas de Ciclo Celular/genética , Resistencia a Medicamentos Antineoplásicos/efeitos dos fármacos , Resistencia a Medicamentos Antineoplásicos/genética , Humanos , Masculino , Neoplasias da Próstata/genética , Proteínas Proto-Oncogênicas/genética , Receptor Tipo 1 de Fator de Crescimento de Fibroblastos/genética , Receptores Androgênicos/genética , Reprodutibilidade dos Testes , Proteína Supressora de Tumor p53/genética

Systematic auditing is essential to debiasing machine learning in biology.

Eid, Fatma-Elzahraa; Elmarakeby, Haitham A; Chan, Yujia Alina; Fornelos, Nadine; ElHefnawi, Mahmoud; Van Allen, Eliezer M; Heath, Lenwood S; Lage, Kasper.

Commun Biol ; 4(1): 183, 2021 02 10.

Artigo em Inglês | MEDLINE | ID: mdl-33568741

RESUMO

Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn. Although biases are common in biological data, systematic auditing of ML models to identify and eliminate these biases is not a common practice when applying ML in the life sciences. Here we devise a systematic, principled, and general approach to audit ML models in the life sciences. We use this auditing framework to examine biases in three ML applications of therapeutic interest and identify unrecognized biases that hinder the ML process and result in substantially reduced model performance on new datasets. Ultimately, we show that ML models tend to learn primarily from data biases when there is insufficient signal in the data to learn from. We provide detailed protocols, guidelines, and examples of code to enable tailoring of the auditing framework to other biomedical applications.

Assuntos

Mineração de Dados , Aprendizado de Máquina , Proteínas/metabolismo , Proteoma , Proteômica , Animais , Viés , Bases de Dados de Proteínas , Antígenos de Histocompatibilidade/metabolismo , Humanos , Preparações Farmacêuticas/química , Preparações Farmacêuticas/metabolismo , Ligação Proteica , Mapas de Interação de Proteínas , Proteínas/química , Reprodutibilidade dos Testes

ATM Loss Confers Greater Sensitivity to ATR Inhibition Than PARP Inhibition in Prostate Cancer.

Rafiei, Shahrzad; Fitzpatrick, Kenyon; Liu, David; Cai, Mu-Yan; Elmarakeby, Haitham A; Park, Jihye; Ricker, Cora; Kochupurakkal, Bose S; Choudhury, Atish D; Hahn, William C; Balk, Steven P; Hwang, Justin H; Van Allen, Eliezer M; Mouw, Kent W.

Cancer Res ; 80(11): 2094-2100, 2020 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-32127357

RESUMO

Alterations in DNA damage response (DDR) genes are common in advanced prostate tumors and are associated with unique genomic and clinical features. ATM is a DDR kinase that has a central role in coordinating DNA repair and cell-cycle response following DNA damage, and ATM alterations are present in approximately 5% of advanced prostate tumors. Recently, inhibitors of PARP have demonstrated activity in advanced prostate tumors harboring DDR gene alterations, particularly in tumors with BRCA1/2 alterations. However, the role of alterations in DDR genes beyond BRCA1/2 in mediating PARP inhibitor sensitivity is poorly understood. To define the role of ATM loss in prostate tumor DDR function and sensitivity to DDR-directed agents, we created a series of ATM-deficient preclinical prostate cancer models and tested the impact of ATM loss on DNA repair function and therapeutic sensitivities. ATM loss altered DDR signaling, but did not directly impact homologous recombination function. Furthermore, ATM loss did not significantly impact sensitivity to PARP inhibition but robustly sensitized to inhibitors of the related DDR kinase ATR. These results have important implications for planned and ongoing prostate cancer clinical trials and suggest that patients with tumor ATM alterations may be more likely to benefit from ATR inhibitor than PARP inhibitor therapy. SIGNIFICANCE: ATM loss occurs in a subset of prostate tumors. This study shows that deleting ATM in prostate cancer models does not significantly increase sensitivity to PARP inhibition but does sensitize to ATR inhibition.See related commentary by Setton and Powell, p. 2085.

Assuntos

Inibidores de Poli(ADP-Ribose) Polimerases , Neoplasias da Próstata , Proteínas Mutadas de Ataxia Telangiectasia , Dano ao DNA , Reparo do DNA , Genoma , Humanos , Masculino

Harmonization of Tumor Mutational Burden Quantification and Association With Response to Immune Checkpoint Blockade in Non-Small-Cell Lung Cancer.

Vokes, Natalie I; Liu, David; Ricciuti, Biagio; Jimenez-Aguilar, Elizabeth; Rizvi, Hira; Dietlein, Felix; He, Meng Xiao; Margolis, Claire A; Elmarakeby, Haitham A; Girshman, Jeffrey; Adeni, Anika; Sanchez-Vega, Francisco; Schultz, Nikolaus; Dahlberg, Suzanne; Zehir, Ahmet; Jänne, Pasi A; Nishino, Mizuki; Umeton, Renato; Sholl, Lynette M; Van Allen, Eliezer M; Hellmann, Matthew D; Awad, Mark M.

JCO Precis Oncol ; 32019.

Artigo em Inglês | MEDLINE | ID: mdl-31832578

RESUMO

PURPOSE: Heterogeneity in tumor mutational burden (TMB) quantification across sequencing platforms limits the application and further study of this potential biomarker of response to immune checkpoint inhibitors (ICI). We hypothesized that harmonization of TMB across platforms would enable integration of distinct clinical datasets to better characterize the association between TMB and ICI response. METHODS: Cohorts of NSCLC patients sequenced by one of three targeted panels or by whole exome sequencing (WES) were compared (total n=7297). TMB was calculated uniformly and compared across cohorts. TMB distributions were harmonized by applying a normal transformation followed by standardization to z-scores. In sub-cohorts of patients treated with ICIs (DFCI n=272; MSKCC n=227), the association between TMB and outcome was assessed. Durable clinical benefit (DCB) was defined as responsive/stable disease lasting ≥6 months. RESULTS: TMB values were higher in the panel cohorts than the WES cohort. Average mutation rates per gene were highly concordant across cohorts (Pearson coefficient 0.842-0.866). Subsetting the WES cohort by gene panels only partially reproduced the observed differences in TMB. Standardization of TMB into z-scores harmonized TMB distributions and enabled integration of the ICI-treated sub-cohorts. Simulations indicated that cohorts >900 are necessary for this approach. TMB did not associate with response in never smokers or patients harboring targetable driver alterations, although these analyses were under-powered. Increasing TMB thresholds increased DCB rate, but DCB rates within deciles varied. Receiver operator curves yielded an area under the curve of 0.614 with no natural inflection point. CONCLUSION: Z-score conversion harmonizes TMB values and enables integration of datasets derived from different sequencing panels. Clinical and biologic features may provide context to the clinical application of TMB, and warrant further study.

A Proposal for a Genome Similarity-Based Taxonomy for Plant-Pathogenic Bacteria that Is Sufficiently Precise to Reflect Phylogeny, Host Range, and Outbreak Affiliation Applied to Pseudomonas syringae sensu lato as a Proof of Concept.

Vinatzer, Boris A; Weisberg, Alexandra J; Monteil, Caroline L; Elmarakeby, Haitham A; Sheppard, Samuel K; Heath, Lenwood S.

Phytopathology ; 107(1): 18-28, 2017 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-27552324

RESUMO

Taxonomy of plant pathogenic bacteria is challenging because pathogens of different crops often belong to the same named species but current taxonomy does not provide names for bacteria below the subspecies level. The introduction of the host range-based pathovar system in the 1980s provided a temporary solution to this problem but has many limitations. The affordability of genome sequencing now provides the opportunity for developing a new genome-based taxonomic framework. We already proposed to name individual bacterial isolates based on pairwise genome similarity. Here, we expand on this idea and propose to use genome similarity-based codes, which we now call life identification numbers (LINs), to describe and name bacterial taxa. Using 93 genomes of Pseudomonas syringae sensu lato, LINs were compared with a P. syringae genome tree whereby the assigned LINs were found to be informative of a majority of phylogenetic relationships. LINs also reflected host range and outbreak association for strains of P. syringae pathovar actinidiae, a pathovar for which many genome sequences are available. We conclude that LINs could provide the basis for a new taxonomic framework to address the shortcomings of the current pathovar system and to complement the current taxonomic system of bacteria in general.

Assuntos

Genoma Bacteriano/genética , Especificidade de Hospedeiro , Doenças das Plantas/microbiologia , Plantas/microbiologia , Pseudomonas syringae/classificação , Filogenia , Pseudomonas syringae/genética , Pseudomonas syringae/fisiologia , Análise de Sequência de DNA

Similarity-based codes sequentially assigned to ebolavirus genomes are informative of species membership, associated outbreaks, and transmission chains.

Weisberg, Alexandra J; Elmarakeby, Haitham A; Heath, Lenwood S; Vinatzer, Boris A.

Open Forum Infect Dis ; 2(1): ofv024, 2015 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-26034773

RESUMO

Background. Developing a universal standardized microbial typing and nomenclature system that provides phylogenetic and epidemiological information in real time has never been as urgent in public health as it is today. We previously proposed to use genome similarity as the basis for immediate and precise typing and naming of individual organisms or viruses. In this study, we tested the validity of the proposed system and applied it to the epidemiology of infectious diseases using Ebola virus disease (EVD) outbreaks as the example. Methods. One hundred twenty-eight publicly available ebolavirus genomes were compared with each other, and average nucleotide identity (ANI) was calculated. The ANI was then used to assign unique codes, hereafter referred to as Life Identification Numbers (LINs), to every viral isolate, whereby each LIN consisted of a series of positions reflecting increasing genome similarity. Congruence of LINs with phylogenetic and epidemiological relationships was then determined. Results. Assigned LINs correlate with phylogeny at the species and infraspecies level and can even identify some individual transmission chains during the 2014-2015 EVD epidemic in West Africa. Conclusions. Life Identification Numbers can provide a fast, automated, standardized, and scalable approach to precisely identify and name viral isolates upon genome sequence submission, facilitating unambiguous communication during disease epidemics among clinicians, epidemiologists, and governments.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA