RESUMEN
Intra-tumor heterogeneity (ITH) is a mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin, and drivers of ITH across cancer types are poorly understood. To address this, we extensively characterize ITH across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types. Nearly all informative samples (95.1%) contain evidence of distinct subclonal expansions with frequent branching relationships between subclones. We observe positive selection of subclonal driver mutations across most cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations as well as dynamic changes in mutational processes between subclonal expansions. Our results underline the importance of ITH and its drivers in tumor evolution and provide a pan-cancer resource of comprehensively annotated subclonal events from whole-genome sequencing data.
Asunto(s)
Heterogeneidad Genética , Neoplasias/genética , Variaciones en el Número de Copia de ADN , ADN de Neoplasias/química , ADN de Neoplasias/metabolismo , Bases de Datos Genéticas , Resistencia a Antineoplásicos/genética , Humanos , Neoplasias/patología , Polimorfismo de Nucleótido Simple , Secuenciación Completa del GenomaRESUMEN
Cancer develops through a process of somatic evolution1,2. Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes3. Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)4, we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer. Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of samples. A nearly fourfold diversification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments. Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.
Asunto(s)
Evolución Molecular , Genoma Humano/genética , Neoplasias/genética , Reparación del ADN/genética , Dosificación de Gen , Genes Supresores de Tumor , Variación Genética , Humanos , Mutagénesis Insercional/genéticaRESUMEN
BACKGROUND: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. METHODS: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. RESULTS: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. CONCLUSIONS: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities.
Asunto(s)
Algoritmos , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Proteínas/metabolismo , Bases de Datos FactualesRESUMEN
BACKGROUND: While many factors may contribute to the higher prostate cancer incidence and mortality experienced by African-American men compared to their counterparts, the contribution of tumor biology is underexplored due to inadequate availability of African-American patient-derived cell lines and specimens. Here, we characterize the proteomes of non-malignant RC-77 N/E and malignant RC-77 T/E prostate epithelial cell lines previously established from prostate specimens from the same African-American patient with early stage primary prostate cancer. METHODS: In this comparative proteomic analysis of RC-77 N/E and RC-77 T/E cells, differentially expressed proteins were identified and analyzed for overrepresentation of PANTHER protein classes, Gene Ontology annotations, and pathways. The enrichment of gene sets and pathway significance were assessed using Gene Set Enrichment Analysis and Signaling Pathway Impact Analysis, respectively. The gene and protein expression data of age- and stage-matched prostate cancer specimens from The Cancer Genome Atlas were analyzed. RESULTS: Structural and cytoskeletal proteins were differentially expressed and statistically overrepresented between RC-77 N/E and RC-77 T/E cells. Beta-catenin, alpha-actinin-1, and filamin-A were upregulated in the tumorigenic RC-77 T/E cells, while integrin beta-1, integrin alpha-6, caveolin-1, laminin subunit gamma-2, and CD44 antigen were downregulated. The increased protein level of beta-catenin and the reduction of caveolin-1 protein level in the tumorigenic RC-77 T/E cells mirrored the upregulation of beta-catenin mRNA and downregulation of caveolin-1 mRNA in African-American prostate cancer specimens compared to non-malignant controls. After subtracting race-specific non-malignant RNA expression, beta-catenin and caveolin-1 mRNA expression levels were higher in African-American prostate cancer specimens than in Caucasian-American specimens. The "ECM-Receptor Interaction" and "Cell Adhesion Molecules", and the "Tight Junction" and "Adherens Junction" pathways contained proteins are associated with RC-77 N/E and RC-77 T/E cells, respectively. CONCLUSIONS: Our results suggest RC-77 T/E and RC-77 N/E cell lines can be distinguished by differentially expressed structural and cytoskeletal proteins, which appeared in several pathways across multiple analyses. Our results indicate that the expression of beta-catenin and caveolin-1 may be prostate cancer- and race-specific. Although the RC-77 cell model may not be representative of all African-American prostate cancer due to tumor heterogeneity, it is a unique resource for studying prostate cancer initiation and progression.
Asunto(s)
Proteínas de Neoplasias/genética , Neoplasias de la Próstata/genética , Proteoma/genética , Proteómica , Negro o Afroamericano , Línea Celular Tumoral , Células Epiteliales/metabolismo , Células Epiteliales/patología , Regulación Neoplásica de la Expresión Génica , Humanos , Masculino , Estadificación de Neoplasias , Próstata/metabolismo , Próstata/patología , Neoplasias de la Próstata/patología , Transducción de Señal/genéticaRESUMEN
Background: Alzheimer's disease (AD), a progressive neurodegenerative disorder, continues to increase in prevalence without any effective treatments to date. In this context, knowledge graphs (KGs) have emerged as a pivotal tool in biomedical research, offering new perspectives on drug repurposing and biomarker discovery by analyzing intricate network structures. Our study seeks to build an AD-specific knowledge graph, highlighting interactions among AD, genes, variants, chemicals, drugs, and other diseases. The goal is to shed light on existing treatments, potential targets, and diagnostic methods for AD, thereby aiding in drug repurposing and the identification of biomarkers. Results: We annotated 800 PubMed abstracts and leveraged GPT-4 for text augmentation to enrich our training data for named entity recognition (NER) and relation classification. A comprehensive data mining model, integrating NER and relationship classification, was trained on the annotated corpus. This model was subsequently applied to extract relation triplets from unannotated abstracts. To enhance entity linking, we utilized a suite of reference biomedical databases and refine the linking accuracy through abbreviation resolution. As a result, we successfully identified 3,199,276 entity mentions and 633,733 triplets, elucidating connections between 5,000 unique entities. These connections were pivotal in constructing a comprehensive Alzheimer's Disease Knowledge Graph (ADKG). We also integrated the ADKG constructed after entity linking with other biomedical databases. The ADKG served as a training ground for Knowledge Graph Embedding models with the high-ranking predicted triplets supported by evidence, underscoring the utility of ADKG in generating testable scientific hypotheses. Further application of ADKG in predictive modeling using the UK Biobank data revealed models based on ADKG outperforming others, as evidenced by higher values in the areas under the receiver operating characteristic (ROC) curves. Conclusion: The ADKG is a valuable resource for generating hypotheses and enhancing predictive models, highlighting its potential to advance AD's disease research and treatment strategies.
RESUMEN
Intra-tumor heterogeneity is an important driver of tumor evolution and therapy response. Advances in precision cancer treatment will require understanding of mutation clonality and subclonal architecture. Currently the slow computational speed of subclonal reconstruction hinders large cohort studies. To overcome this bottleneck, we developed Clonal structure identification through Pairwise Penalization, or CliPP, which clusters subclonal mutations using a regularized likelihood model. CliPP reliably processed whole-genome and whole-exome sequencing data from over 12,000 tumor samples within 24 hours, thus enabling large-scale downstream association analyses between subclonal structures and clinical outcomes. Through a pan-cancer investigation of 7,827 tumors from 32 cancer types, we found that high subclonal mutational load (sML), a measure of latency time in tumor evolution, was significantly associated with better patient outcomes in 16 cancer types with low to moderate tumor mutation burden (TMB). In a cohort of prostate cancer patients participating in an immunotherapy clinical trial, high sML was indicative of favorable response to immune checkpoint blockade. This comprehensive study using CliPP underscores sML as a key feature of cancer. sML may be essential for linking mutation dynamics with immunotherapy response in the large population of non-high TMB cancers.
RESUMEN
To cope with the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have emerged as a powerful data structure for integrating large volumes of heterogeneous data to facilitate accurate and efficient information retrieval and automated knowledge discovery (AKD). However, transforming unstructured content from scientific literature into KGs has remained a significant challenge, with previous methods unable to achieve human-level accuracy. In this study, we utilized an information extraction pipeline that won first place in the LitCoin NLP Challenge to construct a largescale KG using all PubMed abstracts. The quality of the large-scale information extraction rivals that of human expert annotations, signaling a new era of automatic, high-quality database construction from literature. Our extracted information markedly surpasses the amount of content in manually curated public databases. To enhance the KG's comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. The comprehensive KG enabled rigorous performance evaluation of AKD, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and achieved unprecedented results for drug target identification and drug repurposing. Taking lung cancer as an example, we found that 40% of drug targets reported in literature could have been predicted by our algorithm about 15 years ago in a retrospective study, demonstrating that substantial acceleration in scientific discovery could be achieved through automated hypotheses generation and timely dissemination. A cloud-based platform (https://www.biokde.com) was developed for academic users to freely access this rich structured data and associated tools.
RESUMEN
Single-cell RNA sequencing studies have suggested that total mRNA content correlates with tumor phenotypes. Technical and analytical challenges, however, have so far impeded at-scale pan-cancer examination of total mRNA content. Here we present a method to quantify tumor-specific total mRNA expression (TmS) from bulk sequencing data, taking into account tumor transcript proportion, purity and ploidy, which are estimated through transcriptomic/genomic deconvolution. We estimate and validate TmS in 6,590 patient tumors across 15 cancer types, identifying significant inter-tumor variability. Across cancers, high TmS is associated with increased risk of disease progression and death. TmS is influenced by cancer-specific patterns of gene alteration and intra-tumor genetic heterogeneity as well as by pan-cancer trends in metabolic dysregulation. Taken together, our results indicate that measuring cell-type-specific total mRNA expression in tumor cells predicts tumor phenotypes and clinical outcomes.
Asunto(s)
Neoplasias , Humanos , Neoplasias/genética , Neoplasias/metabolismo , Heterogeneidad Genética , Genómica , ARN Mensajero/genética , Progresión de la EnfermedadRESUMEN
Bayesian networks (BNs) provide a probabilistic, graphical framework for modeling high-dimensional joint distributions with complex correlation structures. BNs have wide applications in many disciplines, including biology, social science, finance and biomedical science. Despite extensive studies in the past, network structure learning from data is still a challenging open question in BN research. In this study, we present a sequential Monte Carlo (SMC)-based three-stage approach, GRowth-based Approach with Staged Pruning (GRASP). A double filtering strategy was first used for discovering the overall skeleton of the target BN. To search for the optimal network structures we designed an adaptive SMC (adSMC) algorithm to increase the quality and diversity of sampled networks which were further improved by a third stage to reclaim edges missed in the skeleton discovery step. GRASP gave very satisfactory results when tested on benchmark networks. Finally, BN structure learning using multiple types of genomics data illustrates GRASP's potential in discovering novel biological relationships in integrative genomic studies.
RESUMEN
Radiomics leverages existing image datasets to provide non-visible data extraction via image post-processing, with the aim of identifying prognostic, and predictive imaging features at a sub-region of interest level. However, the application of radiomics is hampered by several challenges such as lack of image acquisition/analysis method standardization, impeding generalizability. As of yet, radiomics remains intriguing, but not clinically validated. We aimed to test the feasibility of a non-custom-constructed platform for disseminating existing large, standardized databases across institutions for promoting radiomics studies. Hence, University of Texas MD Anderson Cancer Center organized two public radiomics challenges in head and neck radiation oncology domain. This was done in conjunction with MICCAI 2016 satellite symposium using Kaggle-in-Class, a machine-learning and predictive analytics platform. We drew on clinical data matched to radiomics data derived from diagnostic contrast-enhanced computed tomography (CECT) images in a dataset of 315 patients with oropharyngeal cancer. Contestants were tasked to develop models for (i) classifying patients according to their human papillomavirus status, or (ii) predicting local tumor recurrence, following radiotherapy. Data were split into training, and test sets. Seventeen teams from various professional domains participated in one or both of the challenges. This review paper was based on the contestants' feedback; provided by 8 contestants only (47%). Six contestants (75%) incorporated extracted radiomics features into their predictive model building, either alone (n = 5; 62.5%), as was the case with the winner of the "HPV" challenge, or in conjunction with matched clinical attributes (n = 2; 25%). Only 23% of contestants, notably, including the winner of the "local recurrence" challenge, built their model relying solely on clinical data. In addition to the value of the integration of machine learning into clinical decision-making, our experience sheds light on challenges in sharing and directing existing datasets toward clinical applications of radiomics, including hyper-dimensionality of the clinical/imaging data attributes. Our experience may help guide researchers to create a framework for sharing and reuse of already published data that we believe will ultimately accelerate the pace of clinical applications of radiomics; both in challenge or clinical settings.
RESUMEN
Choosing the optimal chemotherapy regimen is still an unmet medical need for breast cancer patients. In this study, we reanalyzed data from seven independent data sets with totally 1079 breast cancer patients. The patients were treated with three different types of commonly used neoadjuvant chemotherapies: anthracycline alone, anthracycline plus paclitaxel, and anthracycline plus docetaxel. We developed random forest models with variable selection using both genetic and clinical variables to predict the response of a patient using pCR (pathological complete response) as the measure of response. The models were then used to reassign an optimal regimen to each patient to maximize the chance of pCR. An independent validation was performed where each independent study was left out during model building and later used for validation. The expected pCR rates of our method are significantly higher than the rates of the best treatments for all the seven independent studies. A validation study on 21 breast cancer cell lines showed that our prediction agrees with their drug-sensitivity profiles. In conclusion, the new strategy, called PRES (Personalized REgimen Selection), may significantly increase response rates for breast cancer patients, especially those with HER2 and ER negative tumors, who will receive one of the widely-accepted chemotherapy regimens.
Asunto(s)
Antineoplásicos/administración & dosificación , Neoplasias de la Mama/tratamiento farmacológico , Neoplasias de la Mama/patología , Quimioterapia/métodos , Medicina de Precisión/métodos , Transcriptoma , Antraciclinas/administración & dosificación , Línea Celular Tumoral , Docetaxel , Femenino , Humanos , Masculino , Modelos Biológicos , Paclitaxel/administración & dosificación , Taxoides/administración & dosificaciónRESUMEN
Human Papilloma Virus (HPV) has been associated with oropharyngeal cancer prognosis. Traditionally the HPV status is tested through invasive lab test. Recently, the rapid development of statistical image analysis techniques has enabled precise quantitative analysis of medical images. The quantitative analysis of Computed Tomography (CT) provides a non-invasive way to assess HPV status for oropharynx cancer patients. We designed a statistical radiomics approach analyzing CT images to predict HPV status. Various radiomics features were extracted from CT scans, and analyzed using statistical feature selection and prediction methods. Our approach ranked the highest in the 2016 Medical Image Computing and Computer Assisted Intervention (MICCAI) grand challenge: Oropharynx Cancer (OPC) Radiomics Challenge, Human Papilloma Virus (HPV) Status Prediction. Further analysis on the most relevant radiomic features distinguishing HPV positive and negative subjects suggested that HPV positive patients usually have smaller and simpler tumors.
RESUMEN
INTRODUCTION: African American (AA) women diagnosed with breast cancer are more likely to have aggressive subtypes. Investigating differentially expressed genes between patient populations may help explain racial health disparities. Resistin, one such gene, is linked to inflammation, obesity, and breast cancer risk. Previous studies indicated that resistin expression is higher in serum and tissue of AA breast cancer patients compared to Caucasian American (CA) patients. However, resistin expression levels have not been compared between AA and CA patients in a stage- and subtype-specific context. Breast cancer prognosis and treatments vary by subtype. This work investigates differential resistin gene expression in human breast cancer tissues of specific stages, receptor subtypes, and menopause statuses in AA and CA women. METHODS: Differential gene expression analysis was performed using human breast cancer gene expression data from The Cancer Genome Atlas. We performed inter-race resistin gene expression level comparisons looking at receptor status and stage-specific data between AA and CA samples. DESeq was run to test for differentially expressed resistin values. RESULTS: Resistin RNA was higher in AA women overall, with highest values in receptor negative subtypes. Estrogen-, progesterone-, and human epidermal growth factor receptor 2- negative groups showed statistically significant elevated resistin levels in Stage I and II AA women compared to CA women. In inter-racial comparisons, AA women had significantly higher levels of resistin regardless of menopause status. In whole population comparisons, resistin expression was higher among Stage I and III estrogen receptor negative cases. In comparisons of molecular subtypes, resistin levels were significant higher in triple negative than in luminal A breast cancer. CONCLUSION: Resistin gene expression levels were significantly higher in receptor negative subtypes, especially estrogen receptor negative cases in AA women. Resistin may serve as an early breast cancer biomarker and possible therapeutic target for AA breast cancer.