RESUMO
Genetic alterations in signaling pathways that control cell-cycle progression, apoptosis, and cell growth are common hallmarks of cancer, but the extent, mechanisms, and co-occurrence of alterations in these pathways differ between individual tumors and tumor types. Using mutations, copy-number changes, mRNA expression, gene fusions and DNA methylation in 9,125 tumors profiled by The Cancer Genome Atlas (TCGA), we analyzed the mechanisms and patterns of somatic alterations in ten canonical pathways: cell cycle, Hippo, Myc, Notch, Nrf2, PI-3-Kinase/Akt, RTK-RAS, TGFß signaling, p53 and ß-catenin/Wnt. We charted the detailed landscape of pathway alterations in 33 cancer types, stratified into 64 subtypes, and identified patterns of co-occurrence and mutual exclusivity. Eighty-nine percent of tumors had at least one driver alteration in these pathways, and 57% percent of tumors had at least one alteration potentially targetable by currently available drugs. Thirty percent of tumors had multiple targetable alterations, indicating opportunities for combination therapy.
Assuntos
Bases de Dados Genéticas , Neoplasias/patologia , Transdução de Sinais/genética , Genes Neoplásicos , Humanos , Neoplasias/genética , Fosfatidilinositol 3-Quinases/genética , Fosfatidilinositol 3-Quinases/metabolismo , Fator de Crescimento Transformador beta/genética , Fator de Crescimento Transformador beta/metabolismo , Proteína Supressora de Tumor p53/genética , Proteína Supressora de Tumor p53/metabolismo , Proteínas Wnt/genética , Proteínas Wnt/metabolismoRESUMO
Precision medicine initiatives across the globe have led to a revolution of repositories linking large-scale genomic data with electronic health records, enabling genomic analyses across the entire phenome. Many of these initiatives focus solely on research insights, leading to limited direct benefit to patients. We describe the biobank at the Colorado Center for Personalized Medicine (CCPM Biobank) that was jointly developed by the University of Colorado Anschutz Medical Campus and UCHealth to serve as a unique, dual-purpose research and clinical resource accelerating personalized medicine. This living resource currently has more than 200,000 participants with ongoing recruitment. We highlight the clinical, laboratory, regulatory, and HIPAA-compliant informatics infrastructure along with our stakeholder engagement, consent, recontact, and participant engagement strategies. We characterize aspects of genetic and geographic diversity unique to the Rocky Mountain region, the primary catchment area for CCPM Biobank participants. We leverage linked health and demographic information of the CCPM Biobank participant population to demonstrate the utility of the CCPM Biobank to replicate complex trait associations in the first 33,674 genotyped individuals across multiple disease domains. Finally, we describe our current efforts toward return of clinical genetic test results, including high-impact pathogenic variants and pharmacogenetic information, and our broader goals as the CCPM Biobank continues to grow. Bringing clinical and research interests together fosters unique clinical and translational questions that can be addressed from the large EHR-linked CCPM Biobank resource within a HIPAA- and CLIA-certified environment.
Assuntos
Sistema de Aprendizagem em Saúde , Medicina de Precisão , Humanos , Bancos de Espécimes Biológicos , Colorado , GenômicaRESUMO
Data sharing anchors reproducible science, but expectations and best practices are often nebulous. Communities of funders, researchers and publishers continue to grapple with what should be required or encouraged. To illuminate the rationales for sharing data, the technical challenges and the social and cultural challenges, we consider the stakeholders in the scientific enterprise. In biomedical research, participants are key among those stakeholders. Ethical sharing requires considering both the value of research efforts and the privacy costs for participants. We discuss current best practices for various types of genomic data, as well as opportunities to promote ethical data sharing that accelerates science by aligning incentives.
Assuntos
Pesquisa Biomédica/métodos , Pesquisa Biomédica/tendências , Genômica/ética , Disseminação de Informação/ética , Pesquisadores/tendências , Comportamento Cooperativo , Humanos , PrivacidadeRESUMO
Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint-peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application (https://greenelab.github.io/preprint-similarity-search/) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.
Assuntos
Idioma , Revisão da Pesquisa por Pares , Pré-Publicações como Assunto , Pesquisa Biomédica , Publicações/normas , Terminologia como AssuntoRESUMO
Evolving in sync with the computation revolution over the past 30 years, computational biology has emerged as a mature scientific field. While the field has made major contributions toward improving scientific knowledge and human health, individual computational biology practitioners at various institutions often languish in career development. As optimistic biologists passionate about the future of our field, we propose solutions for both eager and reluctant individual scientists, institutions, publishers, funding agencies, and educators to fully embrace computational biology. We believe that in order to pave the way for the next generation of discoveries, we need to improve recognition for computational biologists and better align pathways of career success with pathways of scientific progress. With 10 outlined steps, we call on all adjacent fields to move away from the traditional individual, single-discipline investigator research model and embrace multidisciplinary, data-driven, team science.
Assuntos
Biologia Computacional , Orçamentos , Comportamento Cooperativo , Humanos , Pesquisa Interdisciplinar , Tutoria , Motivação , Publicações , Recompensa , SoftwareRESUMO
Those building predictive models from transcriptomic data are faced with two conflicting perspectives. The first, based on the inherent high dimensionality of biological systems, supposes that complex non-linear models such as neural networks will better match complex biological systems. The second, imagining that complex systems will still be well predicted by simple dividing lines prefers linear models that are easier to interpret. We compare multi-layer neural networks and logistic regression across multiple prediction tasks on GTEx and Recount3 datasets and find evidence in favor of both possibilities. We verified the presence of non-linear signal when predicting tissue and metadata sex labels from expression data by removing the predictive linear signal with Limma, and showed the removal ablated the performance of linear methods but not non-linear ones. However, we also found that the presence of non-linear signal was not necessarily sufficient for neural networks to outperform logistic regression. Our results demonstrate that while multi-layer neural networks may be useful for making predictions from gene expression data, including a linear baseline model is critical because while biological systems are high-dimensional, effective dividing lines for predictive models may not be.
Assuntos
Expressão Gênica , Dinâmica não Linear , Perfilação da Expressão Gênica , Redes Neurais de Computação , Modelos LinearesRESUMO
MOTIVATION: Domain adaptation allows for the development of predictive models even in cases with limited sample data. Weighted elastic net domain adaptation specifically leverages features of genomic data to maximize transferability but the method is too computationally demanding to apply to many genome-sized datasets. RESULTS: We developed wenda_gpu, which uses GPyTorch to train models on genomic data within hours on a single GPU-enabled machine. We show that wenda_gpu returns comparable results to the original wenda implementation, and that it can be used for improved prediction of cancer mutation status on small sample sizes than regular elastic net. AVAILABILITY AND IMPLEMENTATION: wenda_gpu is available on GitHub at https://github.com/greenelab/wenda_gpu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Neoplasias , Software , Humanos , Genômica/métodos , Neoplasias/genética , Tamanho da AmostraRESUMO
Pseudomonas aeruginosa strains with loss-of-function mutations in the transcription factor LasR are frequently encountered in the clinic and the environment. Among the characteristics common to LasR-defective (LasR-) strains is increased activity of the transcription factor Anr, relative to their LasR+ counterparts, in low-oxygen conditions. One of the Anr-regulated genes found to be highly induced in LasR- strains was PA14_42860 (PA1673), which we named mhr for microoxic hemerythrin. Purified P. aeruginosa Mhr protein contained the predicted di-iron center and bound molecular oxygen with an apparent Kd of â¼1 µM. Both Anr and Mhr were necessary for fitness in lasR+ and lasR mutant strains in colony biofilms grown in microoxic conditions, and the effects were more striking in the lasR mutant. Among genes in the Anr regulon, mhr was most closely coregulated with the Anr-controlled high-affinity cytochrome c oxidase genes. In the absence of high-affinity cytochrome c oxidases, deletion of mhr no longer caused a fitness disadvantage, suggesting that Mhr works in concert with microoxic respiration. We demonstrate that Anr and Mhr contribute to LasR- strain fitness even in biofilms grown in normoxic conditions. Furthermore, metabolomics data indicate that, in a lasR mutant, expression of Anr-regulated mhr leads to differences in metabolism in cells grown on lysogeny broth or artificial sputum medium. We propose that increased Anr activity leads to higher levels of the oxygen-binding protein Mhr, which confers an advantage to lasR mutants in microoxic conditions.
Assuntos
Proteínas de Bactérias/metabolismo , Hipóxia Celular/genética , Aptidão Genética/genética , Hemeritrina/metabolismo , Pseudomonas aeruginosa , Transativadores/metabolismo , Proteínas de Bactérias/genética , Hemeritrina/genética , Oxigênio/metabolismo , Pseudomonas aeruginosa/genética , Pseudomonas aeruginosa/metabolismo , Pseudomonas aeruginosa/fisiologia , Transativadores/genéticaRESUMO
Single-cell RNA-sequencing (scRNA-seq) has made it possible to profile gene expression in tissues at high resolution. An important preprocessing step prior to performing downstream analyses is to identify and remove cells with poor or degraded sample quality using quality control (QC) metrics. Two widely used QC metrics to identify a 'low-quality' cell are (i) if the cell includes a high proportion of reads that map to mitochondrial DNA (mtDNA) encoded genes and (ii) if a small number of genes are detected. Current best practices use these QC metrics independently with either arbitrary, uniform thresholds (e.g. 5%) or biological context-dependent (e.g. species) thresholds, and fail to jointly model these metrics in a data-driven manner. Current practices are often overly stringent and especially untenable on certain types of tissues, such as archived tumor tissues, or tissues associated with mitochondrial function, such as kidney tissue [1]. We propose a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset. We demonstrate how our QC metric easily adapts to different types of single-cell datasets to remove low-quality cells while preserving high-quality cells that can be used for downstream analyses. Our software package is available at https://bioconductor.org/packages/miQC.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Probabilidade , Controle de Qualidade , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , DNA Mitocondrial/genética , HumanosRESUMO
Alternative splicing (AS) is frequent during early mouse embryonic development. Specific histone post-translational modifications (hPTMs) have been shown to regulate exon splicing by either directly recruiting splice machinery or indirectly modulating transcriptional elongation. In this study, we hypothesized that hPTMs regulate expression of alternatively spliced genes for specific processes during differentiation. To address this notion, we applied an innovative machine learning approach to relate global hPTM enrichment to AS regulation during mammalian tissue development. We found that specific hPTMs, H3K36me3 and H3K4me1, play a role in skipped exon selection among all the tissues and developmental time points examined. In addition, we used iterative random forest model and found that interactions of multiple hPTMs most strongly predicted splicing when they included H3K36me3 and H3K4me1. Collectively, our data demonstrated a link between hPTMs and alternative splicing which will drive further experimental studies on the functional relevance of these modifications to alternative splicing.
Assuntos
Processamento Alternativo , Desenvolvimento Embrionário/genética , Éxons , Código das Histonas , Animais , Modelos Logísticos , Aprendizado de Máquina , Camundongos , Processamento de Proteína Pós-TraducionalRESUMO
Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple-testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large-scale consortia and develop open access to enable sharing of genome-wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two-stage testing procedure, named phylogenY-based effect-size tests for interactions using first 2 moments (YETI2), to detect genetic interactions through both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. YETI2 can not only be applied to large consortia without shared personal information but also can be used to leverage underlying heterogeneity in marginal effects to prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies and apply YETI2 to bladder cancer data from dbGaP.
Assuntos
Epistasia Genética/genética , Estudo de Associação Genômica Ampla/métodos , Neoplasias da Bexiga Urinária/genética , Humanos , Disseminação de Informação , Modelos Genéticos , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Omics data contain signals from the molecular, physical, and kinetic inter- and intracellular interactions that control biological systems. Matrix factorization (MF) techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in applications ranging from pathway discovery to timecourse analysis. We review exemplary applications of MF for systems-level analyses. We discuss appropriate applications of these methods, their limitations, and focus on the analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with MF enables discovery from high-throughput data beyond the limits of current biological knowledge - answering questions from high-dimensional data that we have not yet thought to ask.
Assuntos
Interpretação Estatística de Dados , Genômica/estatística & dados numéricos , Proteômica/estatística & dados numéricos , Algoritmos , Humanos , Biologia de Sistemas/estatística & dados numéricosRESUMO
MOTIVATION: Decreasing costs are making it feasible to perform time series proteomics and genomics experiments with more replicates and higher resolution than ever before. With more replicates and time points, proteome and genome-wide patterns of expression are more readily discernible. These larger experiments require more batches exacerbating batch effects and increasing the number of bias trends. In the case of proteomics, where methods frequently result in missing data this increasing scale is also decreasing the number of peptides observed in all samples. The sources of batch effects and missing data are incompletely understood necessitating novel techniques. RESULTS: Here we show that by exploiting the structure of time series experiments, it is possible to accurately and reproducibly model and remove batch effects. We implement Learning and Imputation for Mass-spec Bias Reduction (LIMBR) software, which builds on previous block-based models of batch effects and includes features specific to time series and circadian studies. To aid in the analysis of time series proteomics experiments, which are often plagued with missing data points, we also integrate an imputation system. By building LIMBR for imputation and time series tailored bias modeling into one straightforward software package, we expect that the quality and ease of large-scale proteomics and genomics time series experiments will be significantly increased. AVAILABILITY AND IMPLEMENTATION: Python code and documentation is available for download at https://github.com/aleccrowell/LIMBR and LIMBR can be downloaded and installed with dependencies using 'pip install limbr'. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Software , Genoma , Genômica , Espectrometria de Massas , ProteômicaRESUMO
Open, collaborative research is a powerful paradigm that can immensely strengthen the scientific process by integrating broad and diverse expertise. However, traditional research and multi-author writing processes break down at scale. We present new software named Manubot, available at https://manubot.org, to address the challenges of open scholarly writing. Manubot adopts the contribution workflow used by many large-scale open source software projects to enable collaborative authoring of scholarly manuscripts. With Manubot, manuscripts are written in Markdown and stored in a Git repository to precisely track changes over time. By hosting manuscript repositories publicly, such as on GitHub, multiple authors can simultaneously propose and review changes. A cloud service automatically evaluates proposed changes to catch errors. Publication with Manubot is continuous: When a manuscript's source changes, the rendered outputs are rebuilt and republished to a web page. Manubot automates bibliographic tasks by implementing citation by identifier, where users cite persistent identifiers (e.g. DOIs, PubMed IDs, ISBNs, URLs), whose metadata is then retrieved and converted to a user-specified style. Manubot modernizes publishing to align with the ideals of open science by making it transparent, reproducible, immediate, versioned, collaborative, and free of charge.
Assuntos
Editoração , Software , Redação , Humanos , Manuscritos Médicos como AssuntoRESUMO
The Pseudomonas fluorescens genome encodes more than 50 proteins predicted to be involved in c-di-GMP signaling. Here, we demonstrated that, tested across 188 nutrients, these enzymes and effectors appeared capable of impacting biofilm formation. Transcriptional analysis of network members across â¼50 nutrient conditions indicates that altered gene expression can explain a subset of but not all biofilm formation responses to the nutrients. Additional organization of the network is likely achieved through physical interaction, as determined via probing â¼2,000 interactions by bacterial two-hybrid assays. Our analysis revealed a multimodal regulatory strategy using combinations of ligand-mediated signals, protein-protein interaction, and/or transcriptional regulation to fine-tune c-di-GMP-mediated responses. These results create a profile of a large c-di-GMP network that is used to make important cellular decisions, opening the door to future model building and the ability to engineer this complex circuitry in other bacteria.IMPORTANCE Cyclic diguanylate (c-di-GMP) is a key signaling molecule regulating bacterial biofilm formation, and many microbes have up to dozens of proteins that make, break, or bind this dinucleotide. A major open issue in the field is how signaling specificity is conferred in the unpartitioned space of a bacterial cell. Here, we took a systems approach, using mutational analysis, transcriptional studies, and bacterial two-hybrid analysis to interrogate this network. We found that a majority of enzymes are capable of impacting biofilm formation in a context-dependent manner, and we revealed examples of two or more modes of regulation (i.e., transcriptional control with protein-protein interaction) being utilized to generate an observable impact on biofilm formation.
Assuntos
Biofilmes/crescimento & desenvolvimento , GMP Cíclico/análogos & derivados , Regulação Bacteriana da Expressão Gênica , Pseudomonas fluorescens/crescimento & desenvolvimento , GMP Cíclico/genética , Perfilação da Expressão Gênica , Pseudomonas fluorescens/genética , Transdução de Sinais , Técnicas do Sistema de Duplo-HíbridoRESUMO
One way to design a drug is to attempt to phenocopy a genetic variant that is known to have the desired effect. In general, drugs that are supported by genetic associations progress further in the development pipeline. However, the number of associations that are candidates for development into drugs is limited because many associations are in non-coding regions or difficult to target genes. Approaches that overlay information from pathway databases or biological networks can expand the potential target list. In cases where the initial variant is not targetable or there is no variant with the desired effect, this may reveal new means to target a disease. In this review, we discuss recent examples in the domain of pathway and network-based drug repositioning from genetic associations. We highlight important caveats and challenges for the field, and we discuss opportunities for further development.
RESUMO
We present SEEK (search-based exploration of expression compendia; http://seek.princeton.edu/), a query-based search engine for very large transcriptomic data collections, including thousands of human data sets from many different microarray and high-throughput sequencing platforms. SEEK uses a query-level cross-validation-based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify genes, pathways and processes co-regulated with the query. SEEK provides multigene query searching with iterative metadata-based search refinement and extensive visualization-based analysis options.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Ferramenta de Busca , Transcriptoma , Algoritmos , Bases de Dados Genéticas , Ontologia Genética , Proteínas Hedgehog/genética , Proteínas Hedgehog/metabolismo , Humanos , RNARESUMO
OBJECTIVES: Early prediction of undesired outcomes among newly hospitalized patients could improve patient triage and prompt conversations about patients' goals of care. We evaluated the performance of logistic regression, gradient boosting machine, random forest, and elastic net regression models, with and without unstructured clinical text data, to predict a binary composite outcome of in-hospital death or ICU length of stay greater than or equal to 7 days using data from the first 48 hours of hospitalization. DESIGN: Retrospective cohort study with split sampling for model training and testing. SETTING: A single urban academic hospital. PATIENTS: All hospitalized patients who required ICU care at the Beth Israel Deaconess Medical Center in Boston, MA, from 2001 to 2012. INTERVENTIONS: None. MEASUREMENTS AND MAIN RESULTS: Among eligible 25,947 hospital admissions, we observed 5,504 (21.2%) in which patients died or had ICU length of stay greater than or equal to 7 days. The gradient boosting machine model had the highest discrimination without (area under the receiver operating characteristic curve, 0.83; 95% CI, 0.81-0.84) and with (area under the receiver operating characteristic curve, 0.89; 95% CI, 0.88-0.90) text-derived variables. Both gradient boosting machines and random forests outperformed logistic regression without text data (p < 0.001), whereas all models outperformed logistic regression with text data (p < 0.02). The inclusion of text data increased the discrimination of all four model types (p < 0.001). Among those models using text data, the increasing presence of terms "intubated" and "poor prognosis" were positively associated with mortality and ICU length of stay, whereas the term "extubated" was inversely associated with them. CONCLUSIONS: Variables extracted from unstructured clinical text from the first 48 hours of hospital admission using natural language processing techniques significantly improved the abilities of logistic regression and other machine learning models to predict which patients died or had long ICU stays. Learning health systems may adapt such models using open-source approaches to capture local variation in care patterns.