RESUMO
The improving access to increasing amounts of biomedical data provides completely new chances for advanced patient stratification and disease subtyping strategies. This requires computational tools that produce uniformly robust results across highly heterogeneous molecular data. Unsupervised machine learning methodologies are able to discover de novo patterns in such data. Biclustering is especially suited by simultaneously identifying sample groups and corresponding feature sets across heterogeneous omics data. The performance of available biclustering algorithms heavily depends on individual parameterization and varies with their application. Here, we developed MoSBi (molecular signature identification using biclustering), an automated multialgorithm ensemble approach that integrates results utilizing an error model-supported similarity network. We systematically evaluated the performance of 11 available and established biclustering algorithms together with MoSBi. For this, we used transcriptomics, proteomics, and metabolomics data, as well as synthetic datasets covering various data properties. Profiting from multialgorithm integration, MoSBi identified robust group and disease-specific signatures across all scenarios, overcoming single algorithm specificities. Furthermore, we developed a scalable network-based visualization of bicluster communities that supports biological hypothesis generation. MoSBi is available as an R package and web service to make automated biclustering analysis accessible for application in molecular sample stratification.
Assuntos
Doença , Perfilação da Expressão Gênica , Metabolômica , Pacientes , Proteômica , Software , Algoritmos , Análise por Conglomerados , Doença/classificação , Humanos , Pacientes/classificaçãoRESUMO
BACKGROUND: Distinguishing diseases into distinct subtypes is crucial for study and effective treatment strategies. The Open Targets Platform (OT) integrates biomedical, genetic, and biochemical datasets to empower disease ontologies, classifications, and potential gene targets. Nevertheless, many disease annotations are incomplete, requiring laborious expert medical input. This challenge is especially pronounced for rare and orphan diseases, where resources are scarce. METHODS: We present a machine learning approach to identifying diseases with potential subtypes, using the approximately 23,000 diseases documented in OT. We derive novel features for predicting diseases with subtypes using direct evidence. Machine learning models were applied to analyze feature importance and evaluate predictive performance for discovering both known and novel disease subtypes. RESULTS: Our model achieves a high (89.4%) ROC AUC (Area Under the Receiver Operating Characteristic Curve) in identifying known disease subtypes. We integrated pre-trained deep-learning language models and showed their benefits. Moreover, we identify 515 disease candidates predicted to possess previously unannotated subtypes. CONCLUSIONS: Our models can partition diseases into distinct subtypes. This methodology enables a robust, scalable approach for improving knowledge-based annotations and a comprehensive assessment of disease ontology tiers. Our candidates are attractive targets for further study and personalized medicine, potentially aiding in the unveiling of new therapeutic indications for sought-after targets.
Assuntos
Aprendizado de Máquina , Humanos , Doença/classificação , Curva ROC , Biologia Computacional/métodos , Algoritmos , Aprendizado ProfundoRESUMO
Copy number variations (CNVs) are an important class of variations contributing to the pathogenesis of many disease phenotypes. Detecting CNVs from genomic data remains difficult, and the most currently applied methods suffer from an unacceptably high false positive rate. A common practice is to have human experts manually review original CNV calls for filtering false positives before further downstream analysis or experimental validation. Here, we propose DeepCNV, a deep learning-based tool, intended to replace human experts when validating CNV calls, focusing on the calls made by one of the most accurate CNV callers, PennCNV. The sophistication of the deep neural network algorithm is enriched with over 10 000 expert-scored samples that are split into training and testing sets. Variant confidence, especially for CNVs, is a main roadblock impeding the progress of linking CNVs with the disease. We show that DeepCNV adds to the confidence of the CNV calls with an optimal area under the receiver operating characteristic curve of 0.909, exceeding other machine learning methods. The superiority of DeepCNV was also benchmarked and confirmed using an experimental wet-lab validation dataset. We conclude that the improvement obtained by DeepCNV results in significantly fewer false positive results and failures to replicate the CNV association results.
Assuntos
Variações do Número de Cópias de DNA , Aprendizado Profundo , Doença/genética , Genoma Humano , Área Sob a Curva , Benchmarking , Conjuntos de Dados como Assunto , Doença/classificação , Reações Falso-Positivas , Humanos , Curva ROCRESUMO
Systems medicine (SM) has emerged as a powerful tool for studying the human body at the systems level with the aim of improving our understanding, prevention and treatment of complex diseases. Being able to automatically extract relevant features needed for a given task from high-dimensional, heterogeneous data, deep learning (DL) holds great promise in this endeavour. This review paper addresses the main developments of DL algorithms and a set of general topics where DL is decisive, namely, within the SM landscape. It discusses how DL can be applied to SM with an emphasis on the applications to predictive, preventive and precision medicine. Several key challenges have been highlighted including delivering clinical impact and improving interpretability. We used some prototypical examples to highlight the relevance and significance of the adoption of DL in SM, one of them is involving the creation of a model for personalized Parkinson's disease. The review offers valuable insights and informs the research in DL and SM.
Assuntos
Aprendizado Profundo , Análise de Sistemas , Algoritmos , Biomarcadores/metabolismo , Doença/classificação , Registros Eletrônicos de Saúde , Genômica , Humanos , Metabolômica , Redes Neurais de Computação , Medicina de Precisão/métodos , Proteômica , TranscriptomaRESUMO
How best to utilize the microbial taxonomic abundances in regard to the prediction and explanation of human diseases remains appealing and challenging, and the relative nature of microbiome data necessitates a proper feature selection method to resolve the compositional problem. In this study, we developed an all-in-one platform to address a series of issues in microbiome-based human disease prediction and taxonomic biomarkers discovery. We prioritize the interpretation, runtime and classification accuracy of the distal discriminative balances analysis (DBA-distal) method in selecting a set of distal discriminative balances, and develop DisBalance, a comprehensive platform, to integrate and streamline the workflows of disease model building, disease risk prediction and disease-related biomarker discovery for microbiome-based binary classifications. DisBalance allows the de novo model-building and disease risk prediction in a very fast and convenient way. To facilitate the model-driven and knowledge-driven discoveries, DisBalance dedicates multiple strategies for the mining of microbial biomarkers. The independent validation of the models constructed by the DisBalance pipeline is performed on seven microbiome datasets from the original article of DBA-distal. The implementation of the DisBalance platform is demonstrated by a complete analysis of a shotgun metagenomic dataset of Ulcerative Colitis (UC). As a free and open-source, DisBlance can be accessed at http://lab.malab.cn/soft/DisBalance. The source code and demo data for Disbalance are available at https://github.com/yangfenglong/DisBalance.
Assuntos
Biologia Computacional/métodos , Internet , Metagenoma/genética , Metagenômica/métodos , Microbiota/genética , Biomarcadores/análise , Colite Ulcerativa/diagnóstico , Colite Ulcerativa/genética , Colite Ulcerativa/microbiologia , Doença/classificação , Doença/genética , Humanos , Modelos Logísticos , Reprodutibilidade dos TestesRESUMO
MarkerDB is a freely available electronic database that attempts to consolidate information on all known clinical and a selected set of pre-clinical molecular biomarkers into a single resource. The database includes four major types of molecular biomarkers (chemical, protein, DNA [genetic] and karyotypic) and four biomarker categories (diagnostic, predictive, prognostic and exposure). MarkerDB provides information such as: biomarker names and synonyms, associated conditions or pathologies, detailed disease descriptions, detailed biomarker descriptions, biomarker specificity, sensitivity and ROC curves, standard reference values (for protein and chemical markers), variants (for SNP or genetic markers), sequence information (for genetic and protein markers), molecular structures (for protein and chemical markers), tissue or biofluid sources (for protein and chemical markers), chromosomal location and structure (for genetic and karyotype markers), clinical approval status and relevant literature references. Users can browse the data by conditions, condition categories, biomarker types, biomarker categories or search by sequence similarity through the advanced search function. Currently, the database contains 142 protein biomarkers, 1089 chemical biomarkers, 154 karyotype biomarkers and 26 374 genetic markers. These are categorized into 25 560 diagnostic biomarkers, 102 prognostic biomarkers, 265 exposure biomarkers and 6746 predictive biomarkers or biomarker panels. Collectively, these markers can be used to detect, monitor or predict 670 specific human conditions which are grouped into 27 broad condition categories. MarkerDB is available at https://markerdb.ca.
Assuntos
Biomarcadores/metabolismo , Bases de Dados Factuais , Doença/genética , Marcadores Genéticos , Proteínas/genética , Aberrações Cromossômicas , Doença/classificação , Humanos , Internet , Cariotipagem , Valor Preditivo dos Testes , Prognóstico , Proteínas/metabolismo , Curva ROC , SoftwareRESUMO
MicroRNAs (miRNAs) related single-nucleotide variations (SNVs), including single-nucleotide polymorphisms (SNPs) and disease-related variations (DRVs) in miRNAs and miRNA-target binding sites, can affect miRNA functions and/or biogenesis, thus to impact on phenotypes. miRNASNP is a widely used database for miRNA-related SNPs and their effects. Here, we updated it to miRNASNP-v3 (http://bioinfo.life.hust.edu.cn/miRNASNP/) with tremendous number of SNVs and new features, especially the DRVs data. We analyzed the effects of 7 161 741 SNPs and 505 417 DRVs on 1897 pre-miRNAs (2630 mature miRNAs) and 3'UTRs of 18 152 genes. miRNASNP-v3 provides a one-stop resource for miRNA-related SNVs research with the following functions: (i) explore associations between miRNA-related SNPs/DRVs and diseases; (ii) browse the effects of SNPs/DRVs on miRNA-target binding; (iii) functional enrichment analysis of miRNA target gain/loss caused by SNPs/DRVs; (iv) investigate correlations between drug sensitivity and miRNA expression; (v) inquire expression profiles of miRNAs and their targets in cancers; (vi) browse the effects of SNPs/DRVs on pre-miRNA secondary structure changes; and (vii) predict the effects of user-defined variations on miRNA-target binding or pre-miRNA secondary structure. miRNASNP-v3 is a valuable and long-term supported resource in functional variation screening and miRNA function studies.
Assuntos
Bases de Dados Genéticas , Doença/genética , MicroRNAs/genética , Polimorfismo de Nucleotídeo Único , Precursores de RNA/genética , Regiões 3' não Traduzidas , Sítios de Ligação , Doença/classificação , Resistência a Medicamentos/genética , Regulação da Expressão Gênica , Humanos , Internet , MicroRNAs/química , MicroRNAs/classificação , MicroRNAs/metabolismo , Conformação de Ácido Nucleico , Medicamentos sob Prescrição/uso terapêutico , Precursores de RNA/classificação , Precursores de RNA/metabolismo , SoftwareRESUMO
We describe an updated comprehensive database, LincSNP 3.0 (http://bioinfo.hrbmu.edu.cn/LincSNP), which aims to document and annotate disease or phenotype-associated variants in human long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) or their regulatory elements. LincSNP 3.0 has updated with several novel features, including (i) more types of variants including single nucleotide polymorphisms (SNPs), linkage disequilibrium SNPs (LD SNPs), somatic mutations and RNA editing sites have been expanded; (ii) more regulatory elements including transcription factor binding sites (TFBSs), enhancers, DNase I hypersensitive sites (DHSs), topologically associated domains (TADs), footprintss, methylations and open chromatin regions have been added; (iii) the associations among circRNAs, regulatory elements and variants have been identified; (iv) more experimentally supported variant-lncRNA/circRNA-disease/phenotype associations have been manually collected; (v) the sources of lncRNAs, circRNAs, SNPs, somatic mutations and RNA editing sites have been updated. Moreover, four flexible online tools including Genome Browser, Variant Mapper, Circos Plotter and Functional Annotation have been developed to retrieve, visualize and analyze the data. Collectively, LincSNP 3.0 provides associations among functional variants, regulatory elements, lncRNAs and circRNAs in diseases. It will serve as an important and continually updated resource for investigating functions and mechanisms of lncRNAs and circRNAs in diseases.
Assuntos
Bases de Dados de Ácidos Nucleicos , Doença/genética , Genoma Humano , RNA Circular/genética , RNA Longo não Codificante/genética , Sequências Reguladoras de Ácido Nucleico , Sítios de Ligação , Cromatina/química , Cromatina/metabolismo , Desoxirribonuclease I/genética , Desoxirribonuclease I/metabolismo , Doença/classificação , Humanos , Internet , Desequilíbrio de Ligação , Anotação de Sequência Molecular , Polimorfismo de Nucleotídeo Único , Ligação Proteica , RNA Circular/classificação , RNA Circular/metabolismo , RNA Longo não Codificante/classificação , RNA Longo não Codificante/metabolismo , Software , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
Amino acid transporters (AATs) are membrane-bound transport proteins that mediate transfer of amino acids into and out of cells or cellular organelles. AATs have diverse functional roles ranging from neurotransmission to acid-base balance, intracellular energy metabolism, and anabolic and catabolic reactions. In cancer cells and diabetes, dysregulation of AATs leads to metabolic reprogramming, which changes intracellular amino acid levels, contributing to the pathogenesis of cancer, obesity and diabetes. Indeed, the neutral amino acid transporters (NATs) SLC7A5/LAT1 and SLC1A5/ASCT2 are likely involved in several human malignancies. However, a clinical therapy that directly targets AATs has not yet been developed. The purpose of this review is to highlight the structural and functional diversity of AATs, their diverse physiological roles in different tissues and organs, their wide-ranging implications in human diseases and the emerging strategies and tools that will be necessary to target AATs therapeutically.
Assuntos
Sistemas de Transporte de Aminoácidos/metabolismo , Erros Inatos do Metabolismo dos Aminoácidos/metabolismo , Sistemas de Transporte de Aminoácidos/química , Aminoácidos/metabolismo , Doença/classificação , Células Epiteliais/metabolismo , Humanos , Mucosa Intestinal/metabolismo , Rim/metabolismo , Túbulos Renais Proximais/metabolismo , Longevidade , Conformação Proteica , Estresse FisiológicoRESUMO
Lifestyle and physiological variables on human disease risk have been revealed to be mediated by gut microbiota. Low concordance between case-control studies for detecting disease-associated microbe existed due to limited sample size and population-wide bias in lifestyle and physiological variables. To infer gut microbiota-disease associations accurately, we propose to build machine learning models by including both human variables and gut microbiota. When the model's performance with both gut microbiota and human variables is better than the model with just human variables, the independent gut microbiota -disease associations will be confirmed. By building models on the American Gut Project dataset, we found that gut microbiota showed distinct association strengths with different diseases. Adding gut microbiota into human variables enhanced the classification performance of IBD significantly; independent associations between occurrence information of gut microbiota and irritable bowel syndrome, C. difficile infection, and unhealthy status were found; adding gut microbiota showed no improvement on models' performance for diabetes, small intestinal bacterial overgrowth, lactose intolerance, cardiovascular disease. Our results suggested that although gut microbiota was reported to be associated with many diseases, a considerable proportion of these associations may be very weak. We proposed a list of microbes as biomarkers to classify IBD and unhealthy status. Further functional investigations of these microbes will improve understanding of the molecular mechanism of human diseases.
Assuntos
Doença/classificação , Microbioma Gastrointestinal/fisiologia , Biomarcadores , Infecções por Clostridium/microbiologia , Nível de Saúde , Humanos , Síndrome do Intestino Irritável/microbiologia , Estilo de Vida , Aprendizado de MáquinaRESUMO
miRNAs belong to small non-coding RNAs that are related to a number of complicated biological processes. Considerable studies have suggested that miRNAs are closely associated with many human diseases. In this study, we proposed a computational model based on Similarity Constrained Matrix Factorization for miRNA-Disease Association Prediction (SCMFMDA). In order to effectively combine different disease and miRNA similarity data, we applied similarity network fusion algorithm to obtain integrated disease similarity (composed of disease functional similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity) and integrated miRNA similarity (composed of miRNA functional similarity, miRNA sequence similarity and miRNA Gaussian interaction profile kernel similarity). In addition, the L2 regularization terms and similarity constraint terms were added to traditional Nonnegative Matrix Factorization algorithm to predict disease-related miRNAs. SCMFMDA achieved AUCs of 0.9675 and 0.9447 based on global Leave-one-out cross validation and five-fold cross validation, respectively. Furthermore, the case studies on two common human diseases were also implemented to demonstrate the prediction accuracy of SCMFMDA. The out of top 50 predicted miRNAs confirmed by experimental reports that indicated SCMFMDA was effective for prediction of relationship between miRNAs and diseases.
Assuntos
Algoritmos , Doença , MicroRNAs , Modelos Estatísticos , Biologia Computacional , Doença/classificação , Doença/genética , Humanos , MicroRNAs/análise , MicroRNAs/classificação , MicroRNAs/genéticaRESUMO
The medical and scientific literature is dominated by highly cited historical theories and findings [...].
Assuntos
Ferro/metabolismo , Metais/metabolismo , Terapêutica/tendências , Terapia por Quelação , Doença/classificação , Doença/etiologia , Tratamento Farmacológico/tendências , Humanos , Quelantes de Ferro/uso terapêutico , Deficiências de Ferro/tratamento farmacológico , Sobrecarga de Ferro/tratamento farmacológico , Terapêutica/métodosRESUMO
Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.
Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Mineração de Dados/métodos , Algoritmos , Big Data , Bases de Dados Genéticas/estatística & dados numéricos , Doença/classificação , Doença/genética , Expressão Gênica/efeitos dos fármacos , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Humanos , Anotação de Sequência Molecular/estatística & dados numéricosRESUMO
The Human Disease Ontology (DO) (http://www.disease-ontology.org), database has undergone significant expansion in the past three years. The DO disease classification includes specific formal semantic rules to express meaningful disease models and has expanded from a single asserted classification to include multiple-inferred mechanistic disease classifications, thus providing novel perspectives on related diseases. Expansion of disease terms, alternative anatomy, cell type and genetic disease classifications and workflow automation highlight the updates for the DO since 2015. The enhanced breadth and depth of the DO's knowledgebase has expanded the DO's utility for exploring the multi-etiology of human disease, thus improving the capture and communication of health-related data across biomedical databases, bioinformatics tools, genomic and cancer resources and demonstrated by a 6.6× growth in DO's user community since 2015. The DO's continual integration of human disease knowledge, evidenced by the more than 200 SVN/GitHub releases/revisions, since previously reported in our DO 2015 NAR paper, includes the addition of 2650 new disease terms, a 30% increase of textual definitions, and an expanding suite of disease classification hierarchies constructed through defined logical axioms.
Assuntos
Ontologias Biológicas , Bases de Dados Factuais , Doença , Doença/classificação , Doença/etiologia , Humanos , Fluxo de TrabalhoRESUMO
DNA methylation, the most intensively studied epigenetic modification, plays an important role in understanding the molecular basis of diseases. Furthermore, epigenome-wide association study (EWAS) provides a systematic approach to identify epigenetic variants underlying common diseases/phenotypes. However, there is no comprehensive database to archive the results of EWASs. To fill this gap, we developed the EWASdb, which is a part of 'The EWAS Project', to store the epigenetic association results of DNA methylation from EWASs. In its current version (v 1.0, up to July 2018), the EWASdb has curated 1319 EWASs associated with 302 diseases/phenotypes. There are three types of EWAS results curated in this database: (i) EWAS for single marker; (ii) EWAS for KEGG pathway and (iii) EWAS for GO (Gene Ontology) category. As the first comprehensive EWAS database, EWASdb has been searched or downloaded by researchers from 43 countries to date. We believe that EWASdb will become a valuable resource and significantly contribute to the epigenetic research of diseases/phenotypes and have potential clinical applications. EWASdb is freely available at http://www.ewas.org.cn/ewasdb or http://www.bioapp.org/ewasdb.
Assuntos
Metilação de DNA , Bases de Dados Genéticas , Epigênese Genética , Epigenoma , Doença/classificação , Doença/genética , Ontologia Genética , Estudos de Associação Genética , Fenótipo , Interface Usuário-ComputadorRESUMO
Reducing premature mortality associated with age-related chronic diseases, such as cancer and cardiovascular disease, is an urgent priority. We report early results using genomics in combination with advanced imaging and other clinical testing to proactively screen for age-related chronic disease risk among adults. We enrolled active, symptom-free adults in a study of screening for age-related chronic diseases associated with premature mortality. In addition to personal and family medical history and other clinical testing, we obtained whole-genome sequencing (WGS), noncontrast whole-body MRI, dual-energy X-ray absorptiometry (DXA), global metabolomics, a new blood test for prediabetes (Quantose IR), echocardiography (ECHO), ECG, and cardiac rhythm monitoring to identify age-related chronic disease risks. Precision medicine screening using WGS and advanced imaging along with other testing among active, symptom-free adults identified a broad set of complementary age-related chronic disease risks associated with premature mortality and strengthened WGS variant interpretation. This and other similarly designed screening approaches anchored by WGS and advanced imaging may have the potential to extend healthy life among active adults through improved prevention and early detection of age-related chronic diseases (and their risk factors) associated with premature mortality.
Assuntos
Doença/genética , Predisposição Genética para Doença , Processamento de Imagem Assistida por Computador/métodos , Mutação , Medicina de Precisão/métodos , Sequenciamento Completo do Genoma/métodos , Adulto , Idoso , Idoso de 80 Anos ou mais , Doenças Cardiovasculares/diagnóstico por imagem , Doenças Cardiovasculares/genética , Doenças Cardiovasculares/patologia , Doença/classificação , Feminino , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Masculino , Pessoa de Meia-Idade , Neoplasias/diagnóstico por imagem , Neoplasias/genética , Neoplasias/patologia , Doenças do Sistema Nervoso/diagnóstico por imagem , Doenças do Sistema Nervoso/genética , Doenças do Sistema Nervoso/patologia , Medição de Risco , Análise de Sequência de RNA , Adulto JovemRESUMO
BACKGROUND: Classification of diseases based on genetic information is of great significance as the basis for precision medicine, increasing the understanding of disease etiology and revolutionizing personalized medicine. Much effort has been directed at understanding disease associations by constructing disease networks, and classifying patient samples according to gene expression data. Integrating human gene networks overcomes limited coverage of genes. Incorporating pathway information into disease classification procedure addresses the challenge of cellular heterogeneity across patients. RESULTS: In this work, we propose a disease classification model LAMP, which concentrates on the layered assessment on modules and pathways. Directed human gene interactions are the foundation of constructing the human gene network, where the significant roles of disease and pathway genes are recognized. The fast unfolding algorithm identifies 11 modules in the largest connected component. Then layered networks are introduced to distinguish positions of genes in propagating information from sources to targets. After gene screening, hierarchical clustering and refined process, 1726 diseases from KEGG are classified into 18 categories. Also, it is expounded that diseases with overlapping genes may not belong to the same category in LAMP. Within each category, entropy is applied to measure the compositional complexity, and to evaluate the prospects for combination diagnosis and gene-targeted therapy for diseases. CONCLUSION: In this work, by collecting data from BioGRID and KEGG, we develop a disease classification model LAMP, to support people to view diseases from the perspective of commonalities in etiology and pathology. Comprehensive research on existing diseases can help meet the challenges of unknown diseases. The results provide suggestions for combination diagnosis and gene-targeted therapy, which motivates clinicians and researchers to reposition the understanding of diseases and explore diagnosis and therapy strategies.
Assuntos
Algoritmos , Doença/classificação , Doença/genética , Redes Reguladoras de Genes , Transdução de Sinais/genética , Análise por Conglomerados , Terapia Genética , HumanosRESUMO
Advances in high-throughput technologies allow for measurements of many types of omics data, yet the meaningful integration of several different data types remains a significant challenge. Another important and difficult problem is the discovery of molecular disease subtypes characterized by relevant clinical differences, such as survival. Here we present a novel approach, called perturbation clustering for data integration and disease subtyping (PINS), which is able to address both challenges. The framework has been validated on thousands of cancer samples, using gene expression, DNA methylation, noncoding microRNA, and copy number variation data available from the Gene Expression Omnibus, the Broad Institute, The Cancer Genome Atlas (TCGA), and the European Genome-Phenome Archive. This simultaneous subtyping approach accurately identifies known cancer subtypes and novel subgroups of patients with significantly different survival profiles. The results were obtained from genome-scale molecular data without any other type of prior knowledge. The approach is sufficiently general to replace existing unsupervised clustering approaches outside the scope of bio-medical research, with the additional ability to integrate multiple types of data.
Assuntos
Interpretação Estatística de Dados , Doença/classificação , Algoritmos , Análise por Conglomerados , Metilação de DNA , Feminino , Expressão Gênica , Doenças Genéticas Inatas/classificação , Humanos , Masculino , MicroRNAs , RNA MensageiroRESUMO
Gene co-expression networks can be used to associate genes of unknown function with biological processes, to prioritize candidate disease genes or to discern transcriptional regulatory programmes. With recent advances in transcriptomics and next-generation sequencing, co-expression networks constructed from RNA sequencing data also enable the inference of functions and disease associations for non-coding genes and splice variants. Although gene co-expression networks typically do not provide information about causality, emerging methods for differential co-expression analysis are enabling the identification of regulatory genes underlying various phenotypes. Here, we introduce and guide researchers through a (differential) co-expression analysis. We provide an overview of methods and tools used to create and analyse co-expression networks constructed from gene expression data, and we explain how these can be used to identify genes with a regulatory role in disease. Furthermore, we discuss the integration of other data types with co-expression networks and offer future perspectives of co-expression analysis.
Assuntos
Biologia Computacional/métodos , Doença/classificação , Doença/genética , Regulação da Expressão Gênica , Redes Reguladoras de Genes , Perfilação da Expressão Gênica , Genes , Humanos , FenótipoRESUMO
In identifying subgroups of a heterogeneous disease or condition, it is often desirable to identify both the observations and the features which differ between subgroups. For instance, it may be that there is a subgroup of individuals with a certain disease who differ from the rest of the population based on the expression profile for only a subset of genes. Identifying the subgroup of patients and subset of genes could lead to better-targeted therapy. We can represent the subgroup of individuals and genes as a bicluster, a submatrix, U , of a larger data matrix, X , such that the features and observations in U differ from those not contained in U . We present a novel two-step method, SC-Biclust, for identifying U . In the first step, the observations in the bicluster are identified to maximize the sum of the weighted between-cluster feature differences. In the second step, features in the bicluster are identified based on their contribution to the clustering of the observations. This versatile method can be used to identify biclusters that differ on the basis of feature means, feature variances, or more general differences. The bicluster identification accuracy of SC-Biclust is illustrated through several simulated studies. Application of SC-Biclust to pain research illustrates its ability to identify biologically meaningful subgroups.