ABSTRACT
The extent to which low-frequency (minor allele frequency (MAF) between 1-5%) and rare (MAF ≤ 1%) variants contribute to complex traits and disease in the general population is mainly unknown. Bone mineral density (BMD) is highly heritable, a major predictor of osteoporotic fractures, and has been previously associated with common genetic variants, as well as rare, population-specific, coding variants. Here we identify novel non-coding genetic variants with large effects on BMD (ntotal = 53,236) and fracture (ntotal = 508,253) in individuals of European ancestry from the general population. Associations for BMD were derived from whole-genome sequencing (n = 2,882 from UK10K (ref. 10); a population-based genome sequencing consortium), whole-exome sequencing (n = 3,549), deep imputation of genotyped samples using a combined UK10K/1000 Genomes reference panel (n = 26,534), and de novo replication genotyping (n = 20,271). We identified a low-frequency non-coding variant near a novel locus, EN1, with an effect size fourfold larger than the mean of previously reported common variants for lumbar spine BMD (rs11692564(T), MAF = 1.6%, replication effect size = +0.20 s.d., Pmeta = 2 × 10(-14)), which was also associated with a decreased risk of fracture (odds ratio = 0.85; P = 2 × 10(-11); ncases = 98,742 and ncontrols = 409,511). Using an En1(cre/flox) mouse model, we observed that conditional loss of En1 results in low bone mass, probably as a consequence of high bone turnover. We also identified a novel low-frequency non-coding variant with large effects on BMD near WNT16 (rs148771817(T), MAF = 1.2%, replication effect size = +0.41 s.d., Pmeta = 1 × 10(-11)). In general, there was an excess of association signals arising from deleterious coding and conserved non-coding variants. These findings provide evidence that low-frequency non-coding variants have large effects on BMD and fracture, thereby providing rationale for whole-genome sequencing and improved imputation reference panels to study the genetic architecture of complex traits and disease in the general population.
Subject(s)
Bone Density/genetics , Fractures, Bone/genetics , Genome, Human/genetics , Homeodomain Proteins/genetics , Animals , Bone and Bones/metabolism , Disease Models, Animal , Europe/ethnology , Exome/genetics , Female , Gene Frequency/genetics , Genetic Predisposition to Disease/genetics , Genetic Variation/genetics , Genomics , Genotype , Humans , Mice , Sequence Analysis, DNA , White People/genetics , Wnt Proteins/geneticsABSTRACT
Heritability of bone mineral density (BMD) varies across skeletal sites, reflecting different relative contributions of genetic and environmental influences. To quantify the degree to which common genetic variants tag and environmental factors influence BMD, at different sites, we estimated the genetic (rg) and residual (re) correlations between BMD measured at the upper limbs (UL-BMD), lower limbs (LL-BMD) and skull (SK-BMD), using total-body DXA scans of â¼ 4,890 participants recruited by the Avon Longitudinal Study of Parents and their Children (ALSPAC). Point estimates of rg indicated that appendicular sites have a greater proportion of shared genetic architecture (LL-/UL-BMD rg = 0.78) between them, than with the skull (UL-/SK-BMD rg = 0.58 and LL-/SK-BMD rg = 0.43). Likewise, the residual correlation between BMD at appendicular sites (r(e) = 0.55) was higher than the residual correlation between SK-BMD and BMD at appendicular sites (r(e) = 0.20-0.24). To explore the basis for the observed differences in rg and re, genome-wide association meta-analyses were performed (n â¼ 9,395), combining data from ALSPAC and the Generation R Study identifying 15 independent signals from 13 loci associated at genome-wide significant level across different skeletal regions. Results suggested that previously identified BMD-associated variants may exert site-specific effects (i.e. differ in the strength of their association and magnitude of effect across different skeletal sites). In particular, variants at CPED1 exerted a larger influence on SK-BMD and UL-BMD when compared to LL-BMD (P = 2.01 × 10(-37)), whilst variants at WNT16 influenced UL-BMD to a greater degree when compared to SK- and LL-BMD (P = 2.31 × 10(-14)). In addition, we report a novel association between RIN3 (previously associated with Paget's disease) and LL-BMD (rs754388: ß = 0.13, SE = 0.02, P = 1.4 × 10(-10)). Our results suggest that BMD at different skeletal sites is under a mixture of shared and specific genetic and environmental influences. Allowing for these differences by performing genome-wide association at different skeletal sites may help uncover new genetic influences on BMD.
Subject(s)
Bone Density/genetics , Carrier Proteins/genetics , Guanine Nucleotide Exchange Factors/genetics , Wnt Proteins/genetics , Adult , Bone Development , Bone and Bones/physiology , Child , Cohort Studies , Female , Genome-Wide Association Study , Humans , Longitudinal Studies , Lower Extremity/growth & development , Lower Extremity/physiology , Male , Osteoporosis/epidemiology , Polymorphism, Single Nucleotide , Pregnancy , Prospective Studies , Skull/growth & development , Skull/physiology , Upper Extremity/growth & development , Upper Extremity/physiology , Young AdultABSTRACT
BACKGROUND: The continuous and non-synchronous nature of postnatal male germ-cell development has impeded stage-specific resolution of molecular events of mammalian meiotic prophase in the testis. Here the juvenile onset of spermatogenesis in mice is analyzed by combining cytological and transcriptomic data in a novel computational analysis that allows decomposition of the transcriptional programs of spermatogonia and meiotic prophase substages. RESULTS: Germ cells from testes of individual mice were obtained at two-day intervals from 8 to 18 days post-partum (dpp), prepared as surface-spread chromatin and immunolabeled for meiotic stage-specific protein markers (STRA8, SYCP3, phosphorylated H2AFX, and HISTH1T). Eight stages were discriminated cytologically by combinatorial antibody labeling, and RNA-seq was performed on the same samples. Independent principal component analyses of cytological and transcriptomic data yielded similar patterns for both data types, providing strong evidence for substage-specific gene expression signatures. A novel permutation-based maximum covariance analysis (PMCA) was developed to map co-expressed transcripts to one or more of the eight meiotic prophase substages, thereby linking distinct molecular programs to cytologically defined cell states. Expression of meiosis-specific genes is not substage-limited, suggesting regulation of substage transitions at other levels. CONCLUSIONS: This integrated analysis provides a general method for resolving complex cell populations. Here it revealed not only features of meiotic substage-specific gene expression, but also a network of substage-specific transcription factors and relationships to potential target genes.
Subject(s)
Meiosis , RNA/metabolism , Spermatocytes/metabolism , Animals , Cells, Cultured , Chromatin/metabolism , Gene Regulatory Networks , Germ Cells/cytology , Male , Mice , Mice, Inbred C57BL , Principal Component Analysis , RNA/chemistry , RNA/isolation & purification , Real-Time Polymerase Chain Reaction , Sequence Analysis, RNA , Spermatocytes/cytology , Spermatogenesis , Testis/cytology , Transcription Factors/metabolism , TranscriptomeABSTRACT
Embryonic stem cells (ESCs), characterized by their ability to both self-renew and differentiate into multiple cell lineages, are a powerful model for biomedical research and developmental biology. Human and mouse ESCs share many features, yet have distinctive aspects, including fundamental differences in the signaling pathways and cell cycle controls that support self-renewal. Here, we explore the molecular basis of human ESC self-renewal using Bayesian network machine learning to integrate cell-type-specific, high-throughput data for gene function discovery. We integrated high-throughput ESC data from 83 human studies (~1.8 million data points collected under 1,100 conditions) and 62 mouse studies (~2.4 million data points collected under 1,085 conditions) into separate human and mouse predictive networks focused on ESC self-renewal to analyze shared and distinct functional relationships among protein-coding gene orthologs. Computational evaluations show that these networks are highly accurate, literature validation confirms their biological relevance, and reverse transcriptase polymerase chain reaction (RT-PCR) validation supports our predictions. Our results reflect the importance of key regulatory genes known to be strongly associated with self-renewal and pluripotency in both species (e.g., POU5F1, SOX2, and NANOG), identify metabolic differences between species (e.g., threonine metabolism), clarify differences between human and mouse ESC developmental signaling pathways (e.g., leukemia inhibitory factor (LIF)-activated JAK/STAT in mouse; NODAL/ACTIVIN-A-activated fibroblast growth factor in human), and reveal many novel genes and pathways predicted to be functionally associated with self-renewal in each species. These interactive networks are available online at www.StemSight.org for stem cell researchers to develop new hypotheses, discover potential mechanisms involving sparsely annotated genes, and prioritize genes of interest for experimental validation.
Subject(s)
Cell Differentiation , Cell Proliferation , Embryonic Stem Cells/cytology , Systems Biology/methods , Algorithms , Animals , Bayes Theorem , Cell Lineage , Computational Biology/methods , Embryonic Stem Cells/metabolism , Gene Regulatory Networks , Humans , Mice , Reproducibility of Results , Signal TransductionABSTRACT
High-throughput studies of biological systems are rapidly accumulating a wealth of 'omics'-scale data. Visualization is a key aspect of both the analysis and understanding of these data, and users now have many visualization methods and tools to choose from. The challenge is to create clear, meaningful and integrated visualizations that give biological insight, without being overwhelmed by the intrinsic complexity of the data. In this review, we discuss how visualization tools are being used to help interpret protein interaction, gene expression and metabolic profile data, and we highlight emerging new directions.
Subject(s)
Genomics , Image Processing, Computer-Assisted , Metabolomics , Proteomics , Systems Biology , Mass Spectrometry , Nuclear Magnetic Resonance, Biomolecular , Protein BindingABSTRACT
Integrated analyses of functional genomics data have enormous potential for identifying phenotype-associated genes. Tissue-specificity is an important aspect of many genetic diseases, reflecting the potentially different roles of proteins and pathways in diverse cell lineages. Accounting for tissue specificity in global integration of functional genomics data is challenging, as "functionality" and "functional relationships" are often not resolved for specific tissue types. We address this challenge by generating tissue-specific functional networks, which can effectively represent the diversity of protein function for more accurate identification of phenotype-associated genes in the laboratory mouse. Specifically, we created 107 tissue-specific functional relationship networks through integration of genomic data utilizing knowledge of tissue-specific gene expression patterns. Cross-network comparison revealed significantly changed genes enriched for functions related to specific tissue development. We then utilized these tissue-specific networks to predict genes associated with different phenotypes. Our results demonstrate that prediction performance is significantly improved through using the tissue-specific networks as compared to the global functional network. We used a testis-specific functional relationship network to predict genes associated with male fertility and spermatogenesis phenotypes, and experimentally confirmed one top prediction, Mbyl1. We then focused on a less-common genetic disease, ataxia, and identified candidates uniquely predicted by the cerebellum network, which are supported by both literature and experimental evidence. Our systems-level, tissue-specific scheme advances over traditional global integration and analyses and establishes a prototype to address the tissue-specific effects of genetic perturbations, diseases and drugs.
Subject(s)
Genetic Predisposition to Disease/genetics , Models, Biological , Organ Specificity/genetics , Protein Interaction Mapping/methods , Proteome/genetics , Proteome/metabolism , Signal Transduction/genetics , Animals , Computer Simulation , Humans , Mice , Tissue DistributionABSTRACT
Mitochondria are central to many cellular processes including respiration, ion homeostasis, and apoptosis. Using computational predictions combined with traditional quantitative experiments, we have identified 100 proteins whose deficiency alters mitochondrial biogenesis and inheritance in Saccharomyces cerevisiae. In addition, we used computational predictions to perform targeted double-mutant analysis detecting another nine genes with synthetic defects in mitochondrial biogenesis. This represents an increase of about 25% over previously known participants. Nearly half of these newly characterized proteins are conserved in mammals, including several orthologs known to be involved in human disease. Mutations in many of these genes demonstrate statistically significant mitochondrial transmission phenotypes more subtle than could be detected by traditional genetic screens or high-throughput techniques, and 47 have not been previously localized to mitochondria. We further characterized a subset of these genes using growth profiling and dual immunofluorescence, which identified genes specifically required for aerobic respiration and an uncharacterized cytoplasmic protein required for normal mitochondrial motility. Our results demonstrate that by leveraging computational analysis to direct quantitative experimental assays, we have characterized mutants with subtle mitochondrial defects whose phenotypes were undetected by high-throughput methods.
Subject(s)
Mitochondria/genetics , Proteins/physiology , Saccharomyces cerevisiae/ultrastructure , Cell Respiration/genetics , Cytoplasm/chemistry , Genes, Mitochondrial , Mitochondrial Proteins , Mutant Proteins , Mutation , Proteins/genetics , Proteomics , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/growth & developmentABSTRACT
An ultimate goal of genetic research is to understand the connection between genotype and phenotype in order to improve the diagnosis and treatment of diseases. The quantitative genetics field has developed a suite of statistical methods to associate genetic loci with diseases and phenotypes, including quantitative trait loci (QTL) linkage mapping and genome-wide association studies (GWAS). However, each of these approaches have technical and biological shortcomings. For example, the amount of heritable variation explained by GWAS is often surprisingly small and the resolution of many QTL linkage mapping studies is poor. The predictive power and interpretation of QTL and GWAS results are consequently limited. In this study, we propose a complementary approach to quantitative genetics by interrogating the vast amount of high-throughput genomic data in model organisms to functionally associate genes with phenotypes and diseases. Our algorithm combines the genome-wide functional relationship network for the laboratory mouse and a state-of-the-art machine learning method. We demonstrate the superior accuracy of this algorithm through predicting genes associated with each of 1157 diverse phenotype ontology terms. Comparison between our prediction results and a meta-analysis of quantitative genetic studies reveals both overlapping candidates and distinct, accurate predictions uniquely identified by our approach. Focusing on bone mineral density (BMD), a phenotype related to osteoporotic fracture, we experimentally validated two of our novel predictions (not observed in any previous GWAS/QTL studies) and found significant bone density defects for both Timp2 and Abcg8 deficient mice. Our results suggest that the integration of functional genomics data into networks, which itself is informative of protein function and interactions, can successfully be utilized as a complementary approach to quantitative genetics to predict disease risks. All supplementary material is available at http://cbfg.jax.org/phenotype.
Subject(s)
Chromosome Mapping , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Genomics/methods , ATP Binding Cassette Transporter, Subfamily G, Member 8 , ATP-Binding Cassette Transporters/genetics , Algorithms , Animals , Artificial Intelligence , Bayes Theorem , Bone Density , Cluster Analysis , Databases, Genetic , Disease Models, Animal , Lipoproteins/genetics , Mice , Mice, Transgenic , Osteoporosis/genetics , Phenotype , Quantitative Trait Loci , Reproducibility of Results , Risk Factors , Tissue Inhibitor of Metalloproteinase-2/geneticsABSTRACT
MOTIVATION: Rapidly expanding repositories of highly informative genomic data have generated increasing interest in methods for protein function prediction and inference of biological networks. The successful application of supervised machine learning to these tasks requires a gold standard for protein function: a trusted set of correct examples, which can be used to assess performance through cross-validation or other statistical approaches. Since gene annotation is incomplete for even the best studied model organisms, the biological reliability of such evaluations may be called into question. RESULTS: We address this concern by constructing and analyzing an experimentally based gold standard through comprehensive validation of protein function predictions for mitochondrion biogenesis in Saccharomyces cerevisiae. Specifically, we determine that (i) current machine learning approaches are able to generalize and predict novel biology from an incomplete gold standard and (ii) incomplete functional annotations adversely affect the evaluation of machine learning performance. While computational approaches performed better than predicted in the face of incomplete data, relative comparison of competing approaches-even those employing the same training data-is problematic with a sparse gold standard. Incomplete knowledge causes individual methods' performances to be differentially underestimated, resulting in misleading performance evaluations. We provide a benchmark gold standard for yeast mitochondria to complement current databases and an analysis of our experimental results in the hopes of mitigating these effects in future comparative evaluations. AVAILABILITY: The mitochondrial benchmark gold standard, as well as experimental results and additional data, is available at http://function.princeton.edu/mitochondria.
Subject(s)
Computational Biology/methods , Proteins/metabolism , Algorithms , Databases, Protein , Mitochondria/metabolism , Proteins/chemistry , Saccharomyces cerevisiae/metabolismABSTRACT
Computational approaches have promised to organize collections of functional genomics data into testable predictions of gene and protein involvement in biological processes and pathways. However, few such predictions have been experimentally validated on a large scale, leaving many bioinformatic methods unproven and underutilized in the biology community. Further, it remains unclear what biological concerns should be taken into account when using computational methods to drive real-world experimental efforts. To investigate these concerns and to establish the utility of computational predictions of gene function, we experimentally tested hundreds of predictions generated from an ensemble of three complementary methods for the process of mitochondrial organization and biogenesis in Saccharomyces cerevisiae. The biological data with respect to the mitochondria are presented in a companion manuscript published in PLoS Genetics (doi:10.1371/journal.pgen.1000407). Here we analyze and explore the results of this study that are broadly applicable for computationalists applying gene function prediction techniques, including a new experimental comparison with 48 genes representing the genomic background. Our study leads to several conclusions that are important to consider when driving laboratory investigations using computational prediction approaches. While most genes in yeast are already known to participate in at least one biological process, we confirm that genes with known functions can still be strong candidates for annotation of additional gene functions. We find that different analysis techniques and different underlying data can both greatly affect the types of functional predictions produced by computational methods. This diversity allows an ensemble of techniques to substantially broaden the biological scope and breadth of predictions. We also find that performing prediction and validation steps iteratively allows us to more completely characterize a biological area of interest. While this study focused on a specific functional area in yeast, many of these observations may be useful in the contexts of other processes and organisms.
Subject(s)
Biology/methods , Mitochondria/physiology , Mitochondrial Proteins/metabolism , Models, Biological , Research Design , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/physiology , Signal Transduction/physiology , Computer SimulationABSTRACT
BACKGROUND: Obesity and diabetes mellitus are directly implicated in many adverse health consequences in adults as well as in the offspring of obese and diabetic mothers. Hispanic Americans are particularly at risk for obesity, diabetes, and end-stage renal disease. Maternal obesity and/or diabetes through prenatal programming may alter the fetal epigenome increasing the risk of metabolic disease in their offspring. The aims of this study were to determine if maternal obesity or diabetes mellitus during pregnancy results in a change in infant methylation of CpG islands adjacent to targeted genes specific for obesity or diabetes disease pathways in a largely Hispanic population. METHODS: Methylation levels in the cord blood of 69 newborns were determined using the Illumina Infinium MethylationEPIC BeadChip. Over 850,000 different probe sites were analyzed to determine whether maternal obesity and/or diabetes mellitus directly attributed to differential methylation; epigenome-wide and regional analyses were performed for significant CpG sites. RESULTS: Following quality control, agranular leukocyte samples from 69 newborns (23 normal term (NT), 14 diabetes (DM), 23 obese (OB), 9 DM/OB) were analyzed for over 850,000 different probe sites. Contrasts between the NT, DM, OB, and DM/OB were considered. After correction for multiple testing, 15 CpGs showed differential methylation from the NT, associated with 10 differentially methylated genes between the diabetic and non-diabetic subgroups, CCDC110, KALRN, PAG1, GNRH1, SLC2A9, CSRP2BP, HIVEP1, RALGDS, DHX37, and SCNN1D. The effects of diabetes were partly mediated by the altered methylation of HOOK2, LCE3C, and TMEM63B. The effects of obesity were partly mediated by the differential methylation of LTF and DUSP22. CONCLUSIONS: The presented data highlights the associated altered methylation patterns potentially mediated by maternal diabetes and/or obesity. Larger studies are warranted to investigate the role of both the identified differentially methylated loci and the effects on newborn body composition and future health risk factors for metabolic disease. Additional future consideration should be targeted to the role of Hispanic inheritance. Potential future targeting of transgenerational propagation and developmental programming may reduce population obesity and diabetes risk.
Subject(s)
DNA Methylation , Diabetes, Gestational/genetics , Epigenomics/methods , Fetal Blood/chemistry , Hispanic or Latino/genetics , Obesity/genetics , Adult , CpG Islands , Diabetes, Gestational/ethnology , Epigenesis, Genetic , Female , Gene Regulatory Networks , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Infant, Newborn , Maternal Age , Maternal-Fetal Exchange , Obesity/ethnology , Pregnancy , Prospective Studies , Young AdultABSTRACT
Modern experimental techniques have produced a wealth of high-throughput data that has enabled the ongoing genomic revolution. As the field continues to integrate experimental and computational analyzes of this data, it is essential that performance evaluations of high-throughput results be carried out in a consistent and biologically informative manner. Here, we present an overview of evaluation techniques for high-throughput experimental data and computational methods, and we discuss a number of potential pitfalls in this process. These primarily involve the biological diversity of genomic data, which can be masked or misrepresented in overly simplified global evaluations. We describe systems for preserving information about biological context during dataset evaluation, which can help to ensure that multiple different evaluations are more directly comparable. This biological variety in high-throughput data can also be taken advantage of computationally through data integration and process specificity to produce richer systems-level predictions of cellular function. An awareness of these considerations can greatly improve the evaluation and analysis of any high-throughput experimental dataset.
Subject(s)
Genome, Fungal , Proteome , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Computational Biology , Data Interpretation, Statistical , Databases, Genetic/standards , Databases, Genetic/statistics & numerical data , Databases, Protein/standards , Databases, Protein/statistics & numerical data , Genomics/standards , Genomics/statistics & numerical data , Proteomics/standards , Proteomics/statistics & numerical data , Systems BiologyABSTRACT
MOTIVATION: The increasing availability of gene expression microarray technology has resulted in the publication of thousands of microarray gene expression datasets investigating various biological conditions. This vast repository is still underutilized due to the lack of methods for fast, accurate exploration of the entire compendium. RESULTS: We have collected Saccharomyces cerevisiae gene expression microarray data containing roughly 2400 experimental conditions. We analyzed the functional coverage of this collection and we designed a context-sensitive search algorithm for rapid exploration of the compendium. A researcher using our system provides a small set of query genes to establish a biological search context; based on this query, we weight each dataset's relevance to the context, and within these weighted datasets we identify additional genes that are co-expressed with the query set. Our method exhibits an average increase in accuracy of 273% compared to previous mega-clustering approaches when recapitulating known biology. Further, we find that our search paradigm identifies novel biological predictions that can be verified through further experimentation. Our methodology provides the ability for biological researchers to explore the totality of existing microarray data in a manner useful for drawing conclusions and formulating hypotheses, which we believe is invaluable for the research community. AVAILABILITY: Our query-driven search engine, called SPELL, is available at http://function.princeton.edu/SPELL. SUPPLEMENTARY INFORMATION: Several additional data files, figures and discussions are available at http://function.princeton.edu/SPELL/supplement.
Subject(s)
Databases, Protein , Gene Expression Profiling/methods , Gene Expression/physiology , Information Storage and Retrieval/methods , Oligonucleotide Array Sequence Analysis/methods , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/metabolism , Algorithms , Database Management Systems , Reproducibility of Results , Sensitivity and SpecificityABSTRACT
BACKGROUND: The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes). RESULTS: We developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows genes with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods. CONCLUSION: The Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision.
Subject(s)
Algorithms , Cluster Analysis , Gene Expression , Genes, Fungal , Saccharomyces cerevisiae/genetics , Databases, Genetic , Gene Expression Profiling/methods , Gene Expression Regulation, Fungal , ROC Curve , SoftwareABSTRACT
BACKGROUND: The Gene Ontology has become an extremely useful tool for the analysis of genomic data and structuring of biological knowledge. Several excellent software tools for navigating the gene ontology have been developed. However, no existing system provides an interactively expandable graph-based view of the gene ontology hierarchy. Furthermore, most existing tools are web-based or require an Internet connection, will not load local annotations files, and provide either analysis or visualization functionality, but not both. RESULTS: To address the above limitations, we have developed GOLEM (Gene Ontology Local Exploration Map), a visualization and analysis tool for focused exploration of the gene ontology graph. GOLEM allows the user to dynamically expand and focus the local graph structure of the gene ontology hierarchy in the neighborhood of any chosen term. It also supports rapid analysis of an input list of genes to find enriched gene ontology terms. The GOLEM application permits the user either to utilize local gene ontology and annotations files in the absence of an Internet connection, or to access the most recent ontology and annotation information from the gene ontology webpage. GOLEM supports global and organism-specific searches by gene ontology term name, gene ontology id and gene name. CONCLUSION: GOLEM is a useful software tool for biologists interested in visualizing the local directed acyclic graph structure of the gene ontology hierarchy and searching for gene ontology terms enriched in genes of interest. It is freely available both as an application and as an applet at http://function.princeton.edu/GOLEM.
Subject(s)
Computer Graphics , Database Management Systems , Protein Interaction Mapping/methods , Proteins/classification , Proteins/metabolism , Software , User-Computer Interface , Databases, Protein , Documentation/methods , Information Storage and Retrieval , Proteins/geneticsABSTRACT
BACKGROUND: Accurate evaluation of the quality of genomic or proteomic data and computational methods is vital to our ability to use them for formulating novel biological hypotheses and directing further experiments. There is currently no standard approach to evaluation in functional genomics. Our analysis of existing approaches shows that they are inconsistent and contain substantial functional biases that render the resulting evaluations misleading both quantitatively and qualitatively. These problems make it essentially impossible to compare computational methods or large-scale experimental datasets and also result in conclusions that generalize poorly in most biological applications. RESULTS: We reveal issues with current evaluation methods here and suggest new approaches to evaluation that facilitate accurate and representative characterization of genomic methods and data. Specifically, we describe a functional genomics gold standard based on curation by expert biologists and demonstrate its use as an effective means of evaluation of genomic approaches. Our evaluation framework and gold standard are freely available to the community through our website. CONCLUSION: Proper methods for evaluating genomic data and computational approaches will determine how much we, as a community, are able to learn from the wealth of available data. We propose one possible solution to this problem here but emphasize that this topic warrants broader community discussion.
Subject(s)
Computational Biology/methods , Genomics/methods , Algorithms , Computational Biology/standards , Databases, Genetic/standards , Genomics/standards , Proteomics/methods , Proteomics/standards , Reproducibility of Results , Software/standardsABSTRACT
BACKGROUND: The most common method of identifying groups of functionally related genes in microarray data is to apply a clustering algorithm. However, it is impossible to determine which clustering algorithm is most appropriate to apply, and it is difficult to verify the results of any algorithm due to the lack of a gold-standard. Appropriate data visualization tools can aid this analysis process, but existing visualization methods do not specifically address this issue. RESULTS: We present several visualization techniques that incorporate meaningful statistics that are noise-robust for the purpose of analyzing the results of clustering algorithms on microarray data. This includes a rank-based visualization method that is more robust to noise, a difference display method to aid assessments of cluster quality and detection of outliers, and a projection of high dimensional data into a three dimensional space in order to examine relationships between clusters. Our methods are interactive and are dynamically linked together for comprehensive analysis. Further, our approach applies to both protein and gene expression microarrays, and our architecture is scalable for use on both desktop/laptop screens and large-scale display devices. This methodology is implemented in GeneVAnD (Genomic Visual ANalysis of Datasets) and is available at http://function.princeton.edu/GeneVAnD. CONCLUSION: Incorporating relevant statistical information into data visualizations is key for analysis of large biological datasets, particularly because of high levels of noise and the lack of a gold-standard for comparisons. We developed several new visualization techniques and demonstrated their effectiveness for evaluating cluster quality and relationships between clusters.
Subject(s)
Computational Biology/methods , Computer Graphics , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Artificial Intelligence , Cluster Analysis , Computers , Data Interpretation, Statistical , Databases, Genetic , Gene Expression Profiling , Information Storage and Retrieval , Models, Genetic , Models, Statistical , Pattern Recognition, Automated , Principal Component Analysis , Programming Languages , Sequence Alignment , Sequence Analysis, DNA , Software , User-Computer InterfaceABSTRACT
BACKGROUND: Genetic recombination plays an important role in evolution, facilitating the creation of new, favorable combinations of alleles and the removal of deleterious mutations by unlinking them from surrounding sequences. In most mammals, the placement of genetic crossovers is determined by the binding of PRDM9, a highly polymorphic protein with a long zinc finger array, to its cognate binding sites. It is one of over 800 genes encoding proteins with zinc finger domains in the human genome. RESULTS: We report a novel technique, Affinity-seq, that for the first time identifies both the genome-wide binding sites of DNA-binding proteins and quantitates their relative affinities. We have applied this in vitro technique to PRDM9, the zinc-finger protein that activates genetic recombination, obtaining new information on the regulation of hotspots, whose locations and activities determine the recombination landscape. We identified 31,770 binding sites in the mouse genome for the PRDM9(Dom2) variant. Comparing these results with hotspot usage in vivo, we find that less than half of potential PRDM9 binding sites are utilized in vivo. We show that hotspot usage is increased in actively transcribed genes and decreased in genomic regions containing H3K9me2/3 histone marks or bound to the nuclear lamina. CONCLUSIONS: These results show that a major factor determining whether a binding site will become an active hotspot and what its activity will be are constraints imposed by prior chromatin modifications on the ability of PRDM9 to bind to DNA in vivo. These constraints lead to the presence of long genomic regions depleted of recombination.
ABSTRACT
Massively parallel RNA sequencing (RNA-seq) has yielded a wealth of new insights into transcriptional regulation. A first step in the analysis of RNA-seq data is the alignment of short sequence reads to a common reference genome or transcriptome. Genetic variants that distinguish individual genomes from the reference sequence can cause reads to be misaligned, resulting in biased estimates of transcript abundance. Fine-tuning of read alignment algorithms does not correct this problem. We have developed Seqnature software to construct individualized diploid genomes and transcriptomes for multiparent populations and have implemented a complete analysis pipeline that incorporates other existing software tools. We demonstrate in simulated and real data sets that alignment to individualized transcriptomes increases read mapping accuracy, improves estimation of transcript abundance, and enables the direct estimation of allele-specific expression. Moreover, when applied to expression QTL mapping we find that our individualized alignment strategy corrects false-positive linkage signals and unmasks hidden associations. We recommend the use of individualized diploid genomes over reference sequence alignment for all applications of high-throughput sequencing technology in genetically diverse populations.
Subject(s)
Sequence Alignment/methods , Sequence Analysis, RNA/methods , Software , Transcriptome , Animals , Female , Genome , Male , Mice , Quantitative Trait LociABSTRACT
Self-renewal, the ability of a stem cell to divide repeatedly while maintaining an undifferentiated state, is a defining characteristic of all stem cells. Here, we clarify the molecular foundations of mouse embryonic stem cell (mESC) self-renewal by applying a proven Bayesian network machine learning approach to integrate high-throughput data for protein function discovery. By focusing on a single stem-cell system, at a specific developmental stage, within the context of well-defined biological processes known to be active in that cell type, we produce a consensus predictive network that reflects biological reality more closely than those made by prior efforts using more generalized, context-independent methods. In addition, we show how machine learning efforts may be misled if the tissue specific role of mammalian proteins is not defined in the training set and circumscribed in the evidential data. For this study, we assembled an extensive compendium of mESC data: â¼2.2 million data points, collected from 60 different studies, under 992 conditions. We then integrated these data into a consensus mESC functional relationship network focused on biological processes associated with embryonic stem cell self-renewal and cell fate determination. Computational evaluations, literature validation, and analyses of predicted functional linkages show that our results are highly accurate and biologically relevant. Our mESC network predicts many novel players involved in self-renewal and serves as the foundation for future pluripotent stem cell studies. This network can be used by stem cell researchers (at http://StemSight.org) to explore hypotheses about gene function in the context of self-renewal and to prioritize genes of interest for experimental validation.