Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
Add more filters











Database
Language
Publication year range
1.
J Vis Exp ; (152)2019 10 11.
Article in English | MEDLINE | ID: mdl-31657800

ABSTRACT

Differential gene expression analysis is an important technique for understanding disease states. The machine learning algorithm CorEx has shown utility in analyzing differential expression of groups of genes in tumor RNA-seq in a way that may be helpful for advancing precision oncology. However, CorEx produces many factors that can be challenging to analyze and connect to existing understanding. To facilitate such connections, we have built a website, CorExplorer, that allows users to interactively explore the data and answer common questions related to its analysis. We trained CorEx on RNA-seq gene expression data for four tumor types: ovarian, lung, melanoma, and colorectal. We then incorporated corresponding survival, protein-protein interactions, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichments, and heatmaps into the website for association with the factor graph visualization. Here we employ example protocols to illustrate use of the database for comprehending the significance of the learned tumor factors in the context of this external data.


Subject(s)
Algorithms , Gene Expression Profiling/methods , Gene Expression Regulation, Neoplastic/genetics , Machine Learning , Neoplasms/genetics , Patient Portals , Gene Expression Profiling/instrumentation , Genome , Humans , Neoplasms/metabolism , Precision Medicine/instrumentation , Precision Medicine/methods , RNA/biosynthesis , RNA/genetics
2.
BMC Med Genomics ; 10(1): 12, 2017 03 15.
Article in English | MEDLINE | ID: mdl-28292312

ABSTRACT

BACKGROUND: De novo inference of clinically relevant gene function relationships from tumor RNA-seq remains a challenging task. Current methods typically either partition patient samples into a few subtypes or rely upon analysis of pairwise gene correlations that will miss some groups in noisy data. Leveraging higher dimensional information can be expected to increase the power to discern targetable pathways, but this is commonly thought to be an intractable computational problem. METHODS: In this work we adapt a recently developed machine learning algorithm for sensitive detection of complex gene relationships. The algorithm, CorEx, efficiently optimizes over multivariate mutual information and can be iteratively applied to generate a hierarchy of relatively independent latent factors. The learned latent factors are used to stratify patients for survival analysis with respect to both single factors and combinations. These analyses are performed and interpreted in the context of biological function annotations and protein network interactions that might be utilized to match patients to multiple therapies. RESULTS: Analysis of ovarian tumor RNA-seq samples demonstrates the algorithm's power to infer well over one hundred biologically interpretable gene cohorts, several times more than standard methods such as hierarchical clustering and k-means. The CorEx factor hierarchy is also informative, with related but distinct gene clusters grouped by upper nodes. Some latent factors correlate with patient survival, including one for a pathway connected with the epithelial-mesenchymal transition in breast cancer that is regulated by a microRNA that modulates epigenetics. Further, combinations of factors lead to a synergistic survival advantage in some cases. CONCLUSIONS: In contrast to studies that attempt to partition patients into a small number of subtypes (typically 4 or fewer) for treatment purposes, our approach utilizes subgroup information for combinatoric transcriptional phenotyping. Considering only the 66 gene expression groups that are found to both have significant Gene Ontology enrichment and are small enough to indicate specific drug targets implies a computational phenotype for ovarian cancer that allows for 366 possible patient profiles, enabling truly personalized treatment. The findings here demonstrate a new technique that sheds light on the complexity of gene expression dependencies in tumors and could eventually enable the use of patient RNA-seq profiles for selection of personalized and effective cancer treatments.


Subject(s)
Computational Biology/methods , Gene Expression Profiling , Ovarian Neoplasms/genetics , Ovarian Neoplasms/therapy , Cluster Analysis , Epithelial-Mesenchymal Transition/genetics , Female , Humans , Molecular Sequence Annotation , Neoplasm Metastasis , Neoplastic Stem Cells/pathology , Ovarian Neoplasms/metabolism , Ovarian Neoplasms/pathology , Protein Interaction Mapping , Sequence Analysis, RNA
3.
Cell Stem Cell ; 16(1): 88-101, 2015 Jan 08.
Article in English | MEDLINE | ID: mdl-25575081

ABSTRACT

Cellular reprogramming highlights the epigenetic plasticity of the somatic cell state. Long noncoding RNAs (lncRNAs) have emerging roles in epigenetic regulation, but their potential functions in reprogramming cell fate have been largely unexplored. We used single-cell RNA sequencing to characterize the expression patterns of over 16,000 genes, including 437 lncRNAs, during defined stages of reprogramming to pluripotency. Self-organizing maps (SOMs) were used as an intuitive way to structure and interrogate transcriptome data at the single-cell level. Early molecular events during reprogramming involved the activation of Ras signaling pathways, along with hundreds of lncRNAs. Loss-of-function studies showed that activated lncRNAs can repress lineage-specific genes, while lncRNAs activated in multiple reprogramming cell types can regulate metabolic gene expression. Our findings demonstrate that reprogramming cells activate defined sets of functionally relevant lncRNAs and provide a resource to further investigate how dynamic changes in the transcriptome reprogram cell state.


Subject(s)
Cellular Reprogramming/genetics , RNA, Long Noncoding/genetics , Single-Cell Analysis/methods , Transcriptome/genetics , Animals , Cell Lineage/genetics , Gene Expression Regulation, Developmental , Genes, Developmental , Hematopoiesis/genetics , Induced Pluripotent Stem Cells/cytology , Induced Pluripotent Stem Cells/metabolism , Mice , Pluripotent Stem Cells/metabolism , RNA, Long Noncoding/metabolism , Signal Transduction/genetics , ras Proteins/metabolism
4.
Genome Res ; 23(12): 2136-48, 2013 Dec.
Article in English | MEDLINE | ID: mdl-24170599

ABSTRACT

We tested whether self-organizing maps (SOMs) could be used to effectively integrate, visualize, and mine diverse genomics data types, including complex chromatin signatures. A fine-grained SOM was trained on 72 ChIP-seq histone modifications and DNase-seq data sets from six biologically diverse cell lines studied by The ENCODE Project Consortium. We mined the resulting SOM to identify chromatin signatures related to sequence-specific transcription factor occupancy, sequence motif enrichment, and biological functions. To highlight clusters enriched for specific functions such as transcriptional promoters or enhancers, we overlaid onto the map additional data sets not used during training, such as ChIP-seq, RNA-seq, CAGE, and information on cis-acting regulatory modules from the literature. We used the SOM to parse known transcriptional enhancers according to the cell-type-specific chromatin signature, and we further corroborated this pattern on the map by EP300 (also known as p300) occupancy. New candidate cell-type-specific enhancers were identified for multiple ENCODE cell types in this way, along with new candidates for ubiquitous enhancer activity. An interactive web interface was developed to allow users to visualize and custom-mine the ENCODE SOM. We conclude that large SOMs trained on chromatin data from multiple cell types provide a powerful way to identify complex relationships in genomic data at user-selected levels of granularity.


Subject(s)
Chromatin/genetics , Chromatin/metabolism , Histones/genetics , Histones/metabolism , Transcription Factors/genetics , Algorithms , Cell Line , Chromosome Mapping , Computational Biology , Data Mining , Gene Ontology , Human Umbilical Vein Endothelial Cells , Humans , K562 Cells , Promoter Regions, Genetic , User-Computer Interface
5.
Genome Res ; 21(4): 566-77, 2011 Apr.
Article in English | MEDLINE | ID: mdl-21383317

ABSTRACT

Cis-regulatory modules (CRMs) function by binding sequence specific transcription factors, but the relationship between in vivo physical binding and the regulatory capacity of factor-bound DNA elements remains uncertain. We investigate this relationship for the well-studied Twist factor in Drosophila melanogaster embryos by analyzing genome-wide factor occupancy and testing the functional significance of Twist occupied regions and motifs within regions. Twist ChIP-seq data efficiently identified previously studied Twist-dependent CRMs and robustly predicted new CRM activity in transgenesis, with newly identified Twist-occupied regions supporting diverse spatiotemporal patterns (>74% positive, n = 31). Some, but not all, candidate CRMs require Twist for proper expression in the embryo. The Twist motifs most favored in genome ChIP data (in vivo) differed from those most favored by Systematic Evolution of Ligands by EXponential enrichment (SELEX) (in vitro). Furthermore, the majority of ChIP-seq signals could be parsimoniously explained by a CABVTG motif located within 50 bp of the ChIP summit and, of these, CACATG was most prevalent. Mutagenesis experiments demonstrated that different Twist E-box motif types are not fully interchangeable, suggesting that the ChIP-derived consensus (CABVTG) includes sites having distinct regulatory outputs. Further analysis of position, frequency of occurrence, and sequence conservation revealed significant enrichment and conservation of CABVTG E-box motifs near Twist ChIP-seq signal summits, preferential conservation of ±150 bp surrounding Twist occupied summits, and enrichment of GA- and CA-repeat sequences near Twist occupied summits. Our results show that high resolution in vivo occupancy data can be used to drive efficient discovery and dissection of global and local cis-regulatory logic.


Subject(s)
DNA/genetics , Drosophila/embryology , Drosophila/genetics , Evolution, Molecular , Twist-Related Protein 1/genetics , Twist-Related Protein 1/metabolism , Animals , Base Composition , Base Sequence , Binding Sites/genetics , Computational Biology , Consensus Sequence/genetics , Conserved Sequence , Gene Expression Regulation, Developmental , Molecular Sequence Data , Regulatory Elements, Transcriptional/genetics
6.
PLoS Comput Biol ; 6(2): e1000675, 2010 Feb 12.
Article in English | MEDLINE | ID: mdl-20168991

ABSTRACT

During the acquisition of memories, influx of Ca2+ into the postsynaptic spine through the pores of activated N-methyl-D-aspartate-type glutamate receptors triggers processes that change the strength of excitatory synapses. The pattern of Ca2+influx during the first few seconds of activity is interpreted within the Ca2+-dependent signaling network such that synaptic strength is eventually either potentiated or depressed. Many of the critical signaling enzymes that control synaptic plasticity,including Ca2+/calmodulin-dependent protein kinase II (CaMKII), are regulated by calmodulin, a small protein that can bindup to 4 Ca2+ ions. As a first step toward clarifying how the Ca2+-signaling network decides between potentiation or depression, we have created a kinetic model of the interactions of Ca2+, calmodulin, and CaMKII that represents our best understanding of the dynamics of these interactions under conditions that resemble those in a postsynaptic spine. We constrained parameters of the model from data in the literature, or from our own measurements, and then predicted time courses of activation and autophosphorylation of CaMKII under a variety of conditions. Simulations showed that species of calmodulin with fewer than four bound Ca2+ play a significant role in activation of CaMKII in the physiological regime,supporting the notion that processing of Ca2+ signals in a spine involves competition among target enzymes for binding to unsaturated species of CaM in an environment in which the concentration of Ca2+ is fluctuating rapidly. Indeed, we showed that dependence of activation on the frequency of Ca2+ transients arises from the kinetics of interaction of fluctuating Ca2+with calmodulin/CaMKII complexes. We used parameter sensitivity analysis to identify which parameters will be most beneficial to measure more carefully to improve the accuracy of predictions. This model provides a quantitative base from which to build more complex dynamic models of postsynaptic signal transduction during learning.


Subject(s)
Calcium-Calmodulin-Dependent Protein Kinase Type 2/chemistry , Calcium/chemistry , Calmodulin/chemistry , Multiprotein Complexes/chemistry , Calcium/metabolism , Calcium-Calmodulin-Dependent Protein Kinase Type 2/metabolism , Calmodulin/metabolism , Kinetics , Models, Chemical , Molecular Dynamics Simulation , Multiprotein Complexes/metabolism , Phosphorylation , Protein Binding , Thermodynamics
7.
Nat Methods ; 6(11 Suppl): S22-32, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19844228

ABSTRACT

Genome-wide measurements of protein-DNA interactions and transcriptomes are increasingly done by deep DNA sequencing methods (ChIP-seq and RNA-seq). The power and richness of these counting-based measurements comes at the cost of routinely handling tens to hundreds of millions of reads. Whereas early adopters necessarily developed their own custom computer code to analyze the first ChIP-seq and RNA-seq datasets, a new generation of more sophisticated algorithms and software tools are emerging to assist in the analysis phase of these projects. Here we describe the multilayered analyses of ChIP-seq and RNA-seq datasets, discuss the software packages currently available to perform tasks at each layer and describe some upcoming challenges and features for future analysis tools. We also discuss how software choices and uses are affected by specific aspects of the underlying biology and data structure, including genome size, positional clustering of transcription factor binding sites, transcript discovery and expression quantification.


Subject(s)
Chromatin Immunoprecipitation/methods , Gene Expression Profiling/methods , RNA/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Algorithms , Base Sequence , RNA/analysis
8.
J Mol Evol ; 64(1): 80-9, 2007 Jan.
Article in English | MEDLINE | ID: mdl-17160642

ABSTRACT

We examine the impact of likelihood surface characteristics on phylogenetic inference. Amino acid data sets simulated from topologies with branch length features chosen to represent varying degrees of difficulty for likelihood maximization are analyzed. We present situations where the tree found to achieve the global maximum in likelihood is often not equal to the true tree. We use the program covSEARCH to demonstrate how the use of adaptively sized pools of candidate trees that are updated using confidence tests results in solution sets that are highly likely to contain the true tree. This approach requires more computation than traditional maximum likelihood methods, hence covSEARCH is best suited to small to medium-sized alignments or large alignments with some constrained nodes. The majority rule consensus tree computed from the confidence sets also proves to be different from the generating topology. Although low phylogenetic signal in the input alignment can result in large confidence sets of trees, some biological information can still be obtained based on nodes that exhibit high support within the confidence set. Two real data examples are analyzed: mammal mitochondrial proteins and a small tubulin alignment. We conclude that the technique of confidence set optimization can significantly improve the robustness of phylogenetic inference at a reasonable computational cost. Additionally, when either very short internal branches or very long terminal branches are present, confident resolution of specific bipartitions or subtrees, rather than whole-tree phylogenies, may be the most realistic goal for phylogenetic methods.


Subject(s)
Algorithms , Models, Biological , Phylogeny , Ascomycota/genetics , Likelihood Functions , Mitochondrial Proteins/genetics , Sequence Alignment/methods , Tubulin
SELECTION OF CITATIONS
SEARCH DETAIL