ABSTRACT
Scaffold proteins drive liquid-liquid phase separation (LLPS) to form biomolecular condensates and organize various biochemical reactions in cells. Dysregulation of scaffolds can lead to aberrant condensate assembly and various complex diseases. However, bioinformatics predictors dedicated to scaffolds are still lacking and their development suffers from an extreme imbalance between limited experimentally identified scaffolds and unlabeled candidates. Here, using the joint distribution of hybrid multimodal features, we implemented a positive unlabeled (PU) learning-based framework named PULPS that combined ProbTagging and penalty logistic regression (PLR) to profile the propensity of scaffolds. PULPS achieved the best AUC of 0.8353 and showed an area under the lift curve (AUL) of 0.8339 as an estimation of true performance. Upon reviewing recent experimentally verified scaffolds, we performed a partial recovery with 2.85% increase in AUL from 0.8339 to 0.8577. In comparison, PULPS showed a 45.7% improvement in AUL compared with PLR, whereas 8.2% superiority over other existing tools. Our study first proved that PU learning is more suitable for scaffold prediction and demonstrated the widespread existence of phase separation states. This profile also uncovered potential scaffolds that co-drive LLPS in the human proteome and generated candidates for further experiments. PULPS is free for academic research at http://pulps.zbiolab.cn.
Subject(s)
Cell Physiological Phenomena , Proteome , HumansABSTRACT
MOTIVATION: Biological and cellular systems are often modeled as graphs in which vertices represent objects of interest (genes, proteins and drugs) and edges represent relational ties between these objects (binds-to, interacts-with and regulates). This approach has been highly successful owing to the theory, methodology and software that support analysis and learning on graphs. Graphs, however, suffer from information loss when modeling physical systems due to their inability to accurately represent multiobject relationships. Hypergraphs, a generalization of graphs, provide a framework to mitigate information loss and unify disparate graph-based methodologies. RESULTS: We present a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs. We then introduce a novel kernel method on vertex- and edge-labeled (colored) hypergraphs for analysis and learning. The method is based on exact and inexact (via hypergraph edit distances) enumeration of hypergraphlets; i.e. small hypergraphs rooted at a vertex of interest. We empirically evaluate this method on fifteen biological networks and show its potential use in a positive-unlabeled setting to estimate the interactome sizes in various species. AVAILABILITY AND IMPLEMENTATION: https://github.com/jlugomar/hypergraphlet-kernels. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Proteins , SoftwareABSTRACT
Preterm birth affects more than 10% of all births worldwide. Such infants are much more prone to Growth Faltering (GF), an issue that has been unsolved despite the implementation of numerous interventions aimed at optimizing preterm infant nutrition. To improve the ability for early prediction of GF risk for preterm infants we collected a comprehensive, large, and unique clinical and microbiome dataset from 3 different sites in the US and the UK. We use and extend machine learning methods for GF prediction from clinical data. We next extend graphical models to integrate time series clinical and microbiome data. A model that integrates clinical and microbiome data improves on the ability to predict GF when compared to models using clinical data only. Information on a small subset of the taxa is enough to help improve model accuracy and to predict interventions that can improve outcome. We show that a hierarchical classifier that only uses a subset of the taxa for a subset of the infants is both the most accurate and cost-effective method for GF prediction. Further analysis of the best classifiers enables the prediction of interventions that can improve outcome.
Subject(s)
Microbiota , Premature Birth , Humans , Infant , Infant, Newborn , Infant, Premature , Machine LearningABSTRACT
Elucidating the precise molecular events altered by disease-causing genetic variants represents a major challenge in translational bioinformatics. To this end, many studies have investigated the structural and functional impact of amino acid substitutions. Most of these studies were however limited in scope to either individual molecular functions or were concerned with functional effects (e.g. deleterious vs. neutral) without specifically considering possible molecular alterations. The recent growth of structural, molecular and genetic data presents an opportunity for more comprehensive studies to consider the structural environment of a residue of interest, to hypothesize specific molecular effects of sequence variants and to statistically associate these effects with genetic disease. In this study, we analyzed data sets of disease-causing and putatively neutral human variants mapped to protein 3D structures as part of a systematic study of the loss and gain of various types of functional attribute potentially underlying pathogenic molecular alterations. We first propose a formal model to assess probabilistically function-impacting variants. We then develop an array of structure-based functional residue predictors, evaluate their performance, and use them to quantify the impact of disease-causing amino acid substitutions on catalytic activity, metal binding, macromolecular binding, ligand binding, allosteric regulation and post-translational modifications. We show that our methodology generates actionable biological hypotheses for up to 41% of disease-causing genetic variants mapped to protein structures suggesting that it can be reliably used to guide experimental validation. Our results suggest that a significant fraction of disease-causing human variants mapping to protein structures are function-altering both in the presence and absence of stability disruption.
Subject(s)
Amino Acid Sequence/genetics , Disease/genetics , Models, Statistical , Mutation/genetics , Amino Acid Substitution/genetics , Computational Biology , Computer Simulation , Humans , Models, Molecular , Protein BindingABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.
Subject(s)
Genome/genetics , Genomics/methods , Sequence Analysis, DNA/methods , Sequence Analysis, RNA/methods , Animals , Chickens/genetics , Chromosome Mapping , Drosophila melanogaster/genetics , Pan troglodytes/geneticsABSTRACT
Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.
Subject(s)
Genome , Molecular Sequence Annotation/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Computational Biology/methods , Evolution, Molecular , Genomics/methods , Reproducibility of ResultsABSTRACT
The interactions of proteins to form complexes play a crucial role in cell function. Data on protein-protein or pairwise interactions (PPI) typically come from a combination of sample separation and mass spectrometry. Since 2010, several extensive, high-throughput mass spectrometry-based experimental studies have dramatically expanded public repositories for PPI data and, by extension, our knowledge of protein complexes. Unfortunately, challenges of limited overlap between experiments, modality-oriented biases, and prohibitive costs of experimental reproducibility continue to limit coverage of the human protein assembly map, both underscoring the need for and spurring the development of relevant computational approaches. Here, we present a new method for predicting the strength of protein interactions. It addresses two important issues that have limited past PPI prediction approaches: incomplete feature sets and incomplete proteome coverage. For a given collection of protein pairs, we fused data from heterogeneous sources into a feature matrix and identified the minimal set of feature partitions for which a non-empty set of protein pairs had complete values. For each such feature partition, we trained a classifier to predict PPI probabilities. We then calculated an overall prediction for a given protein pair by weighting the probabilities from all models that applied to that pair. Our approach accurately identified known and highly probable PPI, far exceeding the performance of current approaches and providing more complete proteome coverage. We then used the predicted probabilities to assemble complexes using previously-described graph-based tools and clustering algorithms and again obtained improved results. Lastly, we used features for three human cell lines to predict PPI and complex scores and identified complexes predicted to differ between those cell lines.
ABSTRACT
Many popular spatial transcriptomics techniques lack single-cell resolution. Instead, these methods measure the collective gene expression for each location from a mixture of cells, potentially containing multiple cell types. Here, we developed scResolve, a method for recovering single-cell expression profiles from spatial transcriptomics measurements at multi-cellular resolution. scResolve accurately restores expression profiles of individual cells at their locations, which is unattainable with cell type deconvolution. Applications of scResolve on human breast cancer data and human lung disease data demonstrate that scResolve enables cell-type-specific differential gene expression analysis between different tissue contexts and accurate identification of rare cell populations. The spatially resolved cellular-level expression profiles obtained through scResolve facilitate more flexible and precise spatial analysis that complements raw multi-cellular level analysis.
Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Transcriptome , Single-Cell Analysis/methods , Humans , Gene Expression Profiling/methods , Breast Neoplasms/genetics , Breast Neoplasms/pathology , Female , AlgorithmsABSTRACT
Many popular spatial transcriptomics techniques lack single-cell resolution. Instead, these methods measure the collective gene expression for each location from a mixture of cells, potentially containing multiple cell types. Here, we developed scResolve, a method for recovering single-cell expression profiles from spatial transcriptomics measurements at multi-cellular resolution. scResolve accurately restores expression profiles of individual cells at their locations, which is unattainable from cell type deconvolution. Applications of scResolve on human breast cancer data and human lung disease data demonstrate that scResolve enables cell type-specific differential gene expression analysis between different tissue contexts and accurate identification of rare cell populations. The spatially resolved cellular-level expression profiles obtained through scResolve facilitate more flexible and precise spatial analysis that complements raw multi-cellular level analysis.
ABSTRACT
A key challenge in the analysis of longitudinal microbiome data is the inference of temporal interactions between microbial taxa, their genes, the metabolites that they consume and produce, and host genes. To address these challenges, we developed a computational pipeline, a pipeline for the analysis of longitudinal multi-omics data (PALM), that first aligns multi-omics data and then uses dynamic Bayesian networks (DBNs) to reconstruct a unified model. Our approach overcomes differences in sampling and progression rates, utilizes a biologically inspired multi-omic framework, reduces the large number of entities and parameters in the DBNs, and validates the learned network. Applying PALM to data collected from inflammatory bowel disease patients, we show that it accurately identifies known and novel interactions. Targeted experimental validations further support a number of the predicted novel metabolite-taxon interactions.IMPORTANCE While a number of large consortia collect and profile several different types of microbiome and genomic time series data, very few methods exist for joint modeling of multi-omics data sets. We developed a new computational pipeline, PALM, which uses dynamic Bayesian networks (DBNs) and is designed to integrate multi-omics data from longitudinal microbiome studies. When used to integrate sequence, expression, and metabolomics data from microbiome samples along with host expression data, the resulting models identify interactions between taxa, their genes, and the metabolites that they produce and consume, as well as their impact on host expression. We tested the models both by using them to predict future changes in microbiome levels and by comparing the learned interactions to known interactions in the literature. Finally, we performed experimental validations for a few of the predicted interactions to demonstrate the ability of the method to identify novel relationships and their impact.
ABSTRACT
Several molecular datasets have been recently compiled to characterize the activity of SARS-CoV-2 within human cells. Here we extend computational methods to integrate several different types of sequence, functional and interaction data to reconstruct networks and pathways activated by the virus in host cells. We identify key proteins in these networks and further intersect them with genes differentially expressed at conditions that are known to impact viral activity. Several of the top ranked genes do not directly interact with virus proteins. We experimentally tested treatments for a number of the predicted targets. We show that blocking one of the predicted indirect targets significantly reduces viral loads in stem cell-derived alveolar epithelial type II cells (iAT2s).
ABSTRACT
Identifying pathogenic variants and underlying functional alterations is challenging. To this end, we introduce MutPred2, a tool that improves the prioritization of pathogenic amino acid substitutions over existing methods, generates molecular mechanisms potentially causative of disease, and returns interpretable pathogenicity score distributions on individual genomes. Whilst its prioritization performance is state-of-the-art, a distinguishing feature of MutPred2 is the probabilistic modeling of variant impact on specific aspects of protein structure and function that can serve to guide experimental studies of phenotype-altering variants. We demonstrate the utility of MutPred2 in the identification of the structural and functional mutational signatures relevant to Mendelian disorders and the prioritization of de novo mutations associated with complex neurodevelopmental disorders. We then experimentally validate the functional impact of several variants identified in patients with such disorders. We argue that mechanism-driven studies of human inherited disease have the potential to significantly accelerate the discovery of clinically actionable variants.
Subject(s)
Amino Acid Substitution/genetics , Computational Biology/methods , Genetic Predisposition to Disease , Software , Genome, Human , Humans , Models, Statistical , Mutation , Phenotype , Proteins/geneticsABSTRACT
BACKGROUND: Several studies have focused on the microbiota living in environmental niches including human body sites. In many of these studies, researchers collect longitudinal data with the goal of understanding not only just the composition of the microbiome but also the interactions between the different taxa. However, analysis of such data is challenging and very few methods have been developed to reconstruct dynamic models from time series microbiome data. RESULTS: Here, we present a computational pipeline that enables the integration of data across individuals for the reconstruction of such models. Our pipeline starts by aligning the data collected for all individuals. The aligned profiles are then used to learn a dynamic Bayesian network which represents causal relationships between taxa and clinical variables. Testing our methods on three longitudinal microbiome data sets we show that our pipeline improve upon prior methods developed for this task. We also discuss the biological insights provided by the models which include several known and novel interactions. The extended CGBayesNets package is freely available under the MIT Open Source license agreement. The source code and documentation can be downloaded from https://github.com/jlugomar/longitudinal_microbiome_analysis_public . CONCLUSIONS: We propose a computational pipeline for analyzing longitudinal microbiome data. Our results provide evidence that microbiome alignments coupled with dynamic Bayesian networks improve predictive performance over previous methods and enhance our ability to infer biological relationships within the microbiome and between taxa and clinical factors.
Subject(s)
Computational Biology/methods , Microbiota , Algorithms , Bayes Theorem , Humans , SoftwareABSTRACT
Protein complexes play a significant role in the core functionality of cells. These complexes are typically identified by detecting densely connected subgraphs in protein-protein interaction (PPI) networks. Recently, multiple large-scale mass spectrometry-based experiments have significantly increased the availability of PPI data in order to further expand the set of known complexes. However, high-throughput experimental data generally are incomplete, show limited agreement between experiments, and show frequent false positive interactions. There is a need for computational approaches that can address these limitations in order to improve the coverage and accuracy of human protein complexes. Here, we present a new method that integrates data from multiple heterogeneous experiments and sources in order to increase the reliability and coverage of predicted protein complexes. We first fused the heterogeneous data into a feature matrix and trained classifiers to score pairwise protein interactions. We next used graph based methods to combine pairwise interactions into predicted protein complexes. Our approach improves the accuracy and coverage of protein pairwise interactions, accurately identifies known complexes, and suggests both novel additions to known complexes and entirely new complexes. Our results suggest that integration of heterogeneous experimental data helps improve the reliability and coverage of diverse high-throughput mass-spectrometry experiments, leading to an improved global map of human protein complexes.