ABSTRACT
MOTIVATION: Knowledge of the specific cell types affected by genetic alterations in rare diseases is crucial for advancing diagnostics and treatments. Despite significant progress, the cell types involved in the majority of rare disease manifestations remain largely unknown. In this study, we integrated scRNA-seq data from non-diseased samples with known genetic disorder genes and phenotypic information to predict the specific cell types disrupted by pathogenic mutations for 482 disease phenotypes. RESULTS: We found significant phenotype-cell type associations focusing on differential expression and co-expression mechanisms. Our analysis revealed that 13% of the associations documented in the literature were captured through differential expression, while 42% were elucidated through co-expression analysis, also uncovering potential new associations. These findings underscore the critical role of cellular context in disease manifestation and highlight the potential of single-cell data for the development of cell-aware diagnostics and targeted therapies for rare diseases. AVAILABILITY: All code generated in this work is available at https://github.com/SergioAlias/sc-coex. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
ABSTRACT
MBROLE (Metabolites Biological Role) facilitates the biological interpretation of metabolomics experiments. It performs enrichment analysis of a set of chemical compounds through statistical analysis of annotations from several databases. The original MBROLE server was released in 2011 and, since then, different groups worldwide have used it to analyze metabolomics experiments from a variety of organisms. Here we present the latest version of the system, MBROLE3, accessible at http://csbg.cnb.csic.es/mbrole3. This new version contains updated annotations from previously included databases as well as a wide variety of new functional annotations, such as additional pathway databases and Gene Ontology terms. Of special relevance is the inclusion of a new category of annotations, 'indirect annotations', extracted from the scientific literature and from curated chemical-protein associations. The latter allows to analyze enriched annotations of the proteins known to interact with the set of chemical compounds of interest. Results are provided in the form of interactive tables, formatted data to download, and graphical plots.
Subject(s)
Metabolomics , Proteins , Software , Databases, Factual , Gene Ontology , Metabolomics/methodsABSTRACT
Genetic and molecular analysis of rare disease is made difficult by the small numbers of affected patients. Phenotypic comorbidity analysis can help rectify this by combining information from individuals with similar phenotypes and looking for overlap in terms of shared genes and underlying functional systems. However, few studies have combined comorbidity analysis with genomic data. We present a computational approach that connects patient phenotypes based on phenotypic co-occurence and uses genomic information related to the patient mutations to assign genes to the phenotypes, which are used to detect enriched functional systems. These phenotypes are clustered using network analysis to obtain functionally coherent phenotype clusters. We applied the approach to the DECIPHER database, containing phenotypic and genomic information for thousands of patients with heterogeneous rare disorders and copy number variants. Validity was demonstrated through overlap with known diseases, co-mention within the biomedical literature, semantic similarity measures, and patient cluster membership. These connected pairs formed multiple phenotype clusters, showing functional coherence, and mapped to genes and systems involved in similar pathological processes. Examples include claudin genes from the 22q11 genomic region associated with a cluster of phenotypes related to DiGeorge syndrome and genes related to the GO term anterior/posterior pattern specification associated with abnormal development. The clusters generated can help with the diagnosis of rare diseases, by suggesting additional phenotypes for a given patient and potential underlying functional systems. Other tools to find causal genes based on phenotype were also investigated. The approach has been implemented as a workflow, named PhenCo, which can be adapted to any set of patients for which phenomic and genomic data is available. Full details of the analysis, including the clusters formed, their constituent functional systems and underlying genes are given. Code to implement the workflow is available from GitHub.
Subject(s)
Comorbidity , Genetic Predisposition to Disease , Genomics , Rare Diseases/genetics , DNA Copy Number Variations/genetics , Databases, Genetic , Genetic Association Studies , Genome, Human/genetics , Genotype , Humans , Mutation/genetics , Phenotype , Rare Diseases/diagnosis , Rare Diseases/pathologyABSTRACT
MOTIVATION: Predicting the residues controlling a protein's interaction specificity is important not only to better understand its interactions but also to design mutations aimed at fine-tuning or swapping them as well. RESULTS: In this work, we present a methodology that combines sequence information (in the form of multiple sequence alignments) with interactome information to detect that kind of residues in paralogous families of proteins. The interactome is used to define pairwise similarities of interaction contexts for the proteins in the alignment. The method looks for alignment positions with patterns of amino-acid changes reflecting the similarities/differences in the interaction neighborhoods of the corresponding proteins. We tested this new methodology in a large set of human paralogous families with structurally characterized interactions, and discuss in detail the results for the RasH family. We show that this approach is a better predictor of interfacial residues than both, sequence conservation and an equivalent 'unsupervised' method that does not use interactome information. AVAILABILITY AND IMPLEMENTATION: http://csbg.cnb.csic.es/pazos/Xdet/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Proteins , Software , Humans , Proteins/genetics , Sequence Alignment , Sequence Analysis, ProteinABSTRACT
BACKGROUND: Assignment of chemical compounds to biological pathways is a crucial step to understand the relationship between the chemical repertory of an organism and its biology. Protein sequence profiles are very successful in capturing the main structural and functional features of a protein family, and can be used to assign new members to it based on matching of their sequences against these profiles. In this work, we extend this idea to chemical compounds, constructing a profile-inspired model for a set of related metabolites (those in the same biological pathway), based on a fragment-based vectorial representation of their chemical structures. RESULTS: We use this representation to predict the biological pathway of a chemical compound with good overall accuracy (AUC 0.74-0.90 depending on the database tested), and analyzed some factors that affect performance. The approach, which is compared with equivalent methods, can in addition detect those molecular fragments characteristic of a pathway. CONCLUSIONS: The method is available as a graphical interactive web server http://csbg.cnb.csic.es/iFragMent .
Subject(s)
Proteins , Software , Amino Acid Sequence , Databases, Factual , InternetABSTRACT
Copy number variation (CNV) related disorders tend to show complex phenotypic profiles that do not match known diseases. This makes it difficult to ascertain their underlying molecular basis. A potential solution is to compare the affected genomic regions for multiple patients that share a pathological phenotype, looking for commonalities. Here, we present a novel approach to associate phenotypes with functional systems, in terms of GO categories and KEGG and Reactome pathways, based on patient data. The approach uses genomic and phenomic data from the same patients, finding shared genomic regions between patients with similar phenotypes. These regions are mapped to genes to find associated functional systems. We applied the approach to analyse patients in the DECIPHER database with de novo CNVs, finding functional systems associated with most phenotypes, often due to mutations affecting related genes in the same genomic region. Manual inspection of the ten top-scoring phenotypes found multiple FunSys connections supported by the previous studies for seven of them. The workflow also produces reports focussed on the genes and FunSys connected to the different phenotypes, alongside patient-specific reports, which give details of the associated genes and FunSys for each individual in the cohort. These can be run in "confidential" mode, preserving patient confidentiality. The workflow presented here can be used to associate phenotypes with functional systems using data at the level of a whole cohort of patients, identifying important connections that could not be found when considering them individually. The full workflow is available for download, enabling it to be run on any patient cohort for which phenotypic and CNV data are available.
Subject(s)
DNA Copy Number Variations , Genetic Predisposition to Disease , Genotype , Phenotype , Cohort Studies , Databases, Genetic , HumansABSTRACT
Daily work in molecular biology presently depends on a large number of computational tools. An in-depth, large-scale study of that 'ecosystem' of Web tools, its characteristics, interconnectivity, patterns of usage/citation, temporal evolution and rate of decay is crucial for understanding the forces that shape it and for informing initiatives aimed at its funding, long-term maintenance and improvement. In particular, the long-term maintenance of these tools is compromised because of their specific development model. Hundreds of published studies become irreproducible de facto, as the software tools used to conduct them become unavailable. In this study, we present a large-scale survey of >5400 publications describing Web servers within the two main bibliographic resources for disseminating new software developments in molecular biology. For all these servers, we studied their citation patterns, the subjects they address, their citation networks and the temporal evolution of these factors. We also analysed how these factors affect the availability of these servers (whether they are alive). Our results show that this ecosystem of tools is highly interconnected and adapts to the 'trendy' subjects in every moment. The servers present characteristic temporal patterns of citation/usage, and there is a worrying rate of server 'death', which is influenced by factors such as the server popularity and the institutions that hosts it. These results can inform initiatives aimed at the long-term maintenance of these resources.
Subject(s)
Molecular Biology/statistics & numerical data , Software , Computational Biology/methods , Computational Biology/trends , Internet , Molecular Biology/trends , Periodicals as Topic/statistics & numerical data , Software/trendsABSTRACT
MOTIVATION: The results of some experimental and computational techniques are given in terms of large sets of organisms, especially prokaryotic. While their distinctive features can provide useful data regarding specific phenomenon, there are no automated tools for extracting them. RESULTS: We present here the Bacterial Feature Finder web server, a tool to automatically interrogate sets of prokaryotic organisms provided by the user to evaluate their specific biological features. At the core of the system is a searchable database of qualitative and quantitative features compiled for more than 23 000 prokaryotic organisms. Both the input set of organisms and the background set used to calculate the enriched features can be directly provided by the user, or they can be obtained by searching the database. The results are presented via an interactive graphical interface, with links to external resources. AVAILABILITY AND IMPLEMENTATION: The web server is freely available at http://csbg.cnb.csic.es/BaFF. It has been tested in the main web browsers and does not require any especial plug-ins or additional software. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Internet , Software , Computational Biology , Databases, Factual , Prokaryotic CellsABSTRACT
Co-evolution is a fundamental component of the theory of evolution and is essential for understanding the relationships between species in complex ecological networks. A wide range of co-evolution-inspired computational methods has been designed to predict molecular interactions, but it is only recently that important advances have been made. Breakthroughs in the handling of phylogenetic information and in disentangling indirect relationships have resulted in an improved capacity to predict interactions between proteins and contacts between different protein residues. Here, we review the main co-evolution-based computational approaches, their theoretical basis, potential applications and foreseeable developments.
Subject(s)
Computational Biology/methods , Evolution, Molecular , Protein Structure, Tertiary , Proteins/chemistry , Proteins/genetics , Animals , Humans , Models, Genetic , Models, Molecular , Mutation , Phylogeny , Proteins/classificationABSTRACT
As aberrant protein phosphorylation is a hallmark of tumor cells, the display of tumor-specific phosphopeptides by Human Leukocyte Antigen (HLA) class I molecules can be exploited in the treatment of cancer by T-cell-based immunotherapy. Yet, the characterization and prediction of HLA-I phospholigands is challenging as the molecular determinants of the presentation of such post-translationally modified peptides are not fully understood. Here, we employed a peptidomic workflow to identify 256 unique phosphorylated ligands associated with HLA-B*40, -B*27, -B*39, or -B*07. Remarkably, these phosphopeptides showed similar molecular features. Besides the specific anchor motifs imposed by the binding groove of each allotype, the predominance of phosphorylation at peptide position 4 (P4) became strikingly evident, as was the enrichment of basic residues at P1. To determine the structural basis of this observation, we carried out a series of peptide binding assays and solved the crystal structures of HLA-B*40 in complex with a phosphorylated ligand or its nonphosphorylated counterpart. Overall, our data provide a clear explanation to the common motif found in the phosphopeptidomes associated to different HLA-B molecules. The high prevalence of phosphorylation at P4 is dictated by the presence of the conserved residue Arg62 in the heavy chain, a structural feature shared by most HLA-B alleles. In contrast, the preference for basic residues at P1 is allotype-dependent and might be linked to the structure of the A pocket. This molecular understanding of the presentation of phosphopeptides by HLA-B molecules provides a base for the improved prediction and identification of phosphorylated neo-antigens, as potentially used for cancer immunotherapy.
Subject(s)
HLA-B Antigens/chemistry , HLA-B Antigens/metabolism , Peptides/chemistry , Proteomics/methods , Amino Acid Motifs , Cell Line , Crystallography, X-Ray , HLA-B40 Antigen/chemistry , HLA-B40 Antigen/metabolism , Humans , Models, Molecular , Peptides/analysis , Phosphorylation , Protein BindingABSTRACT
BACKGROUND: The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. RESULTS: In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. CONCLUSIONS: These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Subject(s)
Amino Acids/chemistry , Proteins/chemistry , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence , Binding Sites , Conserved Sequence , Molecular Sequence Annotation , Sequence Alignment , Time FactorsABSTRACT
BACKGROUND: Epigenetic phenomena are crucial for explaining the phenotypic plasticity seen in the cells of different tissues, developmental stages and diseases, all holding the same DNA sequence. As technology is allowing to retrieve epigenetic information in a genome-wide fashion, massive epigenomic datasets are being accumulated in public repositories. New approaches are required to mine those data to extract useful knowledge. We present here an automatic approach for detecting genomic regions with epigenetic variation patterns across samples related to a grouping of these samples, as a way of detecting regions functionally associated to the phenomenon behind the classification. RESULTS: We show that the regions automatically detected by the method in the whole human genome associated to three different classifications of a set of epigenomes (cancer vs. healthy, brain vs. other organs, and fetal vs. adult tissues) are enriched in genes associated to these processes. CONCLUSIONS: The method is fully automatic and can exhaustively scan the whole human genome at any resolution using large collections of epigenomes as input, although it also produces good results with small datasets. Consequently, it will be valuable for obtaining functional information from the incoming epigenomic information as it continues to accumulate.
Subject(s)
Computational Biology/methods , Epigenesis, Genetic , Genome, Human , Automation , Brain/metabolism , Databases, Genetic , Fetus/metabolism , Humans , Neoplasms/geneticsABSTRACT
Determining the residues that are important for the molecular activity of a protein is a topic of broad interest in biomedicine and biotechnology. This knowledge can help understanding the protein's molecular mechanism as well as to fine-tune its natural function eventually with biotechnological or therapeutic implications. Some of the protein residues are essential for the function common to all members of a family of proteins, while others explain the particular specificities of certain subfamilies (like binding on different substrates or cofactors and distinct binding affinities). Owing to the difficulty in experimentally determining them, a number of computational methods were developed to detect these functional residues, generally known as 'specificity-determining positions' (or SDPs), from a collection of homologous protein sequences. These methods are mature enough for being routinely used by molecular biologists in directing experiments aimed at getting insight into the functional specificity of a family of proteins and eventually modifying it. In this review, we summarize some of the recent discoveries achieved through SDP computational identification in a number of relevant protein families, as well as the main approaches and software tools available to perform this type of analysis.
Subject(s)
Amino Acids/chemistry , Multigene Family , Proteins/chemistry , Sequence Analysis, Protein/methods , Sequence Homology, Amino Acid , Amino Acids/genetics , Conserved Sequence , Proteins/genetics , Reproducibility of Results , Sensitivity and Specificity , Sequence Alignment/methodsABSTRACT
Metabolites Biological Role (MBROLE) is a server that performs functional enrichment analysis of a list of chemical compounds derived from a metabolomics experiment, which allows this list to be interpreted in biological terms. Since its release in 2011, MBROLE has been used by different groups worldwide to analyse metabolomics experiments from a variety of organisms. Here we present the latest version of the system, MBROLE2, accessible at http://csbg.cnb.csic.es/mbrole2 MBROLE2 has been supplemented with 10 databases not available in the previous version, which allow analysis over a larger, richer set of vocabularies including metabolite-protein and drug-protein interactions. This new version performs automatic conversion of compound identifiers from different databases, thus simplifying usage. In addition, the user interface has been redesigned to generate an interactive, more intuitive representation of the results.
Subject(s)
Metabolic Networks and Pathways/genetics , Metabolomics , Small Molecule Libraries/metabolism , User-Computer Interface , Actinobacteria/genetics , Actinobacteria/metabolism , Animals , Arecaceae/genetics , Arecaceae/metabolism , Computer Graphics , Cordyceps/genetics , Cordyceps/metabolism , Databases, Chemical , Databases, Genetic , Escherichia coli/genetics , Escherichia coli/metabolism , Humans , Internet , Rats , Small Molecule Libraries/chemistry , Synechococcus/genetics , Synechococcus/metabolismABSTRACT
The CRISPR/Cas technology is enabling targeted genome editing in multiple organisms with unprecedented accuracy and specificity by using RNA-guided nucleases. A critical point when planning a CRISPR/Cas experiment is the design of the guide RNA (gRNA), which directs the nuclease and associated machinery to the desired genomic location. This gRNA has to fulfil the requirements of the nuclease and lack homology with other genome sites that could lead to off-target effects. Here we introduce the Breaking-Cas system for the design of gRNAs for CRISPR/Cas experiments, including those based in the Cas9 nuclease as well as others recently introduced. The server has unique features not available in other tools, including the possibility of using all eukaryotic genomes available in ENSEMBL (currently around 700), placing variable PAM sequences at 5' or 3' and setting the guide RNA length and the scores per nucleotides. It can be freely accessed at: http://bioinfogp.cnb.csic.es/tools/breakingcas, and the code is available upon request.
Subject(s)
Bacterial Proteins/genetics , CRISPR-Cas Systems , Clustered Regularly Interspaced Short Palindromic Repeats , Endonucleases/genetics , Genome , RNA, Guide, Kinetoplastida/chemical synthesis , Software , Bacterial Proteins/metabolism , CRISPR-Associated Protein 9 , Endonucleases/metabolism , Eukaryota/genetics , Gene Editing , Information Storage and Retrieval , Internet , Nucleotide Motifs , RNA, Guide, Kinetoplastida/geneticsABSTRACT
BACKGROUND: Clinical signs are a fundamental aspect of human pathologies. While disease diagnosis is problematic or impossible in many cases, signs are easier to perceive and categorize. Clinical signs are increasingly used, together with molecular networks, to prioritize detected variants in clinical genomics pipelines, even if the patient is still undiagnosed. Here we analyze the ability of these network-based methods to predict genes that underlie clinical signs from the human interactome. RESULTS: Our analysis reveals that these approaches can locate genes associated with clinical signs with variable performance that depends on the sign and associated disease. We analyzed several clinical and biological factors that explain these variable results, including number of genes involved (mono- vs. oligogenic diseases), mode of inheritance, type of clinical sign and gene product function. CONCLUSIONS: Our results indicate that the characteristics of the clinical signs and their related diseases should be considered for interpreting the results of network-prediction methods, such as those aimed at discovering disease-related genes and variants. These results are important due the increasing use of clinical signs as an alternative to diseases for studying the molecular basis of human pathologies.
Subject(s)
Disease/genetics , Diagnosis , Genomics , Humans , Inheritance Patterns , Protein Interaction Mapping , Proteins/geneticsABSTRACT
MOTIVATION: Many diseases are related by shared associated molecules and pathways, exhibiting comorbidities and common phenotypes, an indication of the continuous nature of the human pathological landscape. Although it is continuous, this landscape is always partitioned into discrete diseases when studied at the molecular level. Clinical signs are also important phenotypic descriptors that can reveal the molecular mechanisms that underlie pathological states, but have seldom been the subject of systemic research. Here, we quantify the modular nature of the clinical signs associated with genetic diseases in the human interactome. RESULTS: We found that clinical signs are reflected as modules at the molecular network level, to at least to the same extent as diseases. They can thus serve as a valid complementary partition of the human pathological landscape, with implications for etiology research, diagnosis and treatment. CONTACT: monica.chagoyen@cnb.csic.es SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Models, Biological , HumansABSTRACT
MOTIVATION: The evolution of proteins cannot be fully understood without taking into account the coevolutionary linkages entangling them. From a practical point of view, coevolution between protein families has been used as a way of detecting protein interactions and functional relationships from genomic information. The most common approach to inferring protein coevolution involves the quantification of phylogenetic tree similarity using a family of methodologies termed mirrortree. In spite of their success, a fundamental problem of these approaches is the lack of an adequate statistical framework to assess the significance of a given coevolutionary score (tree similarity). As a consequence, a number of ad hoc filters and arbitrary thresholds are required in an attempt to obtain a final set of confident coevolutionary signals. RESULTS: In this work, we developed a method for associating confidence estimators (P values) to the tree-similarity scores, using a null model specifically designed for the tree comparison problem. We show how this approach largely improves the quality and coverage (number of pairs that can be evaluated) of the detected coevolution in all the stages of the mirrortree workflow, independently of the starting genomic information. This not only leads to a better understanding of protein coevolution and its biological implications, but also to obtain a highly reliable and comprehensive network of predicted interactions, as well as information on the substructure of macromolecular complexes using only genomic information. AVAILABILITY AND IMPLEMENTATION: The software and datasets used in this work are freely available at: http://csbg.cnb.csic.es/pMT/. CONTACT: pazos@cnb.csic.es SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Evolution, Molecular , Genome, Human , Macromolecular Substances/chemistry , Proteins/chemistry , Databases, Protein , Humans , Internet , Phylogeny , Sequence Analysis, Protein , SoftwareABSTRACT
(+)-7-iso-Jasmonoyl-L-isoleucine (JA-Ile) regulates developmental and stress responses in plants. Its perception involves the formation of a ternary complex with the F-box COI1 and a member of the JAZ family of co-repressors and leads to JAZ degradation. Coronatine (COR) is a bacterial phytotoxin that functionally mimics JA-Ile and interacts with the COI1-JAZ co-receptor with higher affinity than JA-Ile. On the basis of the co-receptor structure, we designed ligand derivatives that spatially impede the interaction of the co-receptor proteins and, therefore, should act as competitive antagonists. One derivative, coronatine-O-methyloxime (COR-MO), has strong activity in preventing the COI1-JAZ interaction, JAZ degradation and the effects of JA-Ile or COR on several JA-mediated responses in Arabidopsis thaliana. Moreover, it potentiates plant resistance, preventing the effect of bacterially produced COR during Pseudomonas syringae infections in different plant species. In addition to the utility of COR-MO for plant biology research, our results underscore its biotechnological potential for safer and sustainable agriculture.
Subject(s)
Amino Acids, Neutral/pharmacology , Amino Acids/chemistry , Cyclopentanes/metabolism , Indenes/chemistry , Oximes/pharmacology , Oxylipins/metabolism , Amino Acids/metabolism , Amino Acids/pharmacology , Anthocyanins/metabolism , Arabidopsis/drug effects , Arabidopsis/metabolism , Arabidopsis/microbiology , Arabidopsis Proteins/antagonists & inhibitors , Arabidopsis Proteins/genetics , Arabidopsis Proteins/metabolism , Botrytis/pathogenicity , Cyclopentanes/pharmacology , DNA-Binding Proteins/antagonists & inhibitors , DNA-Binding Proteins/metabolism , Drug Design , Gene Expression Regulation, Plant , Indenes/metabolism , Indenes/pharmacology , Isoleucine/analogs & derivatives , Isoleucine/metabolism , Isoleucine/pharmacology , Ligands , Plant Roots/growth & development , Plant Roots/metabolism , Plants, Genetically Modified , Pseudomonas syringae/pathogenicity , Repressor Proteins/genetics , Repressor Proteins/metabolism , Transcription Factors/antagonists & inhibitors , Transcription Factors/metabolismABSTRACT
ODCs (Orphan Disease Connections), available at http://csbg.cnb.csic.es/odcs, is a novel resource to explore potential molecular relations between rare diseases. These molecular relations have been established through the integration of disease susceptibility genes and human protein-protein interactions. The database currently contains 54,941 relations between 3032 diseases.