Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 41
1.
Sci Rep ; 14(1): 6009, 2024 03 12.
Article En | MEDLINE | ID: mdl-38472223

Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.


Algorithms , Proteins , Sequence Alignment , Proteins/genetics , Biological Evolution , Phylogeny , Computational Biology/methods
2.
Mol Cell Proteomics ; 22(10): 100640, 2023 Oct.
Article En | MEDLINE | ID: mdl-37659604

The "Protein Abundances Across Organisms" database (PaxDb) is an integrative metaresource dedicated to protein abundance levels, in tissue-specific or whole-organism proteomes. PaxDb focuses on computing best-estimate abundances for proteins in normal/healthy contexts and expresses abundance values for each protein in "parts per million" in relation to all other protein molecules in the cell. The uniform data reprocessing, quality scoring, and integrated orthology relations have made PaxDb one of the preferred tools for comparisons between individual datasets, tissues, or organisms. In describing the latest version 5.0 of PaxDb, we particularly emphasize the data integration from various types of raw data and how we expanded the number of organisms and tissue groups as well as the proteome coverage. The current collection of PaxDb includes 831 original datasets from 170 species, including 22 Archaea, 81 Bacteria, and 67 Eukaryota. Apart from detailing the data update, we also present a comparative analysis of the human proteome subset of PaxDb against the two most widely used human proteome data resources: Human Protein Atlas and Genotype-Tissue Expression. Lastly, through our protein abundance data, we reveal an evolutionary trend in the usage of sulfur-containing amino acids in the proteomes of Fungi.

3.
Bioinform Adv ; 3(1): vbad050, 2023.
Article En | MEDLINE | ID: mdl-37123454

Motivation: Alternative splicing, as an essential regulatory mechanism in normal mammalian cells, is frequently disturbed in cancer and other diseases. Switches in the expression of most dominant alternative isoforms can alter protein interaction networks of associated genes giving rise to disease and disease progression. Here, we present CanIsoNet, a database to view, browse and search isoform switching events in diseases. CanIsoNet is the first webserver that incorporates isoform expression data with STRING interaction networks and ClinVar annotations to predict the pathogenic impact of isoform switching events in various diseases. Results: Data in CanIsoNet can be browsed by disease or searched by genes or isoforms in annotation-rich data tables. Various annotations for 11 811 isoforms and 14 357 unique isoform switching events across 31 different disease types are available. The network density score for each disease-specific isoform, PFAM domain IDs of disrupted interactions, domain structure visualization of transcripts and expression data of switched isoforms for each sample is given. Additionally, the genes annotated in ClinVar are highlighted in interactive interaction networks. Availability and implementation: CanIsoNet is freely available at https://www.caniso.net. The source codes can be found under a Creative Common License at https://github.com/kahramanlab/CanIsoNet_Web. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

4.
J Proteome Res ; 22(2): 637-646, 2023 02 03.
Article En | MEDLINE | ID: mdl-36512705

Biological networks are often used to represent complex biological systems, which can contain several types of entities. Analysis and visualization of such networks is supported by the Cytoscape software tool and its many apps. While earlier versions of stringApp focused on providing intraspecies protein-protein interactions from the STRING database, the new stringApp 2.0 greatly improves the support for heterogeneous networks. Here, we highlight new functionality that makes it possible to create networks that contain proteins and interactions from STRING as well as other biological entities and associations from other sources. We exemplify this by complementing a published SARS-CoV-2 interactome with interactions from STRING. We have also extended stringApp with new data and query functionality for protein-protein interactions between eukaryotic parasites and their hosts. We show how this can be used to retrieve and visualize a cross-species network for a malaria parasite, its host, and its vector. Finally, the latest stringApp version has an improved user interface, allows retrieval of both functional associations and physical interactions, and supports group-wise enrichment analysis of different parts of a network to aid biological interpretation. stringApp is freely available at https://apps.cytoscape.org/apps/stringapp.


COVID-19 , Humans , SARS-CoV-2 , Software , Proteins , Eukaryota
5.
Nucleic Acids Res ; 51(D1): D760-D766, 2023 01 06.
Article En | MEDLINE | ID: mdl-36408900

The interpretation of genomic, transcriptomic and other microbial 'omics data is highly dependent on the availability of well-annotated genomes. As the number of publicly available microbial genomes continues to increase exponentially, the need for quality control and consistent annotation is becoming critical. We present proGenomes3, a database of 907 388 high-quality genomes containing 4 billion genes that passed stringent criteria and have been consistently annotated using multiple functional and taxonomic databases including mobile genetic elements and biosynthetic gene clusters. proGenomes3 encompasses 41 171 species-level clusters, defined based on universal single copy marker genes, for which pan-genomes and contextual habitat annotations are provided. The database is available at http://progenomes.embl.de/.


Genome , Prokaryotic Cells , Databases, Genetic , Genomics , Molecular Sequence Annotation , Bacteria/classification , Bacteria/genetics
6.
Nucleic Acids Res ; 51(D1): D389-D394, 2023 01 06.
Article En | MEDLINE | ID: mdl-36399505

The eggNOG (evolutionary gene genealogy Non-supervised Orthologous Groups) database is a bioinformatics resource providing orthology data and comprehensive functional information for organisms from all domains of life. Here, we present a major update of the database and website (version 6.0), which increases the number of covered organisms to 12 535 reference species, expands functional annotations, and implements new functionality. In total, eggNOG 6.0 provides a hierarchy of over 17M orthologous groups (OGs) computed at 1601 taxonomic levels, spanning 10 756 bacterial, 457 archaeal and 1322 eukaryotic organisms. OGs have been thoroughly annotated using recent knowledge from functional databases, including KEGG, Gene Ontology, UniProtKB, BiGG, CAZy, CARD, PFAM and SMART. eggNOG also offers phylogenetic trees for all OGs, maximising utility and versatility for end users while allowing researchers to investigate the evolutionary history of speciation and duplication events as well as the phylogenetic distribution of functional terms within each OG. Furthermore, the eggNOG 6.0 website contains new functionality to mine orthology and functional data with ease, including the possibility of generating phylogenetic profiles for multiple OGs across species or identifying single-copy OGs at custom taxonomic levels. eggNOG 6.0 is available at http://eggnog6.embl.de.


Databases, Genetic , Genomics , Phylogeny , Computational Biology , Eukaryota/genetics
7.
Nucleic Acids Res ; 51(D1): D638-D646, 2023 01 06.
Article En | MEDLINE | ID: mdl-36370105

Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein-protein interactions-both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.


Protein Interaction Mapping , Proteins , Protein Interaction Mapping/methods , Databases, Protein , Proteins/genetics , Proteins/metabolism , Genomics , Proteomics , User-Computer Interface
8.
Brief Bioinform ; 23(5)2022 09 20.
Article En | MEDLINE | ID: mdl-36088548

A knowledge-based grouping of genes into pathways or functional units is essential for describing and understanding cellular complexity. However, it is not always clear a priori how and at what level of specificity functionally interconnected genes should be partitioned into pathways, for a given application. Here, we assess and compare nine existing and two conceptually novel functional classification systems, with respect to their discovery power and generality in gene set enrichment testing. We base our assessment on a collection of nearly 2000 functional genomics datasets provided by users of the STRING database. With these real-life and diverse queries, we assess which systems typically provide the most specific and complete enrichment results. We find many structural and performance differences between classification systems. Overall, the well-established, hierarchically organized pathway annotation systems yield the best enrichment performance, despite covering substantial parts of the human genome in general terms only. On the other hand, the more recent unsupervised annotation systems perform strongest in understudied areas and organisms, and in detecting more specific pathways, albeit with less informative labels.


Genomics , Software , Databases, Factual , Databases, Genetic , Genomics/methods , Humans
10.
Genome Biol ; 22(1): 178, 2021 06 13.
Article En | MEDLINE | ID: mdl-34120611

Genomes are critical units in microbiology, yet ascertaining quality in prokaryotic genome assemblies remains a formidable challenge. We present GUNC (the Genome UNClutterer), a tool that accurately detects and quantifies genome chimerism based on the lineage homogeneity of individual contigs using a genome's full complement of genes. GUNC complements existing approaches by targeting previously underdetected types of contamination: we conservatively estimate that 5.7% of genomes in GenBank, 5.2% in RefSeq, and 15-30% of pre-filtered "high-quality" metagenome-assembled genomes in recent studies are undetected chimeras. GUNC provides a fast and robust tool to substantially improve prokaryotic genome quality.


Chimerism , Computational Biology/methods , Genome, Bacterial , Metagenome , Proteobacteria/genetics , Software , Contig Mapping , Metagenomics/methods , Phylogeny , Prokaryotic Cells/cytology , Prokaryotic Cells/metabolism
11.
Nucleic Acids Res ; 49(D1): D605-D612, 2021 01 08.
Article En | MEDLINE | ID: mdl-33237311

Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.


Databases, Protein , Protein Interaction Mapping , Proteins/genetics , User-Computer Interface
12.
Sci Rep ; 10(1): 14453, 2020 09 02.
Article En | MEDLINE | ID: mdl-32879328

Under normal conditions, cells of almost all tissue types express the same predominant canonical transcript isoform at each gene locus. In cancer, however, splicing regulation is often disturbed, leading to cancer-specific switches in the most dominant transcripts (MDT). To address the pathogenic impact of these switches, we have analyzed isoform-specific protein-protein interaction disruptions in 1,209 cancer samples covering 27 different cancer types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project of the International Cancer Genomics Consortium (ICGC). Our study revealed large variations in the number of cancer-specific MDT (cMDT) with the highest frequency in cancers of female reproductive organs. Interestingly, in contrast to the mutational load, cancers arising from the same primary tissue had a similar number of cMDT. Some cMDT were found in 100% of all samples in a cancer type, making them candidates for diagnostic biomarkers. cMDT tend to be located at densely populated network regions where they disrupted protein interactions in the proximity of pathogenic cancer genes. A gene ontology enrichment analysis showed that these disruptions occurred mostly in protein translation and RNA splicing pathways. Interestingly, samples with mutations in the spliceosomal complex tend to have higher number of cMDT, while other transcript expressions correlated with mutations in non-coding splice-site and promoter regions of their genes. This work demonstrates for the first time the large extent of cancer-specific alterations in alternative splicing for 27 different cancer types. It highlights distinct and common patterns of cMDT and suggests novel pathogenic transcripts and markers that induce large network disruptions in cancers.


Genomics , Neoplasm Proteins/genetics , Neoplasms/genetics , Protein Isoforms/genetics , Alternative Splicing/genetics , Female , Gene Expression Regulation, Neoplastic/genetics , Genitalia, Female/metabolism , Genitalia, Female/pathology , Humans , Male , Mutation , Neoplasms/pathology , RNA Splicing/genetics , Signal Transduction/genetics , Spliceosomes/genetics , Transcription, Genetic/genetics
13.
Nucleic Acids Res ; 48(W1): W538-W545, 2020 07 02.
Article En | MEDLINE | ID: mdl-32374845

The identification of orthologs-genes in different species which descended from the same gene in their last common ancestor-is a prerequisite for many analyses in comparative genomics and molecular evolution. Numerous algorithms and resources have been conceived to address this problem, but benchmarking and interpreting them is fraught with difficulties (need to compare them on a common input dataset, absence of ground truth, computational cost of calling orthologs). To address this, the Quest for Orthologs consortium maintains a reference set of proteomes and provides a web server for continuous orthology benchmarking (http://orthology.benchmarkservice.org). Furthermore, consensus ortholog calls derived from public benchmark submissions are provided on the Alliance of Genome Resources website, the joint portal of NIH-funded model organism databases.


Multigene Family , Proteome , Software , Animals , Benchmarking , Consensus , Genomics , Humans , Mice , Phylogeny , Rats
14.
Mol Cell ; 74(5): 1086-1102.e5, 2019 06 06.
Article En | MEDLINE | ID: mdl-31101498

Kinase and phosphatase overexpression drives tumorigenesis and drug resistance. We previously developed a mass-cytometry-based single-cell proteomics approach that enables quantitative assessment of overexpression effects on cell signaling. Here, we applied this approach in a human kinome- and phosphatome-wide study to assess how 649 individually overexpressed proteins modulated cancer-related signaling in HEK293T cells in an abundance-dependent manner. Based on these data, we expanded the functional classification of human kinases and phosphatases and showed that the overexpression effects include non-catalytic roles. We detected 208 previously unreported signaling relationships. The signaling dynamics analysis indicated that the overexpression of ERK-specific phosphatases sustains proliferative signaling. This suggests a phosphatase-driven mechanism of cancer progression. Moreover, our analysis revealed a drug-resistant mechanism through which overexpression of tyrosine kinases, including SRC, FES, YES1, and BLK, induced MEK-independent ERK activation in melanoma A375 cells. These proteins could predict drug sensitivity to BRAF-MEK concurrent inhibition in cells carrying BRAF mutations.


Carcinogenesis/genetics , Melanoma/genetics , Phosphoric Monoester Hydrolases/genetics , Phosphotransferases/genetics , Proto-Oncogene Proteins B-raf/genetics , Cell Proliferation/genetics , Drug Resistance, Neoplasm/genetics , Gene Expression Regulation, Neoplastic/drug effects , HEK293 Cells , Humans , Melanoma/enzymology , Melanoma/pathology , Mutation , Phosphorylation/genetics , Protein Kinase Inhibitors/pharmacology , Proteomics , Signal Transduction/drug effects
15.
BMC Bioinformatics ; 20(1): 228, 2019 May 06.
Article En | MEDLINE | ID: mdl-31060495

BACKGROUND: An orthologous group (OG) comprises a set of orthologous and paralogous genes that share a last common ancestor (LCA). OGs are defined with respect to a chosen taxonomic level, which delimits the position of the LCA in time to a specified speciation event. A hierarchy of OGs expands on this notion, connecting more general OGs, distant in time, to more recent, fine-grained OGs, thereby spanning multiple levels of the tree of life. Large scale inference of OG hierarchies with independently computed taxonomic levels can suffer from inconsistencies between successive levels, such as the position in time of a duplication event. This can be due to confounding genetic signal or algorithmic limitations. Importantly, inconsistencies limit the potential use of OGs for functional annotation and third-party applications. RESULTS: Here we present a new methodology to ensure hierarchical consistency of OGs across taxonomic levels. To resolve an inconsistency, we subsample the protein space of the OG members and perform gene tree-species tree reconciliation for each sampling. Differently from previous approaches, by subsampling the protein space, we avoid the notoriously difficult task of accurately building and reconciling very large phylogenies. We implement the method into a high-throughput pipeline and apply it to the eggNOG database. We use independent protein domain definitions to validate its performance. CONCLUSION: The presented consistency pipeline shows that, contrary to previous limitations, tree reconciliation can be a useful instrument for the construction of OG hierarchies. The key lies in the combination of sampling smaller trees and aggregating their reconciliations for robustness. Results show comparable or greater performance to previous pipelines. The code is available on Github at: https://github.com/meringlab/og_consistency_pipeline .


Databases, Protein/standards , Phylogeny
16.
Nucleic Acids Res ; 47(D1): D607-D613, 2019 01 08.
Article En | MEDLINE | ID: mdl-30476243

Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.


Genomics/methods , Protein Interaction Mapping/methods , Software , Animals , Databases, Genetic , Gene Ontology , Humans
17.
Nucleic Acids Res ; 47(D1): D309-D314, 2019 01 08.
Article En | MEDLINE | ID: mdl-30418610

eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de.


Conserved Sequence , Databases, Genetic , Evolution, Molecular , Phylogeny , Sequence Homology , Animals , Classification , Eukaryota/genetics , Gene Duplication , Gene Ontology , Genes, Viral , Genome , Humans , Molecular Sequence Annotation , Proteome , Sequence Alignment , Structure-Activity Relationship
18.
Viruses ; 10(10)2018 09 23.
Article En | MEDLINE | ID: mdl-30249048

As viruses continue to pose risks to global health, having a better understanding of virus⁻host protein⁻protein interactions aids in the development of treatments and vaccines. Here, we introduce Viruses.STRING, a protein⁻protein interaction database specifically catering to virus⁻virus and virus⁻host interactions. This database combines evidence from experimental and text-mining channels to provide combined probabilities for interactions between viral and host proteins. The database contains 177,425 interactions between 239 viruses and 319 hosts. The database is publicly available at viruses.string-db.org, and the interaction data can also be accessed through the latest version of the Cytoscape STRING app.


Databases, Protein , Host-Pathogen Interactions , Protein Interaction Mapping , Proteins/metabolism , Viruses/metabolism , Animals , Gene Ontology , Humans , Probability , Protein Binding , Protein Interaction Maps , Software Design
19.
Mol Biol Evol ; 34(8): 2115-2122, 2017 08 01.
Article En | MEDLINE | ID: mdl-28460117

Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, and most pipelines are relatively inaccessible (e.g., new assignments only available through database updates), less precise homology-based functional transfer is still the default for (meta-)genome annotation. We, therefore, developed eggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database. To validate our method, we benchmarked Gene Ontology (GO) predictions against two widely used homology-based approaches: BLAST and InterProScan. Orthology filters applied to BLAST results reduced the rate of false positive assignments by 11%, and increased the ratio of experimentally validated terms recovered over all terms assigned per protein by 15%. Compared with InterProScan, eggNOG-mapper achieved similar proteome coverage and precision while predicting, on average, 41 more terms per protein and increasing the rate of experimentally validated terms recovered over total term assignments per protein by 35%. EggNOG-mapper predictions scored within the top-5 methods in the three GO categories using the CAFA2 NK-partial benchmark. Finally, we evaluated eggNOG-mapper for functional annotation of metagenomics data, yielding better performance than interProScan. eggNOG-mapper runs ∼15× faster than BLAST and at least 2.5× faster than InterProScan. The tool is available standalone and as an online service at http://eggnog-mapper.embl.de.


Sequence Alignment/methods , Sequence Analysis, Protein/methods , Algorithms , Computer Simulation , Databases, Genetic , Databases, Protein , Gene Ontology , Genome/genetics , Phylogeny , Sequence Alignment/statistics & numerical data , Software
20.
Nucleic Acids Res ; 45(D1): D362-D368, 2017 01 04.
Article En | MEDLINE | ID: mdl-27924014

A system-wide understanding of cellular function requires knowledge of all functional interactions between the expressed proteins. The STRING database aims to collect and integrate this information, by consolidating known and predicted protein-protein association data for a large number of organisms. The associations in STRING include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful. Apart from collecting and reassessing available experimental data on protein-protein interactions, and importing known pathways and protein complexes from curated databases, interaction predictions are derived from the following sources: (i) systematic co-expression analysis, (ii) detection of shared selective signals across genomes, (iii) automated text-mining of the scientific literature and (iv) computational transfer of interaction knowledge between organisms based on gene orthology. In the latest version 10.5 of STRING, the biggest changes are concerned with data dissemination: the web frontend has been completely redesigned to reduce dependency on outdated browser technologies, and the database can now also be queried from inside the popular Cytoscape software framework. Further improvements include automated background analysis of user inputs for functional enrichments, and streamlined download options. The STRING resource is available online, at http://string-db.org/.


Computational Biology/methods , Databases, Protein , Software , Models, Molecular , Protein Binding , Protein Conformation , Protein Interaction Mapping , Protein Interaction Maps , Proteins/chemistry , Proteins/metabolism , Structure-Activity Relationship , User-Computer Interface , Web Browser
...