RESUMEN
Enrichment analysis (EA) is a common approach to gain functional insights from genome-scale experiments. As a consequence, a large number of EA methods have been developed, yet it is unclear from previous studies which method is the best for a given dataset. The main issues with previous benchmarks include the complexity of correctly assigning true pathways to a test dataset, and lack of generality of the evaluation metrics, for which the rank of a single target pathway is commonly used. We here provide a generalized EA benchmark and apply it to the most widely used EA methods, representing all four categories of current approaches. The benchmark employs a new set of 82 curated gene expression datasets from DNA microarray and RNA-Seq experiments for 26 diseases, of which only 13 are cancers. In order to address the shortcomings of the single target pathway approach and to enhance the sensitivity evaluation, we present the Disease Pathway Network, in which related Kyoto Encyclopedia of Genes and Genomes pathways are linked. We introduce a novel approach to evaluate pathway EA by combining sensitivity and specificity to provide a balanced evaluation of EA methods. This approach identifies Network Enrichment Analysis methods as the overall top performers compared with overlap-based methods. By using randomized gene expression datasets, we explore the null hypothesis bias of each method, revealing that most of them produce skewed P-values.
Asunto(s)
Benchmarking , RNA-SeqRESUMEN
Accurate inference of gene regulatory networks (GRN) is an essential component of systems biology, and there is a constant development of new inference methods. The most common approach to assess accuracy for publications is to benchmark the new method against a selection of existing algorithms. This often leads to a very limited comparison, potentially biasing the results, which may stem from tuning the benchmark's properties or incorrect application of other methods. These issues can be avoided by a web server with a broad range of data properties and inference algorithms, that makes it easy to perform comprehensive benchmarking of new methods, and provides a more objective assessment. Here we present https://GRNbenchmark.org/ - a new web server for benchmarking GRN inference methods, which provides the user with a set of benchmarks with several datasets, each spanning a range of properties including multiple noise levels. As soon as the web server has performed the benchmarking, the accuracy results are made privately available to the user via interactive summary plots and underlying curves. The user can then download these results for any purpose, and decide whether or not to make them public to share with the community.
Asunto(s)
Benchmarking , Redes Reguladoras de Genes , Algoritmos , Computadores , Biología de Sistemas/métodosRESUMEN
SUMMARY: Predicting orthologs, genes in different species having shared ancestry, is an important task in bioinformatics. Orthology prediction tools are required to make accurate and fast predictions, in order to analyze large amounts of data within a feasible time frame. InParanoid is a well-known algorithm for orthology analysis, shown to perform well in benchmarks, but having the major limitation of long runtimes on large datasets. Here, we present an update to the InParanoid algorithm that can use the faster tool DIAMOND instead of BLAST for the homolog search step. We show that it reduces the runtime by 94%, while still obtaining similar performance in the Quest for Orthologs benchmark. AVAILABILITY AND IMPLEMENTATION: The source code is available at (https://bitbucket.org/sonnhammergroup/inparanoid). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Programas InformáticosRESUMEN
MOTIVATION: Inferring an accurate gene regulatory network (GRN) has long been a key goal in the field of systems biology. To do this, it is important to find a suitable balance between the maximum number of true positive and the minimum number of false-positive interactions. Another key feature is that the inference method can handle the large size of modern experimental data, meaning the method needs to be both fast and accurate. The Least Squares Cut-Off (LSCO) method can fulfill both these criteria, however as it is based on least squares it is vulnerable to known issues of amplifying extreme values, small or large. In GRN this manifests itself with genes that are erroneously hyper-connected to a large fraction of all genes due to extremely low value fold changes. RESULTS: We developed a GRN inference method called Least Squares Cut-Off with Normalization (LSCON) that tackles this problem. LSCON extends the LSCO algorithm by regularization to avoid hyper-connected genes and thereby reduce false positives. The regularization used is based on normalization, which removes effects of extreme values on the fit. We benchmarked LSCON and compared it to Genie3, LASSO, LSCO and Ridge regression, in terms of accuracy, speed and tendency to predict hyper-connected genes. The results show that LSCON achieves better or equal accuracy compared to LASSO, the best existing method, especially for data with extreme values. Thanks to the speed of least squares regression, LSCON does this an order of magnitude faster than LASSO. AVAILABILITY AND IMPLEMENTATION: Data: https://bitbucket.org/sonnhammergrni/lscon; Code: https://bitbucket.org/sonnhammergrni/genespider. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Redes Reguladoras de Genes , Análisis de los Mínimos Cuadrados , Biología de Sistemas , BenchmarkingRESUMEN
MOTIVATION: Pathway annotation tools are indispensable for the interpretation of a wide range of experiments in life sciences. Network-based algorithms have recently been developed which are more sensitive than traditional overlap-based algorithms, but there is still a lack of good online tools for network-based pathway analysis. RESULTS: We present PathwAX II-a pathway analysis web tool based on network crosstalk analysis using the BinoX algorithm. It offers several new features compared with the first version, including interactive graphical network visualization of the crosstalk between a query gene set and an enriched pathway, and the addition of Reactome pathways. AVAILABILITY AND IMPLEMENTATION: PathwAX II is available at http://pathwax.sbc.su.se. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Programas Informáticos , Fenómenos Fisiológicos CelularesRESUMEN
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Asunto(s)
Biología Computacional/estadística & datos numéricos , Bases de Datos de Proteínas , Proteínas/metabolismo , Proteoma/metabolismo , Animales , COVID-19/epidemiología , COVID-19/prevención & control , COVID-19/virología , Biología Computacional/métodos , Epidemias , Humanos , Internet , Modelos Moleculares , Estructura Terciaria de Proteína , Proteínas/química , Proteínas/genética , Proteoma/clasificación , Proteoma/genética , Secuencias Repetitivas de Aminoácido/genética , SARS-CoV-2/genética , SARS-CoV-2/fisiología , Análisis de Secuencia de Proteína/métodosRESUMEN
The vast amount of experimental data from recent advances in the field of high-throughput biology begs for integration into more complex data structures such as genome-wide functional association networks. Such networks have been used for elucidation of the interplay of intra-cellular molecules to make advances ranging from the basic science understanding of evolutionary processes to the more translational field of precision medicine. The allure of the field has resulted in rapid growth of the number of available network resources, each with unique attributes exploitable to answer different biological questions. Unfortunately, the high volume of network resources makes it impossible for the intended user to select an appropriate tool for their particular research question. The aim of this paper is to provide an overview of the underlying data and representative network resources as well as to mention methods of integration, allowing a customized approach to resource selection. Additionally, this report will provide a primer for researchers venturing into the field of network integration.
Asunto(s)
Biología Computacional/métodos , Genoma , Bases de Datos GenéticasRESUMEN
MOTIVATION: Accurate inference of gene regulatory interactions is of importance for understanding the mechanisms of underlying biological processes. For gene expression data gathered from targeted perturbations, gene regulatory network (GRN) inference methods that use the perturbation design are the top performing methods. However, the connection between the perturbation design and gene expression can be obfuscated due to problems, such as experimental noise or off-target effects, limiting the methods' ability to reconstruct the true GRN. RESULTS: In this study, we propose an algorithm, IDEMAX, to infer the effective perturbation design from gene expression data in order to eliminate the potential risk of fitting a disconnected perturbation design to gene expression. We applied IDEMAX to synthetic data from two different data generation tools, GeneNetWeaver and GeneSPIDER, and assessed its effect on the experiment design matrix as well as the accuracy of the GRN inference, followed by application to a real dataset. The results show that our approach consistently improves the accuracy of GRN inference compared to using the intended perturbation design when much of the signal is hidden by noise, which is often the case for real data. AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/sonnhammergrni/idemax. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors' ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Anotación de Secuencia Molecular , Dominios Proteicos , Proteínas/química , Secuencias Repetitivas de AminoácidoRESUMEN
MOTIVATION: Inference of gene regulatory networks (GRNs) from perturbation data can give detailed mechanistic insights of a biological system. Many inference methods exist, but the resulting GRN is generally sensitive to the choice of method-specific parameters. Even though the inferred GRN is optimal given the parameters, many links may be wrong or missing if the data is not informative. To make GRN inference reliable, a method is needed to estimate the support of each predicted link as the method parameters are varied. RESULTS: To achieve this we have developed a method called nested bootstrapping, which applies a bootstrapping protocol to GRN inference, and by repeated bootstrap runs assesses the stability of the estimated support values. To translate bootstrap support values to false discovery rates we run the same pipeline with shuffled data as input. This provides a general method to control the false discovery rate of GRN inference that can be applied to any setting of inference parameters, noise level, or data properties. We evaluated nested bootstrapping on a simulated dataset spanning a range of such properties, using the LASSO, Least Squares, RNI, GENIE3 and CLR inference methods. An improved inference accuracy was observed in almost all situations. Nested bootstrapping was incorporated into the GeneSPIDER package, which was also used for generating the simulated networks and data, as well as running and analyzing the inferences. AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/sonnhammergrni/genespider/src/NB/%2B Methods/NestBoot.m.
Asunto(s)
Algoritmos , Redes Reguladoras de GenesRESUMEN
This release of the FunCoup database (http://funcoup.sbc.su.se) is the fourth generation of one of the most comprehensive databases for genome-wide functional association networks. These functional associations are inferred via integrating various data types using a naive Bayesian algorithm and orthology based information transfer across different species. This approach provides high coverage of the included genomes as well as high quality of inferred interactions. In this update of FunCoup we introduce four new eukaryotic species: Schizosaccharomyces pombe, Plasmodium falciparum, Bos taurus, Oryza sativa and open the database to the prokaryotic domain by including networks for Escherichia coli and Bacillus subtilis. The latter allows us to also introduce a new class of functional association between genes - co-occurrence in the same operon. We also supplemented the existing classes of functional association: metabolic, signaling, complex and physical protein interaction with up-to-date information. In this release we switched to InParanoid v8 as the source of orthology and base for calculation of phylogenetic profiles. While populating all other evidence types with new data we introduce a new evidence type based on quantitative mass spectrometry data. Finally, the new JavaScript based network viewer provides the user an intuitive and responsive platform to further evaluate the results.
Asunto(s)
Bases de Datos Genéticas , Animales , Bovinos , Redes Reguladoras de Genes , Estudio de Asociación del Genoma Completo , Genómica , Humanos , Operón , Oryza/genética , Filogenia , Plasmodium falciparum/genética , Mapas de Interacción de Proteínas , Proteómica , Schizosaccharomyces/genética , Interfaz Usuario-ComputadorRESUMEN
BACKGROUND: Orthology inference is normally based on full-length protein sequences. However, most proteins contain independently folding and recurring regions, domains. The domain architecture of a protein is vital for its function, and recombination events mean individual domains can have different evolutionary histories. It has previously been shown that orthologous proteins may differ in domain architecture, creating challenges for orthology inference methods operating on full-length sequences. We have developed Domainoid, a new tool aiming to overcome these challenges faced by full-length orthology methods by inferring orthology on the domain level. It employs the InParanoid algorithm on single domains separately, to infer groups of orthologous domains. RESULTS: This domain-oriented approach allows detection of discordant domain orthologs, cases where different domains on the same protein have different evolutionary histories. In addition to domain level analysis, protein level orthology based on the fraction of domains that are orthologous can be inferred. Domainoid orthology assignments were compared to those yielded by the conventional full-length approach InParanoid, and were validated in a standard benchmark. CONCLUSIONS: Our results show that domain-based orthology inference can reveal many orthologous relationships that are not found by full-length sequence approaches. AVAILABILITY: https://bitbucket.org/sonnhammergroup/domainoid/.
Asunto(s)
Proteínas/análisis , Algoritmos , Evolución Biológica , Proteínas/genética , Programas InformáticosRESUMEN
HieranoiDB (http://hieranoiDB.sbc.su.se) is a freely available on-line database for hierarchical groups of orthologs inferred by the Hieranoid algorithm. It infers orthologs at each node in a species guide tree with the InParanoid algorithm as it progresses from the leaves to the root. Here we present a database HieranoiDB with a web interface that makes it easy to search and visualize the output of Hieranoid, and to download it in various formats. Searching can be performed using protein description, identifier or sequence. In this first version, orthologs are available for the 66 Quest for Orthologs reference proteomes. The ortholog trees are shown graphically and interactively with marked speciation and duplication nodes that show the inferred evolutionary scenario, and allow for correct extraction of predicted orthologs from the Hieranoid trees.
Asunto(s)
Bases de Datos Genéticas , Navegador Web , Evolución Biológica , Proteómica/métodos , Programas InformáticosRESUMEN
Analyzing gene expression patterns is a mainstay to gain functional insights of biological systems. A plethora of tools exist to identify significant enrichment of pathways for a set of differentially expressed genes. Most tools analyze gene overlap between gene sets and are therefore severely hampered by the current state of pathway annotation, yet at the same time they run a high risk of false assignments. A way to improve both true positive and false positive rates (FPRs) is to use a functional association network and instead look for enrichment of network connections between gene sets. We present a new network crosstalk analysis method BinoX that determines the statistical significance of network link enrichment or depletion between gene sets, using the binomial distribution. This is a much more appropriate statistical model than previous methods have employed, and as a result BinoX yields substantially better true positive and FPRs than was possible before. A number of benchmarks were performed to assess the accuracy of BinoX and competing methods. We demonstrate examples of how BinoX finds many biologically meaningful pathway annotations for gene sets from cancer and other diseases, which are not found by other methods. BinoX is available at http://sonnhammer.org/BinoX.
Asunto(s)
Biología Computacional/métodos , Redes Reguladoras de Genes , Redes y Vías Metabólicas , Transducción de Señal , Programas Informáticos , Algoritmos , Estudio de Asociación del Genoma Completo , Genómica/métodos , HumanosRESUMEN
Pathway annotation of gene lists is often used to functionally analyse biomolecular data such as gene expression in order to establish which processes are activated in a given experiment. Databases such as KEGG or GO represent collections of how genes are known to be organized in pathways, and the challenge is to compare a given gene list with the known pathways such that all true relations are identified. Most tools apply statistical measures to the gene overlap between the gene list and pathway. It is however problematic to avoid false negatives and false positives when only using the gene overlap. The pathwAX web server (http://pathwAX.sbc.su.se/) applies a different approach which is based on network crosstalk. It uses the comprehensive network FunCoup to analyse network crosstalk between a query gene list and KEGG pathways. PathwAX runs the BinoX algorithm, which employs Monte-Carlo sampling of randomized networks and estimates a binomial distribution, for estimating the statistical significance of the crosstalk. This results in substantially higher accuracy than gene overlap methods. The system was optimized for speed and allows interactive web usage. We illustrate the usage and output of pathwAX.
Asunto(s)
Algoritmos , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Interfaz Usuario-Computador , Animales , Arabidopsis/genética , Ciona intestinalis/genética , Perfilación de la Expresión Génica , Humanos , Internet , Redes y Vías Metabólicas/genética , Método de Montecarlo , Saccharomyces cerevisiae/genéticaRESUMEN
UNLABELLED: We present TreeDom, a web tool for graphically analysing the evolutionary history of domains in multi-domain proteins. Individual domains on the same protein chain may have distinct evolutionary histories, which is important to grasp in order to understand protein function. For instance, it may be important to know whether a domain was duplicated recently or long ago, to know the origin of inserted domains, or to know the pattern of domain loss within a protein family. TreeDom uses the Pfam database as the source of domain annotations, and displays these on a sequence tree. An advantage of TreeDom is that the user can limit the analysis to N sequences that are most similar to a query, or provide a list of sequence IDs to include. Using the Pfam alignment of the selected sequences, a tree is built and displayed together with the domain architecture of each sequence.Availablility and implementation: http://TreeDom.sbc.su.se CONTACT: Erik.Sonnhammer@scilifelab.se.
Asunto(s)
Análisis de Secuencia de Proteína , Gráficos por Computador , Bases de Datos de Proteínas , Estructura Terciaria de Proteína , Proteínas , Alineación de Secuencia , Programas InformáticosRESUMEN
MOTIVATION: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the 'next generation' of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. METHOD: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. RESULTS: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. CONCLUSION: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. AVAILABILITY AND IMPLEMENTATION: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). CONTACT: forslund@embl.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Benchmarking , Bases de Datos de Proteínas , Homología de Secuencia , Secuenciación de Nucleótidos de Alto Rendimiento , Proteínas , Homología de Secuencia de AminoácidoRESUMEN
The InParanoid database (http://InParanoid.sbc.su.se) provides a user interface to orthologs inferred by the InParanoid algorithm. As there are now international efforts to curate and standardize complete proteomes, we have switched to using these resources rather than gathering and curating the proteomes ourselves. InParanoid release 8 is based on the 66 reference proteomes that the 'Quest for Orthologs' community has agreed on using, plus 207 additional proteomes from the UniProt complete proteomes--in total 273 species. These represent 246 eukaryotes, 20 bacteria and seven archaea. Compared to the previous release, this increases the number of species by 173% and the number of pairwise species comparisons by 650%. In turn, the number of ortholog groups has increased by 423%. We present the contents and usages of InParanoid 8, and a detailed analysis of how the proteome content has changed since the previous release.
Asunto(s)
Bases de Datos de Proteínas , Proteoma/química , Homología de Secuencia de Aminoácido , AlgoritmosRESUMEN
We present an update of the FunCoup database (http://FunCoup.sbc.su.se) of functional couplings, or functional associations, between genes and gene products. Identifying these functional couplings is an important step in the understanding of higher level mechanisms performed by complex cellular processes. FunCoup distinguishes between four classes of couplings: participation in the same signaling cascade, participation in the same metabolic process, co-membership in a protein complex and physical interaction. For each of these four classes, several types of experimental and statistical evidence are combined by Bayesian integration to predict genome-wide functional coupling networks. The FunCoup framework has been completely re-implemented to allow for more frequent future updates. It contains many improvements, such as a regularization procedure to automatically downweight redundant evidences and a novel method to incorporate phylogenetic profile similarity. Several datasets have been updated and new data have been added in FunCoup 3.0. Furthermore, we have developed a new Web site, which provides powerful tools to explore the predicted networks and to retrieve detailed information about the data underlying each prediction.
Asunto(s)
Bases de Datos Genéticas , Redes Reguladoras de Genes , Genoma , Mapeo de Interacción de Proteínas , Animales , Teorema de Bayes , Humanos , Internet , Redes y Vías Metabólicas , Ratones , Complejos Multiproteicos , Filogenia , Ratas , Transducción de Señal , Factores de Transcripción/metabolismoRESUMEN
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.