ABSTRACT
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Subject(s)
Gene Expression Profiling , RNA-Seq , Humans , Animals , Mice , RNA-Seq/methods , Gene Expression Profiling/methods , Transcriptome , Sequence Analysis, RNA/methods , Molecular Sequence Annotation/methodsABSTRACT
RegulonDB is a database that contains the most comprehensive corpus of knowledge of the regulation of transcription initiation of Escherichia coli K-12, including data from both classical molecular biology and high-throughput methodologies. Here, we describe biological advances since our last NAR paper of 2019. We explain the changes to satisfy FAIR requirements. We also present a full reconstruction of the RegulonDB computational infrastructure, which has significantly improved data storage, retrieval and accessibility and thus supports a more intuitive and user-friendly experience. The integration of graphical tools provides clear visual representations of genetic regulation data, facilitating data interpretation and knowledge integration. RegulonDB version 12.0 can be accessed at https://regulondb.ccg.unam.mx.
Subject(s)
Databases, Genetic , Escherichia coli K12 , Gene Expression Regulation, Bacterial , Computational Biology/methods , Escherichia coli K12/genetics , Internet , Transcription, GeneticABSTRACT
MOTIVATION: Software plays a crucial and growing role in research. Unfortunately, the computational component in Life Sciences research is often challenging to reproduce and verify. It could be undocumented, opaque, contain unknown errors that affect the outcome, or be directly unavailable and impossible to use for others. These issues are detrimental to the overall quality of scientific research. One step to address this problem is the formulation of principles that research software in the domain should meet to ensure its quality and sustainability, resembling the FAIR (findable, accessible, interoperable, and reusable) data principles. RESULTS: We present here a comprehensive series of quantitative indicators based on a pragmatic interpretation of the FAIR Principles and their implementation on OpenEBench, ELIXIR's open platform providing both support for scientific benchmarking and an active observatory of quality-related features for Life Sciences research software. The results serve to understand the current practices around research software quality-related features and provide objective indications for improving them. AVAILABILITY AND IMPLEMENTATION: Software metadata, from 11 different sources, collected, integrated, and analysed in the context of this manuscript are available at https://doi.org/10.5281/zenodo.7311067. Code used for software metadata retrieval and processing is available in the following repository: https://gitlab.bsc.es/inb/elixir/software-observatory/FAIRsoft_ETL.
Subject(s)
Software , Computational Biology/methods , MetadataABSTRACT
Human genomics is undergoing a step change from being a predominantly research-driven activity to one driven through health care as many countries in Europe now have nascent precision medicine programmes. To maximize the value of the genomic data generated, these data will need to be shared between institutions and across countries. In recognition of this challenge, 21 European countries recently signed a declaration to transnationally share data on at least 1 million human genomes by 2022. In this Roadmap, we identify the challenges of data sharing across borders and demonstrate that European research infrastructures are well-positioned to support the rapid implementation of widespread genomic data access.
Subject(s)
Biomedical Research , Genome, Human , Human Genome Project , Europe , HumansABSTRACT
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
ABSTRACT
The inherent diversity of approaches in proteomics research has led to a wide range of software solutions for data analysis. These software solutions encompass multiple tools, each employing different algorithms for various tasks such as peptide-spectrum matching, protein inference, quantification, statistical analysis, and visualization. To enable an unbiased comparison of commonly used bottom-up label-free proteomics workflows, we introduce WOMBAT-P, a versatile platform designed for automated benchmarking and comparison. WOMBAT-P simplifies the processing of public data by utilizing the sample and data relationship format for proteomics (SDRF-Proteomics) as input. This feature streamlines the analysis of annotated local or public ProteomeXchange data sets, promoting efficient comparisons among diverse outputs. Through an evaluation using experimental ground truth data and a realistic biological data set, we uncover significant disparities and a limited overlap in the quantified proteins. WOMBAT-P not only enables rapid execution and seamless comparison of workflows but also provides valuable insights into the capabilities of different software solutions. These benchmarking metrics are a valuable resource for researchers in selecting the most suitable workflow for their specific data sets. The modular architecture of WOMBAT-P promotes extensibility and customization. The software is available at https://github.com/wombat-p/WOMBAT-Pipelines.
Subject(s)
Benchmarking , Proteomics , Workflow , Software , Proteins , Data AnalysisABSTRACT
PhylomeDB is a unique knowledge base providing public access to minable and browsable catalogues of pre-computed genome-wide collections of annotated sequences, alignments and phylogenies (i.e. phylomes) of homologous genes, as well as to their corresponding phylogeny-based orthology and paralogy relationships. In addition, PhylomeDB trees and alignments can be downloaded for further processing to detect and date gene duplication events, infer past events of inter-species hybridization and horizontal gene transfer, as well as to uncover footprints of selection, introgression, gene conversion, or other relevant evolutionary processes in the genes and organisms of interest. Here, we describe the latest evolution of PhylomeDB (version 5). This new version includes a newly implemented web interface and several new functionalities such as optimized searching procedures, the possibility to create user-defined phylome collections, and a fully redesigned data structure. This release also represents a significant core data expansion, with the database providing access to 534 phylomes, comprising over 8 million trees, and homology relationships for genes in over 6000 species. This makes PhylomeDB the largest and most comprehensive public repository of gene phylogenies. PhylomeDB is available at http://www.phylomedb.org.
Subject(s)
Databases, Genetic , Evolution, Molecular , Genome/genetics , Software , Animals , Humans , Knowledge Bases , Molecular Sequence Annotation , Phylogeny , Plants/genetics , Proteome/geneticsABSTRACT
The Orthology Benchmark Service (https://orthology.benchmarkservice.org) is the gold standard for orthology inference evaluation, supported and maintained by the Quest for Orthologs consortium. It is an essential resource to compare existing and new methods of orthology inference (the bedrock for many comparative genomics and phylogenetic analysis) over a standard dataset and through common procedures. The Quest for Orthologs Consortium is dedicated to maintaining the resource up to date, through regular updates of the Reference Proteomes and increasingly accessible data through the OpenEBench platform. For this update, we have added a new benchmark based on curated orthology assertion from the Vertebrate Gene Nomenclature Committee, and provided an example meta-analysis of the public predictions present on the platform.
Subject(s)
Benchmarking , Genomics , Phylogeny , Genomics/methods , ProteomeABSTRACT
The identification of orthologs-genes in different species which descended from the same gene in their last common ancestor-is a prerequisite for many analyses in comparative genomics and molecular evolution. Numerous algorithms and resources have been conceived to address this problem, but benchmarking and interpreting them is fraught with difficulties (need to compare them on a common input dataset, absence of ground truth, computational cost of calling orthologs). To address this, the Quest for Orthologs consortium maintains a reference set of proteomes and provides a web server for continuous orthology benchmarking (http://orthology.benchmarkservice.org). Furthermore, consensus ortholog calls derived from public benchmark submissions are provided on the Alliance of Genome Resources website, the joint portal of NIH-funded model organism databases.
Subject(s)
Multigene Family , Proteome , Software , Animals , Benchmarking , Consensus , Genomics , Humans , Mice , Phylogeny , RatsABSTRACT
Sugar beet (Beta vulgaris ssp. vulgaris) is an important crop of temperate climates which provides nearly 30% of the world's annual sugar production and is a source for bioethanol and animal feed. The species belongs to the order of Caryophylalles, is diploid with 2n = 18 chromosomes, has an estimated genome size of 714-758 megabases and shares an ancient genome triplication with other eudicot plants. Leafy beets have been cultivated since Roman times, but sugar beet is one of the most recently domesticated crops. It arose in the late eighteenth century when lines accumulating sugar in the storage root were selected from crosses made with chard and fodder beet. Here we present a reference genome sequence for sugar beet as the first non-rosid, non-asterid eudicot genome, advancing comparative genomics and phylogenetic reconstructions. The genome sequence comprises 567 megabases, of which 85% could be assigned to chromosomes. The assembly covers a large proportion of the repetitive sequence content that was estimated to be 63%. We predicted 27,421 protein-coding genes supported by transcript data and annotated them on the basis of sequence homology. Phylogenetic analyses provided evidence for the separation of Caryophyllales before the split of asterids and rosids, and revealed lineage-specific gene family expansions and losses. We sequenced spinach (Spinacia oleracea), another Caryophyllales species, and validated features that separate this clade from rosids and asterids. Intraspecific genomic variation was analysed based on the genome sequences of sea beet (Beta vulgaris ssp. maritima; progenitor of all beet crops) and four additional sugar beet accessions. We identified seven million variant positions in the reference genome, and also large regions of low variability, indicating artificial selection. The sugar beet genome sequence enables the identification of genes affecting agronomically relevant traits, supports molecular breeding and maximizes the plant's potential in energy biotechnology.
Subject(s)
Beta vulgaris/genetics , Crops, Agricultural/genetics , Genome, Plant/genetics , Biofuels/supply & distribution , Carbohydrate Metabolism , Chromosomes, Plant/genetics , Ethanol/metabolism , Genomics , In Situ Hybridization, Fluorescence , Molecular Sequence Data , Phylogeny , Sequence Analysis, DNA , Spinacia oleracea/geneticsABSTRACT
Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
Subject(s)
Computational Biology/standards , Genomics/standards , Phylogeny , Proteomics/standards , Archaea/classification , Archaea/genetics , Bacteria/classification , Bacteria/genetics , Computational Biology/methods , Databases, Genetic , Eukaryota/classification , Eukaryota/genetics , Gene Ontology , Genomics/methods , Models, Genetic , Proteomics/methods , Sequence Analysis, Protein , Sequence Homology , Species SpecificityABSTRACT
The Quest for Orthologs (QfO) is an open collaboration framework for experts in comparative phylogenomics and related research areas who have an interest in highly accurate orthology predictions and their applications. We here report highlights and discussion points from the QfO meeting 2015 held in Barcelona. Achievements in recent years have established a basis to support developments for improved orthology prediction and to explore new approaches. Central to the QfO effort is proper benchmarking of methods and services, as well as design of standardized datasets and standardized formats to allow sharing and comparison of results. Simultaneously, analysis pipelines have been improved, evaluated and adapted to handle large datasets. All this would not have occurred without the long-term collaboration of Consortium members. Meeting regularly to review and coordinate complementary activities from a broad spectrum of innovative researchers clearly benefits the community. Highlights of the meeting include addressing sources of and legitimacy of disagreements between orthology calls, the context dependency of orthology definitions, special challenges encountered when analyzing very anciently rooted orthologies, orthology in the light of whole-genome duplications, and the concept of orthologous versus paralogous relationships at different levels, including domain-level orthology. Furthermore, particular needs for different applications (e.g. plant genomics, ancient gene families and others) and the infrastructure for making orthology inferences available (e.g. interfaces with model organism databases) were discussed, with several ongoing efforts that are expected to be reported on during the upcoming 2017 QfO meeting.
ABSTRACT
A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes-CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es.
Subject(s)
Drug-Related Side Effects and Adverse Reactions , Software , Cytochrome P-450 Enzyme System , Data Mining , Genes , Internet , Liver/drug effectsABSTRACT
Selenoproteins are proteins that incorporate selenocysteine (Sec), a nonstandard amino acid encoded by UGA, normally a stop codon. Sec synthesis requires the enzyme Selenophosphate synthetase (SPS or SelD), conserved in all prokaryotic and eukaryotic genomes encoding selenoproteins. Here, we study the evolutionary history of SPS genes, providing a map of selenoprotein function spanning the whole tree of life. SPS is itself a selenoprotein in many species, although functionally equivalent homologs that replace the Sec site with cysteine (Cys) are common. Many metazoans, however, possess SPS genes with substitutions other than Sec or Cys (collectively referred to as SPS1). Using complementation assays in fly mutants, we show that these genes share a common function, which appears to be distinct from the synthesis of selenophosphate carried out by the Sec- and Cys- SPS genes (termed SPS2), and unrelated to Sec synthesis. We show here that SPS1 genes originated through a number of independent gene duplications from an ancestral metazoan selenoprotein SPS2 gene that most likely already carried the SPS1 function. Thus, in SPS genes, parallel duplications and subsequent convergent subfunctionalization have resulted in the segregation to different loci of functions initially carried by a single gene. This evolutionary history constitutes a remarkable example of emergence and evolution of gene function, which we have been able to trace thanks to the singular features of SPS genes, wherein the amino acid at a single site determines unequivocally protein function and is intertwined to the evolutionary fate of the entire selenoproteome.
Subject(s)
Biological Evolution , Phosphotransferases/genetics , Phosphotransferases/metabolism , Animals , Biomarkers , Eukaryota/genetics , Eukaryota/metabolism , Gene Duplication , Humans , Insecta , Phylogeny , Prokaryotic Cells/metabolism , Selection, Genetic , Selenium/metabolism , Selenoproteins/genetics , Selenoproteins/metabolism , Urochordata , VertebratesABSTRACT
Here, we report the draft genome sequence of Solanum commersonii, which consists of â¼830 megabases with an N50 of 44,303 bp anchored to 12 chromosomes, using the potato (Solanum tuberosum) genome sequence as a reference. Compared with potato, S. commersonii shows a striking reduction in heterozygosity (1.5% versus 53 to 59%), and differences in genome sizes were mainly due to variations in intergenic sequence length. Gene annotation by ab initio prediction supported by RNA-seq data produced a catalog of 1703 predicted microRNAs, 18,882 long noncoding RNAs of which 20% are shown to target cold-responsive genes, and 39,290 protein-coding genes with a significant repertoire of nonredundant nucleotide binding site-encoding genes and 126 cold-related genes that are lacking in S. tuberosum. Phylogenetic analyses indicate that domesticated potato and S. commersonii lineages diverged â¼2.3 million years ago. Three duplication periods corresponding to genome enrichment for particular gene families related to response to salt stress, water transport, growth, and defense response were discovered. The draft genome sequence of S. commersonii substantially increases our understanding of the domesticated germplasm, facilitating translation of acquired knowledge into advances in crop stability in light of global climate and environmental changes.
Subject(s)
Genome, Plant/genetics , Solanum tuberosum/genetics , Solanum/genetics , Acclimatization , Biological Evolution , Phylogeny , Solanum/classification , Solanum tuberosum/classificationABSTRACT
Myriapods (e.g., centipedes and millipedes) display a simple homonomous body plan relative to other arthropods. All members of the class are terrestrial, but they attained terrestriality independently of insects. Myriapoda is the only arthropod class not represented by a sequenced genome. We present an analysis of the genome of the centipede Strigamia maritima. It retains a compact genome that has undergone less gene loss and shuffling than previously sequenced arthropods, and many orthologues of genes conserved from the bilaterian ancestor that have been lost in insects. Our analysis locates many genes in conserved macro-synteny contexts, and many small-scale examples of gene clustering. We describe several examples where S. maritima shows different solutions from insects to similar problems. The insect olfactory receptor gene family is absent from S. maritima, and olfaction in air is likely effected by expansion of other receptor gene families. For some genes S. maritima has evolved paralogues to generate coding sequence diversity, where insects use alternate splicing. This is most striking for the Dscam gene, which in Drosophila generates more than 100,000 alternate splice forms, but in S. maritima is encoded by over 100 paralogues. We see an intriguing linkage between the absence of any known photosensory proteins in a blind organism and the additional absence of canonical circadian clock genes. The phylogenetic position of myriapods allows us to identify where in arthropod phylogeny several particular molecular mechanisms and traits emerged. For example, we conclude that juvenile hormone signalling evolved with the emergence of the exoskeleton in the arthropods and that RR-1 containing cuticle proteins evolved in the lineage leading to Mandibulata. We also identify when various gene expansions and losses occurred. The genome of S. maritima offers us a unique glimpse into the ancestral arthropod genome, while also displaying many adaptations to its specific life history.
Subject(s)
Arthropods/genetics , Genome , Synteny , Animals , Circadian Rhythm Signaling Peptides and Proteins/genetics , DNA Methylation , Evolution, Molecular , Female , Genome, Mitochondrial , Hormones/genetics , Male , Multigene Family , Phylogeny , Polymorphism, Genetic , Protein Kinases/genetics , RNA, Untranslated/genetics , Receptors, Odorant/genetics , Selenoproteins/genetics , Sex Chromosomes , Transcription Factors/geneticsABSTRACT
Reconstructing the evolutionary relationships of species is a major goal in biology. Despite the increasing number of completely sequenced genomes, a large number of phylogenetic projects rely on targeted sequencing and analysis of a relatively small sample of marker genes. The selection of these phylogenetic markers should ideally be based on accurate predictions of their combined, rather than individual, potential to accurately resolve the phylogeny of interest. Here we present and validate a new phylogenomics strategy to efficiently select a minimal set of stable markers able to reconstruct the underlying species phylogeny. In contrast to previous approaches, our methodology does not only rely on the ability of individual genes to reconstruct a known phylogeny, but it also explores the combined power of sets of concatenated genes to accurately infer phylogenetic relationships of species not previously analyzed. We applied our approach to two broad sets of cyanobacterial and ascomycetous fungal species, and provide two minimal sets of six and four genes, respectively, necessary to fully resolve the target phylogenies. This approach paves the way for the informed selection of phylogenetic markers in the effort of reconstructing the tree of life.
Subject(s)
Genomics/methods , Phylogeny , Ascomycota/classification , Ascomycota/genetics , Cyanobacteria/classification , Cyanobacteria/genetics , Genes, Bacterial , Genes, Fungal , Genetic MarkersABSTRACT
Phylogenetic trees representing the evolutionary relationships of homologous genes are the entry point for many evolutionary analyses. For instance, the use of a phylogenetic tree can aid in the inference of orthology and paralogy relationships, and in the detection of relevant evolutionary events such as gene family expansions and contractions, horizontal gene transfer, recombination or incomplete lineage sorting. Similarly, given the plurality of evolutionary histories among genes encoded in a given genome, there is a need for the combined analysis of genome-wide collections of phylogenetic trees (phylomes). Here, we introduce a new release of PhylomeDB (http://phylomedb.org), a public repository of phylomes. Currently, PhylomeDB hosts 120 public phylomes, comprising >1.5 million maximum likelihood trees and multiple sequence alignments. In the current release, phylogenetic trees are annotated with taxonomic, protein-domain arrangement, functional and evolutionary information. PhylomeDB is also a major source for phylogeny-based predictions of orthology and paralogy, covering >10 million proteins across 1059 sequenced species. Here we describe newly implemented PhylomeDB features, and discuss a benchmark of the orthology predictions provided by the database, the impact of proteome updates and the use of the phylome approach in the analysis of newly sequenced genomes and transcriptomes.