Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 68
Filter
1.
Bioinform Adv ; 4(1): vbae057, 2024.
Article in English | MEDLINE | ID: mdl-38721398

ABSTRACT

Motivation: Data reuse is a common and vital practice in molecular biology and enables the knowledge gathered over recent decades to drive discovery and innovation in the life sciences. Much of this knowledge has been collated into molecular biology databases, such as UniProtKB, and these resources derive enormous value from sharing data among themselves. However, quantifying and documenting this kind of data reuse remains a challenge. Results: The article reports on a one-day virtual workshop hosted by the UniProt Consortium in March 2023, attended by representatives from biodata resources, experts in data management, and NIH program managers. Workshop discussions focused on strategies for tracking data reuse, best practices for reusing data, and the challenges associated with data reuse and tracking. Surveys and discussions showed that data reuse is widespread, but critical information for reproducibility is sometimes lacking. Challenges include costs of tracking data reuse, tensions between tracking data and open sharing, restrictive licenses, and difficulties in tracking commercial data use. Recommendations that emerged from the discussion include: development of standardized formats for documenting data reuse, education about the obstacles posed by restrictive licenses, and continued recognition by funding agencies that data management is a critical activity that requires dedicated resources. Availability and implementation: Summaries of survey results are available at: https://docs.google.com/forms/d/1j-VU2ifEKb9C-sW6l3ATB79dgHdRk5v_lESv2hawnso/viewanalytics (survey of data providers) and https://docs.google.com/forms/d/18WbJFutUd7qiZoEzbOytFYXSfWFT61hVce0vjvIwIjk/viewanalytics (survey of users).

2.
Sci Data ; 11(1): 268, 2024 Mar 05.
Article in English | MEDLINE | ID: mdl-38443367
3.
Sci Rep ; 13(1): 7612, 2023 05 10.
Article in English | MEDLINE | ID: mdl-37165019

ABSTRACT

Epidemiologic surveillance of circulating SARS-CoV-2 variants is essential to assess impact on clinical outcomes and vaccine efficacy. Whole genome sequencing (WGS), the gold-standard to identify variants, requires significant infrastructure and expertise. We developed a digital droplet polymerase chain reaction (ddPCR) assay that can rapidly identify circulating variants of concern/interest (VOC/VOI) using variant-specific mutation combinations in the Spike gene. To validate the assay, 800 saliva samples known to be SARS-CoV-2 positive by RT-PCR were used. During the study (July 2020-March 2022) the assay was easily adaptable to identify not only existing circulating VAC/VOI, but all new variants as they evolved. The assay can discriminate nine variants (Alpha, Beta, Gamma, Delta, Eta, Epsilon, Lambda, Mu, and Omicron) and sub-lineages (Delta 417N, Omicron BA.1, BA.2). Sequence analyses confirmed variant type for 124/124 samples tested. This ddPCR assay is an inexpensive, sensitive, high-throughput assay that can easily be adapted as new variants are identified.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , COVID-19/diagnosis , COVID-19/epidemiology , Polymerase Chain Reaction , Clinical Decision-Making , Population Surveillance , COVID-19 Testing
4.
Genetics ; 224(1)2023 05 04.
Article in English | MEDLINE | ID: mdl-36866529

ABSTRACT

The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO-a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations-evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)-mechanistic models of molecular "pathways" (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.


Subject(s)
Databases, Genetic , Proteins , Gene Ontology , Proteins/genetics , Molecular Sequence Annotation , Computational Biology
5.
Nucleic Acids Res ; 51(D1): D418-D427, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36350672

ABSTRACT

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.


Subject(s)
Databases, Protein , Humans , Amino Acid Sequence , Artificial Intelligence , Internet , Proteins/chemistry , Software
6.
Nucleic Acids Res ; 50(W1): W623-W632, 2022 07 05.
Article in English | MEDLINE | ID: mdl-35552456

ABSTRACT

The Orthology Benchmark Service (https://orthology.benchmarkservice.org) is the gold standard for orthology inference evaluation, supported and maintained by the Quest for Orthologs consortium. It is an essential resource to compare existing and new methods of orthology inference (the bedrock for many comparative genomics and phylogenetic analysis) over a standard dataset and through common procedures. The Quest for Orthologs Consortium is dedicated to maintaining the resource up to date, through regular updates of the Reference Proteomes and increasingly accessible data through the OpenEBench platform. For this update, we have added a new benchmark based on curated orthology assertion from the Vertebrate Gene Nomenclature Committee, and provided an example meta-analysis of the public predictions present on the platform.


Subject(s)
Benchmarking , Genomics , Phylogeny , Genomics/methods , Proteome
7.
Nucleic Acids Res ; 50(W1): W57-W65, 2022 07 05.
Article in English | MEDLINE | ID: mdl-35640593

ABSTRACT

The Annotation Query (AnnoQ) (http://annoq.org/) is designed to provide comprehensive and up-to-date functional annotations for human genetic variants. The system is supported by an annotation database with ∼39 million human variants from the Haplotype Reference Consortium (HRC) pre-annotated with sequence feature annotations by WGSA and functional annotations to Gene Ontology (GO) and pathways in PANTHER. The database operates on an optimized Elasticsearch framework to support real-time complex searches. This implementation enables users to annotate data with the most up-to-date functional annotations via simple queries instead of setting up individual tools. A web interface allows users to interactively browse the annotations, annotate variants and search variant data. Its easy-to-use interface and search capabilities are well-suited for scientists with fewer bioinformatics skills such as bench scientists and statisticians. AnnoQ also has an API for users to access and annotate the data programmatically. Packages for programming languages, such as the R package, are available for users to embed the annotation queries in their scripts. AnnoQ serves researchers with a wide range of backgrounds and research interests as an integrated annotation platform.


Subject(s)
Genetic Variation , Molecular Sequence Annotation , Software , Humans , Databases, Genetic , Internet , Molecular Sequence Annotation/methods , User-Computer Interface , Genetic Variation/genetics , Haplotypes/genetics , Programming Languages
8.
Protein Sci ; 31(1): 8-22, 2022 01.
Article in English | MEDLINE | ID: mdl-34717010

ABSTRACT

Phylogenetics is a powerful tool for analyzing protein sequences, by inferring their evolutionary relationships to other proteins. However, phylogenetics analyses can be challenging: they are computationally expensive and must be performed carefully in order to avoid systematic errors and artifacts. Protein Analysis THrough Evolutionary Relationships (PANTHER; http://pantherdb.org) is a publicly available, user-focused knowledgebase that stores the results of an extensive phylogenetic reconstruction pipeline that includes computational and manual processes and quality control steps. First, fully reconciled phylogenetic trees (including ancestral protein sequences) are reconstructed for a set of "reference" protein sequences obtained from fully sequenced genomes of organisms across the tree of life. Second, the resulting phylogenetic trees are manually reviewed and annotated with function evolution events: inferred gains and losses of protein function along branches of the phylogenetic tree. Here, we describe in detail the current contents of PANTHER, how those contents are generated, and how they can be used in a variety of applications. The PANTHER knowledgebase can be downloaded or accessed via an extensive API. In addition, PANTHER provides software tools to facilitate the application of the knowledgebase to common protein sequence analysis tasks: exploring an annotated genome by gene function; performing "enrichment analysis" of lists of genes; annotating a single sequence or large batch of sequences by homology; and assessing the likelihood that a genetic variant at a particular site in a protein will have deleterious effects.


Subject(s)
Databases, Protein , Evolution, Molecular , Phylogeny , Proteins , Sequence Analysis, Protein , Software , Molecular Sequence Annotation , Proteins/chemistry , Proteins/genetics
9.
Biochim Biophys Acta Gene Regul Mech ; 1864(11-12): 194752, 2021.
Article in English | MEDLINE | ID: mdl-34461313

ABSTRACT

Transcription plays a central role in defining the identity and functionalities of cells, as well as in their responses to changes in the cellular environment. The Gene Ontology (GO) provides a rigorously defined set of concepts that describe the functions of gene products. A GO annotation is a statement about the function of a particular gene product, represented as an association between a gene product and the biological concept a GO term defines. Critically, each GO annotation is based on traceable scientific evidence. Here, we describe the different GO terms that are associated with proteins involved in transcription and its regulation, focusing on the standard of evidence required to support these associations. This article is intended to help users of GO annotations understand how to interpret the annotations and can contribute to the consistency of GO annotations. We distinguish between three classes of activities involved in transcription or directly regulating it - general transcription factors, DNA-binding transcription factors, and transcription co-regulators.


Subject(s)
Databases, Genetic/statistics & numerical data , Gene Expression Regulation , Gene Ontology/statistics & numerical data , Transcription Factors/classification , Computational Biology/methods , Molecular Sequence Annotation/statistics & numerical data
10.
Bioinformatics ; 37(19): 3343-3348, 2021 Oct 11.
Article in English | MEDLINE | ID: mdl-33964129

ABSTRACT

MOTIVATION: Gene Ontology Causal Activity Models (GO-CAMs) assemble individual associations of gene products with cellular components, molecular functions and biological processes into causally linked activity flow models. Pathway databases such as the Reactome Knowledgebase create detailed molecular process descriptions of reactions and assemble them, based on sharing of entities between individual reactions into pathway descriptions. RESULTS: To convert the rich content of Reactome into GO-CAMs, we have developed a software tool, Pathways2GO, to convert the entire set of normal human Reactome pathways into GO-CAMs. This conversion yields standard GO annotations from Reactome content and supports enhanced quality control for both Reactome and GO, yielding a nearly seamless conversion between these two resources for the bioinformatics community. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.
Mol Biol Evol ; 38(8): 3033-3045, 2021 07 29.
Article in English | MEDLINE | ID: mdl-33822172

ABSTRACT

Accurate determination of the evolutionary relationships between genes is a foundational challenge in biology. Homology-evolutionary relatedness-is in many cases readily determined based on sequence similarity analysis. By contrast, whether or not two genes directly descended from a common ancestor by a speciation event (orthologs) or duplication event (paralogs) is more challenging, yet provides critical information on the history of a gene. Since 2009, this task has been the focus of the Quest for Orthologs (QFO) Consortium. The sixth QFO meeting took place in Okazaki, Japan in conjunction with the 67th National Institute for Basic Biology conference. Here, we report recent advances, applications, and oncoming challenges that were discussed during the conference. Steady progress has been made toward standardization and scalability of new and existing tools. A feature of the conference was the presentation of a panel of accessible tools for phylogenetic profiling and several developments to bring orthology beyond the gene unit-from domains to networks. This meeting brought into light several challenges to come: leveraging orthology computations to get the most of the incoming avalanche of genomic data, integrating orthology from domain to biological network levels, building better gene models, and adapting orthology approaches to the broad evolutionary and genomic diversity recognized in different forms of life and viruses.


Subject(s)
Genetic Speciation , Genomics/trends , Phylogeny , Genome, Viral , Genomics/methods
12.
PLoS Comput Biol ; 17(2): e1007948, 2021 02.
Article in English | MEDLINE | ID: mdl-33600408

ABSTRACT

Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.


Subject(s)
Models, Genetic , Molecular Sequence Annotation/methods , Phylogeny , Algorithms , Animals , Bayes Theorem , Computational Biology , Databases, Genetic , Evolution, Molecular , Gene Ontology/statistics & numerical data , Humans , Likelihood Functions , Markov Chains , Mice , Models, Statistical , Molecular Sequence Annotation/statistics & numerical data , Monte Carlo Method , Multigene Family
13.
Bioinformatics ; 36(24): 5712-5718, 2021 04 05.
Article in English | MEDLINE | ID: mdl-32637990

ABSTRACT

MOTIVATION: A large variety of molecular interactions occurs between biomolecular components in cells. When a molecular interaction results in a regulatory effect, exerted by one component onto a downstream component, a so-called 'causal interaction' takes place. Causal interactions constitute the building blocks in our understanding of larger regulatory networks in cells. These causal interactions and the biological processes they enable (e.g. gene regulation) need to be described with a careful appreciation of the underlying molecular reactions. A proper description of this information enables archiving, sharing and reuse by humans and for automated computational processing. Various representations of causal relationships between biological components are currently used in a variety of resources. RESULTS: Here, we propose a checklist that accommodates current representations, called the Minimum Information about a Molecular Interaction CAusal STatement (MI2CAST). This checklist defines both the required core information, as well as a comprehensive set of other contextual details valuable to the end user and relevant for reusing and reproducing causal molecular interaction information. The MI2CAST checklist can be used as reporting guidelines when annotating and curating causal statements, while fostering uniformity and interoperability of the data across resources. AVAILABILITY AND IMPLEMENTATION: The checklist together with examples is accessible at https://github.com/MI2CAST/MI2CAST. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Software , Causality , Humans
14.
Nucleic Acids Res ; 49(D1): D394-D403, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33290554

ABSTRACT

PANTHER (Protein Analysis Through Evolutionary Relationships, http://www.pantherdb.org) is a resource for the evolutionary and functional classification of protein-coding genes from all domains of life. The evolutionary classification is based on a library of over 15,000 phylogenetic trees, and the functional classifications include Gene Ontology terms and pathways. Here, we analyze the current coverage of genes from genomes in different taxonomic groups, so that users can better understand what to expect when analyzing a gene list using PANTHER tools. We also describe extensive improvements to PANTHER made in the past two years. The PANTHER Protein Class ontology has been completely refactored, and 6101 PANTHER families have been manually assigned to a Protein Class, providing a high level classification of protein families and their genes. Users can access the TreeGrafter tool to add their own protein sequences to the reference phylogenetic trees in PANTHER, to infer evolutionary context as well as fine-grained annotations. We have added human enhancer-gene links that associate non-coding regions with the annotated human genes in PANTHER. We have also expanded the available services for programmatic access to PANTHER tools and data via application programming interfaces (APIs). Other improvements include additional plant genomes and an updated PANTHER GO-slim.


Subject(s)
Computational Biology/methods , Enhancer Elements, Genetic/genetics , Phylogeny , Software , User-Computer Interface , Evolution, Molecular , Gene Ontology , Genome , Molecular Sequence Annotation , Open Reading Frames/genetics
15.
Nucleic Acids Res ; 49(D1): D344-D354, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33156333

ABSTRACT

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.


Subject(s)
Databases, Protein , Proteins/chemistry , Amino Acid Sequence , COVID-19/metabolism , Internet , Molecular Sequence Annotation , Protein Domains , Protein Interaction Maps , SARS-CoV-2/metabolism , Sequence Alignment
16.
PLoS One ; 15(12): e0243791, 2020.
Article in English | MEDLINE | ID: mdl-33320871

ABSTRACT

Enhancers are powerful and versatile agents of cell-type specific gene regulation, which are thought to play key roles in human disease. Enhancers are short DNA elements that function primarily as clusters of transcription factor binding sites that are spatially coordinated to regulate expression of one or more specific target genes. These regulatory connections between enhancers and target genes can therefore be characterized as enhancer-gene links that can affect development, disease, and homeostatic cellular processes. Despite their implication in disease and the establishment of cell identity during development, most enhancer-gene links remain unknown. Here we introduce a new, publicly accessible database of predicted enhancer-gene links, PEREGRINE. The PEREGRINE human enhancer-gene links interactive web interface incorporates publicly available experimental data from ChIA-PET, eQTL, and Hi-C assays across 78 cell and tissue types to link 449,627 enhancers to 17,643 protein-coding genes. These enhancer-gene links are made available through the new Enhancer module of the PANTHER database and website where the user may easily access the evidence for each enhancer-gene link, as well as query by target gene and enhancer location.


Subject(s)
Enhancer Elements, Genetic/genetics , Genomics/methods , Cell Line , Databases, Genetic , Quantitative Trait Loci/genetics
17.
Nucleic Acids Res ; 48(W1): W538-W545, 2020 07 02.
Article in English | MEDLINE | ID: mdl-32374845

ABSTRACT

The identification of orthologs-genes in different species which descended from the same gene in their last common ancestor-is a prerequisite for many analyses in comparative genomics and molecular evolution. Numerous algorithms and resources have been conceived to address this problem, but benchmarking and interpreting them is fraught with difficulties (need to compare them on a common input dataset, absence of ground truth, computational cost of calling orthologs). To address this, the Quest for Orthologs consortium maintains a reference set of proteomes and provides a web server for continuous orthology benchmarking (http://orthology.benchmarkservice.org). Furthermore, consensus ortholog calls derived from public benchmark submissions are provided on the Alliance of Genome Resources website, the joint portal of NIH-funded model organism databases.


Subject(s)
Multigene Family , Proteome , Software , Animals , Benchmarking , Consensus , Genomics , Humans , Mice , Phylogeny , Rats
18.
Plant Direct ; 4(12): e00293, 2020 Dec.
Article in English | MEDLINE | ID: mdl-33392435

ABSTRACT

We aim to enable the accurate and efficient transfer of knowledge about gene function gained from Arabidopsis thaliana and other model organisms to other plant species. This knowledge transfer is frequently challenging in plants due to duplications of individual genes and whole genomes in plant lineages. Such duplications result in complex evolutionary relationships between related genes, which may have similar sequences but highly divergent functions. In such cases, functional inference requires more than a simple sequence similarity calculation. We have developed an online resource, PhyloGenes (phylogenes.org), that displays precomputed phylogenetic trees for plant gene families along with experimentally validated function information for individual genes within the families. A total of 40 plant genomes and 10 non-plant model organisms are represented in over 8,000 gene families. Evolutionary events such as speciation and duplication are clearly labeled on gene trees to distinguish orthologs from paralogs. Nearly 6,000 families have at least one member with an experimentally supported annotation to a Gene Ontology (GO) molecular function or biological process term. By displaying experimentally validated gene functions associated to individual genes within a tree, PhyloGenes enables functional inference for genes of uncharacterized function, based on their evolutionary relationships to experimentally studied genes, in a visually traceable manner. For the many families containing genes that have evolved to perform different functions, PhyloGenes facilitates the use of evolutionary history to determine the most likely function of genes that have not been experimentally characterized. Future work will enrich the resource by incorporating additional gene function datasets such as plant gene expression atlas data.

20.
Mol Biol Evol ; 36(10): 2157-2164, 2019 10 01.
Article in English | MEDLINE | ID: mdl-31241141

ABSTRACT

Gene families evolve by the processes of speciation (creating orthologs), gene duplication (paralogs), and horizontal gene transfer (xenologs), in addition to sequence divergence and gene loss. Orthologs in particular play an essential role in comparative genomics and phylogenomic analyses. With the continued sequencing of organisms across the tree of life, the data are available to reconstruct the unique evolutionary histories of tens of thousands of gene families. Accurate reconstruction of these histories, however, is a challenging computational problem, and the focus of the Quest for Orthologs Consortium. We review the recent advances and outstanding challenges in this field, as revealed at a symposium and meeting held at the University of Southern California in 2017. Key advances have been made both at the level of orthology algorithm development and with respect to coordination across the community of algorithm developers and orthology end-users. Applications spanned a broad range, including gene function prediction, phylostratigraphy, genome evolution, and phylogenomics. The meetings highlighted the increasing use of meta-analyses integrating results from multiple different algorithms, and discussed ongoing challenges in orthology inference as well as the next steps toward improvement and integration of orthology resources.


Subject(s)
Evolution, Molecular , Genomics/trends , Multigene Family , Algorithms , Animals , Genomics/methods , Humans
SELECTION OF CITATIONS
SEARCH DETAIL
...