RESUMO
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
Assuntos
Enzimas , Processamento de Linguagem Natural , Enzimas/química , PubMed , Bases de Dados de Proteínas , Bases de ConhecimentoRESUMO
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.
RESUMO
ViralZone (http://viralzone.expasy.org) is a knowledge repository for viruses that links biological knowledge and databases. It contains data on virion structure, genome, proteome, replication cycle and host-virus interactions. The new update provides better access to the data through contextual popups and higher resolution images in Scalable Vector Graphics (SVG) format. These images are designed to be dynamic and interactive with human viruses to give users better access to the data. In addition, a new coronavirus-specific resource provides regularly updated data on variants and molecular biology of SARS-CoV-2. Other virus-specific resources have been added to the database, particularly for HIV, herpesviruses and poxviruses.
Assuntos
Bases de Conhecimento , Vírus , Humanos , Vírion/química , Vírion/genética , Vírion/crescimento & desenvolvimento , Vírus/química , Vírus/genética , Vírus/crescimento & desenvolvimentoRESUMO
MOTIVATION: To provide high quality, computationally tractable annotation of binding sites for biologically relevant (cognate) ligands in UniProtKB using the chemical ontology ChEBI (Chemical Entities of Biological Interest), to better support efforts to study and predict functionally relevant interactions between protein sequences and structures and small molecule ligands. RESULTS: We structured the data model for cognate ligand binding site annotations in UniProtKB and performed a complete reannotation of all cognate ligand binding sites using stable unique identifiers from ChEBI, which we now use as the reference vocabulary for all such annotations. We developed improved search and query facilities for cognate ligands in the UniProt website, REST API and SPARQL endpoint that leverage the chemical structure data, nomenclature and classification that ChEBI provides. AVAILABILITY AND IMPLEMENTATION: Binding site annotations for cognate ligands described using ChEBI are available for UniProtKB protein sequence records in several formats (text, XML and RDF) and are freely available to query and download through the UniProt website (www.uniprot.org), REST API (www.uniprot.org/help/api), SPARQL endpoint (sparql.uniprot.org/) and FTP site (https://ftp.uniprot.org/pub/databases/uniprot/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bases de Conhecimento , Bases de Dados de Proteínas , Ligantes , Sequência de Aminoácidos , Sítios de Ligação , Anotação de Sequência MolecularRESUMO
SwissBioPics (www.swissbiopics.org) is a freely available resource of interactive, high-resolution cell images designed for the visualization of subcellular location data. SwissBioPics provides images describing cell types from all kingdoms of life-from the specialized muscle, neuronal and epithelial cells of animals, to the rods, cocci, clubs and spirals of prokaryotes. All cell images in SwissBioPics are drawn in Scalable Vector Graphics (SVG), with each subcellular location tagged with a unique identifier from the controlled vocabulary of subcellular locations and organelles of UniProt (https://www.uniprot.org/locations/). Users can search and explore SwissBioPics cell images through our website, which provides a platform for users to learn more about how cells are organized. A web component allows developers to embed SwissBioPics images in their own websites, using the associated JavaScript and a styling template, and to highlight subcellular locations and organelles by simply providing the web component with the appropriate identifier(s) from the UniProt-controlled vocabulary or the 'Cellular Component' branch of the Gene Ontology (www.geneontology.org), as well as an organism identifier from the National Center for Biotechnology Information taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy). The UniProt website now uses SwissBioPics to visualize the subcellular locations and organelles where proteins function. SwissBioPics is freely available for anyone to use under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. DATABASE URL: www.swissbiopics.org.
Assuntos
Proteínas , Vocabulário Controlado , AnimaisRESUMO
The UniProt Knowledgebase UniProtKB is a comprehensive, high-quality, and freely accessible resource of protein sequences and functional annotation that covers genomes and proteomes from tens of thousands of taxa, including a broad range of plants and microorganisms producing natural products of medical, nutritional, and agronomical interest. Here we describe work that enhances the utility of UniProtKB as a support for both the study of natural products and for their discovery. The foundation of this work is an improved representation of natural product metabolism in UniProtKB using Rhea, an expert-curated knowledgebase of biochemical reactions, that is built on the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules. Knowledge of natural products and precursors is captured in ChEBI, enzyme-catalyzed reactions in Rhea, and enzymes in UniProtKB/Swiss-Prot, thereby linking chemical structure data directly to protein knowledge. We provide a practical demonstration of how users can search UniProtKB for protein knowledge relevant to natural products through interactive or programmatic queries using metabolite names and synonyms, chemical identifiers, chemical classes, and chemical structures and show how to federate UniProtKB with other data and knowledge resources and tools using semantic web technologies such as RDF and SPARQL. All UniProtKB data are freely available for download in a broad range of formats for users to further mine or exploit as an annotation source, to enrich other natural product datasets and databases.
RESUMO
MOTIVATION: The number of protein records in the UniProt Knowledgebase (UniProtKB: https://www.uniprot.org) continues to grow rapidly as a result of genome sequencing and the prediction of protein-coding genes. Providing functional annotation for these proteins presents a significant and continuing challenge. RESULTS: In response to this challenge, UniProt has developed a method of annotation, known as UniRule, based on expertly curated rules, which integrates related systems (RuleBase, HAMAP, PIRSR, PIRNR) developed by the members of the UniProt consortium. UniRule uses protein family signatures from InterPro, combined with taxonomic and other constraints, to select sets of reviewed proteins which have common functional properties supported by experimental evidence. This annotation is propagated to unreviewed records in UniProtKB that meet the same selection criteria, most of which do not have (and are never likely to have) experimentally verified functional annotation. Release 2020_01 of UniProtKB contains 6496 UniRule rules which provide annotation for 53 million proteins, accounting for 30% of the 178 million records in UniProtKB. UniRule provides scalable enrichment of annotation in UniProtKB. AVAILABILITY AND IMPLEMENTATION: UniRule rules are integrated into UniProtKB and can be viewed at https://www.uniprot.org/unirule/. UniRule rules and the code required to run the rules, are publicly available for researchers who wish to annotate their own sequences. The implementation used to run the rules is known as UniFIRE and is available at https://gitlab.ebi.ac.uk/uniprot-public/unifire.
Assuntos
Bases de Conhecimento , Proteínas , Mapeamento Cromossômico , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Proteínas/genéticaRESUMO
BACKGROUND: Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of effort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation. RESULTS: Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the World Wide Web Consortium (W3C) standards Resource Description Framework (RDF) and SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language). We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, off-the-shelf solutions, and is applicable to any genome or proteome annotation pipeline. CONCLUSIONS: HAMAP SPARQL rules are freely available for download from the HAMAP FTP site, ftp://ftp.expasy.org/databases/hamap/sparql/, under the CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license. A tutorial and supplementary code to use HAMAP as SPARQL are available on GitHub at https://github.com/sib-swiss/HAMAP-SPARQL, and general documentation about HAMAP can be found on the HAMAP website at https://hamap.expasy.org.
Assuntos
Genômica/métodos , Anotação de Sequência Molecular/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software/normas , Animais , Genômica/normas , Humanos , Anotação de Sequência Molecular/normas , Análise de Sequência de DNA/normas , Análise de Sequência de Proteína/normasRESUMO
MOTIVATION: To provide high quality computationally tractable enzyme annotation in UniProtKB using Rhea, a comprehensive expert-curated knowledgebase of biochemical reactions which describes reaction participants using the ChEBI (Chemical Entities of Biological Interest) ontology. RESULTS: We replaced existing textual descriptions of biochemical reactions in UniProtKB with their equivalents from Rhea, which is now the standard for annotation of enzymatic reactions in UniProtKB. We developed improved search and query facilities for the UniProt website, REST API and SPARQL endpoint that leverage the chemical structure data, nomenclature and classification that Rhea and ChEBI provide. AVAILABILITY AND IMPLEMENTATION: UniProtKB at https://www.uniprot.org; UniProt REST API at https://www.uniprot.org/help/api; UniProt SPARQL endpoint at https://sparql.uniprot.org/; Rhea at https://www.rhea-db.org.
Assuntos
Reiformes , Animais , Bases de Dados de Proteínas , Bases de ConhecimentoRESUMO
Bacterial viruses, also called bacteriophages, display a great genetic diversity and utilize unique processes for infecting and reproducing within a host cell. All these processes were investigated and indexed in the ViralZone knowledge base. To facilitate standardizing data, a simple ontology of viral life-cycle terms was developed to provide a common vocabulary for annotating data sets. New terminology was developed to address unique viral replication cycle processes, and existing terminology was modified and adapted. Classically, the viral life-cycle is described by schematic pictures. Using this ontology, it can be represented by a combination of successive events: entry, latency, transcription/replication, host-virus interactions and virus release. Each of these parts is broken down into discrete steps. For example enterobacteria phage lambda entry is broken down in: viral attachment to host adhesion receptor, viral attachment to host entry receptor, viral genome ejection and viral genome circularization. To demonstrate the utility of a standard ontology for virus biology, this work was completed by annotating virus data in the ViralZone, UniProtKB and Gene Ontology databases.
Assuntos
Bacteriófagos/genética , Bacteriófagos/fisiologia , Ontologias Biológicas , Bacteriófagos/classificação , Bacteriófagos/crescimento & desenvolvimento , Bases de Dados Factuais , Interações Hospedeiro-Patógeno , Terminologia como AssuntoRESUMO
Viruses are genetically diverse, infect a wide range of tissues and host cells and follow unique processes for replicating themselves. All these processes were investigated and indexed in ViralZone knowledge base. To facilitate standardizing data, a simple ontology of viral life-cycle terms was developed to provide a common vocabulary for annotating data sets. New terminology was developed to address unique viral replication cycle processes, and existing terminology was modified and adapted. The virus life-cycle is classically described by schematic pictures. Using this ontology, it can be represented by a combination of successive terms: "entry", "latency", "transcription", "replication" and "exit". Each of these parts is broken down into discrete steps. For example Zika virus "entry" is broken down in successive steps: "Attachment", "Apoptotic mimicry", "Viral endocytosis/ macropinocytosis", "Fusion with host endosomal membrane", "Viral factory". To demonstrate the utility of a standard ontology for virus biology, this work was completed by annotating virus data in the ViralZone, UniProtKB and Gene Ontology databases.
Assuntos
Células Eucarióticas/virologia , Terminologia como Assunto , Viroses/virologia , Fenômenos Fisiológicos Virais , Bases de Dados Genéticas , Replicação Viral , Vírus/genética , Vírus/patogenicidadeRESUMO
HAMAP (High-quality Automated and Manual Annotation of Proteins--available at http://hamap.expasy.org/) is a system for the automatic classification and annotation of protein sequences. HAMAP provides annotation of the same quality and detail as UniProtKB/Swiss-Prot, using manually curated profiles for protein sequence family classification and expert curated rules for functional annotation of family members. HAMAP data and tools are made available through our website and as part of the UniRule pipeline of UniProt, providing annotation for millions of unreviewed sequences of UniProtKB/TrEMBL. Here we report on the growth of HAMAP and updates to the HAMAP system since our last report in the NAR Database Issue of 2013. We continue to augment HAMAP with new family profiles and annotation rules as new protein families are characterized and annotated in UniProtKB/Swiss-Prot; the latest version of HAMAP (as of 3 September 2014) contains 1983 family classification profiles and 1998 annotation rules (up from 1780 and 1720). We demonstrate how the complex logic of HAMAP rules allows for precise annotation of individual functional variants within large homologous protein families. We also describe improvements to our web-based tool HAMAP-Scan which simplify the classification and annotation of sequences, and the incorporation of an improved sequence-profile search algorithm.
Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Homologia de Sequência de Aminoácidos , Humanos , Internet , Proteínas/classificaçãoRESUMO
Our growing knowledge of viruses reveals how these pathogens manage to evade innate host defenses. A global scheme emerges in which many viruses usurp key cellular defense mechanisms and often inhibit the same components of antiviral signaling. To accurately describe these processes, we have generated a comprehensive dictionary for eukaryotic host-virus interactions. This controlled vocabulary has been detailed in 57 ViralZone resource web pages which contain a global description of all molecular processes. In order to annotate viral gene products with this vocabulary, an ontology has been built in a hierarchy of UniProt Knowledgebase (UniProtKB) keyword terms and corresponding Gene Ontology (GO) terms have been developed in parallel. The results are 65 UniProtKB keywords related to 57 GO terms, which have been used in 14,390 manual annotations; 908,723 automatic annotations and propagated to an estimation of 922,941 GO annotations. ViralZone pages, UniProtKB keywords and GO terms provide complementary tools to users, and the three resources have been linked to each other through host-virus vocabulary.
Assuntos
Ontologia Genética , Interações Hospedeiro-Patógeno/genética , Imunidade Adaptativa/genética , Animais , Bases de Dados de Ácidos Nucleicos , Regulação da Expressão Gênica/imunologia , Humanos , Imunidade Inata , Interferons/genética , Viroses/genética , Viroses/imunologia , Viroses/virologiaRESUMO
BACKGROUND: Tef (Eragrostis tef), an indigenous cereal critical to food security in the Horn of Africa, is rich in minerals and protein, resistant to many biotic and abiotic stresses and safe for diabetics as well as sufferers of immune reactions to wheat gluten. We present the genome of tef, the first species in the grass subfamily Chloridoideae and the first allotetraploid assembled de novo. We sequenced the tef genome for marker-assisted breeding, to shed light on the molecular mechanisms conferring tef's desirable nutritional and agronomic properties, and to make its genome publicly available as a community resource. RESULTS: The draft genome contains 672 Mbp representing 87% of the genome size estimated from flow cytometry. We also sequenced two transcriptomes, one from a normalized RNA library and another from unnormalized RNASeq data. The normalized RNA library revealed around 38000 transcripts that were then annotated by the SwissProt group. The CoGe comparative genomics platform was used to compare the tef genome to other genomes, notably sorghum. Scaffolds comprising approximately half of the genome size were ordered by syntenic alignment to sorghum producing tef pseudo-chromosomes, which were sorted into A and B genomes as well as compared to the genetic map of tef. The draft genome was used to identify novel SSR markers, investigate target genes for abiotic stress resistance studies, and understand the evolution of the prolamin family of proteins that are responsible for the immune response to gluten. CONCLUSIONS: It is highly plausible that breeding targets previously identified in other cereal crops will also be valuable breeding targets in tef. The draft genome and transcriptome will be of great use for identifying these targets for genetic improvement of this orphan crop that is vital for feeding 50 million people in the Horn of Africa.
Assuntos
Eragrostis/genética , Genoma de Planta , Transcriptoma , Mapeamento Cromossômico , Eragrostis/classificação , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Repetições de Microssatélites/genética , Anotação de Sequência Molecular , Monoéster Fosfórico Hidrolases/classificação , Monoéster Fosfórico Hidrolases/genética , Filogenia , Proteínas de Plantas/classificação , Proteínas de Plantas/genética , Prolaminas/classificação , Prolaminas/genética , RNA não Traduzido/genética , RNA não Traduzido/metabolismo , Análise de Sequência de RNARESUMO
HAMAP (High-quality Automated and Manual Annotation of Proteins-available at http://hamap.expasy.org/) is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated annotation rules that specify annotations that apply to family members. HAMAP was originally developed to support the manual curation of UniProtKB/Swiss-Prot records describing microbial proteins. Here we describe new developments in HAMAP, including the extension of HAMAP to eukaryotic proteins, the use of HAMAP in the automated annotation of UniProtKB/TrEMBL, providing high-quality annotation for millions of protein sequences, and the future integration of HAMAP into a unified system for UniProtKB annotation, UniRule. HAMAP is continuously updated by expert curators with new family profiles and annotation rules as new protein families are characterized. The collection of HAMAP family classification profiles and annotation rules can be browsed and viewed on the HAMAP website, which also provides an interface to scan user sequences against HAMAP profiles.
Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Proteínas/classificação , Eucariotos/genética , InternetRESUMO
ViralZone (http://viralzone.expasy.org) is a knowledge repository that allows users to learn about viruses including their virion structure, replication cycle and host-virus interactions. The information is divided into viral fact sheets that describe virion shape, molecular biology and epidemiology for each viral genus, with links to the corresponding annotated proteomes of UniProtKB. Each viral genus page contains detailed illustrations, text and PubMed references. This new update provides a linked view of viral molecular biology through 133 new viral ontology pages that describe common steps of viral replication cycles shared by several viral genera. This viral cell-cycle ontology is also represented in UniProtKB in the form of annotated keywords. In this way, users can navigate from the description of a replication-cycle event, to the viral genus concerned, and the associated UniProtKB protein records.
Assuntos
Bases de Dados Genéticas , Fenômenos Fisiológicos Virais , Genoma Viral , Vírus da Hepatite B/fisiologia , Interações Hospedeiro-Patógeno , Internet , Proteínas Virais/genética , Internalização do Vírus , Replicação Viral , Vocabulário ControladoRESUMO
PROSITE (http://prosite.expasy.org/) consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule a collection of rules, which increases the discriminatory power of these profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE signatures, together with ProRule, are used for the annotation of domains and features of UniProtKB/Swiss-Prot entries. Here, we describe recent developments that allow users to perform whole-proteome annotation as well as a number of filtering options that can be combined to perform powerful targeted searches for biological discovery. The latest version of PROSITE (release 20.85, of 30 August 2012) contains 1308 patterns, 1039 profiles and 1041 ProRules.
Assuntos
Motivos de Aminoácidos , Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Análise de Sequência de Proteína , Sequência de Aminoácidos , Sequência Conservada , Internet , Anotação de Sequência Molecular , Proteínas/química , Proteínas/classificação , Proteoma/químicaRESUMO
ExPASy (http://www.expasy.org) has worldwide reputation as one of the main bioinformatics resources for proteomics. It has now evolved, becoming an extensible and integrative portal accessing many scientific resources, databases and software tools in different areas of life sciences. Scientists can henceforth access seamlessly a wide range of resources in many different domains, such as proteomics, genomics, phylogeny/evolution, systems biology, population genetics, transcriptomics, etc. The individual resources (databases, web-based and downloadable software tools) are hosted in a 'decentralized' way by different groups of the SIB Swiss Institute of Bioinformatics and partner institutions. Specifically, a single web portal provides a common entry point to a wide range of resources developed and operated by different SIB groups and external institutions. The portal features a search function across 'selected' resources. Additionally, the availability and usage of resources are monitored. The portal is aimed for both expert users and people who are not familiar with a specific domain in life sciences. The new web interface provides, in particular, visual guidance for newcomers to ExPASy.