Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 20
Filtrer
Plus de filtres










Base de données
Gamme d'année
1.
ArXiv ; 2024 Apr 22.
Article de Anglais | MEDLINE | ID: mdl-38903736

RÉSUMÉ

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

2.
Database (Oxford) ; 20222022 08 12.
Article de Anglais | MEDLINE | ID: mdl-35961013

RÉSUMÉ

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.


Sujet(s)
Génomique , Protéines , Séquence nucléotidique , Biologie informatique , Génome , Annotation de séquence moléculaire
3.
Nucleic Acids Res ; 50(D1): D693-D700, 2022 01 07.
Article de Anglais | MEDLINE | ID: mdl-34755880

RÉSUMÉ

Rhea (https://www.rhea-db.org) is an expert-curated knowledgebase of biochemical reactions based on the chemical ontology ChEBI (Chemical Entities of Biological Interest) (https://www.ebi.ac.uk/chebi). In this paper, we describe a number of key developments in Rhea since our last report in the database issue of Nucleic Acids Research in 2019. These include improved reaction coverage in Rhea, the adoption of Rhea as the reference vocabulary for enzyme annotation in the UniProt knowledgebase UniProtKB (https://www.uniprot.org), the development of a new Rhea website, and the designation of Rhea as an ELIXIR Core Data Resource. We hope that these and other developments will enhance the utility of Rhea as a reference resource to study and engineer enzymes and the metabolic systems in which they function.


Sujet(s)
Phénomènes chimiques , Bases de données factuelles , Logiciel , Animaux , Humains , Internet , Bases de connaissances
4.
Metabolites ; 11(1)2021 Jan 12.
Article de Anglais | MEDLINE | ID: mdl-33445429

RÉSUMÉ

The UniProt Knowledgebase UniProtKB is a comprehensive, high-quality, and freely accessible resource of protein sequences and functional annotation that covers genomes and proteomes from tens of thousands of taxa, including a broad range of plants and microorganisms producing natural products of medical, nutritional, and agronomical interest. Here we describe work that enhances the utility of UniProtKB as a support for both the study of natural products and for their discovery. The foundation of this work is an improved representation of natural product metabolism in UniProtKB using Rhea, an expert-curated knowledgebase of biochemical reactions, that is built on the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules. Knowledge of natural products and precursors is captured in ChEBI, enzyme-catalyzed reactions in Rhea, and enzymes in UniProtKB/Swiss-Prot, thereby linking chemical structure data directly to protein knowledge. We provide a practical demonstration of how users can search UniProtKB for protein knowledge relevant to natural products through interactive or programmatic queries using metabolite names and synonyms, chemical identifiers, chemical classes, and chemical structures and show how to federate UniProtKB with other data and knowledge resources and tools using semantic web technologies such as RDF and SPARQL. All UniProtKB data are freely available for download in a broad range of formats for users to further mine or exploit as an annotation source, to enrich other natural product datasets and databases.

5.
Bioinformatics ; 36(6): 1896-1901, 2020 03 01.
Article de Anglais | MEDLINE | ID: mdl-31688925

RÉSUMÉ

MOTIVATION: To provide high quality computationally tractable enzyme annotation in UniProtKB using Rhea, a comprehensive expert-curated knowledgebase of biochemical reactions which describes reaction participants using the ChEBI (Chemical Entities of Biological Interest) ontology. RESULTS: We replaced existing textual descriptions of biochemical reactions in UniProtKB with their equivalents from Rhea, which is now the standard for annotation of enzymatic reactions in UniProtKB. We developed improved search and query facilities for the UniProt website, REST API and SPARQL endpoint that leverage the chemical structure data, nomenclature and classification that Rhea and ChEBI provide. AVAILABILITY AND IMPLEMENTATION: UniProtKB at https://www.uniprot.org; UniProt REST API at https://www.uniprot.org/help/api; UniProt SPARQL endpoint at https://sparql.uniprot.org/; Rhea at https://www.rhea-db.org.


Sujet(s)
Rheiformes , Animaux , Bases de données de protéines , Bases de connaissances
6.
Nucleic Acids Res ; 47(D1): D596-D600, 2019 01 08.
Article de Anglais | MEDLINE | ID: mdl-30272209

RÉSUMÉ

Rhea (http://www.rhea-db.org) is a comprehensive and non-redundant resource of over 11 000 expert-curated biochemical reactions that uses chemical entities from the ChEBI ontology to represent reaction participants. Originally designed as an annotation vocabulary for the UniProt Knowledgebase (UniProtKB), Rhea also provides reaction data for a range of other core knowledgebases and data repositories including ChEBI and MetaboLights. Here we describe recent developments in Rhea, focusing on a new resource description framework representation of Rhea reaction data and an SPARQL endpoint (https://sparql.rhea-db.org/sparql) that provides access to it. We demonstrate how federated queries that combine the Rhea SPARQL endpoint and other SPARQL endpoints such as that of UniProt can provide improved metabolite annotation and support integrative analyses that link the metabolome through the proteome to the transcriptome and genome. These developments will significantly boost the utility of Rhea as a means to link chemistry and biology for a more holistic understanding of biological systems and their function in health and disease.


Sujet(s)
Bases de données chimiques , Bases de données de protéines , Métabolomique/méthodes , Logiciel/normes , Humains , Bases de connaissances , Biologie des systèmes/méthodes
8.
Nucleic Acids Res ; 45(D1): D415-D418, 2017 01 04.
Article de Anglais | MEDLINE | ID: mdl-27789701

RÉSUMÉ

Rhea (http://www.rhea-db.org) is a comprehensive and non-redundant resource of expert-curated biochemical reactions designed for the functional annotation of enzymes and the description of metabolic networks. Rhea describes enzyme-catalyzed reactions covering the IUBMB Enzyme Nomenclature list as well as additional reactions, including spontaneously occurring reactions, using entities from the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules. Here we describe developments in Rhea since our last report in the database issue of Nucleic Acids Research. These include the first implementation of a simple hierarchical classification of reactions, improved coverage of the IUBMB Enzyme Nomenclature list and additional reactions through continuing expert curation, and the development of a new website to serve this improved dataset.

9.
Environ Microbiol ; 18(10): 3403-3424, 2016 10.
Article de Anglais | MEDLINE | ID: mdl-26913973

RÉSUMÉ

By the time the complete genome sequence of the soil bacterium Pseudomonas putida KT2440 was published in 2002 (Nelson et al., ) this bacterium was considered a potential agent for environmental bioremediation of industrial waste and a good colonizer of the rhizosphere. However, neither the annotation tools available at that time nor the scarcely available omics data-let alone metabolic modeling and other nowadays common systems biology approaches-allowed them to anticipate the astonishing capacities that are encoded in the genetic complement of this unique microorganism. In this work we have adopted a suite of state-of-the-art genomic analysis tools to revisit the functional and metabolic information encoded in the chromosomal sequence of strain KT2440. We identified 242 new protein-coding genes and re-annotated the functions of 1548 genes, which are linked to almost 4900 PubMed references. Catabolic pathways for 92 compounds (carbon, nitrogen and phosphorus sources) that could not be accommodated by the previously constructed metabolic models were also predicted. The resulting examination not only accounts for some of the known stress tolerance traits known in P. putida but also recognizes the capacity of this bacterium to perform difficult redox reactions, thereby multiplying its value as a platform microorganism for industrial biotechnology.


Sujet(s)
Génome bactérien , Pseudomonas putida/génétique , Protéines bactériennes/génétique , Protéines bactériennes/métabolisme , Carbone/métabolisme , Génomique , Azote/métabolisme , Pseudomonas putida/métabolisme
10.
Nucleic Acids Res ; 44(D1): D523-6, 2016 Jan 04.
Article de Anglais | MEDLINE | ID: mdl-26527720

RÉSUMÉ

MetaNetX is a repository of genome-scale metabolic networks (GSMNs) and biochemical pathways from a number of major resources imported into a common namespace of chemical compounds, reactions, cellular compartments--namely MNXref--and proteins. The MetaNetX.org website (http://www.metanetx.org/) provides access to these integrated data as well as a variety of tools that allow users to import their own GSMNs, map them to the MNXref reconciliation, and manipulate, compare, analyze, simulate (using flux balance analysis) and export the resulting GSMNs. MNXref and MetaNetX are regularly updated and freely available.


Sujet(s)
Bases de données chimiques , Génome , Voies et réseaux métaboliques/génétique , Structure moléculaire , Logiciel
11.
Nucleic Acids Res ; 43(Database issue): D459-64, 2015 Jan.
Article de Anglais | MEDLINE | ID: mdl-25332395

RÉSUMÉ

Rhea (http://www.ebi.ac.uk/rhea) is a comprehensive and non-redundant resource of expert-curated biochemical reactions described using species from the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules. Rhea has been designed for the functional annotation of enzymes and the description of genome-scale metabolic networks, providing stoichiometrically balanced enzyme-catalyzed reactions (covering the IUBMB Enzyme Nomenclature list and additional reactions), transport reactions and spontaneously occurring reactions. Rhea reactions are extensively curated with links to source literature and are mapped to other publicly available enzyme and pathway databases such as Reactome, BioCyc, KEGG and UniPathway, through manual curation and computational methods. Here we describe developments in Rhea since our last report in the 2012 database issue of Nucleic Acids Research. These include significant growth in the number of Rhea reactions and the inclusion of reactions involving complex macromolecules such as proteins, nucleic acids and other polymers that lie outside the scope of ChEBI. Together these developments will significantly increase the utility of Rhea as a tool for the description, analysis and reconciliation of genome-scale metabolic models.


Sujet(s)
Bases de données chimiques , Enzymes/métabolisme , Voies et réseaux métaboliques , Phénomènes biochimiques , Biopolymères/métabolisme , Génomique , Internet , Voies et réseaux métaboliques/génétique
12.
Brief Bioinform ; 15(1): 123-35, 2014 Jan.
Article de Anglais | MEDLINE | ID: mdl-23172809

RÉSUMÉ

Genome-scale metabolic network reconstructions are now routinely used in the study of metabolic pathways, their evolution and design. The development of such reconstructions involves the integration of information on reactions and metabolites from the scientific literature as well as public databases and existing genome-scale metabolic models. The reconciliation of discrepancies between data from these sources generally requires significant manual curation, which constitutes a major obstacle in efforts to develop and apply genome-scale metabolic network reconstructions. In this work, we discuss some of the major difficulties encountered in the mapping and reconciliation of metabolic resources and review three recent initiatives that aim to accelerate this process, namely BKM-react, MetRxn and MNXref (presented in this article). Each of these resources provides a pre-compiled reconciliation of many of the most commonly used metabolic resources. By reducing the time required for manual curation of metabolite and reaction discrepancies, these resources aim to accelerate the development and application of high-quality genome-scale metabolic network reconstructions and models.


Sujet(s)
Voies et réseaux métaboliques , Biologie informatique , Simulation numérique , Bases de données factuelles/statistiques et données numériques , Génomique/statistiques et données numériques , Voies et réseaux métaboliques/génétique , Modèles biologiques , Structure moléculaire , Logiciel
13.
Microbiology (Reading) ; 159(Pt 4): 757-770, 2013 Apr.
Article de Anglais | MEDLINE | ID: mdl-23429746

RÉSUMÉ

Continuous updating of the genome sequence of Bacillus subtilis, the model of the Firmicutes, is a basic requirement needed by the biology community. In this work new genomic objects have been included (toxin/antitoxin genes and small RNA genes) and the metabolic network has been entirely updated. The curated view of the validated metabolic pathways present in the organism as of 2012 shows several significant differences from pathways present in the other bacterial reference, Escherichia coli: variants in synthesis of cofactors (thiamine, biotin, bacillithiol), amino acids (lysine, methionine), branched-chain fatty acids, tRNA modification and RNA degradation. In this new version, gene products that are enzymes or transporters are explicitly linked to the biochemical reactions of the RHEA reaction resource (http://www.ebi.ac.uk/rhea/), while novel compound entries have been created in the database Chemical Entities of Biological Interest (http://www.ebi.ac.uk/chebi/). The newly annotated sequence is deposited at the International Nucleotide Sequence Data Collaboration with accession number AL009126.4.


Sujet(s)
Bacillus subtilis/métabolisme , Protéines bactériennes/métabolisme , Génome bactérien , Voies et réseaux métaboliques/génétique , Bacillus subtilis/génétique , Protéines bactériennes/génétique , Génomique , Annotation de séquence moléculaire , Données de séquences moléculaires , Analyse de séquence d'ADN
14.
Nucleic Acids Res ; 40(Database issue): D754-60, 2012 Jan.
Article de Anglais | MEDLINE | ID: mdl-22135291

RÉSUMÉ

Rhea (http://www.ebi.ac.uk/rhea) is a comprehensive resource of expert-curated biochemical reactions. Rhea provides a non-redundant set of chemical transformations for use in a broad spectrum of applications, including metabolic network reconstruction and pathway inference. Rhea includes enzyme-catalyzed reactions (covering the IUBMB Enzyme Nomenclature list), transport reactions and spontaneously occurring reactions. Rhea reactions are described using chemical species from the Chemical Entities of Biological Interest ontology (ChEBI) and are stoichiometrically balanced for mass and charge. They are extensively manually curated with links to source literature and other public resources on metabolism including enzyme and pathway databases. This cross-referencing facilitates the mapping and reconciliation of common reactions and compounds between distinct resources, which is a common first step in the reconstruction of genome scale metabolic networks and models.


Sujet(s)
Phénomènes biochimiques , Bases de données factuelles , Enzymes/métabolisme , Internet , Voies et réseaux métaboliques , Logiciel
15.
Nucleic Acids Res ; 40(Database issue): D761-9, 2012 Jan.
Article de Anglais | MEDLINE | ID: mdl-22102589

RÉSUMÉ

UniPathway (http://www.unipathway.org) is a fully manually curated resource for the representation and annotation of metabolic pathways. UniPathway provides explicit representations of enzyme-catalyzed and spontaneous chemical reactions, as well as a hierarchical representation of metabolic pathways. This hierarchy uses linear subpathways as the basic building block for the assembly of larger and more complex pathways, including species-specific pathway variants. All of the pathway data in UniPathway has been extensively cross-linked to existing pathway resources such as KEGG and MetaCyc, as well as sequence resources such as the UniProt KnowledgeBase (UniProtKB), for which UniPathway provides a controlled vocabulary for pathway annotation. We introduce here the basic concepts underlying the UniPathway resource, with the aim of allowing users to fully exploit the information provided by UniPathway.


Sujet(s)
Bases de données factuelles , Voies et réseaux métaboliques , Bases de données de protéines , Enzymes/métabolisme , Lysine/biosynthèse , Annotation de séquence moléculaire
16.
Infect Genet Evol ; 8(4): 459-66, 2008 Jul.
Article de Anglais | MEDLINE | ID: mdl-17644446

RÉSUMÉ

Ehrlichia ruminantium is the causative agent of heartwater, a major tick-borne disease of livestock in Africa introduced in the Caribbean and threatening to emerge and spread in the American mainland. Complete genome sequencing was done for two isolates of E. ruminantium of differing phenotype, isolates Gardel (Erga) from Guadeloupe Island and Welgevonden (Erwe) originating from South Africa and maintained in Guadeloupe. The type strain of E. ruminantium (Erwo), previously isolated and sequenced in South Africa; is identical to Erwe with respect to target genes. They make the Erwe/Erwo complex. Comparative analysis of the genomes shows the presence of 49 unique CDS and 28 truncated CDS differentiating Erga from Erwe/Erwo. Three regions of accumulated differences (RAD) acting as mutational hot spots were identified in E. ruminantium. Ten CDS, six unique CDS and four truncated CDS corresponding to major genomic changes (deletions or extensive mutations) were considered as targets for differential diagnosis on four isolates of E. ruminantium: Erga, Erwe/Erwo, Senegal and Umpala. Pairs of PCR primers were developed for each target gene. PCR analysis of the target genes generated strain-specific patterns on Erga and Erwe/Erwo as predicted by comparative genomics, but also for isolates Senegal and Umpala. The target genes identified by bacterial comparative genomics are shown to be highly efficient for strain-specific PCR diagnosis of E. ruminantium and further vaccine management tools.


Sujet(s)
Ehrlichia ruminantium/isolement et purification , Cowdriose/diagnostic , Cowdriose/microbiologie , Animaux , Bovins , Maladies des bovins/diagnostic , Maladies des bovins/microbiologie , Cellules cultivées , ADN bactérien/analyse , ADN bactérien/isolement et purification , Ehrlichia ruminantium/génétique , Femelle , Génome bactérien , Géographie , Capra , Souris , Ovis , Spécificité d'espèce
17.
Ann N Y Acad Sci ; 1081: 417-33, 2006 Oct.
Article de Anglais | MEDLINE | ID: mdl-17135545

RÉSUMÉ

The tick-borne Rickettsiale Ehrlichia ruminantium (E. ruminantium) is the causative agent of heartwater in Africa and the Caribbean. Heartwater, responsible for major losses on livestock in Africa represents also a threat for the American mainland. Three complete genomes corresponding to two different groups of differing phenotypes, Gardel and Welgevonden, have been recently described. One genome (Erga) represents the Gardel group from Guadeloupe Island and two genomes (Erwo and Erwe) belong to the Welgevonden group. Erwo, isolated in South Africa, is the parental strain of Erwe, which was maintained for 18 years in Guadeloupe under different culture conditions than Erwo. The three strains display genomes of differing sizes with 1,499,920 bp, 1,512,977 bp, and 1,516,355 bp for Erga, Erwe, and Erwo, respectively. Gene sequences and order are highly conserved between the three strains, although several gene truncations could be pinpointed, most of them occurring within three regions of accumulated differences (RAD). E. ruminantium displays a strong leading/lagging compositional bias inducing a strand-specific codon usage. Finally, a striking feature of E. ruminantium is the presence of long intergenic regions containing tandem repeats. These repeats are at the origin of an active process, specific to E. ruminantium, of genome expansion/contraction based on the addition or removal of tandem units.


Sujet(s)
Ehrlichia ruminantium/génétique , Évolution moléculaire , Génome bactérien , Séquences répétées en tandem/génétique , Animaux , Séquence conservée , Données de séquences moléculaires , Masse moléculaire , Spécificité d'espèce
18.
J Bacteriol ; 188(7): 2533-42, 2006 Apr.
Article de Anglais | MEDLINE | ID: mdl-16547041

RÉSUMÉ

Ehrlichia ruminantium is the causative agent of heartwater, a major tick-borne disease of livestock in Africa that has been introduced in the Caribbean and is threatening to emerge and spread on the American mainland. We sequenced the complete genomes of two strains of E. ruminantium of differing phenotypes, strains Gardel (Erga; 1,499,920 bp), from the island of Guadeloupe, and Welgevonden (Erwe; 1,512,977 bp), originating in South Africa and maintained in Guadeloupe in a different cell environment. Comparative genomic analysis of these two strains was performed with the recently published parent strain of Erwe (Erwo) and other Rickettsiales (Anaplasma, Wolbachia, and Rickettsia spp.). Gene order is highly conserved between the E. ruminantium strains and with A. marginale. In contrast, there is very little conservation of gene order with members of the Rickettsiaceae. However, gene order may be locally conserved, as illustrated by the tuf operons. Eighteen truncated protein-encoding sequences (CDSs) differentiate Erga from Erwe/Erwo, whereas four other truncated CDSs differentiate Erwe from Erwo. Moreover, E. ruminantium displays the lowest coding ratio observed among bacteria due to unusually long intergenic regions. This is related to an active process of genome expansion/contraction targeted at tandem repeats in noncoding regions and based on the addition or removal of ca. 150-bp tandem units. This process seems to be specific to E. ruminantium and is not observed in the other Rickettsiales.


Sujet(s)
Ehrlichia ruminantium/classification , Ehrlichia ruminantium/génétique , Évolution moléculaire , Variation génétique/génétique , Génome bactérien , Mutagenèse/génétique , Séquence conservée , Ordre des gènes , Données de séquences moléculaires , Phénotype , Spécificité d'espèce , Séquences répétées en tandem/génétique
19.
Bioinformatics ; 21(23): 4209-15, 2005 Dec 01.
Article de Anglais | MEDLINE | ID: mdl-16216829

RÉSUMÉ

MOTIVATION: Modern comparative genomics does not restrict to sequence but involves the comparison of metabolic pathways or protein-protein interactions as well. Central in this approach is the concept of neighbourhood between entities (genes, proteins, chemical compounds). Therefore there is a growing need for new methods aiming at merging the connectivity information from different biological sources in order to infer functional coupling. RESULTS: We present a generic approach to merge the information from two or more graphs representing biological data. The method is based on two concepts. The first one, the correspondence multigraph, precisely defines how correspondence is performed between the primary data-graphs. The second one, the common connected components, defines which property of the multigraph is searched for. Although this problem has already been informally stated in the past few years, we give here a formal and general statement together with an exact algorithm to solve it. AVAILABILITY: The algorithm presented in this paper has been implemented in C. Source code is freely available for download at: http://www.inrialpes.fr/helix/people/viari/cccpart.


Sujet(s)
Biologie informatique/méthodes , Génome , Génomique/méthodes , Cartographie d'interactions entre protéines , Algorithmes , Analyse de regroupements , Infographie , Bases de données de protéines , Escherichia coli/métabolisme , Évolution moléculaire , Gènes bactériens , Génome bactérien , Modèles biologiques , Modèles génétiques , Modèles statistiques , ARN ribosomique/composition chimique , ARN de transfert/composition chimique , Logiciel
20.
Curr Opin Drug Discov Devel ; 6(3): 346-52, 2003 May.
Article de Anglais | MEDLINE | ID: mdl-12833667

RÉSUMÉ

The development of genomic and post-genomic technologies has created an explosion in the quantity, diversity and availability of both biological data and methods of analysis. Biologists are currently facing the problem of using all these resources to convert raw data into new valuable knowledge. This review presents software platforms designed to handle data and/or methods in the context of genome analysis.


Sujet(s)
Systèmes de gestion de bases de données , Bases de données génétiques , Génome , Analyse de séquence d'ADN/méthodes , Animaux , Systèmes de gestion de bases de données/tendances , Bases de données génétiques/tendances , Génome humain , Humains , Analyse de séquence d'ADN/tendances
SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE
...