Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters











Publication year range
1.
Sci Data ; 11(1): 982, 2024 Sep 09.
Article in English | MEDLINE | ID: mdl-39251610

ABSTRACT

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.


Subject(s)
Enzymes , Natural Language Processing , Enzymes/chemistry , PubMed , Databases, Protein , Knowledge Bases
2.
ArXiv ; 2024 Apr 22.
Article in English | MEDLINE | ID: mdl-38903736

ABSTRACT

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

3.
Database (Oxford) ; 20222022 04 12.
Article in English | MEDLINE | ID: mdl-35411389

ABSTRACT

SwissBioPics (www.swissbiopics.org) is a freely available resource of interactive, high-resolution cell images designed for the visualization of subcellular location data. SwissBioPics provides images describing cell types from all kingdoms of life-from the specialized muscle, neuronal and epithelial cells of animals, to the rods, cocci, clubs and spirals of prokaryotes. All cell images in SwissBioPics are drawn in Scalable Vector Graphics (SVG), with each subcellular location tagged with a unique identifier from the controlled vocabulary of subcellular locations and organelles of UniProt (https://www.uniprot.org/locations/). Users can search and explore SwissBioPics cell images through our website, which provides a platform for users to learn more about how cells are organized. A web component allows developers to embed SwissBioPics images in their own websites, using the associated JavaScript and a styling template, and to highlight subcellular locations and organelles by simply providing the web component with the appropriate identifier(s) from the UniProt-controlled vocabulary or the 'Cellular Component' branch of the Gene Ontology (www.geneontology.org), as well as an organism identifier from the National Center for Biotechnology Information taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy). The UniProt website now uses SwissBioPics to visualize the subcellular locations and organelles where proteins function. SwissBioPics is freely available for anyone to use under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. DATABASE URL: www.swissbiopics.org.


Subject(s)
Proteins , Vocabulary, Controlled , Animals
4.
J Alzheimers Dis ; 77(1): 257-273, 2020.
Article in English | MEDLINE | ID: mdl-32716361

ABSTRACT

BACKGROUND: The analysis and interpretation of data generated from patient-derived clinical samples relies on access to high-quality bioinformatics resources. These are maintained and updated by expert curators extracting knowledge from unstructured biological data described in free-text journal articles and converting this into more structured, computationally-accessible forms. This enables analyses such as functional enrichment of sets of genes/proteins using the Gene Ontology, and makes the searching of data more productive by managing issues such as gene/protein name synonyms, identifier mapping, and data quality. OBJECTIVE: To undertake a coordinated annotation update of key public-domain resources to better support Alzheimer's disease research. METHODS: We have systematically identified target proteins critical to disease process, in part by accessing informed input from the clinical research community. RESULTS: Data from 954 papers have been added to the UniProtKB, Gene Ontology, and the International Molecular Exchange Consortium (IMEx) databases, with 299 human proteins and 279 orthologs updated in UniProtKB. 745 binary interactions were added to the IMEx human molecular interaction dataset. CONCLUSION: This represents a significant enhancement in the expert curated data pertinent to Alzheimer's disease available in a number of biomedical databases. Relevant protein entries have been updated in UniProtKB and concomitantly in the Gene Ontology. Molecular interaction networks have been significantly extended in the IMEx Consortium dataset and a set of reference protein complexes created. All the resources described are open-source and freely available to the research community and we provide examples of how these data could be exploited by researchers.


Subject(s)
Alzheimer Disease/genetics , Computational Biology/methods , Databases, Protein , Expert Systems , Protein Interaction Maps/genetics , Public Sector , Alzheimer Disease/diagnosis , Humans
5.
PLoS Comput Biol ; 14(8): e1006390, 2018 08.
Article in English | MEDLINE | ID: mdl-30102703

ABSTRACT

Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.


Subject(s)
Data Curation/methods , Information Storage and Retrieval/methods , Data Curation/statistics & numerical data , Databases, Genetic , Databases, Protein , Deep Learning , Genomics , Knowledge Bases , Machine Learning , Publications
6.
Article in English | MEDLINE | ID: mdl-26896845

ABSTRACT

Advances in high-throughput and advanced technologies allow researchers to routinely perform whole genome and proteome analysis. For this purpose, they need high-quality resources providing comprehensive gene and protein sets for their organisms of interest. Using the example of the human proteome, we will describe the content of a complete proteome in the UniProt Knowledgebase (UniProtKB). We will show how manual expert curation of UniProtKB/Swiss-Prot is complemented by expert-driven automatic annotation to build a comprehensive, high-quality and traceable resource. We will also illustrate how the complexity of the human proteome is captured and structured in UniProtKB. Database URL: www.uniprot.org.


Subject(s)
Databases, Protein , Proteome/genetics , Proteomics/methods , Automation , Genome , Humans , Knowledge Bases , Phenotype , Protein Processing, Post-Translational , Proteins/chemistry , RNA Editing , Software
7.
Hum Mutat ; 35(8): 927-35, 2014 Aug.
Article in English | MEDLINE | ID: mdl-24848695

ABSTRACT

During the last few years, next-generation sequencing (NGS) technologies have accelerated the detection of genetic variants resulting in the rapid discovery of new disease-associated genes. However, the wealth of variation data made available by NGS alone is not sufficient to understand the mechanisms underlying disease pathogenesis and manifestation. Multidisciplinary approaches combining sequence and clinical data with prior biological knowledge are needed to unravel the role of genetic variants in human health and disease. In this context, it is crucial that these data are linked, organized, and made readily available through reliable online resources. The Swiss-Prot section of the Universal Protein Knowledgebase (UniProtKB/Swiss-Prot) provides the scientific community with a collection of information on protein functions, interactions, biological pathways, as well as human genetic diseases and variants, all manually reviewed by experts. In this article, we present an overview of the information content of UniProtKB/Swiss-Prot to show how this knowledgebase can support researchers in the elucidation of the mechanisms leading from a molecular defect to a disease phenotype.


Subject(s)
Databases, Protein/statistics & numerical data , Genetic Association Studies , Genetics, Medical , Knowledge Bases , Proteome , Software , Amino Acid Sequence , Genetic Variation , Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Internet , Molecular Sequence Annotation , Molecular Sequence Data , Terminology as Topic
8.
Nucleic Acids Res ; 42(Database issue): D358-63, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24234451

ABSTRACT

IntAct (freely available at http://www.ebi.ac.uk/intact) is an open-source, open data molecular interaction database populated by data either curated from the literature or from direct data depositions. IntAct has developed a sophisticated web-based curation tool, capable of supporting both IMEx- and MIMIx-level curation. This tool is now utilized by multiple additional curation teams, all of whom annotate data directly into the IntAct database. Members of the IntAct team supply appropriate levels of training, perform quality control on entries and take responsibility for long-term data maintenance. Recently, the MINT and IntAct databases decided to merge their separate efforts to make optimal use of limited developer resources and maximize the curation output. All data manually curated by the MINT curators have been moved into the IntAct database at EMBL-EBI and are merged with the existing IntAct dataset. Both IntAct and MINT are active contributors to the IMEx consortium (http://www.imexconsortium.org).


Subject(s)
Databases, Protein , Protein Interaction Mapping , Internet , Software
9.
Nucleic Acids Res ; 40(Database issue): D565-70, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22123736

ABSTRACT

The GO annotation dataset provided by the UniProt Consortium (GOA: http://www.ebi.ac.uk/GOA) is a comprehensive set of evidenced-based associations between terms from the Gene Ontology resource and UniProtKB proteins. Currently supplying over 100 million annotations to 11 million proteins in more than 360,000 taxa, this resource has increased 2-fold over the last 2 years and has benefited from a wealth of checks to improve annotation correctness and consistency as well as now supplying a greater information content enabled by GO Consortium annotation format developments. Detailed, manual GO annotations obtained from the curation of peer-reviewed papers are directly contributed by all UniProt curators and supplemented with manual and electronic annotations from 36 model organism and domain-focused scientific resources. The inclusion of high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterized, non-model organism species. UniProt GO annotations are freely available in a range of formats accessible by both file downloads and web-based views. In addition, the introduction of a new, normalized file format in 2010 has made for easier handling of the complete UniProt-GOA data set.


Subject(s)
Databases, Protein , Molecular Sequence Annotation , Vocabulary, Controlled , Molecular Sequence Annotation/standards
10.
Nucleic Acids Res ; 40(Database issue): D841-6, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22121220

ABSTRACT

IntAct is an open-source, open data molecular interaction database populated by data either curated from the literature or from direct data depositions. Two levels of curation are now available within the database, with both IMEx-level annotation and less detailed MIMIx-compatible entries currently supported. As from September 2011, IntAct contains approximately 275,000 curated binary interaction evidences from over 5000 publications. The IntAct website has been improved to enhance the search process and in particular the graphical display of the results. New data download formats are also available, which will facilitate the inclusion of IntAct's data in the Semantic Web. IntAct is an active contributor to the IMEx consortium (http://www.imexconsortium.org). IntAct source code and data are freely available at http://www.ebi.ac.uk/intact.


Subject(s)
Databases, Protein , Protein Interaction Mapping , Computer Graphics , Genes , Internet , Molecular Sequence Annotation , Sequence Analysis, Protein , Software
11.
J Biol Chem ; 279(45): 47242-53, 2004 Nov 05.
Article in English | MEDLINE | ID: mdl-15308636

ABSTRACT

Cycling proteins play important roles in the organization and function of the early secretory pathway by participating in membrane traffic and selective transport of cargo between the endoplasmic reticulum (ER), the intermediate compartment (ERGIC), and the Golgi. To identify new cycling proteins, we have developed a novel procedure for the purification of ERGIC membranes from HepG2 cells treated with brefeldin A, a drug known to accumulate cycling proteins in the ERGIC. Membranes enriched 110-fold over the homogenate for ERGIC-53 were obtained and analyzed by mass spectrometry. Major proteins corresponded to established and putative cargo receptors and components mediating protein maturation and membrane traffic. Among the uncharacterized proteins, a 32-kDa protein termed ERGIC-32 is a novel cycling membrane protein with sequence homology to Erv41p and Erv46p, two proteins enriched in COPII vesicles of yeast. ERGIC-32 localizes to the ERGIC and partially colocalizes with the human homologs of Erv41p and Erv46p, which mainly localize to the cis-Golgi. ERGIC-32 interacts with human Erv46 (hErv46) as revealed by covalent cross-linking and mistargeting experiments, and silencing of ERGIC-32 by small interfering RNAs increases the turnover of hErv46. We propose that ERGIC-32 functions as a modulator of the hErv41-hErv46 complex by stabilizing hErv46. Our novel approach for the isolation of the ERGIC from BFA-treated cells may ultimately lead to the identification of all proteins rapidly cycling early in the secretory pathway.


Subject(s)
Brefeldin A/pharmacology , Carrier Proteins/biosynthesis , Carrier Proteins/chemistry , Cell Membrane/metabolism , Endoplasmic Reticulum/metabolism , Golgi Apparatus/metabolism , Membrane Proteins/biosynthesis , Membrane Proteins/chemistry , Membrane Proteins/metabolism , Proteome/chemistry , Saccharomyces cerevisiae Proteins/metabolism , Amino Acid Sequence , Animals , Cell Line , Cross-Linking Reagents/pharmacology , Cytoplasm/metabolism , Electrophoresis, Polyacrylamide Gel , HeLa Cells , Humans , Immunoblotting , Immunoprecipitation , Mass Spectrometry , Mice , Microscopy, Fluorescence , Models, Biological , Molecular Sequence Data , Phylogeny , Protein Binding , Protein Structure, Tertiary , RNA Interference , Saccharomyces cerevisiae Proteins/chemistry , Sequence Homology, Amino Acid , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization , Subcellular Fractions/metabolism , Vesicular Transport Proteins/chemistry
12.
J Cell Sci ; 115(Pt 23): 4457-67, 2002 Dec 01.
Article in English | MEDLINE | ID: mdl-12414992

ABSTRACT

In this work, we showed that in Caco-2 cells, a polarized cell line derived from human colon cancer that does not express caveolin 1 (Cav-1), there was no detectable expression of caveolin 2 (Cav-2). When Cav-2 was reintroduced in these cells, it accumulated in the Golgi complex. A chimera, in which the scaffolding domain of Cav-1 was replaced by the one from Cav-2, induced a prominent Golgi staining of Cav-1, strongly indicating that this domain was responsible for the accumulation of Cav-2 in the Golgi complex. Cav-2 was able to interact with Cav-1 in the Golgi complex but this interaction was not sufficient to export it from this compartment. Several chimeras between Cav-1 and 2 were used to show that surface expression of caveolin was necessary but not sufficient to promote caveolae formation. Interestingly, levels of incorporation of the chimeras into Triton insoluble rafts correlated with their ability to trigger caveolae formation raising the possibility that a critical concentration of caveolins to discrete domains of the plasma membrane might be necessary for caveolae formation.


Subject(s)
Caveolins/chemistry , Caveolins/metabolism , Golgi Apparatus/metabolism , Caco-2 Cells , Caveolae/metabolism , Caveolae/ultrastructure , Caveolin 1 , Caveolin 2 , Caveolins/genetics , Humans , Membrane Microdomains/metabolism , Microscopy, Immunoelectron , Protein Structure, Tertiary
13.
Exp Cell Res ; 273(2): 178-86, 2002 Feb 15.
Article in English | MEDLINE | ID: mdl-11822873

ABSTRACT

We have analyzed the respective roles of the stalk and/or the O-glycosylation sites in apical sorting by producing partially deleted mutants in this region of the human receptor for neurotrophins (P75(NTR)). The mere presence of O-glycosylations was not sufficient for efficient delivery to the apical surface since changing the stalk domain of P75(NTR) for the heavily O-glycosylated stalk from human decay-accelerating factor led to random distribution of the chimera. The presence of O-glycosylations, however, was a prerequisite for exit from the ER and protection from intracellular cleavage since a P75(NTR) containing the non O-glycosylated stalk of the human placental alkaline phosphatase was not transported to the cell surface but was cleaved and secreted from the basolateral side. Deletion of the membrane-proximal part of the stalk showed a more dramatic reversal of polarity of P75(NTR) than the deletion of the distal part. Furthermore, moving the first putative O-glycosylation site (T216) two amino acids away from the membrane resulted in a loss of apical polarity of P75(NTR), suggesting that an important clue for apical sorting resides in this part of the stalk. This loss of apical polarity paralleled a loss of association of P75(NTR) mutants with Lubrol rafts. These data indicate that the position of O-glycans in the proximal part of the stalk domain of P75(NTR) is crucial for apical sorting and may regulate association with apical rafts.


Subject(s)
Polysaccharides/metabolism , Receptors, Nerve Growth Factor/metabolism , Animals , Binding Sites , Caco-2 Cells , Cell Line , Cell Membrane/metabolism , Dogs , Endoplasmic Reticulum/metabolism , Glycosylation , Humans , Receptor, Nerve Growth Factor
14.
Biochem Soc Symp ; (69): 73-82, 2002.
Article in English | MEDLINE | ID: mdl-12655775

ABSTRACT

Lectins of the early secretory pathway are involved in selective transport of newly synthesized glycoproteins from the endoplasmic reticulum (ER) to the ER-Golgi intermediate compartment (ERGIC). The most prominent cycling lectin is the mannose-binding type I membrane protein ERGIC-53 (ERGIC protein of 53 kDa), a marker for the ERGIC, which functions as a cargo receptor to facilitate export of an increasing number of glycoproteins with different characteristics from the ER. Two ERGIC-53-related proteins, VIP36 (vesicular integral membrane protein 36) and a novel ERGIC-53-like protein, ERGL, are also found in the early secretory pathway. ERGL may act as a regulator of ERGIC-53. Studies of ERGIC-53 continue to provide new insights into the organization and dynamics of the early secretory pathway. Analysis of the cycling of ERGIC-53 uncovered a complex interplay of trafficking signals and revealed novel cytoplasmic ER-export motifs that interact with COP-II coat proteins. These motifs are common to type I and polytopic membrane proteins including presenilin 1 and presenilin 2. The results support the notion that protein export from the ER is selective.


Subject(s)
Lectins/metabolism , Mannose-Binding Lectins/metabolism , Membrane Proteins/metabolism , Amino Acid Sequence , Animals , Lectins/chemistry , Mannose-Binding Lectins/chemistry , Membrane Proteins/chemistry , Molecular Sequence Data , Protein Transport , Sequence Homology, Amino Acid , Signal Transduction
SELECTION OF CITATIONS
SEARCH DETAIL