Pesquisa | Portal de Pesquisa da BVS

1.

BioRED: a rich biomedical relation extraction dataset.

Luo, Ling; Lai, Po-Ting; Wei, Chih-Hsuan; Arighi, Cecilia N; Lu, Zhiyong.

Brief Bioinform ; 23(5)2022 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-35849818

RESUMO

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.

Assuntos

Algoritmos , Mineração de Dados , Proteínas , PubMed

2.

A crowdsourcing open platform for literature curation in UniProt.

Wang, Yuqi; Wang, Qinghua; Huang, Hongzhan; Huang, Wei; Chen, Yongxing; McGarvey, Peter B; Wu, Cathy H; Arighi, Cecilia N.

PLoS Biol ; 19(12): e3001464, 2021 12.

Artigo em Inglês | MEDLINE | ID: mdl-34871295

RESUMO

The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.

Assuntos

Crowdsourcing/métodos , Curadoria de Dados/métodos , Anotação de Sequência Molecular/métodos , Sequência de Aminoácidos/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas/tendências , Humanos , Literatura , Proteínas/metabolismo , Participação dos Interessados

3.

Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research.

Hufsky, Franziska; Lamkiewicz, Kevin; Almeida, Alexandre; Aouacheria, Abdel; Arighi, Cecilia; Bateman, Alex; Baumbach, Jan; Beerenwinkel, Niko; Brandt, Christian; Cacciabue, Marco; Chuguransky, Sara; Drechsel, Oliver; Finn, Robert D; Fritz, Adrian; Fuchs, Stephan; Hattab, Georges; Hauschild, Anne-Christin; Heider, Dominik; Hoffmann, Marie; Hölzer, Martin; Hoops, Stefan; Kaderali, Lars; Kalvari, Ioanna; von Kleist, Max; Kmiecinski, Renó; Kühnert, Denise; Lasso, Gorka; Libin, Pieter; List, Markus; Löchel, Hannah F; Martin, Maria J; Martin, Roman; Matschinske, Julian; McHardy, Alice C; Mendes, Pedro; Mistry, Jaina; Navratil, Vincent; Nawrocki, Eric P; O'Toole, Áine Niamh; Ontiveros-Palacios, Nancy; Petrov, Anton I; Rangel-Pineros, Guillermo; Redaschi, Nicole; Reimering, Susanne; Reinert, Knut; Reyes, Alejandro; Richardson, Lorna; Robertson, David L; Sadegh, Sepideh; Singer, Joshua B.

Brief Bioinform ; 22(2): 642-663, 2021 03 22.

Artigo em Inglês | MEDLINE | ID: mdl-33147627

RESUMO

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.

Assuntos

COVID-19/prevenção & controle , Biologia Computacional , SARS-CoV-2/isolamento & purificação , Pesquisa Biomédica , COVID-19/epidemiologia , COVID-19/virologia , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética

4.

Utilizing image and caption information for biomedical document classification.

Li, Pengyuan; Jiang, Xiangying; Zhang, Gongbo; Trabucco, Juan Trelles; Raciti, Daniela; Smith, Cynthia; Ringwald, Martin; Marai, G Elisabeta; Arighi, Cecilia; Shatkay, Hagit.

Bioinformatics ; 37(Suppl_1): i468-i476, 2021 07 12.

Artigo em Inglês | MEDLINE | ID: mdl-34252939

RESUMO

MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature-a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. RESULTS: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. AVAILABILITY AND IMPLEMENTATION: Source code and the list of PMIDs of the publications in our datasets are available upon request.

Assuntos

Pesquisa Biomédica , Bases de Dados Factuais

5.

UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase.

MacDougall, Alistair; Volynkin, Vladimir; Saidi, Rabie; Poggioli, Diego; Zellner, Hermann; Hatton-Ellis, Emma; Joshi, Vishal; O'Donovan, Claire; Orchard, Sandra; Auchincloss, Andrea H; Baratin, Delphine; Bolleman, Jerven; Coudert, Elisabeth; de Castro, Edouard; Hulo, Chantal; Masson, Patrick; Pedruzzi, Ivo; Rivoire, Catherine; Arighi, Cecilia; Wang, Qinghua; Chen, Chuming; Huang, Hongzhan; Garavelli, John; Vinayaka, C R; Yeh, Lai-Su; Natale, Darren A; Laiho, Kati; Martin, Maria-Jesus; Renaux, Alexandre; Pichler, Klemens.

Bioinformatics ; 36(17): 4643-4648, 2020 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-32399560

RESUMO

MOTIVATION: The number of protein records in the UniProt Knowledgebase (UniProtKB: https://www.uniprot.org) continues to grow rapidly as a result of genome sequencing and the prediction of protein-coding genes. Providing functional annotation for these proteins presents a significant and continuing challenge. RESULTS: In response to this challenge, UniProt has developed a method of annotation, known as UniRule, based on expertly curated rules, which integrates related systems (RuleBase, HAMAP, PIRSR, PIRNR) developed by the members of the UniProt consortium. UniRule uses protein family signatures from InterPro, combined with taxonomic and other constraints, to select sets of reviewed proteins which have common functional properties supported by experimental evidence. This annotation is propagated to unreviewed records in UniProtKB that meet the same selection criteria, most of which do not have (and are never likely to have) experimentally verified functional annotation. Release 2020_01 of UniProtKB contains 6496 UniRule rules which provide annotation for 53 million proteins, accounting for 30% of the 178 million records in UniProtKB. UniRule provides scalable enrichment of annotation in UniProtKB. AVAILABILITY AND IMPLEMENTATION: UniRule rules are integrated into UniProtKB and can be viewed at https://www.uniprot.org/unirule/. UniRule rules and the code required to run the rules, are publicly available for researchers who wish to annotate their own sequences. The implementation used to run the rules is known as UniFIRE and is available at https://gitlab.ebi.ac.uk/uniprot-public/unifire.

Assuntos

Bases de Conhecimento , Proteínas , Mapeamento Cromossômico , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Proteínas/genética

6.

iPTMnet: an integrated resource for protein post-translational modification network discovery.

Huang, Hongzhan; Arighi, Cecilia N; Ross, Karen E; Ren, Jia; Li, Gang; Chen, Sheng-Chih; Wang, Qinghua; Cowart, Julie; Vijay-Shanker, K; Wu, Cathy H.

Nucleic Acids Res ; 46(D1): D542-D550, 2018 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-29145615

RESUMO

Protein post-translational modifications (PTMs) play a pivotal role in numerous biological processes by modulating regulation of protein function. We have developed iPTMnet (http://proteininformationresource.org/iPTMnet) for PTM knowledge discovery, employing an integrative bioinformatics approach-combining text mining, data mining, and ontological representation to capture rich PTM information, including PTM enzyme-substrate-site relationships, PTM-specific protein-protein interactions (PPIs) and PTM conservation across species. iPTMnet encompasses data from (i) our PTM-focused text mining tools, RLIMS-P and eFIP, which extract phosphorylation information from full-scale mining of PubMed abstracts and full-length articles; (ii) a set of curated databases with experimentally observed PTMs; and iii) Protein Ontology that organizes proteins and PTM proteoforms, enabling their representation, annotation and comparison within and across species. Presently covering eight major PTM types (phosphorylation, ubiquitination, acetylation, methylation, glycosylation, S-nitrosylation, sumoylation and myristoylation), iPTMnet knowledgebase contains more than 654 500 unique PTM sites in over 62 100 proteins, along with more than 1200 PTM enzymes and over 24 300 PTM enzyme-substrate-site relations. The website supports online search, browsing, retrieval and visual analysis for scientific queries. Several examples, including functional interpretation of phosphoproteomic data, demonstrate iPTMnet as a gateway for visual exploration and systematic analysis of PTM networks and conservation, thereby enabling PTM discovery and hypothesis generation.

Assuntos

Bases de Dados de Proteínas , Bases de Conhecimento , Processamento de Proteína Pós-Traducional , Animais , Biologia Computacional , Mineração de Dados , Enzimas/metabolismo , Humanos , Internet , Fosforilação , Mapas de Interação de Proteínas , Alinhamento de Sequência

7.

Protein Ontology (PRO): enhancing and scaling up the representation of protein entities.

Natale, Darren A; Arighi, Cecilia N; Blake, Judith A; Bona, Jonathan; Chen, Chuming; Chen, Sheng-Chih; Christie, Karen R; Cowart, Julie; D'Eustachio, Peter; Diehl, Alexander D; Drabkin, Harold J; Duncan, William D; Huang, Hongzhan; Ren, Jia; Ross, Karen; Ruttenberg, Alan; Shamovsky, Veronica; Smith, Barry; Wang, Qinghua; Zhang, Jian; El-Sayed, Abdelrahman; Wu, Cathy H.

Nucleic Acids Res ; 45(D1): D339-D346, 2017 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-27899649

RESUMO

The Protein Ontology (PRO; http://purl.obolibrary.org/obo/pr) formally defines and describes taxon-specific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and protein-containing complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translational modification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.

Assuntos

Biologia Computacional/métodos , Bases de Dados Genéticas , Proteínas , Animais , Humanos , Proteínas/química , Proteínas/genética , Navegador

8.

On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.

Poux, Sylvain; Arighi, Cecilia N; Magrane, Michele; Bateman, Alex; Wei, Chih-Hsuan; Lu, Zhiyong; Boutet, Emmanuel; Bye-A-Jee, Hema; Famiglietti, Maria Livia; Roechert, Bernd; UniProt Consortium, The.

Bioinformatics ; 33(21): 3454-3460, 2017 Nov 01.

Artigo em Inglês | MEDLINE | ID: mdl-29036270

RESUMO

MOTIVATION: Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized and computable knowledge extracted from the literature by expert curators. While knowledgebases play an increasingly important role in the scientific community, their ability to keep up with the growth of biomedical literature is under scrutiny. Using UniProtKB/Swiss-Prot as a case study, we address this concern via multiple literature triage approaches. RESULTS: With the assistance of the PubTator text-mining tool, we tagged more than 10 000 articles to assess the ratio of papers relevant for curation. We first show that curators read and evaluate many more papers than they curate, and that measuring the number of curated publications is insufficient to provide a complete picture as demonstrated by the fact that 8000-10 000 papers are curated in UniProt each year while curators evaluate 50 000-70 000 papers per year. We show that 90% of the papers in PubMed are out of the scope of UniProt, that a maximum of 2-3% of the papers indexed in PubMed each year are relevant for UniProt curation, and that, despite appearances, expert curation in UniProt is scalable. AVAILABILITY AND IMPLEMENTATION: UniProt is freely available at http://www.uniprot.org/. CONTACT: sylvain.poux@sib.swiss. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Curadoria de Dados , Bases de Dados de Proteínas , Curadoria de Dados/estatística & dados numéricos , Mineração de Dados , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Bases de Conhecimento , PubMed/estatística & dados numéricos , Literatura de Revisão como Assunto , Estatística como Assunto

9.

Corrigendum to: Utilizing image and caption information for biomedical document classification.

Li, Pengyuan; Jiang, Xiangying; Zhang, Gongbo; Trabucco, Juan Trelles; Raciti, Daniela; Smith, Cynthia; Ringwald, Martin; Marai, G Elisabeta; Arighi, Cecilia; Shatkay, Hagit.

Bioinformatics ; 37(19): 3389, 2021 Oct 11.

Artigo em Inglês | MEDLINE | ID: mdl-34453518

10.

UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase.

MacDougall, Alistair; Volynkin, Vladimir; Saidi, Rabie; Poggioli, Diego; Zellner, Hermann; Hatton-Ellis, Emma; Joshi, Vishal; O'Donovan, Claire; Orchard, Sandra; Auchincloss, Andrea H; Baratin, Delphine; Bolleman, Jerven; Coudert, Elisabeth; de Castro, Edouard; Hulo, Chantal; Masson, Patrick; Pedruzzi, Ivo; Rivoire, Catherine; Arighi, Cecilia; Wang, Qinghua; Chen, Chuming; Huang, Hongzhan; Garavelli, John; Vinayaka, C R; Yeh, Lai-Su; Natale, Darren A; Laiho, Kati; Martin, Maria-Jesus; Renaux, Alexandre; Pichler, Klemens.

Bioinformatics ; 36(22-23): 5562, 2021 04 01.

Artigo em Inglês | MEDLINE | ID: mdl-33821964

11.

miRTex: A Text Mining System for miRNA-Gene Relation Extraction.

Li, Gang; Ross, Karen E; Arighi, Cecilia N; Peng, Yifan; Wu, Cathy H; Vijay-Shanker, K.

PLoS Comput Biol ; 11(9): e1004391, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26407127

RESUMO

MicroRNAs (miRNAs) regulate a wide range of cellular and developmental processes through gene expression suppression or mRNA degradation. Experimentally validated miRNA gene targets are often reported in the literature. In this paper, we describe miRTex, a text mining system that extracts miRNA-target relations, as well as miRNA-gene and gene-miRNA regulation relations. The system achieves good precision and recall when evaluated on a literature corpus of 150 abstracts with F-scores close to 0.90 on the three different types of relations. We conducted full-scale text mining using miRTex to process all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset. The results for all the Medline abstracts are stored in a database for interactive query and file download via the website at http://proteininformationresource.org/mirtex. Using miRTex, we identified genes potentially regulated by miRNAs in Triple Negative Breast Cancer, as well as miRNA-gene relations that, in conjunction with kinase-substrate relations, regulate the response to abiotic stress in Arabidopsis thaliana. These two use cases demonstrate the usefulness of miRTex text mining in the analysis of miRNA-regulated biological processes.

Assuntos

Biologia Computacional/métodos , Mineração de Dados/métodos , Genes/genética , MicroRNAs/genética , Bases de Dados Genéticas , Humanos , MicroRNAs/classificação , Modelos Genéticos , Publicações Periódicas como Assunto

12.

Protein Ontology: a controlled structured network of protein entities.

Natale, Darren A; Arighi, Cecilia N; Blake, Judith A; Bult, Carol J; Christie, Karen R; Cowart, Julie; D'Eustachio, Peter; Diehl, Alexander D; Drabkin, Harold J; Helfer, Olivia; Huang, Hongzhan; Masci, Anna Maria; Ren, Jia; Roberts, Natalia V; Ross, Karen; Ruttenberg, Alan; Shamovsky, Veronica; Smith, Barry; Yerramalla, Meher Shruti; Zhang, Jian; AlJanahi, Aisha; Çelen, Irem; Gan, Cynthia; Lv, Mengxi; Schuster-Lezell, Emily; Wu, Cathy H.

Nucleic Acids Res ; 42(Database issue): D415-21, 2014 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-24270789

RESUMO

The Protein Ontology (PRO; http://proconsortium.org) formally defines protein entities and explicitly represents their major forms and interrelations. Protein entities represented in PRO corresponding to single amino acid chains are categorized by level of specificity into family, gene, sequence and modification metaclasses, and there is a separate metaclass for protein complexes. All metaclasses also have organism-specific derivatives. PRO complements established sequence databases such as UniProtKB, and interoperates with other biomedical and biological ontologies such as the Gene Ontology (GO). PRO relates to UniProtKB in that PRO's organism-specific classes of proteins encoded by a specific gene correspond to entities documented in UniProtKB entries. PRO relates to the GO in that PRO's representations of organism-specific protein complexes are subclasses of the organism-agnostic protein complex terms in the GO Cellular Component Ontology. The past few years have seen growth and changes to the PRO, as well as new points of access to the data and new applications of PRO in immunology and proteomics. Here we describe some of these developments.

Assuntos

Ontologias Biológicas , Bases de Dados de Proteínas , Proteínas/classificação , Animais , Humanos , Internet , Camundongos , Proteínas/química

13.

LARGE LANGUAGE MODELS (LLMS) AND CHATGPT FOR BIOMEDICINE.

Arighi, Cecilia; Brenner, Steven; Lu, Zhiyong.

Pac Symp Biocomput ; 29: 641-644, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38160312

RESUMO

Large Language Models (LLMs) are a type of artificial intelligence that has been revolutionizing various fields, including biomedicine. They have the capability to process and analyze large amounts of data, understand natural language, and generate new content, making them highly desirable in many biomedical applications and beyond. In this workshop, we aim to introduce the attendees to an in-depth understanding of the rise of LLMs in biomedicine, and how they are being used to drive innovation and improve outcomes in the field, along with associated challenges and pitfalls.

Assuntos

Inteligência Artificial , Biologia Computacional , Humanos , Idioma

14.

Functional implications of glycans and their curation: insights from the workshop held at the 16th Annual International Biocuration Conference in Padua, Italy.

Martinez, Karina; Agirre, Jon; Akune, Yukie; Aoki-Kinoshita, Kiyoko F; Arighi, Cecilia; Axelsen, Kristian B; Bolton, Evan; Bordeleau, Emily; Edwards, Nathan J; Fadda, Elisa; Feizi, Ten; Hayes, Catherine; Ives, Callum M; Joshi, Hiren J; Krishna Prasad, Khakurel; Kossida, Sofia; Lisacek, Frederique; Liu, Yan; Lütteke, Thomas; Ma, Junfeng; Malik, Adnan; Martin, Maria; Mehta, Akul Y; Neelamegham, Sriram; Panneerselvam, Kalpana; Ranzinger, René; Ricard-Blum, Sylvie; Sanou, Gaoussou; Shanker, Vijay; Thomas, Paul D; Tiemeyer, Michael; Urban, James; Vita, Randi; Vora, Jeet; Yamamoto, Yasunori; Mazumder, Raja.

Database (Oxford) ; 20242024 Aug 13.

Artigo em Inglês | MEDLINE | ID: mdl-39137905

RESUMO

Dynamic changes in protein glycosylation impact human health and disease progression. However, current resources that capture disease and phenotype information focus primarily on the macromolecules within the central dogma of molecular biology (DNA, RNA, proteins). To gain a better understanding of organisms, there is a need to capture the functional impact of glycans and glycosylation on biological processes. A workshop titled "Functional impact of glycans and their curation" was held in conjunction with the 16th Annual International Biocuration Conference to discuss ongoing worldwide activities related to glycan function curation. This workshop brought together subject matter experts, tool developers, and biocurators from over 20 projects and bioinformatics resources. Participants discussed four key topics for each of their resources: (i) how they curate glycan function-related data from publications and other sources, (ii) what type of data they would like to acquire, (iii) what data they currently have, and (iv) what standards they use. Their answers contributed input that provided a comprehensive overview of state-of-the-art glycan function curation and annotations. This report summarizes the outcome of discussions, including potential solutions and areas where curators, data wranglers, and text mining experts can collaborate to address current gaps in glycan and glycosylation annotations, leveraging each other's work to improve their respective resources and encourage impactful data sharing among resources. Database URL: https://wiki.glygen.org/Glycan_Function_Workshop_2023.

Assuntos

Curadoria de Dados , Polissacarídeos , Polissacarídeos/metabolismo , Humanos , Curadoria de Dados/métodos , Glicosilação , Itália , Biocuradoria

15.

The Protein Ontology: a structured representation of protein forms and complexes.

Natale, Darren A; Arighi, Cecilia N; Barker, Winona C; Blake, Judith A; Bult, Carol J; Caudy, Michael; Drabkin, Harold J; D'Eustachio, Peter; Evsikov, Alexei V; Huang, Hongzhan; Nchoutmboube, Jules; Roberts, Natalia V; Smith, Barry; Zhang, Jian; Wu, Cathy H.

Nucleic Acids Res ; 39(Database issue): D539-45, 2011 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-20935045

RESUMO

The Protein Ontology (PRO) provides a formal, logically-based classification of specific protein classes including structured representations of protein isoforms, variants and modified forms. Initially focused on proteins found in human, mouse and Escherichia coli, PRO now includes representations of protein complexes. The PRO Consortium works in concert with the developers of other biomedical ontologies and protein knowledge bases to provide the ability to formally organize and integrate representations of precise protein forms so as to enhance accessibility to results of protein research. PRO (http://pir.georgetown.edu/pro) is part of the Open Biomedical Ontology Foundry.

Assuntos

Bases de Dados de Proteínas , Proteínas/classificação , Animais , Proteínas de Escherichia coli/química , Humanos , Camundongos , Complexos Multiproteicos/química , Complexos Multiproteicos/classificação , Isoformas de Proteínas/química , Isoformas de Proteínas/classificação , Proteínas/química , Proteínas/genética , Interface Usuário-Computador , Vocabulário Controlado

16.

Enhancing biomedical search interfaces with images.

Trelles Trabucco, Juan; Arighi, Cecilia; Shatkay, Hagit; Marai, G Elisabeta.

Bioinform Adv ; 3(1): vbad095, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37485423

RESUMO

Motivation: Figures in biomedical papers communicate essential information with the potential to identify relevant documents in biomedical and clinical settings. However, academic search interfaces mainly search over text fields. Results: We describe a search system for biomedical documents that leverages image modalities and an existing index server. We integrate a problem-specific taxonomy of image modalities and image-based data into a custom search system. Our solution features a front-end interface to enhance classical document search results with image-related data, including page thumbnails, figures, captions and image-modality information. We demonstrate the system on a subset of the CORD-19 document collection. A quantitative evaluation demonstrates higher precision and recall for biomedical document retrieval. A qualitative evaluation with domain experts further highlights our solution's benefits to biomedical search. Availability and implementation: A demonstration is available at https://runachay.evl.uic.edu/scholar. Our code and image models can be accessed via github.com/uic-evl/bio-search. The dataset is continuously expanded.

17.

Overview of the COVID-19 text mining tool interactive demonstration track in BioCreative VII.

Chatr-Aryamontri, Andrew; Hirschman, Lynette; Ross, Karen E; Oughtred, Rose; Krallinger, Martin; Dolinski, Kara; Tyers, Mike; Korves, Tonia; Arighi, Cecilia N.

Database (Oxford) ; 20222022 10 05.

Artigo em Inglês | MEDLINE | ID: mdl-36197453

RESUMO

The coronavirus disease 2019 (COVID-19) pandemic has compelled biomedical researchers to communicate data in real time to establish more effective medical treatments and public health policies. Nontraditional sources such as preprint publications, i.e. articles not yet validated by peer review, have become crucial hubs for the dissemination of scientific results. Natural language processing (NLP) systems have been recently developed to extract and organize COVID-19 data in reasoning systems. Given this scenario, the BioCreative COVID-19 text mining tool interactive demonstration track was created to assess the landscape of the available tools and to gauge user interest, thereby providing a two-way communication channel between NLP system developers and potential end users. The goal was to inform system designers about the performance and usability of their products and to suggest new additional features. Considering the exploratory nature of this track, the call for participation solicited teams to apply for the track, based on their system's ability to perform COVID-19-related tasks and interest in receiving user feedback. We also recruited volunteer users to test systems. Seven teams registered systems for the track, and >30 individuals volunteered as test users; these volunteer users covered a broad range of specialties, including bench scientists, bioinformaticians and biocurators. The users, who had the option to participate anonymously, were provided with written and video documentation to familiarize themselves with the NLP tools and completed a survey to record their evaluation. Additional feedback was also provided by NLP system developers. The track was well received as shown by the overall positive feedback from the participating teams and the users. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-4/.

Assuntos

COVID-19 , COVID-19/epidemiologia , Mineração de Dados/métodos , Bases de Dados Factuais , Documentação , Humanos , Processamento de Linguagem Natural

18.

A roadmap for the functional annotation of protein families: a community perspective.

de Crécy-Lagard, Valérie; Amorin de Hegedus, Rocio; Arighi, Cecilia; Babor, Jill; Bateman, Alex; Blaby, Ian; Blaby-Haas, Crysten; Bridge, Alan J; Burley, Stephen K; Cleveland, Stacey; Colwell, Lucy J; Conesa, Ana; Dallago, Christian; Danchin, Antoine; de Waard, Anita; Deutschbauer, Adam; Dias, Raquel; Ding, Yousong; Fang, Gang; Friedberg, Iddo; Gerlt, John; Goldford, Joshua; Gorelik, Mark; Gyori, Benjamin M; Henry, Christopher; Hutinet, Geoffrey; Jaroch, Marshall; Karp, Peter D; Kondratova, Liudmyla; Lu, Zhiyong; Marchler-Bauer, Aron; Martin, Maria-Jesus; McWhite, Claire; Moghe, Gaurav D; Monaghan, Paul; Morgat, Anne; Mungall, Christopher J; Natale, Darren A; Nelson, William C; O'Donoghue, Seán; Orengo, Christine; O'Toole, Katherine H; Radivojac, Predrag; Reed, Colbie; Roberts, Richard J; Rodionov, Dmitri; Rodionova, Irina A; Rudolf, Jeffrey D; Saleh, Lana; Sheynkman, Gloria.

Database (Oxford) ; 20222022 08 12.

Artigo em Inglês | MEDLINE | ID: mdl-35961013

RESUMO

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.

Assuntos

Genômica , Proteínas , Sequência de Bases , Biologia Computacional , Genoma , Anotação de Sequência Molecular

19.

Overview of the BioCreative III Workshop.

Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin; Cohen, Kevin B; Wilbur, W John; Valencia, Alfonso; Hirschman, Lynette; Wu, Cathy H.

BMC Bioinformatics ; 12 Suppl 8: S1, 2011 Oct 03.

Artigo em Inglês | MEDLINE | ID: mdl-22151647

RESUMO

BACKGROUND: The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. To this end BioCreative I was held in 2004, BioCreative II in 2007, and BioCreative II.5 in 2009. Each of these workshops involved humanly annotated test data for several basic tasks in text mining applied to the biomedical literature. Participants in the workshops were invited to compete in the tasks by constructing software systems to perform the tasks automatically and were given scores based on their performance. The results of these workshops have benefited the community in several ways. They have 1) provided evidence for the most effective methods currently available to solve specific problems; 2) revealed the current state of the art for performance on those problems; 3) and provided gold standard data and results on that data by which future advances can be gauged. This special issue contains overview papers for the three tasks of BioCreative III. RESULTS: The BioCreative III Workshop was held in September of 2010 and continued the tradition of a challenge evaluation on several tasks judged basic to effective text mining in biology, including a gene normalization (GN) task and two protein-protein interaction (PPI) tasks. In total the Workshop involved the work of twenty-three teams. Thirteen teams participated in the GN task which required the assignment of EntrezGene IDs to all named genes in full text papers without any species information being provided to a system. Ten teams participated in the PPI article classification task (ACT) requiring a system to classify and rank a PubMed® record as belonging to an article either having or not having "PPI relevant" information. Eight teams participated in the PPI interaction method task (IMT) where systems were given full text documents and were required to extract the experimental methods used to establish PPIs and a text segment supporting each such method. Gold standard data was compiled for each of these tasks and participants competed in developing systems to perform the tasks automatically.BioCreative III also introduced a new interactive task (IAT), run as a demonstration task. The goal was to develop an interactive system to facilitate a user's annotation of the unique database identifiers for all the genes appearing in an article. This task included ranking genes by importance (based preferably on the amount of described experimental information regarding genes). There was also an optional task to assist the user in finding the most relevant articles about a given gene. For BioCreative III, a user advisory group (UAG) was assembled and played an important role 1) in producing some of the gold standard annotations for the GN task, 2) in critiquing IAT systems, and 3) in providing guidance for a future more rigorous evaluation of IAT systems. Six teams participated in the IAT demonstration task and received feedback on their systems from the UAG group. Besides innovations in the GN and PPI tasks making them more realistic and practical and the introduction of the IAT task, discussions were begun on community data standards to promote interoperability and on user requirements and evaluation metrics to address utility and usability of systems. CONCLUSIONS: In this paper we give a brief history of the BioCreative Workshops and how they relate to other text mining competitions in biology. This is followed by a synopsis of the three tasks GN, PPI, and IAT in BioCreative III with figures for best participant performance on the GN and PPI tasks. These results are discussed and compared with results from previous BioCreative Workshops and we conclude that the best performing systems for GN, PPI-ACT and PPI-IMT in realistic settings are not sufficient for fully automatic use. This provides evidence for the importance of interactive systems and we present our vision of how best to construct an interactive system for a GN or PPI like task in the remainder of the paper.

Assuntos

Biologia Computacional/métodos , Mineração de Dados , Genes , Proteínas/metabolismo , Software , Animais , Biologia Computacional/normas , Humanos , Publicações Periódicas como Assunto , Proteínas/genética

20.

The representation of protein complexes in the Protein Ontology (PRO).

Bult, Carol J; Drabkin, Harold J; Evsikov, Alexei; Natale, Darren; Arighi, Cecilia; Roberts, Natalia; Ruttenberg, Alan; D'Eustachio, Peter; Smith, Barry; Blake, Judith A; Wu, Cathy.

BMC Bioinformatics ; 12: 371, 2011 Sep 19.

Artigo em Inglês | MEDLINE | ID: mdl-21929785

RESUMO

BACKGROUND: Representing species-specific proteins and protein complexes in ontologies that are both human- and machine-readable facilitates the retrieval, analysis, and interpretation of genome-scale data sets. Although existing protin-centric informatics resources provide the biomedical research community with well-curated compendia of protein sequence and structure, these resources lack formal ontological representations of the relationships among the proteins themselves. The Protein Ontology (PRO) Consortium is filling this informatics resource gap by developing ontological representations and relationships among proteins and their variants and modified forms. Because proteins are often functional only as members of stable protein complexes, the PRO Consortium, in collaboration with existing protein and pathway databases, has launched a new initiative to implement logical and consistent representation of protein complexes. DESCRIPTION: We describe here how the PRO Consortium is meeting the challenge of representing species-specific protein complexes, how protein complex representation in PRO supports annotation of protein complexes and comparative biology, and how PRO is being integrated into existing community bioinformatics resources. The PRO resource is accessible at http://pir.georgetown.edu/pro/. CONCLUSION: PRO is a unique database resource for species-specific protein complexes. PRO facilitates robust annotation of variations in composition and function contexts for protein complexes within and between species.

Assuntos

Bases de Dados de Proteínas , Complexos Multiproteicos , Proteínas/química , Animais , Biologia Computacional , Humanos , Internet , Complexos Multienzimáticos , Proteínas/metabolismo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA