Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
PLoS Biol ; 15(6): e2001414, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28662064

RESUMO

In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.


Assuntos
Disciplinas das Ciências Biológicas/métodos , Biologia Computacional/métodos , Mineração de Dados/métodos , Design de Software , Software , Disciplinas das Ciências Biológicas/estatística & dados numéricos , Disciplinas das Ciências Biológicas/tendências , Biologia Computacional/tendências , Mineração de Dados/estatística & dados numéricos , Mineração de Dados/tendências , Bases de Dados Factuais/estatística & dados numéricos , Bases de Dados Factuais/tendências , Previsões , Humanos , Internet
2.
BMC Bioinformatics ; 14: 104, 2013 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-23517090

RESUMO

BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS: The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS: The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.


Assuntos
Mineração de Dados/métodos , Bases de Dados de Proteínas , Bases de Conhecimento , Processamento de Proteína Pós-Traducional , Humanos , Anotação de Sequência Molecular , Proteômica
3.
Nucleic Acids Res ; 39(Database issue): D58-65, 2011 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-21062818

RESUMO

UK PubMed Central (UKPMC) is a full-text article database that extends the functionality of the original PubMed Central (PMC) repository. The UKPMC project was launched as the first 'mirror' site to PMC, which in analogy to the International Nucleotide Sequence Database Collaboration, aims to provide international preservation of the open and free-access biomedical literature. UKPMC (http://ukpmc.ac.uk) has undergone considerable development since its inception in 2007 and now includes both a UKPMC and PubMed search, as well as access to other records such as Agricola, Patents and recent biomedical theses. UKPMC also differs from PubMed/PMC in that the full text and abstract information can be searched in an integrated manner from one input box. Furthermore, UKPMC contains 'Cited By' information as an alternative way to navigate the literature and has incorporated text-mining approaches to semantically enrich content and integrate it with related database resources. Finally, UKPMC also offers added-value services (UKPMC+) that enable grantees to deposit manuscripts, link papers to grants, publish online portfolios and view citation information on their papers. Here we describe UKPMC and clarify the relationship between PMC and UKPMC, providing historical context and future directions, 10 years on from when PMC was first launched.


Assuntos
PubMed , Mineração de Dados , Internet , Software , Reino Unido
4.
Trends Biochem Sci ; 29(12): 627-33, 2004 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-15544947

RESUMO

Sequence similarities among proteins can infer biological function and evolutionary relationships--a powerful approach for investigating new proteins and suggesting future experiments. The availability of public sequence databases and freely distributed tools for sequence analysis has meant that researchers from all over the world can use this approach. For the past 12 years, the Protein Sequence Motif column in TiBS has provided a platform for documenting interesting discoveries from sequence analyses. As the column comes to an end, we look at the published contributions over the years and reflect on sequence analysis through the beginning of the genomic era.


Assuntos
Bases de Dados de Proteínas , Proteínas/química , Homologia Estrutural de Proteína , Animais , Humanos
5.
J Biomed Semantics ; 6: 1, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25789152

RESUMO

BACKGROUND: In this study, we present an analysis of data citation practices in full text research articles and their corresponding supplementary data files, made available in the Open Access set of articles from Europe PubMed Central. Our aim is to investigate whether supplementary data files should be considered as a source of information for integrating the literature with biomolecular databases. RESULTS: Using text-mining methods to identify and extract a variety of core biological database accession numbers, we found that the supplemental data files contain many more database citations than the body of the article, and that those citations often take the form of a relatively small number of articles citing large collections of accession numbers in text-based files. Moreover, citation of value-added databases derived from submission databases (such as Pfam, UniProt or Ensembl) is common, demonstrating the reuse of these resources as datasets in themselves. All the database accession numbers extracted from the supplementary data are publicly accessible from http://dx.doi.org/10.5281/zenodo.11771. CONCLUSIONS: Our study suggests that supplementary data should be considered when linking articles with data, in curation pipelines, and in information retrieval tasks in order to make full use of the entire research article. These observations highlight the need to improve the management of supplemental data in general, in order to make this information more discoverable and useful.

6.
J Biomed Semantics ; 6: 7, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25774284

RESUMO

BACKGROUND: As the availability of open access full text research articles increases, so does the need for sophisticated search services that make the most of this new content. Here, we present a new feature available in Europe PMC that allows selected sections of full text articles to be searched, including figures and reference lists. Users can now search particular parts of an article, reducing noise and allowing fine-tuning of searches. RESULTS: To the best of our knowledge, Europe PMC is the first service that provides a granular literature search by allowing users to target their search to particular sections of articles. This new functionality is based on a heuristic algorithm that identifies and categorises article sections into 17 pre-defined categories based on the section heading. The tagger's performance is measured against a manually curated dataset consisting of 100 full text articles with an F-score of 98.02%. CONCLUSIONS: The section search is available from the advanced search within Europe PMC (http://europepmc.org). The source code is freely available from http://europepmc.org/ftp/oa/SectionTagger/.

7.
PLoS One ; 8(5): e63184, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23734176

RESUMO

Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.


Assuntos
Mineração de Dados/métodos , Mineração de Dados/estatística & dados numéricos , Bases de Dados Factuais , Internet , Mineração de Dados/tendências , Bases de Dados Bibliográficas , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas , Europa (Continente) , Humanos , Reprodutibilidade dos Testes , Estados Unidos
8.
AMIA Annu Symp Proc ; 2009: 396-400, 2009 Nov 14.
Artigo em Inglês | MEDLINE | ID: mdl-20351887

RESUMO

It is common for PubMed users to repeatedly modify their queries (search terms) before retrieving documents relevant to their information needs. To assist users in reformulating their queries, we report the implementation and usage analysis of a new component in PubMed called Related Queries, which automatically produces query suggestions in response to the original user's input. The proposed method is based on query log analysis and focuses on finding popular queries that contain the initial user search term with a goal of helping users describe their information needs in a more precise manner. This work has been integrated into PubMed since January 2009. Automatic assessment using clickthrough data show that each day, the new feature is used consistently between 6% and 10% of the time when it is shown, suggesting that it has quickly become a popular new feature in PubMed.


Assuntos
Armazenamento e Recuperação da Informação/métodos , PubMed , Interface Usuário-Computador , Medical Subject Headings , Processamento de Linguagem Natural , Terminologia como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA