Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
Add more filters










Database
Language
Publication year range
1.
PLoS Biol ; 15(6): e2001414, 2017 Jun.
Article in English | MEDLINE | ID: mdl-28662064

ABSTRACT

In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.


Subject(s)
Biological Science Disciplines/methods , Computational Biology/methods , Data Mining/methods , Software Design , Software , Biological Science Disciplines/statistics & numerical data , Biological Science Disciplines/trends , Computational Biology/trends , Data Mining/statistics & numerical data , Data Mining/trends , Databases, Factual/statistics & numerical data , Databases, Factual/trends , Forecasting , Humans , Internet
2.
J Biomed Semantics ; 6: 7, 2015.
Article in English | MEDLINE | ID: mdl-25774284

ABSTRACT

BACKGROUND: As the availability of open access full text research articles increases, so does the need for sophisticated search services that make the most of this new content. Here, we present a new feature available in Europe PMC that allows selected sections of full text articles to be searched, including figures and reference lists. Users can now search particular parts of an article, reducing noise and allowing fine-tuning of searches. RESULTS: To the best of our knowledge, Europe PMC is the first service that provides a granular literature search by allowing users to target their search to particular sections of articles. This new functionality is based on a heuristic algorithm that identifies and categorises article sections into 17 pre-defined categories based on the section heading. The tagger's performance is measured against a manually curated dataset consisting of 100 full text articles with an F-score of 98.02%. CONCLUSIONS: The section search is available from the advanced search within Europe PMC (http://europepmc.org). The source code is freely available from http://europepmc.org/ftp/oa/SectionTagger/.

3.
J Biomed Semantics ; 6: 1, 2015.
Article in English | MEDLINE | ID: mdl-25789152

ABSTRACT

BACKGROUND: In this study, we present an analysis of data citation practices in full text research articles and their corresponding supplementary data files, made available in the Open Access set of articles from Europe PubMed Central. Our aim is to investigate whether supplementary data files should be considered as a source of information for integrating the literature with biomolecular databases. RESULTS: Using text-mining methods to identify and extract a variety of core biological database accession numbers, we found that the supplemental data files contain many more database citations than the body of the article, and that those citations often take the form of a relatively small number of articles citing large collections of accession numbers in text-based files. Moreover, citation of value-added databases derived from submission databases (such as Pfam, UniProt or Ensembl) is common, demonstrating the reuse of these resources as datasets in themselves. All the database accession numbers extracted from the supplementary data are publicly accessible from http://dx.doi.org/10.5281/zenodo.11771. CONCLUSIONS: Our study suggests that supplementary data should be considered when linking articles with data, in curation pipelines, and in information retrieval tasks in order to make full use of the entire research article. These observations highlight the need to improve the management of supplemental data in general, in order to make this information more discoverable and useful.

4.
PLoS One ; 8(5): e63184, 2013.
Article in English | MEDLINE | ID: mdl-23734176

ABSTRACT

Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.


Subject(s)
Data Mining/methods , Data Mining/statistics & numerical data , Databases, Factual , Internet , Data Mining/trends , Databases, Bibliographic , Databases, Nucleic Acid , Databases, Protein , Europe , Humans , Reproducibility of Results , United States
5.
BMC Bioinformatics ; 14: 104, 2013 Mar 22.
Article in English | MEDLINE | ID: mdl-23517090

ABSTRACT

BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS: The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS: The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.


Subject(s)
Data Mining/methods , Databases, Protein , Knowledge Bases , Protein Processing, Post-Translational , Humans , Molecular Sequence Annotation , Proteomics
6.
Nucleic Acids Res ; 39(Database issue): D58-65, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21062818

ABSTRACT

UK PubMed Central (UKPMC) is a full-text article database that extends the functionality of the original PubMed Central (PMC) repository. The UKPMC project was launched as the first 'mirror' site to PMC, which in analogy to the International Nucleotide Sequence Database Collaboration, aims to provide international preservation of the open and free-access biomedical literature. UKPMC (http://ukpmc.ac.uk) has undergone considerable development since its inception in 2007 and now includes both a UKPMC and PubMed search, as well as access to other records such as Agricola, Patents and recent biomedical theses. UKPMC also differs from PubMed/PMC in that the full text and abstract information can be searched in an integrated manner from one input box. Furthermore, UKPMC contains 'Cited By' information as an alternative way to navigate the literature and has incorporated text-mining approaches to semantically enrich content and integrate it with related database resources. Finally, UKPMC also offers added-value services (UKPMC+) that enable grantees to deposit manuscripts, link papers to grants, publish online portfolios and view citation information on their papers. Here we describe UKPMC and clarify the relationship between PMC and UKPMC, providing historical context and future directions, 10 years on from when PMC was first launched.


Subject(s)
PubMed , Data Mining , Internet , Software , United Kingdom
7.
AMIA Annu Symp Proc ; 2009: 396-400, 2009 Nov 14.
Article in English | MEDLINE | ID: mdl-20351887

ABSTRACT

It is common for PubMed users to repeatedly modify their queries (search terms) before retrieving documents relevant to their information needs. To assist users in reformulating their queries, we report the implementation and usage analysis of a new component in PubMed called Related Queries, which automatically produces query suggestions in response to the original user's input. The proposed method is based on query log analysis and focuses on finding popular queries that contain the initial user search term with a goal of helping users describe their information needs in a more precise manner. This work has been integrated into PubMed since January 2009. Automatic assessment using clickthrough data show that each day, the new feature is used consistently between 6% and 10% of the time when it is shown, suggesting that it has quickly become a popular new feature in PubMed.


Subject(s)
Information Storage and Retrieval/methods , PubMed , User-Computer Interface , Medical Subject Headings , Natural Language Processing , Terminology as Topic
8.
Trends Biochem Sci ; 29(12): 627-33, 2004 Dec.
Article in English | MEDLINE | ID: mdl-15544947

ABSTRACT

Sequence similarities among proteins can infer biological function and evolutionary relationships--a powerful approach for investigating new proteins and suggesting future experiments. The availability of public sequence databases and freely distributed tools for sequence analysis has meant that researchers from all over the world can use this approach. For the past 12 years, the Protein Sequence Motif column in TiBS has provided a platform for documenting interesting discoveries from sequence analyses. As the column comes to an end, we look at the published contributions over the years and reflect on sequence analysis through the beginning of the genomic era.


Subject(s)
Databases, Protein , Proteins/chemistry , Structural Homology, Protein , Animals , Humans
SELECTION OF CITATIONS
SEARCH DETAIL
...