Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
Bioinformatics ; 32(6): 918-25, 2016 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-26556384

RESUMO

MOTIVATION: The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday. RESULTS: We present a new generic methodology to identify problematic records, causing what we describe as 'data hairball' structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses. CONTACT: samuel.croset@roche.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Curadoria de Dados , Humanos , Bases de Conhecimento
2.
BMC Bioinformatics ; 15 Suppl 14: S6, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25472638

RESUMO

Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.


Assuntos
Mineração de Dados , Descoberta de Drogas , Descoberta do Conhecimento , Software
3.
Drug Discov Today ; 27(5): 1441-1447, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35066138

RESUMO

Over recent years, there has been exciting growth in collaboration between academia and industry in the life sciences to make data more Findable, Accessible, Interoperable and Reusable (FAIR) to achieve greater value. Despite considerable progress, the transformative shift from an application-centric to a data-centric perspective, enabled by FAIR implementation, remains very much a work in progress on the 'FAIR journey'. In this review, we consider use cases for FAIR implementation. These can be deployed alongside assessment of data quality to maximize the value of data generated from research, clinical trials, and real-world healthcare data, which are essential for the discovery and development of new medical treatments by biopharma.


Assuntos
Disciplinas das Ciências Biológicas , Confiabilidade dos Dados , Indústrias
4.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151901

RESUMO

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Assuntos
Algoritmos , Mineração de Dados/métodos , Genes , Animais , Mineração de Dados/normas , Humanos , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , Estados Unidos
5.
Drug Discov Today ; 24(10): 2068-2075, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31158512

RESUMO

In this review, we provide a summary of recent progress in ontology mapping (OM) at a crucial time when biomedical research is under a deluge of an increasing amount and variety of data. This is particularly important for realising the full potential of semantically enabled or enriched applications and for meaningful insights, such as drug discovery, using machine-learning technologies. We discuss challenges and solutions for better ontology mappings, as well as how to select ontologies before their application. In addition, we describe tools and algorithms for ontology mapping, including evaluation of tool capability and quality of mappings. Finally, we outline the requirements for an ontology mapping service (OMS) and the progress being made towards implementation of such sustainable services.


Assuntos
Ontologias Biológicas , Descoberta de Drogas/métodos , Aprendizado de Máquina , Semântica , Algoritmos , Humanos
6.
Drug Discov Today ; 24(4): 933-938, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30690198

RESUMO

Biopharmaceutical industry R&D, and indeed other life sciences R&D such as biomedical, environmental, agricultural and food production, is becoming increasingly data-driven and can significantly improve its efficiency and effectiveness by implementing the FAIR (findable, accessible, interoperable, reusable) guiding principles for scientific data management and stewardship. By so doing, the plethora of new and powerful analytical tools such as artificial intelligence and machine learning will be able, automatically and at scale, to access the data from which they learn, and on which they thrive. FAIR is a fundamental enabler for digital transformation.


Assuntos
Gerenciamento de Dados , Indústria Farmacêutica , Produtos Biológicos , Pesquisa Biomédica
7.
J Biomed Semantics ; 8(1): 55, 2017 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-29197409

RESUMO

BACKGROUND: The disease and phenotype track was designed to evaluate the relative performance of ontology matching systems that generate mappings between source ontologies. Disease and phenotype ontologies are important for applications such as data mining, data integration and knowledge management to support translational science in drug discovery and understanding the genetics of disease. RESULTS: Eleven systems (out of 21 OAEI participating systems) were able to cope with at least one of the tasks in the Disease and Phenotype track. AML, FCA-Map, LogMap(Bio) and PhenoMF systems produced the top results for ontology matching in comparison to consensus alignments. The results against manually curated mappings proved to be more difficult most likely because these mapping sets comprised mostly subsumption relationships rather than equivalence. Manual assessment of unique equivalence mappings showed that AML, LogMap(Bio) and PhenoMF systems have the highest precision results. CONCLUSIONS: Four systems gave the highest performance for matching disease and phenotype ontologies. These systems coped well with the detection of equivalence matches, but struggled to detect semantic similarity. This deserves more attention in the future development of ontology matching systems. The findings of this evaluation show that such systems could help to automate equivalence matching in the workflow of curators, who maintain ontology mapping services in numerous domains such as disease and phenotype.


Assuntos
Ontologias Biológicas , Doença , Fenótipo , Consenso , Humanos
8.
BMC Bioinformatics ; 7 Suppl 3: S3, 2006 Nov 24.
Artigo em Inglês | MEDLINE | ID: mdl-17134476

RESUMO

BACKGROUND: The biomedical domain is witnessing a rapid growth of the amount of published scientific results, which makes it increasingly difficult to filter the core information. There is a real need for support tools that 'digest' the published results and extract the most important information. RESULTS: We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. CONCLUSION: The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation.


Assuntos
Indexação e Redação de Resumos , Inteligência Artificial , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Terminologia como Assunto , Vocabulário Controlado , Algoritmos , Bases de Dados Factuais , Semântica , Software
9.
Int J Med Inform ; 67(1-3): 63-74, 2002 Dec 04.
Artigo em Inglês | MEDLINE | ID: mdl-12460632

RESUMO

MEDSYNDIKATE is a natural language processor, which automatically acquires medical information from findings reports. In the course of text analysis their contents is transferred to conceptual representation structures, which constitute a corresponding text knowledge base. MEDSYNDIKATE is particularly adapted to deal properly with text structures, such as various forms of anaphoric reference relations spanning several sentences. The strong demands MEDSYNDIKATE poses on the availability of expressive knowledge sources are accounted for by two alternative approaches to acquire medical domain knowledge (semi)automatically. We also present data for the information extraction performance of MEDSYNDIKATE in terms of the semantic interpretation of three major syntactic patterns in medical documents.


Assuntos
Armazenamento e Recuperação da Informação , Informática Médica , Processamento de Linguagem Natural , Intervalos de Confiança , Humanos , Linguística , Sistemas Computadorizados de Registros Médicos , Semântica , Software
10.
J Biomed Semantics ; 5(Suppl 1 Proceedings of the Bio-Ontologies Spec Interest G): S5, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25093074

RESUMO

The lack of established standards to describe and annotate biological assays and screening outcomes in the domain of drug and chemical probe discovery is a severe limitation to utilize public and proprietary drug screening data to their maximum potential. We have created the BioAssay Ontology (BAO) project (http://bioassayontology.org) to develop common reference metadata terms and definitions required for describing relevant information of low-and high-throughput drug and probe screening assays and results. The main objectives of BAO are to enable effective integration, aggregation, retrieval, and analyses of drug screening data. Since we first released BAO on the BioPortal in 2010 we have considerably expanded and enhanced BAO and we have applied the ontology in several internal and external collaborative projects, for example the BioAssay Research Database (BARD). We describe the evolution of BAO with a design that enables modeling complex assays including profile and panel assays such as those in the Library of Integrated Network-based Cellular Signatures (LINCS). One of the critical questions in evolving BAO is the following: how can we provide a way to efficiently reuse and share among various research projects specific parts of our ontologies without violating the integrity of the ontology and without creating redundancies. This paper provides a comprehensive answer to this question with a description of a methodology for ontology modularization using a layered architecture. Our modularization approach defines several distinct BAO components and separates internal from external modules and domain-level from structural components. This approach facilitates the generation/extraction of derived ontologies (or perspectives) that can suit particular use cases or software applications. We describe the evolution of BAO related to its formal structures, engineering approaches, and content to enable modeling of complex assays and integration with other ontologies and datasets.

11.
Database (Oxford) ; 2013: bas053, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23396322

RESUMO

In this article, we describe the architecture of the OntoGene Relation mining pipeline and its application in the triage task of BioCreative 2012. The aim of the task is to support the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. We use a conventional information retrieval system (Lucene) to provide a baseline ranking, which we then combine with information provided by our relation mining system, in order to achieve an optimized ranking. Our approach additionally delivers domain entities mentioned in each input document as well as candidate relationships, both ranked according to a confidence score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation. Thanks, in particular, to the high-quality entity recognition, the OntoGene system achieved the best overall results in the task.


Assuntos
Mineração de Dados/métodos , Anotação de Sequência Molecular/métodos , Bases de Dados Genéticas , Armazenamento e Recuperação da Informação , Toxicogenética , Interface Usuário-Computador
12.
Artigo em Inglês | MEDLINE | ID: mdl-20671319

RESUMO

We describe a system for the detection of mentions of protein-protein interactions in the biomedical scientific literature. The original system was developed as a part of the OntoGene project, which focuses on using advanced computational linguistic techniques for text mining applications in the biomedical domain. In this paper, we focus in particular on the participation to the BioCreative II.5 challenge, where the OntoGene system achieved best-ranked results. Additionally, we describe a feature-analysis experiment performed after the challenge, which shows the unexpected result that one single feature alone performs better than the combination of features used in the challenge.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Mapeamento de Interação de Proteínas/métodos , Bases de Dados Bibliográficas , Processamento de Linguagem Natural , Sociedades Científicas
13.
Genome Biol ; 9 Suppl 2: S13, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18834491

RESUMO

BACKGROUND: Research scientists and companies working in the domains of biomedicine and genomics are increasingly faced with the problem of efficiently locating, within the vast body of published scientific findings, the critical pieces of information that are needed to direct current and future research investment. RESULTS: In this report we describe approaches taken within the scope of the second BioCreative competition in order to solve two aspects of this problem: detection of novel protein interactions reported in scientific articles, and detection of the experimental method that was used to confirm the interaction. Our approach to the former problem is based on a high-recall protein annotation step, followed by two strict disambiguation steps. The remaining proteins are then combined according to a number of lexico-syntactic filters, which deliver high-precision results while maintaining reasonable recall. The detection of the experimental methods is tackled by a pattern matching approach, which has delivered the best results in the official BioCreative evaluation. CONCLUSION: Although the results of BioCreative clearly show that no tool is sufficiently reliable for fully automated annotations, a few of the proposed approaches (including our own) already perform at a competitive level. This makes them interesting either as standalone tools for preliminary document inspection, or as modules within an environment aimed at supporting the process of curation of biomedical literature.


Assuntos
Biologia Computacional/métodos , Genes , Sociedades Científicas , Indexação e Redação de Resumos , Internet , Mapeamento de Interação de Proteínas , Reprodutibilidade dos Testes
14.
Pac Symp Biocomput ; : 338-49, 2002.
Artigo em Inglês | MEDLINE | ID: mdl-11928488

RESUMO

MEDSYNDIKATE is a natural language processor for automatically acquiring knowledge from medical finding reports. The content of these documents is transferred to formal representation structures which constitute a corresponding text knowledge base. The system architecture integrates requirements from the analysis of single sentences, as well as those of referentially linked sentences forming cohesive texts. The strong demands MEDSYNDIKATE poses to the availability of expressive knowledge sources are accounted for by two alternative approaches to (semi)automatic ontology engineering. We also present data for the knowledge extraction performance of MEDSYNDIKATE for three major syntactic patterns in medical documents.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Aplicações da Informática Médica , Informática Médica/métodos , Intervalos de Confiança , Humanos , Armazenamento e Recuperação da Informação/normas , Conhecimento , Informática Médica/normas , Reprodutibilidade dos Testes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA