Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
Stud Health Technol Inform ; 186: 155-9, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23542988

RESUMEN

With the vast amount of biomedical data we face the necessity to improve information retrieval processes in biomedical domain. The use of biomedical ontologies facilitated the combination of various data sources (e.g. scientific literature, clinical data repository) by increasing the quality of information retrieval and reducing the maintenance efforts. In this context, we developed Ontology Look-up services (OLS), based on NEWT and MeSH vocabularies. Our services were involved in some information retrieval tasks such as gene/disease normalization. The implementation of OLS services significantly accelerated the extraction of particular biomedical facts by structuring and enriching the data context. The results of precision in normalization tasks were boosted on about 20%.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Minería de Datos/métodos , Sistemas de Administración de Bases de Datos , Bases de Datos Bibliográficas , Medical Subject Headings , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Semántica , Interfaz Usuario-Computador
2.
Stud Health Technol Inform ; 174: 89-93, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22491118

RESUMEN

We present a new approach for pathogens and gene product normalization in the biomedical literature. The idea of this approach was motivated by needs such as literature curation, in particular applied to the field of infectious diseases thus, variants of bacterial species (S. aureus, Staphyloccocus aureus, ...) and their gene products (protein ArsC, Arsenical pump modifier, Arsenate reductase, ...). Our approach is based on the use of an Ontology Look-up Service, a Gene Ontology Categorizer (GOCat) and Gene Normalization methods. In the pathogen detection task the use of OLS disambiguates found pathogen names. GOCat results are incorporated into overall score system to support and to confirm the decisionmaking in normalization process of pathogens and their genomes. The evaluation was done on two test sets of BioCreativeIII benchmark: gold standard of manual curation (50 articles) and silver standard (507 articles) curated by collective results of BCIII participants. For the cross-species GN we achieved the precision of 46% for silver and 27% for gold sets. Pathogen normalization results showed 95% of precision and 93% of recall. The impact of GOCat explicitly improves results of pathogen and gene normalization, basically confirming identified pathogens and boosting correct gene identifiers on the top of the results' list ranked by confidence. A correct identification of the pathogen is able to improve significantly normalization effectiveness and to solve the disambiguation problem of genes.


Asunto(s)
Bacterias/clasificación , Proteínas Bacterianas/clasificación , Minería de Datos/métodos , Publicaciones Periódicas como Asunto , Vocabulario Controlado , Humanos
3.
Stud Health Technol Inform ; 174: 121-5, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22491124

RESUMEN

Health-related information retrieval is complicated by the variety of nomenclatures available to name entities, since different communities of users will use different ways to name a same entity. We present in this report the development and evaluation of a user-friendly interactive Web application aiming at facilitating health-related patent search. Our tool, called TWINC, relies on a search engine tuned during several patent retrieval competitions, enhanced with intelligent interaction modules, such as chemical query, normalization and expansion. While the functionality of related article search showed promising performances, the ad hoc search results in fairly contrasted results. Nonetheless, TWINC performed well during the PatOlympics competition and was appreciated by intellectual property experts. This result should be balanced by the limited evaluation sample. We can also assume that it can be customized to be applied in corporate search environments to process domain and company-specific vocabularies, including non-English literature and patents reports.


Asunto(s)
Almacenamiento y Recuperación de la Información/métodos , Internet , Patentes como Asunto , Motor de Búsqueda/métodos , Interfaz Usuario-Computador , Inteligencia Artificial , Humanos
4.
Stud Health Technol Inform ; 180: 210-4, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22874182

RESUMEN

We present a new approach to perform biomedical documents classification and prioritization for the Comparative Toxicogenomics Database (CTD). This approach is motivated by needs such as literature curation, in particular applied to the human health environment domain. The unique integration of chemical, genes/proteins and disease data in the biomedical literature may advance the identification of exposure and disease biomarkers, mechanisms of chemical actions, and the complex aetiologies of chronic diseases. Our approach aims to assist biomedical researchers when searching for relevant articles for CTD. The task is functionally defined as a binary classification task, where selected articles must also be ranked by order of relevance. We design a SVM classifier, which combines three main feature sets: an information retrieval system (EAGLi), a biomedical named-entity recognizer (MeSH term extraction), a gene normalization (GN) service (NormaGene) and an ad-hoc keyword recognizer for diseases and chemicals. The evaluation of the gene identification module was done on BioCreativeIII test data. Disease normalization is achieved with 95% precision and 93% of recall. The evaluation of the classification was done on the corpus provided by BioCreative organizers in 2012. The approach showed promising performance on the test data.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Bases de Datos Genéticas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/clasificación , Publicaciones Periódicas como Asunto/clasificación , Toxicogenética/métodos , Sistemas de Administración de Bases de Datos , Humanos , Interfaz Usuario-Computador
5.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151901

RESUMEN

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Asunto(s)
Algoritmos , Minería de Datos/métodos , Genes , Animales , Minería de Datos/normas , Humanos , National Library of Medicine (U.S.) , Publicaciones Periódicas como Asunto , Estados Unidos
6.
Stud Health Technol Inform ; 169: 477-81, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-21893795

RESUMEN

We present exploratory investigations of multimodal mining to help designing clinical guidelines for antibiotherapy. Our approach is based on the assumption that combining various sources of data, such as the literature, a clinical datawarehouse, as well as information regarding costs will result in better recommendations. Compared to our baseline recommendation system based on a question-answering engine built on top of PubMed, an improvement of +16% is observed when clinical data (i.e. resistance profiles) are injected into the model. In complement to PubMed, an alternative search strategy is reported, which is significantly improved by the use of the combined multimodal approach. These results suggest that combining literature-based discovery with structured data mining can significantly improve effectiveness of decision-support systems for authors of clinical practice guidelines.


Asunto(s)
Antibacterianos/uso terapéutico , Guías de Práctica Clínica como Asunto , Estadística como Asunto/métodos , Algoritmos , Antibacterianos/economía , Sistemas de Computación , Sistemas de Apoyo a Decisiones Clínicas , Costos de los Medicamentos , Humanos , National Institutes of Health (U.S.) , PubMed , Staphylococcus aureus/metabolismo , Staphylococcus epidermidis/metabolismo , Estados Unidos
7.
J Am Med Inform Assoc ; 26(10): 1037-1045, 2019 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-30958542

RESUMEN

OBJECTIVE: Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation. MATERIALS AND METHODS: Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards. RESULTS: We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms. DISCUSSION AND CONCLUSION: An unbiased gold standard can give a more accurate picture of the status of author disambiguation research and help in the discovery of new features for machine learning. The principles and methods shown here can be applied to other reference databases beyond MEDLINE. The gold standard and code used for this study are available at the following repository: https://github.com/amorgani/AND/.


Asunto(s)
Autoria , Minería de Datos/métodos , MEDLINE , Aprendizaje Automático , Estándares de Referencia , Algoritmos , Colaboración de las Masas , Bases de Datos Bibliográficas/normas , MEDLINE/normas
8.
Stud Health Technol Inform ; 221: 97-101, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27071885

RESUMEN

In order to reuse data for clinical research it is then necessary to overcome two main challenges - to formalize data sources and to increase the portability. Once the challenge is resolved, it then will allow research applications to reuse clinical data. In this paper, three data models such as entity-attribute-value, ontological and data-driven are described. Their further implementation at University Hospitals of Geneva (HUG) in the data integration methodologies for operational healthcare data sources of the European projects such as DebugIT and EHR4CR and national project the Swiss Transplant Cohort Study are explained. In these methodologies the clinical data are either aligned according to standardised terminologies using different processing techniques or transformed and loaded directly to data models. Then these models are compared and discussed based on the quality criteria. The comparison shows that the described data models are strongly dependent on the objectives of the projects.


Asunto(s)
Curaduría de Datos/normas , Registros Electrónicos de Salud/organización & administración , Hospitales Universitarios/organización & administración , Registro Médico Coordinado/normas , Modelos Organizacionales , Guías de Práctica Clínica como Asunto , Suiza
9.
Artículo en Inglés | MEDLINE | ID: mdl-26384372

RESUMEN

Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/.


Asunto(s)
Minería de Datos/métodos , Anotación de Secuencia Molecular/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Animales , Humanos
10.
Artículo en Inglés | MEDLINE | ID: mdl-25190367

RESUMEN

Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/. DATABASE URL: http://eagl.unige.ch/GOCat4FT/.


Asunto(s)
Biología Computacional/métodos , Curaduría de Datos/métodos , Ontología de Genes , Anotación de Secuencia Molecular/métodos , Proteínas/química , Proteínas/clasificación , Programas Informáticos
11.
Stud Health Technol Inform ; 197: 29-33, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24743073

RESUMEN

Employing the bridge between Clinical Information System (CIS) and Clinical Research Environment (CRE) can provide functionality, which is not easily, implemented by traditional legacy EHR system. In this paper, the experience of such implementation at the University Hospitals of Geneva is described. General overview of the mapping of extracted from CIS data to the i2b2 Clinical Data Warehouse is provided. The defined implementation manages to provide the interoperability for the CRE.


Asunto(s)
Investigación Biomédica , Registros Electrónicos de Salud , Almacenamiento y Recuperación de la Información/métodos , Informática Médica/métodos , Registro Médico Coordinado/métodos , Procesamiento de Lenguaje Natural , Suiza , Integración de Sistemas
12.
Stud Health Technol Inform ; 205: 995-9, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25160337

RESUMEN

We present an electronic capture tool to process informed consents, which are mandatory recorded when running a clinical trial. This tool aims at the extraction of information expressing the duration of the consent given by the patient to authorize the exploitation of biomarker-related information collected during clinical trials. The system integrates a language detection module (LDM) to route a document into the appropriate information extraction module (IEM). The IEM is based on language-specific sets of linguistic rules for the identification of relevant textual facts. The achieved accuracy of both the LDM and IEM is 99%. The architecture of the system is described in detail.


Asunto(s)
Ensayos Clínicos como Asunto/estadística & datos numéricos , Formularios de Consentimiento/clasificación , Formularios de Consentimiento/normas , Bases de Datos Factuales , Industria Farmacéutica/estadística & datos numéricos , Almacenamiento y Recuperación de la Información/métodos , Inteligencia Artificial , Ensayos Clínicos como Asunto/legislación & jurisprudencia , Formularios de Consentimiento/legislación & jurisprudencia , Sistemas de Administración de Bases de Datos , Industria Farmacéutica/legislación & jurisprudencia , Internacionalidad , Procesamiento de Lenguaje Natural , Vocabulario Controlado
13.
Database (Oxford) ; 2013: bat041, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23842461

RESUMEN

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/


Asunto(s)
Minería de Datos/métodos , Bases de Datos Genéticas , Anotación de Secuencia Molecular , Algoritmos , Bases del Conocimiento
14.
Stud Health Technol Inform ; 192: 1068, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23920842

RESUMEN

The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in large biomedical collections. Therefore, using biomedical controlled vocabularies is crucial for managing these contents. We investigate the impact of query expansion based on controlled vocabularies to improve the effectiveness of two search engines. Our strategy relies on the enrichment of users' queries with additional terms, directly derived from such vocabularies applied to infectious diseases and chemical patents. We observed that query expansion based on pathogen names resulted in improvements of the top-precision of our first search engine, while the normalization of diseases degraded the top-precision. The expansion of chemical entities, which was performed on the second search engine, positively affected the mean average precision. We have shown that query expansion of some types of biomedical entities has a great potential to improve search effectiveness; therefore a fine-tuning of query expansion strategies could help improving the performances of search engines.


Asunto(s)
Minería de Datos/métodos , Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Medical Subject Headings , Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas/métodos , Terminología como Asunto , Inteligencia Artificial
15.
Database (Oxford) ; 2012: bas050, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23221176

RESUMEN

We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Genéticas/clasificación , Publicaciones Periódicas como Asunto , Toxicogenética , Internet , Anotación de Secuencia Molecular , Semántica , Máquina de Vectores de Soporte , Flujo de Trabajo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA