Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
J Am Med Inform Assoc ; 26(10): 1037-1045, 2019 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-30958542

RESUMO

OBJECTIVE: Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation. MATERIALS AND METHODS: Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards. RESULTS: We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms. DISCUSSION AND CONCLUSION: An unbiased gold standard can give a more accurate picture of the status of author disambiguation research and help in the discovery of new features for machine learning. The principles and methods shown here can be applied to other reference databases beyond MEDLINE. The gold standard and code used for this study are available at the following repository: https://github.com/amorgani/AND/.


Assuntos
Autoria , Mineração de Dados/métodos , MEDLINE , Aprendizado de Máquina , Padrões de Referência , Algoritmos , Crowdsourcing , Bases de Dados Bibliográficas/normas , MEDLINE/normas
2.
Stud Health Technol Inform ; 221: 97-101, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27071885

RESUMO

In order to reuse data for clinical research it is then necessary to overcome two main challenges - to formalize data sources and to increase the portability. Once the challenge is resolved, it then will allow research applications to reuse clinical data. In this paper, three data models such as entity-attribute-value, ontological and data-driven are described. Their further implementation at University Hospitals of Geneva (HUG) in the data integration methodologies for operational healthcare data sources of the European projects such as DebugIT and EHR4CR and national project the Swiss Transplant Cohort Study are explained. In these methodologies the clinical data are either aligned according to standardised terminologies using different processing techniques or transformed and loaded directly to data models. Then these models are compared and discussed based on the quality criteria. The comparison shows that the described data models are strongly dependent on the objectives of the projects.


Assuntos
Curadoria de Dados/normas , Registros Eletrônicos de Saúde/organização & administração , Hospitais Universitários/organização & administração , Registro Médico Coordenado/normas , Modelos Organizacionais , Guias de Prática Clínica como Assunto , Suíça
3.
Artigo em Inglês | MEDLINE | ID: mdl-26384372

RESUMO

Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/.


Assuntos
Mineração de Dados/métodos , Anotação de Sequência Molecular/métodos , Análise de Sequência de Proteína/métodos , Software , Animais , Humanos
4.
Artigo em Inglês | MEDLINE | ID: mdl-25190367

RESUMO

Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/. DATABASE URL: http://eagl.unige.ch/GOCat4FT/.


Assuntos
Biologia Computacional/métodos , Curadoria de Dados/métodos , Ontologia Genética , Anotação de Sequência Molecular/métodos , Proteínas/química , Proteínas/classificação , Software
5.
Stud Health Technol Inform ; 205: 995-9, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25160337

RESUMO

We present an electronic capture tool to process informed consents, which are mandatory recorded when running a clinical trial. This tool aims at the extraction of information expressing the duration of the consent given by the patient to authorize the exploitation of biomarker-related information collected during clinical trials. The system integrates a language detection module (LDM) to route a document into the appropriate information extraction module (IEM). The IEM is based on language-specific sets of linguistic rules for the identification of relevant textual facts. The achieved accuracy of both the LDM and IEM is 99%. The architecture of the system is described in detail.


Assuntos
Ensaios Clínicos como Assunto/estatística & dados numéricos , Termos de Consentimento/classificação , Termos de Consentimento/normas , Bases de Dados Factuais , Indústria Farmacêutica/estatística & dados numéricos , Armazenamento e Recuperação da Informação/métodos , Inteligência Artificial , Ensaios Clínicos como Assunto/legislação & jurisprudência , Termos de Consentimento/legislação & jurisprudência , Sistemas de Gerenciamento de Base de Dados , Indústria Farmacêutica/legislação & jurisprudência , Internacionalidade , Processamento de Linguagem Natural , Vocabulário Controlado
6.
Stud Health Technol Inform ; 197: 29-33, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24743073

RESUMO

Employing the bridge between Clinical Information System (CIS) and Clinical Research Environment (CRE) can provide functionality, which is not easily, implemented by traditional legacy EHR system. In this paper, the experience of such implementation at the University Hospitals of Geneva is described. General overview of the mapping of extracted from CIS data to the i2b2 Clinical Data Warehouse is provided. The defined implementation manages to provide the interoperability for the CRE.


Assuntos
Pesquisa Biomédica , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Informática Médica/métodos , Registro Médico Coordenado/métodos , Processamento de Linguagem Natural , Suíça , Integração de Sistemas
7.
Stud Health Technol Inform ; 192: 1068, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23920842

RESUMO

The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in large biomedical collections. Therefore, using biomedical controlled vocabularies is crucial for managing these contents. We investigate the impact of query expansion based on controlled vocabularies to improve the effectiveness of two search engines. Our strategy relies on the enrichment of users' queries with additional terms, directly derived from such vocabularies applied to infectious diseases and chemical patents. We observed that query expansion based on pathogen names resulted in improvements of the top-precision of our first search engine, while the normalization of diseases degraded the top-precision. The expansion of chemical entities, which was performed on the second search engine, positively affected the mean average precision. We have shown that query expansion of some types of biomedical entities has a great potential to improve search effectiveness; therefore a fine-tuning of query expansion strategies could help improving the performances of search engines.


Assuntos
Mineração de Dados/métodos , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Medical Subject Headings , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Terminologia como Assunto , Inteligência Artificial
8.
Database (Oxford) ; 2013: bat041, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23842461

RESUMO

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/


Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas , Anotação de Sequência Molecular , Algoritmos , Bases de Conhecimento
9.
Stud Health Technol Inform ; 186: 155-9, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23542988

RESUMO

With the vast amount of biomedical data we face the necessity to improve information retrieval processes in biomedical domain. The use of biomedical ontologies facilitated the combination of various data sources (e.g. scientific literature, clinical data repository) by increasing the quality of information retrieval and reducing the maintenance efforts. In this context, we developed Ontology Look-up services (OLS), based on NEWT and MeSH vocabularies. Our services were involved in some information retrieval tasks such as gene/disease normalization. The implementation of OLS services significantly accelerated the extraction of particular biomedical facts by structuring and enriching the data context. The results of precision in normalization tasks were boosted on about 20%.


Assuntos
Indexação e Redação de Resumos/métodos , Mineração de Dados/métodos , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Bibliográficas , Medical Subject Headings , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Semântica , Interface Usuário-Computador
10.
Database (Oxford) ; 2012: bas050, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23221176

RESUMO

We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.


Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas/classificação , Publicações Periódicas como Assunto , Toxicogenética , Internet , Anotação de Sequência Molecular , Semântica , Máquina de Vetores de Suporte , Fluxo de Trabalho
11.
Stud Health Technol Inform ; 180: 210-4, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22874182

RESUMO

We present a new approach to perform biomedical documents classification and prioritization for the Comparative Toxicogenomics Database (CTD). This approach is motivated by needs such as literature curation, in particular applied to the human health environment domain. The unique integration of chemical, genes/proteins and disease data in the biomedical literature may advance the identification of exposure and disease biomarkers, mechanisms of chemical actions, and the complex aetiologies of chronic diseases. Our approach aims to assist biomedical researchers when searching for relevant articles for CTD. The task is functionally defined as a binary classification task, where selected articles must also be ranked by order of relevance. We design a SVM classifier, which combines three main feature sets: an information retrieval system (EAGLi), a biomedical named-entity recognizer (MeSH term extraction), a gene normalization (GN) service (NormaGene) and an ad-hoc keyword recognizer for diseases and chemicals. The evaluation of the gene identification module was done on BioCreativeIII test data. Disease normalization is achieved with 95% precision and 93% of recall. The evaluation of the classification was done on the corpus provided by BioCreative organizers in 2012. The approach showed promising performance on the test data.


Assuntos
Indexação e Redação de Resumos/métodos , Mineração de Dados/métodos , Bases de Dados de Compostos Químicos , Bases de Dados Genéticas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/classificação , Publicações Periódicas como Assunto/classificação , Toxicogenética/métodos , Sistemas de Gerenciamento de Base de Dados , Humanos , Interface Usuário-Computador
12.
Stud Health Technol Inform ; 174: 89-93, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22491118

RESUMO

We present a new approach for pathogens and gene product normalization in the biomedical literature. The idea of this approach was motivated by needs such as literature curation, in particular applied to the field of infectious diseases thus, variants of bacterial species (S. aureus, Staphyloccocus aureus, ...) and their gene products (protein ArsC, Arsenical pump modifier, Arsenate reductase, ...). Our approach is based on the use of an Ontology Look-up Service, a Gene Ontology Categorizer (GOCat) and Gene Normalization methods. In the pathogen detection task the use of OLS disambiguates found pathogen names. GOCat results are incorporated into overall score system to support and to confirm the decisionmaking in normalization process of pathogens and their genomes. The evaluation was done on two test sets of BioCreativeIII benchmark: gold standard of manual curation (50 articles) and silver standard (507 articles) curated by collective results of BCIII participants. For the cross-species GN we achieved the precision of 46% for silver and 27% for gold sets. Pathogen normalization results showed 95% of precision and 93% of recall. The impact of GOCat explicitly improves results of pathogen and gene normalization, basically confirming identified pathogens and boosting correct gene identifiers on the top of the results' list ranked by confidence. A correct identification of the pathogen is able to improve significantly normalization effectiveness and to solve the disambiguation problem of genes.


Assuntos
Bactérias/classificação , Proteínas de Bactérias/classificação , Mineração de Dados/métodos , Publicações Periódicas como Assunto , Vocabulário Controlado , Humanos
13.
Stud Health Technol Inform ; 174: 121-5, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22491124

RESUMO

Health-related information retrieval is complicated by the variety of nomenclatures available to name entities, since different communities of users will use different ways to name a same entity. We present in this report the development and evaluation of a user-friendly interactive Web application aiming at facilitating health-related patent search. Our tool, called TWINC, relies on a search engine tuned during several patent retrieval competitions, enhanced with intelligent interaction modules, such as chemical query, normalization and expansion. While the functionality of related article search showed promising performances, the ad hoc search results in fairly contrasted results. Nonetheless, TWINC performed well during the PatOlympics competition and was appreciated by intellectual property experts. This result should be balanced by the limited evaluation sample. We can also assume that it can be customized to be applied in corporate search environments to process domain and company-specific vocabularies, including non-English literature and patents reports.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Internet , Patentes como Assunto , Ferramenta de Busca/métodos , Interface Usuário-Computador , Inteligência Artificial , Humanos
14.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151901

RESUMO

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Assuntos
Algoritmos , Mineração de Dados/métodos , Genes , Animais , Mineração de Dados/normas , Humanos , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , Estados Unidos
15.
Stud Health Technol Inform ; 169: 477-81, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21893795

RESUMO

We present exploratory investigations of multimodal mining to help designing clinical guidelines for antibiotherapy. Our approach is based on the assumption that combining various sources of data, such as the literature, a clinical datawarehouse, as well as information regarding costs will result in better recommendations. Compared to our baseline recommendation system based on a question-answering engine built on top of PubMed, an improvement of +16% is observed when clinical data (i.e. resistance profiles) are injected into the model. In complement to PubMed, an alternative search strategy is reported, which is significantly improved by the use of the combined multimodal approach. These results suggest that combining literature-based discovery with structured data mining can significantly improve effectiveness of decision-support systems for authors of clinical practice guidelines.


Assuntos
Antibacterianos/uso terapêutico , Guias de Prática Clínica como Assunto , Estatística como Assunto/métodos , Algoritmos , Antibacterianos/economia , Sistemas Computacionais , Sistemas de Apoio a Decisões Clínicas , Custos de Medicamentos , Humanos , National Institutes of Health (U.S.) , PubMed , Staphylococcus aureus/metabolismo , Staphylococcus epidermidis/metabolismo , Estados Unidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA