Pesquisa | Portal Regional da BVS

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.

Krallinger, Martin; Morgan, Alexander; Smith, Larry; Leitner, Florian; Tanabe, Lorraine; Wilbur, John; Hirschman, Lynette; Valencia, Alfonso.

Genome Biol ; 9 Suppl 2: S1, 2008.

Artigo em Inglês | MEDLINE | ID: mdl-18834487

RESUMO

BACKGROUND: Genome sciences have experienced an increasing demand for efficient text-processing tools that can extract biologically relevant information from the growing amount of published literature. In response, a range of text-mining and information-extraction tools have recently been developed specifically for the biological domain. Such tools are only useful if they are designed to meet real-life tasks and if their performance can be estimated and compared. The BioCreative challenge (Critical Assessment of Information Extraction in Biology) consists of a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems. RESULTS: The Second BioCreative assessment (2006 to 2007) attracted 44 teams from 13 countries worldwide, with the aim of evaluating current information-extraction/text-mining technologies developed for one or more of the three tasks defined for this challenge evaluation. These tasks included the recognition of gene mentions in abstracts (gene mention task); the extraction of a list of unique identifiers for human genes mentioned in abstracts (gene normalization task); and finally the extraction of physical protein-protein interaction annotation-relevant information (protein-protein interaction task). The 'gold standard' data used for evaluating submissions for the third task was provided by the interaction databases MINT (Molecular Interaction Database) and IntAct. CONCLUSION: The Second BioCreative assessment almost doubled the number of participants for each individual task when compared with the first BioCreative assessment. An overall improvement in terms of balanced precision and recall was observed for the best submissions for the gene mention (F score 0.87); for the gene normalization task, the best results were comparable (F score 0.81) compared with results obtained for similar tasks posed at the first BioCreative challenge. In case of the protein-protein interaction task, the importance and difficulties of experimentally confirmed annotation extraction from full-text articles were explored, yielding different results depending on the step of the annotation extraction workflow. A common characteristic observed in all three tasks was that the combination of system outputs could yield better results than any single system. Finally, the development of the first text-mining meta-server was promoted within the context of this community challenge.

Assuntos

Biologia Computacional/métodos , Sociedades Científicas , Biologia Computacional/instrumentação , Genes , Processamento de Linguagem Natural , Mapeamento de Interação de Proteínas

Overview of BioCreative II gene mention recognition.

Smith, Larry; Tanabe, Lorraine K; Ando, Rie Johnson nee; Kuo, Cheng-Ju; Chung, I-Fang; Hsu, Chun-Nan; Lin, Yu-Shi; Klinger, Roman; Friedrich, Christoph M; Ganchev, Kuzman; Torii, Manabu; Liu, Hongfang; Haddow, Barry; Struble, Craig A; Povinelli, Richard J; Vlachos, Andreas; Baumgartner, William A; Hunter, Lawrence; Carpenter, Bob; Tsai, Richard Tzong-Han; Dai, Hong-Jie; Liu, Feng; Chen, Yifei; Sun, Chengjie; Katrenko, Sophia; Adriaans, Pieter; Blaschke, Christian; Torres, Rafael; Neves, Mariana; Nakov, Preslav; Divoli, Anna; Maña-López, Manuel; Mata, Jacinto; Wilbur, W John.

Genome Biol ; 9 Suppl 2: S2, 2008.

Artigo em Inglês | MEDLINE | ID: mdl-18834493

RESUMO

Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.

Assuntos

Biologia Computacional/métodos , Genes , Sociedades Científicas , Congressos como Assunto

SemCat: semantically categorized entities for genomics.

Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne; Comeau, Donald C; Wilbur, W John.

AMIA Annu Symp Proc ; : 754-8, 2006.

Artigo em Inglês | MEDLINE | ID: mdl-17238442

RESUMO

We describe the construction of a semantic database called SemCat consisting of a large number of semantically categorized names relevant to genomics. SemCat can be used to facilitate natural language processing in MEDLINE. We present suitable application areas including biomedical name classification and named entity recognition.

Assuntos

Genômica , Processamento de Linguagem Natural , Terminologia como Assunto , Algoritmos , MEDLINE , Reconhecimento Automatizado de Padrão , Semântica

GENETAG: a tagged corpus for gene/protein named entity recognition.

Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H; Matten, Wayne; Wilbur, W John.

BMC Bioinformatics ; 6 Suppl 1: S3, 2005.

Artigo em Inglês | MEDLINE | ID: mdl-15960837

RESUMO

BACKGROUND: Named entity recognition (NER) is an important first step for text mining the biomedical literature. Evaluating the performance of biomedical NER systems is impossible without a standardized test corpus. The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition. RESULTS: To ensure heterogeneity of the corpus, MEDLINE sentences were first scored for term similarity to documents with known gene names, and 10K high- and 10K low-scoring sentences were chosen at random. The original 20K sentences were run through a gene/protein name tagger, and the results were modified manually to reflect a wide definition of gene/protein names subject to a specificity constraint, a rule that required the tagged entities to refer to specific entities. Each sentence in GENETAG was annotated with acceptable alternatives to the gene/protein names it contained, allowing for partial matching with semantic constraints. Semantic constraints are rules requiring the tagged entity to contain its true meaning in the sentence context. Application of these constraints results in a more meaningful measure of the performance of an NER system than unrestricted partial matching. CONCLUSION: The annotation of GENETAG required intricate manual judgments by annotators which hindered tagging consistency. The data were pre-segmented into words, to provide indices supporting comparison of system responses to the "gold standard". However, character-based indices would have been more robust than word-based indices. GENETAG Train, Test and Round1 data and ancillary programs are freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz. A newer version of GENETAG-05, will be released later this year.

Assuntos

Genes , Proteínas/classificação , Proteínas/genética , Reconhecimento Psicológico , Terminologia como Assunto , Animais , Humanos , MEDLINE

Generation of a large gene/protein lexicon by morphological pattern analysis.

Tanabe, Lorraine; Wilbur, W John.

J Bioinform Comput Biol ; 1(4): 611-26, 2004 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-15290756

RESUMO

The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (+/-3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon.

Assuntos

Genes , Processamento de Linguagem Natural , Proteínas , Terminologia como Assunto , Algoritmos , Inteligência Artificial , Biologia Computacional , MEDLINE , Reconhecimento Automatizado de Padrão , Design de Software

Tagging gene and protein names in biomedical text.

Tanabe, Lorraine; Wilbur, W John.

Bioinformatics ; 18(8): 1124-32, 2002 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-12176836

RESUMO

MOTIVATION: The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This remains a challenging task due to the irregularities and ambiguities in gene and protein nomenclature. We propose to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation. RESULTS: We present a method for tagging gene and protein names in biomedical text using a combination of statistical and knowledge-based strategies. This method incorporates automatically generated rules from a transformation-based part-of-speech tagger, and manually generated rules from morphological clues, low frequency trigrams, indicator terms, suffixes and part-of-speech information. Results of an experiment on a test corpus of 56K MEDLINE documents demonstrate that our method to extract gene and protein names can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets. AVAILABILITY: The programs are available on request from the authors.

Assuntos

Indexação e Redação de Resumos/métodos , Inteligência Artificial , DNA/classificação , Armazenamento e Recuperação da Informação/métodos , Proteínas/classificação , Terminologia como Assunto , Abreviaturas como Assunto , Algoritmos , DNA/genética , Reações Falso-Negativas , Reações Falso-Positivas , Genes , Armazenamento e Recuperação da Informação/estatística & dados numéricos , MEDLINE , Modelos Estatísticos , National Library of Medicine (U.S.) , Reconhecimento Automatizado de Padrão , Proteínas/genética , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Estados Unidos

The bioinformatics of microarray gene expression profiling.

Weinstein, John N; Scherf, Uwe; Lee, Jae K; Nishizuka, Satoshi; Gwadry, Fuad; Bussey, Ajay Kim; Kim, S; Smith, Lawrence H; Tanabe, Lorraine; Richman, Samuel; Alexander, Jessie; Kouros-Mehr, Hosein; Maunakea, Alika; Reinhold, William C.

Cytometry ; 47(1): 46-9, 2002 Jan 01.

Artigo em Inglês | MEDLINE | ID: mdl-11774349

Assuntos

Bases de Dados de Ácidos Nucleicos , Expressão Gênica , Biologia Computacional , Computadores , Perfilação da Expressão Gênica/métodos , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA