Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
BMC Med Inform Decis Mak ; 19(Suppl 7): 273, 2019 12 23.
Artículo en Inglés | MEDLINE | ID: mdl-31865903

RESUMEN

BACKGROUND: Clinical Named Entity Recognition is to find the name of diseases, body parts and other related terms from the given text. Because Chinese language is quite different with English language, the machine cannot simply get the graphical and phonetic information form Chinese characters. The method for Chinese should be different from that for English. Chinese characters present abundant information with the graphical features, recent research on Chinese word embedding tries to use graphical information as subword. This paper uses both graphical and phonetic features to improve Chinese Clinical Named Entity Recognition based on the presence of phono-semantic characters. METHODS: This paper proposed three different embedding models and tested them on the annotated data. The data have been divided into two sections for exploring the effect of the proportion of phono-semantic characters. RESULTS: The model using primary radical and pinyin can improve Clinical Named Entity Recognition in Chinese and get the F-measure of 0.712. More phono-semantic characters does not give a better result. CONCLUSIONS: The paper proves that the use of the combination of graphical and phonetic features can improve the Clinical Named Entity Recognition in Chinese.


Asunto(s)
Lenguaje , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Fonética , Curaduría de Datos , Registros Electrónicos de Salud , Humanos , Semántica
2.
BMC Bioinformatics ; 16 Suppl 10: S2, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26202570

RESUMEN

BACKGROUND: Since their introduction in 2009, the BioNLP Shared Task events have been instrumental in advancing the development of methods and resources for the automatic extraction of information from the biomedical literature. In this paper, we present the Cancer Genetics (CG) and Pathway Curation (PC) tasks, two event extraction tasks introduced in the BioNLP Shared Task 2013. The CG task focuses on cancer, emphasizing the extraction of physiological and pathological processes at various levels of biological organization, and the PC task targets reactions relevant to the development of biomolecular pathway models, defining its extraction targets on the basis of established pathway representations and ontologies. RESULTS: Six groups participated in the CG task and two groups in the PC task, together applying a wide range of extraction approaches including both established state-of-the-art systems and newly introduced extraction methods. The best-performing systems achieved F-scores of 55% on the CG task and 53% on the PC task, demonstrating a level of performance comparable to the best results achieved in similar previously proposed tasks. CONCLUSIONS: The results indicate that existing event extraction technology can generalize to meet the novel challenges represented by the CG and PC task settings, suggesting that extraction methods are capable of supporting the construction of knowledge bases on the molecular mechanisms of cancer and the curation of biomolecular pathway models. The CG and PC tasks continue as open challenges for all interested parties, with data, tools and resources available from the shared task homepage.


Asunto(s)
Redes Reguladoras de Genes , Genes , Almacenamiento y Recuperación de la Información , Bases del Conocimiento , Modelos Teóricos , Neoplasias/genética , Neoplasias/patología , Humanos , Procesamiento de Lenguaje Natural
3.
Bioinformatics ; 28(18): i575-i581, 2012 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-22962484

RESUMEN

MOTIVATION: Event extraction using expressive structured representations has been a significant focus of recent efforts in biomedical information extraction. However, event extraction resources and methods have so far focused almost exclusively on molecular-level entities and processes, limiting their applicability. RESULTS: We extend the event extraction approach to biomedical information extraction to encompass all levels of biological organization from the molecular to the whole organism. We present the ontological foundations, target types and guidelines for entity and event annotation and introduce the new multi-level event extraction (MLEE) corpus, manually annotated using a structured representation for event extraction. We further adapt and evaluate named entity and event extraction methods for the new task, demonstrating that both can be achieved with performance broadly comparable with that for established molecular entity and event extraction tasks. AVAILABILITY: The resources and methods introduced in this study are available from http://nactem.ac.uk/MLEE/. CONTACT: pyysalos@cs.man.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Minería de Datos/métodos , Humanos , Neoplasias
4.
BMC Bioinformatics ; 13 Suppl 11: S1, 2012 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-22759455

RESUMEN

BACKGROUND: The Genia task, when it was introduced in 2009, was the first community-wide effort to address a fine-grained, structural information extraction from biomedical literature. Arranged for the second time as one of the main tasks of BioNLP Shared Task 2011, it aimed to measure the progress of the community since 2009, and to evaluate generalization of the technology to full text papers. The Protein Coreference task was arranged as one of the supporting tasks, motivated from one of the lessons of the 2009 task that the abundance of coreference structures in natural language text hinders further improvement with the Genia task. RESULTS: The Genia task received final submissions from 15 teams. The results show that the community has made a significant progress, marking 74% of the best F-score in extracting bio-molecular events of simple structure, e.g., gene expressions, and 45% ~ 48% in extracting those of complex structure, e.g., regulations. The Protein Coreference task received 6 final submissions. The results show that the coreference resolution performance in biomedical domain is lagging behind that in newswire domain, cf. 50% vs. 66% in MUC score. Particularly, in terms of protein coreference resolution the best system achieved 34% in F-score. CONCLUSIONS: Detailed analysis performed on the results improves our insight into the problem and suggests the directions for further improvements.


Asunto(s)
Sistemas de Información , Procesamiento de Lenguaje Natural , Proteínas/química , Congresos como Asunto , Expresión Génica , Proteínas/genética , Proteínas/metabolismo
5.
BMC Bioinformatics ; 13 Suppl 11: S2, 2012 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-22759456

RESUMEN

We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.


Asunto(s)
Epigenómica , Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Procesamiento Proteico-Postraduccional , Proteínas/metabolismo , Enfermedades Transmisibles , Metilación de ADN , Código de Histonas , Lipoproteínas , Proteínas/genética
6.
Bioinformatics ; 27(13): i111-9, 2011 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-21685059

RESUMEN

MOTIVATION: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. RESULTS: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. AVAILABILITY: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. CONTACT: tsuruoka@jaist.ac.jp.


Asunto(s)
Inteligencia Artificial , Minería de Datos , Aplicaciones de la Informática Médica , Internet , MEDLINE , PubMed , Estados Unidos
7.
Bioinformatics ; 27(19): 2730-7, 2011 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-21821664

RESUMEN

MOTIVATION: Understanding key biological processes (bioprocesses) and their relationships with constituent biological entities and pharmaceutical agents is crucial for drug design and discovery. One way to harvest such information is searching the literature. However, bioprocesses are difficult to capture because they may occur in text in a variety of textual expressions. Moreover, a bioprocess is often composed of a series of bioevents, where a bioevent denotes changes to one or a group of cells involved in the bioprocess. Such bioevents are often used to refer to bioprocesses in text, which current techniques, relying solely on specialized lexicons, struggle to find. RESULTS: This article presents a range of methods for finding bioprocess terms and events. To facilitate the study, we built a gold standard corpus in which terms and events related to angiogenesis, a key biological process of the growth of new blood vessels, were annotated. Statistics of the annotated corpus revealed that over 36% of the text expressions that referred to angiogenesis appeared as events. The proposed methods respectively employed domain-specific vocabularies, a manually annotated corpus and unstructured domain-specific documents. Evaluation results showed that, while a supervised machine-learning model yielded the best precision, recall and F1 scores, the other methods achieved reasonable performance and less cost to develop. AVAILABILITY: The angiogenesis vocabularies, gold standard corpus, annotation guidelines and software described in this article are available at http://text0.mib.man.ac.uk/~mbassxw2/angiogenesis/ CONTACT: xinglong.wang@gmail.com.


Asunto(s)
Fenómenos Biológicos , Minería de Datos/métodos , Procesamiento de Lenguaje Natural , Inhibidores de la Angiogénesis , Inteligencia Artificial , Documentación , Modelos Estadísticos , Neovascularización Patológica/tratamiento farmacológico , Neovascularización Patológica/genética , Neovascularización Fisiológica/genética , Programas Informáticos , Vocabulario
8.
Bioinformatics ; 27(8): 1185-6, 2011 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-21349873

RESUMEN

UNLABELLED: Often, the most informative genes have to be selected from different gene sets and several computer gene ranking algorithms have been developed to cope with the problem. To help researchers decide which algorithm to use, we developed the analysis of gene ranking algorithms (AGRA) system that offers a novel technique for comparing ranked lists of genes. The most important feature of AGRA is that no previous knowledge of gene ranking algorithms is needed for their comparison. Using the text mining system finding-associated concepts with text analysis. AGRA defines what we call biomedical concept space (BCS) for each gene list and offers a comparison of the gene lists in six different BCS categories. The uploaded gene lists can be compared using two different methods. In the first method, the overlap between each pair of two gene lists of BCSs is calculated. The second method offers a text field where a specific biomedical concept can be entered. AGRA searches for this concept in each gene lists' BCS, highlights the rank of the concept and offers a visual representation of concepts ranked above and below it. AVAILABILITY AND IMPLEMENTATION: Available at http://agra.fzv.uni-mb.si/, implemented in Java and running on the Glassfish server. CONTACT: simon.kocbek@uni-mb.si.


Asunto(s)
Algoritmos , Genes , Minería de Datos , Programas Informáticos
9.
BMC Bioinformatics ; 12: 481, 2011 Dec 18.
Artículo en Inglés | MEDLINE | ID: mdl-22177292

RESUMEN

BACKGROUND: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. RESULTS: We have integrated nine event extraction systems in the U-Compare framework, making them intercompatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. CONCLUSIONS: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.


Asunto(s)
Minería de Datos , Sistemas de Computación , Publicaciones Periódicas como Asunto , Programas Informáticos
10.
Bioinformatics ; 26(5): 661-7, 2010 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-20053840

RESUMEN

MOTIVATION: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers. RESULTS: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification. AVAILABILITY: The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/


Asunto(s)
Minería de Datos/métodos , Procesamiento de Lenguaje Natural , Inteligencia Artificial , Bases de Datos Factuales , Especificidad de la Especie , Terminología como Asunto
11.
Bioinformatics ; 26(9): 1246-53, 2010 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-20360059

RESUMEN

MOTIVATION: The ultimate goal of abbreviation management is to disambiguate every occurrence of an abbreviation into its expanded form (concept or sense). To collect expanded forms for abbreviations, previous studies have recognized abbreviations and their expanded forms in parenthetical expressions of bio-medical texts. However, expanded forms extracted by abbreviation recognition are mixtures of concepts/senses and their term variations. Consequently, a list of expanded forms should be structured into a sense inventory, which provides possible concepts or senses for abbreviation disambiguation. RESULTS: A sense inventory is a key to robust management of abbreviations. Therefore, we present a supervised approach for clustering expanded forms. The experimental result reports 0.915 F1 score in clustering expanded forms. We then investigate the possibility of conflicts of protein and gene names with abbreviations. Finally, an experiment of abbreviation disambiguation on the sense inventory yielded 0.984 accuracy and 0.986 F1 score using the dataset obtained from MEDLINE abstracts. AVAILABILITY: The sense inventory and disambiguator of abbreviations are accessible at http://www.nactem.ac.uk/software/acromine/ and http://www.nactem.ac.uk/software/acromine_disambiguation/.


Asunto(s)
Biología Computacional/métodos , Algoritmos , Análisis por Conglomerados , Bases de Datos Bibliográficas , Diccionarios como Asunto , MEDLINE , Modelos Estadísticos , Procesamiento de Lenguaje Natural , Reacción en Cadena de la Polimerasa/métodos , Reproducibilidad de los Resultados , Programas Informáticos , Tomografía Computarizada por Rayos X/métodos
12.
Bioinformatics ; 26(19): 2486-7, 2010 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-20709690

RESUMEN

UNLABELLED: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare to Taverna, a generic workflow system, to expose text mining functionality to the bioinformatics community. AVAILABILITY: http://u-compare.org/taverna.html, http://u-compare.org.


Asunto(s)
Minería de Datos/métodos , Biología Computacional , Bases de Datos Factuales , Interfaz Usuario-Computador , Flujo de Trabajo
13.
Bioinformatics ; 26(12): i382-90, 2010 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-20529932

RESUMEN

MOTIVATION: There has recently been a notable shift in biomedical information extraction (IE) from relation models toward the more expressive event model, facilitated by the maturation of basic tools for biomedical text analysis and the availability of manually annotated resources. The event model allows detailed representation of complex natural language statements and can support a number of advanced text mining applications ranging from semantic search to pathway extraction. A recent collaborative evaluation demonstrated the potential of event extraction systems, yet there have so far been no studies of the generalization ability of the systems nor the feasibility of large-scale extraction. RESULTS: This study considers event-based IE at PubMed scale. We introduce a system combining publicly available, state-of-the-art methods for domain parsing, named entity recognition and event extraction, and test the system on a representative 1% sample of all PubMed citations. We present the first evaluation of the generalization performance of event extraction systems to this scale and show that despite its computational complexity, event extraction from the entire PubMed is feasible. We further illustrate the value of the extraction approach through a number of analyses of the extracted information. AVAILABILITY: The event detection system and extracted data are open source licensed and available at http://bionlp.utu.fi/.


Asunto(s)
Minería de Datos/métodos , PubMed , Procesamiento de Lenguaje Natural , Biología de Sistemas
14.
Bioinformatics ; 26(12): i374-81, 2010 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-20529930

RESUMEN

MOTIVATION: Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this article, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by Systems Biology Institute, Okinawa Institute of Science and Technology, National Centre for Text Mining (University of Manchester) and the University of Tokyo, and is being used by groups of biologists from these locations.


Asunto(s)
Minería de Datos/métodos , Programas Informáticos , Fenómenos Biológicos , Biología de Sistemas
15.
Bioinformatics ; 25(3): 394-400, 2009 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-19073593

RESUMEN

MOTIVATION: While text mining technologies for biomedical research have gained popularity as a way to take advantage of the explosive growth of information in text form in biomedical papers, selecting appropriate natural language processing (NLP) tools is still difficult for researchers who are not familiar with recent advances in NLP. This article provides a comparative evaluation of several state-of-the-art natural language parsers, focusing on the task of extracting protein-protein interaction (PPI) from biomedical papers. We measure how each parser, and its output representation, contributes to accuracy improvement when the parser is used as a component in a PPI system. RESULTS: All the parsers attained improvements in accuracy of PPI extraction. The levels of accuracy obtained with these different parsers vary slightly, while differences in parsing speed are larger. The best accuracy in this work was obtained when we combined Miyao and Tsujii's Enju parser and Charniak and Johnson's reranking parser, and the accuracy is better than the state-of-the-art results on the same data. AVAILABILITY: The PPI extraction system used in this work (AkanePPI) is available online at http://www-tsujii.is.s.u-tokyo.ac.jp/downloads/downloads.cgi. The evaluated parsers are also available online from each developer's site.


Asunto(s)
Procesamiento de Lenguaje Natural , Mapeo de Interacción de Proteínas/métodos , Algoritmos , Bases de Datos de Proteínas , Proteínas/química , Proteínas/metabolismo
16.
Bioinformatics ; 25(15): 1997-8, 2009 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-19414535

RESUMEN

SUMMARY: Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. AVAILABILITY: http://u-compare.org/


Asunto(s)
Biología Computacional/métodos , Programas Informáticos , Bases de Datos Factuales , Almacenamiento y Recuperación de la Información , Interfaz Usuario-Computador
17.
BMC Bioinformatics ; 10: 403, 2009 Dec 09.
Artículo en Inglés | MEDLINE | ID: mdl-19995463

RESUMEN

BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. RESULTS: We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. CONCLUSION: Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.


Asunto(s)
Biología Computacional/métodos , Proteínas/química , Bases de Datos Factuales , Genes , Reconocimiento de Normas Patrones Automatizadas , Proteínas/genética
18.
Bioinformatics ; 24(21): 2559-60, 2008 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-18772154

RESUMEN

UNLABELLED: FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank. AVAILABILITY: The system is available at http://www.nactem.ac.uk/software/facta/


Asunto(s)
Indización y Redacción de Resúmenes/métodos , MEDLINE , Programas Informáticos , Sistemas de Administración de Bases de Datos
19.
BMC Bioinformatics ; 9 Suppl 11: S8, 2008 Nov 19.
Artículo en Inglés | MEDLINE | ID: mdl-19025694

RESUMEN

BACKGROUND: Previous studies of named entity recognition have shown that a reasonable level of recognition accuracy can be achieved by using machine learning models such as conditional random fields or support vector machines. However, the lack of training data (i.e. annotated corpora) makes it difficult for machine learning-based named entity recognizers to be used in building practical information extraction systems. RESULTS: This paper presents an active learning-like framework for reducing the human effort required to create named entity annotations in a corpus. In this framework, the annotation work is performed as an iterative and interactive process between the human annotator and a probabilistic named entity tagger. Unlike active learning, our framework aims to annotate all occurrences of the target named entities in the given corpus, so that the resulting annotations are free from the sampling bias which is inevitable in active learning approaches. CONCLUSION: We evaluate our framework by simulating the annotation process using two named entity corpora and show that our approach can reduce the number of sentences which need to be examined by the human annotator. The cost reduction achieved by the framework could be drastic when the target named entities are sparse.


Asunto(s)
Almacenamiento y Recuperación de la Información/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Terminología como Asunto , Algoritmos , Inteligencia Artificial , Bases de Datos Bibliográficas , Procesamiento de Lenguaje Natural
20.
BMC Bioinformatics ; 9: 10, 2008 Jan 08.
Artículo en Inglés | MEDLINE | ID: mdl-18182099

RESUMEN

BACKGROUND: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. RESULTS: We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. CONCLUSION: The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Inteligencia Artificial , Sistemas de Administración de Bases de Datos , Documentación/métodos , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Vocabulario Controlado
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA