Búsqueda | Portal de Búsqueda de la BVS Colombia

1.

PheneBank: a literature-based database of phenotypes.

Pilehvar, Mohammad Taher; Bernard, Adam; Smedley, Damian; Collier, Nigel.

Bioinformatics ; 38(4): 1179-1180, 2022 01 27.

Artículo en Inglés | MEDLINE | ID: mdl-34788791

RESUMEN

MOTIVATION: Significant effort has been spent by curators to create coding systems for phenotypes such as the Human Phenotype Ontology, as well as disease-phenotype annotations. We aim to support the discovery of literature-based phenotypes and integrate them into the knowledge discovery process. RESULTS: PheneBank is a Web-portal for retrieving human phenotype-disease associations that have been text-mined from the whole of Medline. Our approach exploits state-of-the-art machine learning for concept identification by utilizing an expert annotated rare disease corpus from the PMC Text Mining subset. Evaluation of the system for entities is conducted on a gold-standard corpus of rare disease sentences and for associations against the Monarch initiative data. AVAILABILITY AND IMPLEMENTATION: The PheneBank Web-portal freely available at http://www.phenebank.org. Annotated Medline data is available from Zenodo at DOI: 10.5281/zenodo.1408800. Semantic annotation software is freely available for non-commercial use at GitHub: https://github.com/pilehvar/phenebank. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Enfermedades Raras , Programas Informáticos , Humanos , Algoritmos , Minería de Datos , Fenotipo

2.

BioCaster in 2021: automatic disease outbreaks detection from global news media.

Meng, Zaiqiao; Okhmatovskaia, Anya; Polleri, Maxime; Shen, Yannan; Powell, Guido; Fu, Zihao; Ganser, Iris; Zhang, Meiru; King, Nicholas B; Buckeridge, David; Collier, Nigel.

Bioinformatics ; 38(18): 4446-4448, 2022 09 15.

Artículo en Inglés | MEDLINE | ID: mdl-35900173

RESUMEN

SUMMARY: BioCaster was launched in 2008 to provide an ontology-based text mining system for early disease detection from open news sources. Following a 6-year break, we have re-launched the system in 2021. Our goal is to systematically upgrade the methodology using state-of-the-art neural network language models, whilst retaining the original benefits that the system provided in terms of logical reasoning and automated early detection of infectious disease outbreaks. Here, we present recent extensions such as neural machine translation in 10 languages, neural classification of disease outbreak reports and a new cloud-based visualization dashboard. Furthermore, we discuss our vision for further improvements, including combining risk assessment with event semantics and assessing the risk of outbreaks with multi-granularity. We hope that these efforts will benefit the global public health community. AVAILABILITY AND IMPLEMENTATION: BioCaster web-portal is freely accessible at http://biocaster.org.

Asunto(s)

Brotes de Enfermedades , Vigilancia de la Población , Vigilancia de la Población/métodos , Minería de Datos/métodos , Semántica

3.

A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics.

Gritta, Milan; Pilehvar, Mohammad Taher; Collier, Nigel.

Lang Resour Eval ; 54(3): 683-712, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-32802011

RESUMEN

Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym taxonomy with implications for Named Entity Recognition (NER) and beyond. To address these deficiencies, our manuscript introduces a new framework in three parts. (Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms. (Part 2) Metrics: discussed and reviewed for a rigorous evaluation including recommendations for NER/Geoparsing practitioners. (Part 3) Evaluation data: shared via a new dataset called GeoWebNews to provide test/train examples and enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping and evaluating machine learning NLP models.

4.

The digital revolution in phenotyping.

Oellrich, Anika; Collier, Nigel; Groza, Tudor; Rebholz-Schuhmann, Dietrich; Shah, Nigam; Bodenreider, Olivier; Boland, Mary Regina; Georgiev, Ivo; Liu, Hongfang; Livingston, Kevin; Luna, Augustin; Mallon, Ann-Marie; Manda, Prashanti; Robinson, Peter N; Rustici, Gabriella; Simon, Michelle; Wang, Liqin; Winnenburg, Rainer; Dumontier, Michel.

Brief Bioinform ; 17(5): 819-30, 2016 09.

Artículo en Inglés | MEDLINE | ID: mdl-26420780

RESUMEN

Phenotypes have gained increased notoriety in the clinical and biological domain owing to their application in numerous areas such as the discovery of disease genes and drug targets, phylogenetics and pharmacogenomics. Phenotypes, defined as observable characteristics of organisms, can be seen as one of the bridges that lead to a translation of experimental findings into clinical applications and thereby support 'bench to bedside' efforts. However, to build this translational bridge, a common and universal understanding of phenotypes is required that goes beyond domain-specific definitions. To achieve this ambitious goal, a digital revolution is ongoing that enables the encoding of data in computer-readable formats and the data storage in specialized repositories, ready for integration, enabling translational research. While phenome research is an ongoing endeavor, the true potential hidden in the currently available data still needs to be unlocked, offering exciting opportunities for the forthcoming years. Here, we provide insights into the state-of-the-art in digital phenotyping, by means of representing, acquiring and analyzing phenotype data. In addition, we provide visions of this field for future research work that could enable better applications of phenotype data.

Asunto(s)

Fenotipo , Humanos , Almacenamiento y Recuperación de la Información , Proyectos de Investigación , Investigación Biomédica Traslacional

5.

What's missing in geographical parsing?

Gritta, Milan; Pilehvar, Mohammad Taher; Limsopatham, Nut; Collier, Nigel.

Lang Resour Eval ; 52(2): 603-623, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-31258456

RESUMEN

Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as emergency responses, real-time social media geographical event analysis, understanding location instructions in auto-response systems and more. However, geoparsing is still widely regarded as a challenge because of domain language diversity, place name ambiguity, metonymic language and limited leveraging of context as we show in our analysis. Results to date, whilst promising, are on laboratory data and unlike in wider NLP are often not cross-compared. In this study, we evaluate and analyse the performance of a number of leading geoparsers on a number of corpora and highlight the challenges in detail. We also publish an automatically geotagged Wikipedia corpus to alleviate the dearth of (open source) corpora in this domain.

6.

Crowdsourcing Twitter annotations to identify first-hand experiences of prescription drug use.

Alvaro, Nestor; Conway, Mike; Doan, Son; Lofi, Christoph; Overington, John; Collier, Nigel.

J Biomed Inform ; 58: 280-287, 2015 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-26556646

RESUMEN

Self-reported patient data has been shown to be a valuable knowledge source for post-market pharmacovigilance. In this paper we propose using the popular micro-blogging service Twitter to gather evidence about adverse drug reactions (ADRs) after firstly having identified micro-blog messages (also know as "tweets") that report first-hand experience. In order to achieve this goal we explore machine learning with data crowdsourced from laymen annotators. With the help of lay annotators recruited from CrowdFlower we manually annotated 1548 tweets containing keywords related to two kinds of drugs: SSRIs (eg. Paroxetine), and cognitive enhancers (eg. Ritalin). Our results show that inter-annotator agreement (Fleiss' kappa) for crowdsourcing ranks in moderate agreement with a pair of experienced annotators (Spearman's Rho=0.471). We utilized the gold standard annotations from CrowdFlower for automatically training a range of supervised machine learning models to recognize first-hand experience. F-Score values are reported for 6 of these techniques with the Bayesian Generalized Linear Model being the best (F-Score=0.64 and Informedness=0.43) when combined with a selected set of features obtained by using information gain criteria.

Asunto(s)

Colaboración de las Masas , Prescripciones de Medicamentos , Medios de Comunicación Sociales , Humanos

7.

GENI-DB: a database of global events for epidemic intelligence.

Collier, Nigel; Doan, Son.

Bioinformatics ; 28(8): 1186-8, 2012 Apr 15.

Artículo en Inglés | MEDLINE | ID: mdl-22383735

RESUMEN

UNLABELLED: We present a novel public health database (GENI-DB) in which news events on the topic of over 176 infectious diseases and chemicals affecting human and animal health are compiled from surveillance of the global online news media in 10 languages. News event frequency data were gathered systematically through the BioCaster public health surveillance system from July 2009 to the present and is available to download by the research community for purposes of analyzing trends in the global burden of infectious diseases. Database search can be conducted by year, country, disease and language. AVAILABILITY: The GENI-DB is freely available via a web portal at http://born.nii.ac.jp/.

Asunto(s)

Bases de Datos Factuales , Vigilancia de la Población , Animales , Humanos , Internacionalidad , Internet , Medios de Comunicación de Masas , Medicina Veterinaria

8.

Semantic Speech Networks Linked to Formal Thought Disorder in Early Psychosis.

Nettekoven, Caroline R; Diederen, Kelly; Giles, Oscar; Duncan, Helen; Stenson, Iain; Olah, Julianna; Gibbs-Dean, Toni; Collier, Nigel; Vértes, Petra E; Spencer, Tom J; Morgan, Sarah E; McGuire, Philip.

Schizophr Bull ; 49(Suppl_2): S142-S152, 2023 03 22.

Artículo en Inglés | MEDLINE | ID: mdl-36946531

RESUMEN

BACKGROUND AND HYPOTHESIS: Mapping a patient's speech as a network has proved to be a useful way of understanding formal thought disorder in psychosis. However, to date, graph theory tools have not explicitly modelled the semantic content of speech, which is altered in psychosis. STUDY DESIGN: We developed an algorithm, "netts," to map the semantic content of speech as a network, then applied netts to construct semantic speech networks for a general population sample (N = 436), and a clinical sample comprising patients with first episode psychosis (FEP), people at clinical high risk of psychosis (CHR-P), and healthy controls (total N = 53). STUDY RESULTS: Semantic speech networks from the general population were more connected than size-matched randomized networks, with fewer and larger connected components, reflecting the nonrandom nature of speech. Networks from FEP patients were smaller than from healthy participants, for a picture description task but not a story recall task. For the former task, FEP networks were also more fragmented than those from controls; showing more connected components, which tended to include fewer nodes on average. CHR-P networks showed fragmentation values in-between FEP patients and controls. A clustering analysis suggested that semantic speech networks captured novel signals not already described by existing NLP measures. Network features were also related to negative symptom scores and scores on the Thought and Language Index, although these relationships did not survive correcting for multiple comparisons. CONCLUSIONS: Overall, these data suggest that semantic networks can enable deeper phenotyping of formal thought disorder in psychosis. Whilst here we focus on network fragmentation, the semantic speech networks created by Netts also contain other, rich information which could be extracted to shed further light on formal thought disorder. We are releasing Netts as an open Python package alongside this manuscript.

Asunto(s)

Trastornos Psicóticos , Habla , Humanos , Lenguaje , Trastornos Psicóticos/diagnóstico , Web Semántica , Semántica , Estudios de Casos y Controles

9.

Recognition of medication information from discharge summaries using ensembles of classifiers.

Doan, Son; Collier, Nigel; Xu, Hua; Pham, Hoang Duy; Tu, Minh Phuong.

BMC Med Inform Decis Mak ; 12: 36, 2012 May 07.

Artículo en Inglés | MEDLINE | ID: mdl-22564405

RESUMEN

BACKGROUND: Extraction of clinical information such as medications or problems from clinical text is an important task of clinical natural language processing (NLP). Rule-based methods are often used in clinical NLP systems because they are easy to adapt and customize. Recently, supervised machine learning methods have proven to be effective in clinical NLP as well. However, combining different classifiers to further improve the performance of clinical entity recognition systems has not been investigated extensively. Combining classifiers into an ensemble classifier presents both challenges and opportunities to improve performance in such NLP tasks. METHODS: We investigated ensemble classifiers that used different voting strategies to combine outputs from three individual classifiers: a rule-based system, a support vector machine (SVM) based system, and a conditional random field (CRF) based system. Three voting methods were proposed and evaluated using the annotated data sets from the 2009 i2b2 NLP challenge: simple majority, local SVM-based voting, and local CRF-based voting. RESULTS: Evaluation on 268 manually annotated discharge summaries from the i2b2 challenge showed that the local CRF-based voting method achieved the best F-score of 90.84% (94.11% Precision, 87.81% Recall) for 10-fold cross-validation. We then compared our systems with the first-ranked system in the challenge by using the same training and test sets. Our system based on majority voting achieved a better F-score of 89.65% (93.91% Precision, 85.76% Recall) than the previously reported F-score of 89.19% (93.78% Precision, 85.03% Recall) by the first-ranked system in the challenge. CONCLUSIONS: Our experimental results using the 2009 i2b2 challenge datasets showed that ensemble classifiers that combine individual classifiers into a voting system could achieve better performance than a single classifier in recognizing medication information from clinical text. It suggests that simple strategies that can be easily implemented such as majority voting could have the potential to significantly improve clinical entity recognition.

Asunto(s)

Almacenamiento y Recuperación de la Información/métodos , Sistemas de Medicación , Procesamiento de Lenguaje Natural , Alta del Paciente , Reconocimiento de Normas Patrones Automatizadas , Algoritmos , Inteligencia Artificial , Técnicas de Apoyo para la Decisión , Femenino , Humanos , Equipos de Administración Institucional , Masculino , Preparaciones Farmacéuticas , Reproducibilidad de los Resultados , Semántica , Diseño de Software , Máquina de Vectores de Soporte

10.

Exploiting document graphs for inter sentence relation extraction.

Le, Hoang-Quynh; Can, Duy-Cat; Collier, Nigel.

J Biomed Semantics ; 13(1): 15, 2022 06 03.

Artículo en Inglés | MEDLINE | ID: mdl-35659292

RESUMEN

BACKGROUND: Most previous relation extraction (RE) studies have focused on intra sentence relations and have ignored relations that span sentences, i.e. inter sentence relations. Such relations connect entities at the document level rather than as relational facts in a single sentence. Extracting facts that are expressed across sentences leads to some challenges and requires different approaches than those usually applied in recent intra sentence relation extraction. Despite recent results, there are still limitations to be overcome. RESULTS: We present a novel representation for a sequence of consecutive sentences, namely document subgraph, to extract inter sentence relations. Experiments on the BioCreative V Chemical-Disease Relation corpus demonstrate the advantages and robustness of our novel system to extract both intra- and inter sentence relations in biomedical literature abstracts. The experimental results are comparable to state-of-the-art approaches and show the potential by demonstrating the effectiveness of graphs, deep learning-based model, and other processing techniques. Experiments were also carried out to verify the rationality and impact of various additional information and model components. CONCLUSIONS: Our proposed graph-based representation helps to extract â¼50% of inter sentence relations and boosts the model performance on both precision and recall compared to the baseline model.

Asunto(s)

Publicaciones

11.

A Conceptual Framework for Representing Events Under Public Health Surveillance.

Okhmatovskaia, Anya; Shen, Yannan; Ganser, Iris; Collier, Nigel; King, Nicholas B; Meng, Zaiqiao; Buckeridge, David L.

Stud Health Technol Inform ; 294: 387-391, 2022 May 25.

Artículo en Inglés | MEDLINE | ID: mdl-35612102

RESUMEN

Information integration across multiple event-based surveillance (EBS) systems has been shown to improve global disease surveillance in experimental settings. In practice, however, integration does not occur due to the lack of a common conceptual framework for encoding data within EBS systems. We aim to address this gap by proposing a candidate conceptual framework for representing events and related concepts in the domain of public health surveillance.

Asunto(s)

Brotes de Enfermedades , Vigilancia en Salud Pública , Vigilancia de la Población , Salud Pública

12.

A survey on clinical natural language processing in the United Kingdom from 2007 to 2022.

Wu, Honghan; Wang, Minhong; Wu, Jinge; Francis, Farah; Chang, Yun-Hsuan; Shavick, Alex; Dong, Hang; Poon, Michael T C; Fitzpatrick, Natalie; Levine, Adam P; Slater, Luke T; Handy, Alex; Karwath, Andreas; Gkoutos, Georgios V; Chelala, Claude; Shah, Anoop Dinesh; Stewart, Robert; Collier, Nigel; Alex, Beatrice; Whiteley, William; Sudlow, Cathie; Roberts, Angus; Dobson, Richard J B.

NPJ Digit Med ; 5(1): 186, 2022 Dec 21.

Artículo en Inglés | MEDLINE | ID: mdl-36544046

RESUMEN

Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union's funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019-2022 was 80 times that of 2007-2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP's great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.

13.

Developing a disease outbreak event corpus.

Conway, Mike; Kawazoe, Ai; Chanlekha, Hutchatai; Collier, Nigel.

J Med Internet Res ; 12(3): e43, 2010 Sep 28.

Artículo en Inglés | MEDLINE | ID: mdl-20876049

RESUMEN

BACKGROUND: In recent years, there has been a growth in work on the use of information extraction technologies for tracking disease outbreaks from online news texts, yet publicly available evaluation standards (and associated resources) for this new area of research have been noticeably lacking. OBJECTIVE: This study seeks to create a "gold standard" data set against which to test how accurately disease outbreak information extraction systems can identify the semantics of disease outbreak events. Additionally, we hope that the provision of an annotation scheme (and associated corpus) to the community will encourage open evaluation in this new and growing application area. METHODS: We developed an annotation scheme for identifying infectious disease outbreak events in news texts. An event--in the context of our annotation scheme--consists minimally of geographical (eg, country and province) and disease name information. However, the scheme also allows for the rich encoding of other domain salient concepts (eg, international travel, species, and food contamination). RESULTS: The work resulted in a 200-document corpus of event-annotated disease outbreak reports that can be used to evaluate the accuracy of event detection algorithms (in this case, for the BioCaster biosurveillance online news information extraction system). In the 200 documents, 394 distinct events were identified (mean 1.97 events per document, range 0-25 events per document). We also provide a download script and graphical user interface (GUI)-based event browsing software to facilitate corpus exploration. CONCLUSION: In summary, we present an annotation scheme and corpus that can be used in the evaluation of disease outbreak event extraction algorithms. The annotation scheme and corpus were designed both with the particular evaluation requirements of the BioCaster system in mind as well as the wider need for further evaluation resources in this growing research area.

Asunto(s)

Brotes de Enfermedades/estadística & datos numéricos , Sistemas en Línea , Animales , Brotes de Enfermedades/prevención & control , Documentación , Procesamiento Automatizado de Datos/métodos , Geografía , Humanos , Organización Mundial de la Salud

14.

A framework for enhancing spatial and temporal granularity in report-based health surveillance systems.

Chanlekha, Hutchatai; Kawazoe, Ai; Collier, Nigel.

BMC Med Inform Decis Mak ; 10: 1, 2010 Jan 12.

Artículo en Inglés | MEDLINE | ID: mdl-20067612

RESUMEN

BACKGROUND: Current public concern over the spread of infectious diseases has underscored the importance of health surveillance systems for the speedy detection of disease outbreaks. Several international report-based monitoring systems have been developed, including GPHIN, Argus, HealthMap, and BioCaster. A vital feature of these report-based systems is the geo-temporal encoding of outbreak-related textual data. Until now, automated systems have tended to use an ad-hoc strategy for processing geo-temporal information, normally involving the detection of locations that match pre-determined criteria, and the use of document publication dates as a proxy for disease event dates. Although these strategies appear to be effective enough for reporting events at the country and province levels, they may be less effective at discovering geo-temporal information at more detailed levels of granularity. In order to improve the capabilities of current Web-based health surveillance systems, we introduce the design for a novel scheme called spatiotemporal zoning. METHOD: The proposed scheme classifies news articles into zones according to the spatiotemporal characteristics of their content. In order to study the reliability of the annotation scheme, we analyzed the inter-annotator agreements on a group of human annotators for over 1000 reported events. Qualitative and quantitative evaluation is made on the results including the kappa and percentage agreement. RESULTS: The reliability evaluation of our scheme yielded very promising inter-annotator agreement, more than a 0.9 kappa and a 0.9 percentage agreement for event type annotation and temporal attributes annotation, respectively, with a slight degradation for the spatial attribute. However, for events indicating an outbreak situation, the annotators usually had inter-annotator agreements with the lowest granularity location. CONCLUSIONS: We developed and evaluated a novel spatiotemporal zoning annotation scheme. The results of the scheme evaluation indicate that our annotated corpus and the proposed annotation scheme are reliable and could be effectively used for developing an automatic system. Given the current advances in natural language processing techniques, including the availability of language resources and tools, we believe that a reliable automatic spatiotemporal zoning system can be achieved. In the next stage of this work, we plan to develop an automatic zoning system and evaluate its usability within an operational health surveillance system.

Asunto(s)

Brotes de Enfermedades/clasificación , Sistemas de Información Geográfica , Procesamiento de Lenguaje Natural , Vigilancia de la Población/métodos , Demografía , Humanos , Medios de Comunicación de Masas , Informática en Salud Pública , Reproducibilidad de los Resultados , Proyectos de Investigación

15.

BioCaster: detecting public health rumors with a Web-based text mining system.

Collier, Nigel; Doan, Son; Kawazoe, Ai; Goodwin, Reiko Matsuda; Conway, Mike; Tateno, Yoshio; Ngo, Quoc-Hung; Dien, Dinh; Kawtrakul, Asanee; Takeuchi, Koichi; Shigematsu, Mika; Taniguchi, Kiyosu.

Bioinformatics ; 24(24): 2940-1, 2008 Dec 15.

Artículo en Inglés | MEDLINE | ID: mdl-18922806

RESUMEN

SUMMARY: BioCaster is an ontology-based text mining system for detecting and tracking the distribution of infectious disease outbreaks from linguistic signals on the Web. The system continuously analyzes documents reported from over 1700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information. The background knowledge for bridging the gap between Layman's terms and formal-coding systems is contained in the freely available BioCaster ontology which includes information in eight languages focused on the epidemiological role of pathogens as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and event recognition. Higher order event analysis is used to detect more precisely specified warning signals that can then be notified to registered users via email alerts. Evaluation of the system for topic recognition and entity identification is conducted on a gold standard corpus of annotated news articles. AVAILABILITY: The BioCaster map and ontology are freely available via a web portal at http://www.biocaster.org.

Asunto(s)

Almacenamiento y Recuperación de la Información/métodos , Vigilancia de la Población , Programas Informáticos , Humanos , Internet , Salud Pública

16.

Towards role-based filtering of disease outbreak reports.

Doan, Son; Kawazoe, Ai; Conway, Mike; Collier, Nigel.

J Biomed Inform ; 42(5): 773-80, 2009 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-19171201

RESUMEN

This paper explores the role of named entities (NEs) in the classification of disease outbreak report. In the annotation schema of BioCaster, a text mining system for public health protection, important concepts that reflect information about infectious diseases were conceptually analyzed with a formal ontological methodology and classified into types and roles. Types are specified as NE classes and roles are integrated into NEs as attributes such as a chemical and whether it is being used as a therapy for some infectious disease. We focus on the roles of NEs and explore different ways to extract, combine and use them as features in a text classifier. In addition, we investigate the combination of roles with semantic categories of disease-related nouns and verbs. Experimental results using naïve Bayes and Support Vector Machine (SVM) algorithms show that: (1) roles in combination with NEs improve performance in text classification, (2) roles in combination with semantic categories of noun and verb features contribute substantially to the improvement of text classification. Both these results were statistically significant compared to the baseline "raw text" representation. We discuss in detail the effects of roles on each NE and on semantic categories of noun and verb features in terms of accuracy, precision/recall and F-score measures for the text classification task.

Asunto(s)

Brotes de Enfermedades , Almacenamiento y Recuperación de la Información/métodos , Informática Médica/métodos , Procesamiento de Lenguaje Natural , Algoritmos , Inteligencia Artificial , Teorema de Bayes , Humanos , Reconocimiento de Normas Patrones Automatizadas , Vigilancia de la Población

17.

Synonym set extraction from the biomedical literature by lexical pattern discovery.

McCrae, John; Collier, Nigel.

BMC Bioinformatics ; 9: 159, 2008 Mar 24.

Artículo en Inglés | MEDLINE | ID: mdl-18366721

RESUMEN

BACKGROUND: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst 1, but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way. RESULTS: We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia. CONCLUSION: We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.

Asunto(s)

Inteligencia Artificial , Diccionarios como Asunto , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Semántica , Vocabulario Controlado , Biología Computacional/métodos , Sistemas de Administración de Bases de Datos , Reconocimiento de Normas Patrones Automatizadas/métodos

18.

Structuring an event ontology for disease outbreak detection.

Kawazoe, Ai; Chanlekha, Hutchatai; Shigematsu, Mika; Collier, Nigel.

BMC Bioinformatics ; 9 Suppl 3: S8, 2008 Apr 11.

Artículo en Inglés | MEDLINE | ID: mdl-18426553

RESUMEN

BACKGROUND: This paper describes the design of an event ontology being developed for application in the machine understanding of infectious disease-related events reported in natural language text. This event ontology is designed to support timely detection of disease outbreaks and rapid judgment of their alerting status by 1) bridging a gap between layman's language used in disease outbreak reports and public health experts' deep knowledge, and 2) making multi-lingual information available. CONSTRUCTION AND CONTENT: This event ontology integrates a model of experts' knowledge for disease surveillance, and at the same time sets of linguistic expressions which denote disease-related events, and formal definitions of events. In this ontology, rather general event classes, which are suitable for application to language-oriented tasks such as recognition of event expressions, are placed on the upper-level, and more specific events of the experts' interest are in the lower level. Each class is related to other classes which represent participants of events, and linked with multi-lingual synonym sets and axioms. CONCLUSIONS: We consider that the design of the event ontology and the methodology introduced in this paper are applicable to other domains which require integration of natural language information and machine support for experts to assess them. The first version of the ontology, with about 40 concepts, will be available in March 2008.

Asunto(s)

Algoritmos , Inteligencia Artificial , Brotes de Enfermedades/prevención & control , Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas/métodos , Vigilancia de la Población/métodos , Vocabulario Controlado

19.

TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

Alvaro, Nestor; Miyao, Yusuke; Collier, Nigel.

JMIR Public Health Surveill ; 3(2): e24, 2017 May 03.

Artículo en Inglés | MEDLINE | ID: mdl-28468748

RESUMEN

BACKGROUND: Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner. OBJECTIVE: This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed. METHODS: We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines. RESULTS: The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes. CONCLUSIONS: We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP.

20.

A multilingual ontology for infectious disease surveillance: rationale, design and challenges.

Collier, Nigel; Kawazoe, Ai; Jin, Lihua; Shigematsu, Mika; Dien, Dinh; Barrero, Roberto A; Takeuchi, Koichi; Kawtrakul, Asanee.

Lang Resour Eval ; 40(3): 405, 2006.

Artículo en Inglés | MEDLINE | ID: mdl-32214930

RESUMEN

A lack of surveillance system infrastructure in the Asia-Pacific region is seen as hindering the global control of rapidly spreading infectious diseases such as the recent avian H5N1 epidemic. As part of improving surveillance in the region, the BioCaster project aims to develop a system based on text mining for automatically monitoring Internet news and other online sources in several regional languages. At the heart of the system is an application ontology which serves the dual purpose of enabling advanced searches on the mined facts and of allowing the system to make intelligent inferences for assessing the priority of events. However, it became clear early on in the project that existing classification schemes did not have the necessary language coverage or semantic specificity for our needs. In this article we present an overview of our needs and explore in detail the rationale and methods for developing a new conceptual structure and multilingual terminological resource that focusses on priority pathogens and the diseases they cause. The ontology is made freely available as an online database and downloadable OWL file.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA