Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
BMC Med Inform Decis Mak ; 19(Suppl 3): 79, 2019 04 04.
Artículo en Inglés | MEDLINE | ID: mdl-30943954

RESUMEN

BACKGROUND: Twitter messages (tweets) contain various types of topics in our daily life, which include health-related topics. Analysis of health-related tweets would help us understand health conditions and concerns encountered in our daily lives. In this paper we evaluate an approach to extracting causalities from tweets using natural language processing (NLP) techniques. METHODS: Lexico-syntactic patterns based on dependency parser outputs are used for causality extraction. We focused on three health-related topics: "stress", "insomnia", and "headache." A large dataset consisting of 24 million tweets are used. RESULTS: The results show the proposed approach achieved an average precision between 74.59 to 92.27% in comparisons with human annotations. CONCLUSIONS: Manual analysis on extracted causalities in tweets reveals interesting findings about expressions on health-related topic posted by Twitter users.


Asunto(s)
Causalidad , Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Envío de Mensajes de Texto , Conjuntos de Datos como Asunto , Cefalea , Humanos , Trastornos del Inicio y del Mantenimiento del Sueño , Medios de Comunicación Sociales , Estrés Psicológico
2.
J Biomed Inform ; 58 Suppl: S164-S170, 2015 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-26279500

RESUMEN

In the United States, about 600,000 people die of heart disease every year. The annual cost of care services, medications, and lost productivity reportedly exceeds 108.9 billion dollars. Effective disease risk assessment is critical to prevention, care, and treatment planning. Recent advancements in text analytics have opened up new possibilities of using the rich information in electronic medical records (EMRs) to identify relevant risk factors. The 2014 i2b2/UTHealth Challenge brought together researchers and practitioners of clinical natural language processing (NLP) to tackle the identification of heart disease risk factors reported in EMRs. We participated in this track and developed an NLP system by leveraging existing tools and resources, both public and proprietary. Our system was a hybrid of several machine-learning and rule-based components. The system achieved an overall F1 score of 0.9185, with a recall of 0.9409 and a precision of 0.8972.


Asunto(s)
Enfermedades Cardiovasculares/epidemiología , Minería de Datos/métodos , Complicaciones de la Diabetes/epidemiología , Registros Electrónicos de Salud/organización & administración , Narración , Procesamiento de Lenguaje Natural , Anciano , California/epidemiología , Enfermedades Cardiovasculares/diagnóstico , Estudios de Cohortes , Comorbilidad , Seguridad Computacional , Confidencialidad , Complicaciones de la Diabetes/diagnóstico , Femenino , Humanos , Incidencia , Estudios Longitudinales , Masculino , Persona de Mediana Edad , Reconocimiento de Normas Patrones Automatizadas/métodos , Medición de Riesgo/métodos , Vocabulario Controlado
3.
BMC Bioinformatics ; 15: 285, 2014 Aug 23.
Artículo en Inglés | MEDLINE | ID: mdl-25149151

RESUMEN

BACKGROUND: Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task. RESULTS: A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations. CONCLUSIONS: In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.


Asunto(s)
Investigación Biomédica/métodos , Minería de Datos/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Lenguaje , Publicaciones , Factores de Tiempo
4.
BMC Bioinformatics ; 12: 91, 2011 Apr 06.
Artículo en Inglés | MEDLINE | ID: mdl-21466708

RESUMEN

BACKGROUND: Protein O-GlcNAcylation (or O-GlcNAc-ylation) is an O-linked glycosylation involving the transfer of ß-N-acetylglucosamine to the hydroxyl group of serine or threonine residues of proteins. Growing evidences suggest that protein O-GlcNAcylation is common and is analogous to phosphorylation in modulating broad ranges of biological processes. However, compared to phosphorylation, the amount of protein O-GlcNAcylation data is relatively limited and its annotation in databases is scarce. Furthermore, a bioinformatics resource for O-GlcNAcylation is lacking, and an O-GlcNAcylation site prediction tool is much needed. DESCRIPTION: We developed a database of O-GlcNAcylated proteins and sites, dbOGAP, primarily based on literature published since O-GlcNAcylation was first described in 1984. The database currently contains ~800 proteins with experimental O-GlcNAcylation information, of which ~61% are of humans, and 172 proteins have a total of ~400 O-GlcNAcylation sites identified. The O-GlcNAcylated proteins are primarily nucleocytoplasmic, including membrane- and non-membrane bounded organelle-associated proteins. The known O-GlcNAcylated proteins exert a broad range of functions including transcriptional regulation, macromolecular complex assembly, intracellular transport, translation, and regulation of cell growth or death. The database also contains ~365 potential O-GlcNAcylated proteins inferred from known O-GlcNAcylated orthologs. Additional annotations, including other protein posttranslational modifications, biological pathways and disease information are integrated into the database. We developed an O-GlcNAcylation site prediction system, OGlcNAcScan, based on Support Vector Machine and trained using protein sequences with known O-GlcNAcylation sites from dbOGAP. The site prediction system achieved an area under ROC curve of 74.3% in five-fold cross-validation. The dbOGAP website was developed to allow for performing search and query on O-GlcNAcylated proteins and associated literature, as well as for browsing by gene names, organisms or pathways, and downloading of the database. Also available from the website, the OGlcNAcScan tool presents a list of predicted O-GlcNAcylation sites for given protein sequences. CONCLUSIONS: dbOGAP is the first public bioinformatics resource to allow systematic access to the O-GlcNAcylated proteins, and related functional information and bibliography, as well as to an O-GlcNAcylation site prediction tool. The resource will facilitate research on O-GlcNAcylation and its proteomic identification.


Asunto(s)
Biología Computacional/métodos , Acetilglucosamina/metabolismo , Glicosilación , Humanos , Fosforilación , Procesamiento Proteico-Postraduccional , Proteínas/metabolismo , Proteómica
5.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151901

RESUMEN

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Asunto(s)
Algoritmos , Minería de Datos/métodos , Genes , Animales , Minería de Datos/normas , Humanos , National Library of Medicine (U.S.) , Publicaciones Periódicas como Asunto , Estados Unidos
6.
J Am Med Inform Assoc ; 16(2): 247-55, 2009.
Artículo en Inglés | MEDLINE | ID: mdl-19074302

RESUMEN

OBJECTIVES: Biomedical named entity recognition (BNER) is a critical component in automated systems that mine biomedical knowledge in free text. Among different types of entities in the domain, gene/protein would be the most studied one for BNER. Our goal is to develop a gene/protein name recognition system BioTagger-GM that exploits rich information in terminology sources using powerful machine learning frameworks and system combination. DESIGN: BioTagger-GM consists of four main components: (1) dictionary lookup-gene/protein names in BioThesaurus and biomedical terms in UMLS Metathesaurus are tagged in text, (2) machine learning-machine learning systems are trained using dictionary lookup results as one type of feature, (3) post-processing-heuristic rules are used to correct recognition errors, and (4) system combination-a voting scheme is used to combine recognition results from multiple systems. MEASUREMENTS: The BioCreAtIvE II Gene Mention (GM) corpus was used to evaluate the proposed method. To test its general applicability, the method was also evaluated on the JNLPBA corpus modified for gene/protein name recognition. The performance of the systems was evaluated through cross-validation tests and measured using precision, recall, and F-Measure. RESULTS: BioTagger-GM achieved an F-Measure of 0.8887 on the BioCreAtIvE II GM corpus, which is higher than that of the first-place system in the BioCreAtIvE II challenge. The applicability of the method was also confirmed on the modified JNLPBA corpus. CONCLUSION: The results suggest that terminology sources, powerful machine learning frameworks, and system combination can be integrated to build an effective BNER system.


Asunto(s)
Inteligencia Artificial , Genes , Proteínas , Bases de Datos Genéticas , Proteínas/genética , Unified Medical Language System
7.
AMIA Annu Symp Proc ; 2018: 1028-1035, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30815146

RESUMEN

Concept detection is an integral step in natural language processing (NLP) applications in the clinical domain. Clinical concepts are detailed (e.g., "pain in left/right upper/lower arm/leg") and expressed in diverse phrase types (e.g., noun, verb, adjective, or prepositional phrase). There are rich terminological resources in the clinical domain that include many concept synonyms. Even with these resources, concept detection remains challenging due to discontinuous and/or permuted phrase occurrences. To overcome this challenge, we investigated an approach to exploiting syntactic information. Syntactic patterns of concept phrases were mined from continuous, non-permuted forms of synonyms, and these patterns were used to detect discontinuous and/or permuted concept phrases. Experiments on 790 de-identified clinical notes showed that the proposed approach can potentially boost a recall of concept detection. Meanwhile, challenges and limitations were noticed. In this paper, we report and discuss our preliminary analysis and finding.


Asunto(s)
Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas , Semántica , Unified Medical Language System , Algoritmos , Registros Electrónicos de Salud , Humanos
8.
BMC Bioinformatics ; 8 Suppl 9: S5, 2007 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-18047706

RESUMEN

MOTIVATION: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. METHOD: We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. RESULTS: We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. AVAILABILITY: The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.


Asunto(s)
Algoritmos , Inteligencia Artificial , Sistemas de Administración de Bases de Datos , MEDLINE , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Análisis por Conglomerados , Almacenamiento y Recuperación de la Información/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Semántica , Interfaz Usuario-Computador
9.
J Am Med Inform Assoc ; 13(5): 497-507, 2006.
Artículo en Inglés | MEDLINE | ID: mdl-16799122

RESUMEN

OBJECTIVE: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. METHODS: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. RESULTS: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. CONCLUSION: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.


Asunto(s)
Procesamiento de Lenguaje Natural , Proteínas , Vocabulario Controlado , Diccionarios como Asunto , Genes , Nombres
10.
Biomed Inform Insights ; 8(Suppl 1): 1-11, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27375358

RESUMEN

In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.

11.
Artículo en Inglés | MEDLINE | ID: mdl-26357075

RESUMEN

We introduce RLIMS-P version 2.0, an enhanced rule-based information extraction (IE) system for mining kinase, substrate, and phosphorylation site information from scientific literature. Consisting of natural language processing and IE modules, the system has integrated several new features, including the capability of processing full-text articles and generalizability towards different post-translational modifications (PTMs). To evaluate the system, sets of abstracts and full-text articles, containing a variety of textual expressions, were annotated. On the abstract corpus, the system achieved F-scores of 0.91, 0.92, and 0.95 for kinases, substrates, and sites, respectively. The corresponding scores on the full-text corpus were 0.88, 0.91, and 0.92. It was additionally evaluated on the corpus of the 2013 BioNLP-ST GE task, and achieved an F-score of 0.87 for the phosphorylation core task, improving upon the results previously reported on the corpus. Full-scale processing of all abstracts in MEDLINE and all articles in PubMed Central Open Access Subset has demonstrated scalability for mining rich information in literature, enabling its adoption for biocuration and for knowledge discovery. The new system is generalizable and it will be adapted to tackle other major PTM types. RLIMS-P 2.0 online system is available online (http://proteininformationresource.org/rlimsp/) and the developed corpora are available from iProLINK (http://proteininformationresource.org/iprolink/).


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Procesamiento de Lenguaje Natural , Fosfoproteínas/química , Fosfoproteínas/clasificación , Programas Informáticos , Bases de Datos de Proteínas , Fosfoproteínas/análisis , Fosforilación
12.
J Biomed Semantics ; 5(1): 3, 2014 Jan 17.
Artículo en Inglés | MEDLINE | ID: mdl-24438362

RESUMEN

BACKGROUND: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. RESULTS: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. CONCLUSIONS: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.

13.
Artículo en Inglés | MEDLINE | ID: mdl-24850848

RESUMEN

This article reports the use of the BioC standard format in our sentence simplification system, iSimp, and demonstrates its general utility. iSimp is designed to simplify complex sentences commonly found in the biomedical text, and has been shown to improve existing text mining applications that rely on the analysis of sentence structures. By adopting the BioC format, we aim to make iSimp readily interoperable with other applications in the biomedical domain. To examine the utility of iSimp in BioC, we implemented a rule-based relation extraction system that uses iSimp as a preprocessing module and BioC for data exchange. Evaluation on the training corpus of BioNLP-ST 2011 GENIA Event Extraction (GE) task showed that iSimp sentence simplification improved the recall by 3.2% without reducing precision. The iSimp simplification-annotated corpora, both our previously used corpus and the GE corpus in the current study, have been converted into the BioC format and made publicly available at the project's Web site: http://research.bioinformatics.udel.edu/isimp/. Database URL:http://research.bioinformatics.udel.edu/isimp/


Asunto(s)
Algoritmos , Minería de Datos/métodos , Procesamiento de Lenguaje Natural , Semántica , Internet
14.
Artículo en Inglés | MEDLINE | ID: mdl-25122463

RESUMEN

Protein phosphorylation is central to the regulation of most aspects of cell function. Given its importance, it has been the subject of active research as well as the focus of curation in several biological databases. We have developed Rule-based Literature Mining System for protein Phosphorylation (RLIMS-P), an online text-mining tool to help curators identify biomedical research articles relevant to protein phosphorylation. The tool presents information on protein kinases, substrates and phosphorylation sites automatically extracted from the biomedical literature. The utility of the RLIMS-P Web site has been evaluated by curators from Phospho.ELM, PhosphoGRID/BioGrid and Protein Ontology as part of the BioCreative IV user interactive task (IAT). The system achieved F-scores of 0.76, 0.88 and 0.92 for the extraction of kinase, substrate and phosphorylation sites, respectively, and a precision of 0.88 in the retrieval of relevant phosphorylation literature. The system also received highly favorable feedback from the curators in a user survey. Based on the curators' suggestions, the Web site has been enhanced to improve its usability. In the RLIMS-P Web site, phosphorylation information can be retrieved by PubMed IDs or keywords, with an option for selecting targeted species. The result page displays a sortable table with phosphorylation information. The text evidence page displays the abstract with color-coded entity mentions and includes links to UniProtKB entries via normalization, i.e., the linking of entity mentions to database identifiers, facilitated by the GenNorm tool and by the links to the bibliography in UniProt. Log in and editing capabilities are offered to any user interested in contributing to the validation of RLIMS-P results. Retrieved phosphorylation information can also be downloaded in CSV format and the text evidence in the BioC format. RLIMS-P is freely available. DATABASE URL: http://www.proteininformationresource.org/rlimsp/


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Bases de Datos de Proteínas , Internet , Fosfoproteínas , Animales , Humanos , Interfaz Usuario-Computador
15.
Med Decis Making ; 33(6): 860-8, 2013 08.
Artículo en Inglés | MEDLINE | ID: mdl-23515214

RESUMEN

OBJECTIVE: In the Electronic Surveillance System for the Early Notification of Community-based Epidemics (ESSENCE), influenza was originally defined by a list of 29 and later by a list of 12 diagnosis codes. This article describes a dependent Bayesian procedure designed to improve the ESSENCE system and exploit multiple sources of information without being biased by redundancy. METHODS: We obtained 13,096 cases within the Armed Forces Health Longitudinal Technological Application electronic medical records that included an influenza laboratory test. A Dependent Bayesian Expert System (D-BESt) was used to predict influenza from diagnoses, symptoms, reason for visit, temperature, month of visit, category of enrollment, and demographics. For each case, D-BESt sequentially selects the most discriminating piece of information, calculates its likelihood ratio conditioned on previously selected information, and updates the case's probability of influenza. RESULTS: When the analysis was limited to definitions based on diagnoses and was applied to a sample of patients for whom laboratory tests had been ordered, the areas under the receiver operating characteristic curve (AUCs) for the previous (29-diagnosis) and current (12-diagnosis) ESSENCE lists and the D-BESt algorithm were, respectively, 0.47, 0.36, and 0.77. Including other sources of information further improved the AUC for D-BESt to 0.79. At the best cutoff point for D-BESt, where the receiver operating characteristic curve for D-BESt is farthest from the diagonal line, the D-BESt algorithm correctly classified 84% of cases (specificity = 88%, sensitivity = 62%). In comparison, the current ESSENCE approach of using a list of 12 diagnoses correctly classified only 31% of this sample of cases (specificity = 29%, sensitivity = 42%). CONCLUSIONS: False alarms in ESSENCE surveillance systems can be reduced if a probabilistic dynamic learning system is used.


Asunto(s)
Teorema de Bayes , Gripe Humana/epidemiología , Vigilancia de la Población , Algoritmos , Humanos
16.
J Biomed Semantics ; 4(1): 3, 2013 Jan 08.
Artículo en Inglés | MEDLINE | ID: mdl-23294871

RESUMEN

BACKGROUND: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions. RESULTS: We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling. CONCLUSIONS: The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that - i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.

17.
BMC Syst Biol ; 7 Suppl 4: S8, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24565394

RESUMEN

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries.


Asunto(s)
Investigación Biomédica , Biología Computacional/métodos , Gráficos por Computador , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Imagen Asistido por Computador
18.
Database (Oxford) ; 2013: bat064, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24048470

RESUMEN

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/


Asunto(s)
Investigación Biomédica , Minería de Datos , Procesamiento de Lenguaje Natural , Programas Informáticos , Humanos
19.
Qual Manag Health Care ; 21(1): 9-19, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22207014

RESUMEN

This article shows how sentiment analysis (an artificial intelligence procedure that classifies opinions expressed within the text) can be used to design real-time satisfaction surveys. To improve participation, real-time surveys must be radically short. The shortest possible survey is a comment card. Patients' comments can be found online at sites organized for rating clinical care, within e-mails, in hospital complaint registries, or through simplified satisfaction surveys such as "Minute Survey." Sentiment analysis uses patterns among words to classify a comment into a complaint, or praise. It further classifies complaints into specific reasons for dissatisfaction, similar to broad categories found in longer surveys such as Consumer Assessment of Healthcare Providers and Systems. In this manner, sentiment analysis allows one to re-create responses to longer satisfaction surveys from a list of comments. To demonstrate, this article provides an analysis of sentiments expressed in 995 online comments made at the RateMDs.com Web site. We focused on pediatrician and obstetrician/gynecologist physicians in District of Columbia, Maryland, and Virginia. We were able to classify patients' reasons for dissatisfaction and the analysis provided information on how practices can improve their care. This article reports the accuracy of classifications of comments. Accuracy will improve as the number of comments received increases. In addition, we ranked physicians using the concept of time-to-next complaint. A time-between control chart was used to assess whether time-to-next complaint exceeded historical patterns and therefore suggested a departure from norms. These findings suggest that (1) patients' comments are easily available, (2) sentiment analysis can classify these comments into complaints/praise, and (3) time-to-next complaint can turn these classifications into numerical benchmarks that can trace impact of improvements over time. The procedures described in the article show that real-time satisfaction surveys are possible.


Asunto(s)
Satisfacción del Paciente , Calidad de la Atención de Salud , Proyectos de Investigación , Encuestas y Cuestionarios , Inteligencia Artificial , Actitud , District of Columbia , Estudios de Factibilidad , Ginecología , Encuestas de Atención de la Salud , Humanos , Maryland , Obstetricia , Pediatría , Factores de Tiempo , Virginia
20.
Artículo en Inglés | MEDLINE | ID: mdl-22779047

RESUMEN

Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA