Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 43
Filtrar
1.
Pharmacoepidemiol Drug Saf ; 33(1): e5743, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38158381

RESUMEN

BACKGROUND: Medication errors (MEs) are a major public health concern which can cause harm and financial burden within the healthcare system. Characterizing MEs is crucial to develop strategies to mitigate MEs in the future. OBJECTIVES: To characterize ME-associated reports, and investigate signals of disproportionate reporting (SDRs) on MEs in the Food and Drug Administration's Adverse Event Reporting System (FAERS). METHODS: FAERS data from 2004 to 2020 was used. ME reports were identified with the narrow Standardised Medical Dictionary for Regulatory Activities® (MedDRA®) Query (SMQ) for MEs. Drug names were converted to the Anatomical Therapeutic Chemical (ATC) classification. SDRs were investigated using the reporting odds ratio (ROR). RESULTS: In total 488 470 ME reports were identified, mostly (59%) submitted by consumers and mainly (55%) associated with females. Median age at time of ME was 57 years (interquartile range: 37-70 years). Approximately 1 out of 3 reports stated a serious health outcome. The most prevalent reported drug class was "antineoplastic and immunomodulating agents" (25%). The most common ME type was "incorrect dose administered" (9%). Of the 1659 SDRs obtained, adalimumab was the most common drug associated with MEs, noting a ROR of 1.22 (95% confidence interval: 1.21-1.24). CONCLUSION: This study offers a first of its kind characterization of MEs as reported to FAERS. Reported MEs are frequent and may be associated with serious health outcomes. This FAERS data provides insights on ME prevention and offers possibilities for additional in-depth analyses.


Asunto(s)
Sistemas de Registro de Reacción Adversa a Medicamentos , Errores de Medicación , Femenino , Estados Unidos , Humanos , Adulto , Persona de Mediana Edad , Anciano , Preparaciones Farmacéuticas , United States Food and Drug Administration , Errores de Medicación/prevención & control , Adalimumab , Farmacovigilancia
2.
Front Pharmacol ; 14: 1276340, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38035014

RESUMEN

Introduction: Monoclonal antibodies (mAbs) targeting immunoglobulin E (IgE) [omalizumab], type 2 (T2) cytokine interleukin (IL) 5 [mepolizumab, reslizumab], IL-4 Receptor (R) α [dupilumab], and IL-5R [benralizumab]), improve quality of life in patients with T2-driven inflammatory diseases. However, there is a concern for an increased risk of helminth infections. The aim was to explore safety signals of parasitic infections for omalizumab, mepolizumab, reslizumab, dupilumab, and benralizumab. Methods: Spontaneous reports were used from the Food and Drug Administration's Adverse Event Reporting System (FAERS) database from 2004 to 2021. Parasitic infections were defined as any type of parasitic infection term obtained from the Standardised Medical Dictionary for Regulatory Activities® (MedDRA®). Safety signal strength was assessed by the Reporting Odds Ratio (ROR). Results: 15,502,908 reports were eligible for analysis. Amongst 175,888 reports for omalizumab, mepolizumab, reslizumab, dupilumab, and benralizumab, there were 79 reports on parasitic infections. Median age was 55 years (interquartile range 24-63 years) and 59.5% were female. Indications were known in 26 (32.9%) reports; 14 (53.8%) biologicals were reportedly prescribed for asthma, 8 (30.7%) for various types of dermatitis, and 2 (7.6%) for urticaria. A safety signal was observed for each biological, except for reslizumab (due to lack of power), with the strongest signal attributed to benralizumab (ROR = 15.7, 95% Confidence Interval: 8.4-29.3). Conclusion: Parasitic infections were disproportionately reported for mAbs targeting IgE, T2 cytokines, or T2 cytokine receptors. While the number of adverse event reports on parasitic infections in the database was relatively low, resulting safety signals were disproportionate and warrant further investigation.

3.
J Am Med Inform Assoc ; 30(12): 1973-1984, 2023 11 17.
Artículo en Inglés | MEDLINE | ID: mdl-37587084

RESUMEN

OBJECTIVE: This work aims to explore the value of Dutch unstructured data, in combination with structured data, for the development of prognostic prediction models in a general practitioner (GP) setting. MATERIALS AND METHODS: We trained and validated prediction models for 4 common clinical prediction problems using various sparse text representations, common prediction algorithms, and observational GP electronic health record (EHR) data. We trained and validated 84 models internally and externally on data from different EHR systems. RESULTS: On average, over all the different text representations and prediction algorithms, models only using text data performed better or similar to models using structured data alone in 2 prediction tasks. Additionally, in these 2 tasks, the combination of structured and text data outperformed models using structured or text data alone. No large performance differences were found between the different text representations and prediction algorithms. DISCUSSION: Our findings indicate that the use of unstructured data alone can result in well-performing prediction models for some clinical prediction problems. Furthermore, the performance improvement achieved by combining structured and text data highlights the added value. Additionally, we demonstrate the significance of clinical natural language processing research in languages other than English and the possibility of validating text-based prediction models across various EHR systems. CONCLUSION: Our study highlights the potential benefits of incorporating unstructured data in clinical prediction models in a GP setting. Although the added value of unstructured data may vary depending on the specific prediction task, our findings suggest that it has the potential to enhance patient care.


Asunto(s)
Médicos Generales , Humanos , Registros Electrónicos de Salud , Lenguaje , Algoritmos , Programas Informáticos , Procesamiento de Lenguaje Natural
4.
J Biomed Semantics ; 13(1): 24, 2022 10 18.
Artículo en Inglés | MEDLINE | ID: mdl-36258262

RESUMEN

BACKGROUND: Vaccine information in European electronic health record (EHR) databases is represented using various clinical and database-specific coding systems and drug vocabularies. The lack of harmonization constitutes a challenge in reusing EHR data in collaborative benefit-risk studies about vaccines. METHODS: We designed an ontology of the properties that are commonly used in vaccine descriptions, called Ontology of Vaccine Descriptions (VaccO), with a dictionary for the analysis of multilingual vaccine descriptions. We implemented five algorithms for the alignment of vaccine coding systems, i.e., the identification of corresponding codes from different coding ystems, based on an analysis of the code descriptors. The algorithms were evaluated by comparing their results with manually created alignments in two reference sets including clinical and database-specific coding systems with multilingual code descriptors. RESULTS: The best-performing algorithm represented code descriptors as logical statements about entities in the VaccO ontology and used an ontology reasoner to infer common properties and identify corresponding vaccine codes. The evaluation demonstrated excellent performance of the approach (F-scores 0.91 and 0.96). CONCLUSION: The VaccO ontology allows the identification, representation, and comparison of heterogeneous descriptions of vaccines. The automatic alignment of vaccine coding systems can accelerate the readiness of EHR databases in collaborative vaccine studies.


Asunto(s)
Registros Electrónicos de Salud , Vacunas , Bases de Datos Factuales , Algoritmos
5.
PLoS One ; 17(7): e0271395, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35830458

RESUMEN

Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) that play important roles in the genetic heritability of traits and diseases. With most of these SNPs located on the non-coding part of the genome, it is currently assumed that these SNPs influence the expression of nearby genes on the genome. However, identifying which genes are targeted by these disease-associated SNPs remains challenging. In the past, protein knowledge graphs have often been used to identify genes that are associated with disease, also referred to as "disease genes". Here, we explore whether protein knowledge graphs can be used to identify genes that are targeted by disease-associated non-coding SNPs by testing and comparing the performance of six existing methods for a protein knowledge graph, four of which were developed for disease gene identification. We compare our performance against two baselines: (1) an existing state-of-the-art method that is based on guilt-by-association, and (2) the leading assumption that SNPs target the nearest gene on the genome. We test these methods with four reference sets, three of which were obtained by different means. Furthermore, we combine methods to investigate whether their combination improves performance. We find that protein knowledge graphs that include predicate information perform comparable to the current state of the art, achieving an area under the receiver operating characteristic curve (AUC) of 79.6% on average across all four reference sets. Protein knowledge graphs that lack predicate information perform comparable to our other baseline (genetic distance) which achieved an AUC of 75.7% across all four reference sets. Combining multiple methods improved performance to 84.9% AUC. We conclude that methods for a protein knowledge graph can be used to identify which genes are targeted by disease-associated non-coding SNPs.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo/métodos , Reconocimiento de Normas Patrones Automatizadas , Fenotipo
6.
J Am Med Inform Assoc ; 29(7): 1292-1302, 2022 06 14.
Artículo en Inglés | MEDLINE | ID: mdl-35475536

RESUMEN

OBJECTIVE: This systematic review aims to assess how information from unstructured text is used to develop and validate clinical prognostic prediction models. We summarize the prediction problems and methodological landscape and determine whether using text data in addition to more commonly used structured data improves the prediction performance. MATERIALS AND METHODS: We searched Embase, MEDLINE, Web of Science, and Google Scholar to identify studies that developed prognostic prediction models using information extracted from unstructured text in a data-driven manner, published in the period from January 2005 to March 2021. Data items were extracted, analyzed, and a meta-analysis of the model performance was carried out to assess the added value of text to structured-data models. RESULTS: We identified 126 studies that described 145 clinical prediction problems. Combining text and structured data improved model performance, compared with using only text or only structured data. In these studies, a wide variety of dense and sparse numeric text representations were combined with both deep learning and more traditional machine learning methods. External validation, public availability, and attention for the explainability of the developed models were limited. CONCLUSION: The use of unstructured text in the development of prognostic prediction models has been found beneficial in addition to structured data in most studies. The text data are source of valuable information for prediction model development and should not be neglected. We suggest a future focus on explainability and external validation of the developed models, promoting robust and trustworthy prediction models in clinical practice.


Asunto(s)
Aprendizaje Automático , Pronóstico
7.
J Biomed Semantics ; 11(1): 9, 2020 08 20.
Artículo en Inglés | MEDLINE | ID: mdl-32819419

RESUMEN

BACKGROUND: Knowledge graphs can represent the contents of biomedical literature and databases as subject-predicate-object triples, thereby enabling comprehensive analyses that identify e.g. relationships between diseases. Some diseases are often diagnosed in patients in specific temporal sequences, which are referred to as disease trajectories. Here, we determine whether a sequence of two diseases forms a trajectory by leveraging the predicate information from paths between (disease) proteins in a knowledge graph. Furthermore, we determine the added value of directional information of predicates for this task. To do so, we create four feature sets, based on two methods for representing indirect paths, and both with and without directional information of predicates (i.e., which protein is considered subject and which object). The added value of the directional information of predicates is quantified by comparing the classification performance of the feature sets that include or exclude it. RESULTS: Our method achieved a maximum area under the ROC curve of 89.8% and 74.5% when evaluated with two different reference sets. Use of directional information of predicates significantly improved performance by 6.5 and 2.0 percentage points respectively. CONCLUSIONS: Our work demonstrates that predicates between proteins can be used to identify disease trajectories. Using the directional information of predicates significantly improved performance over not using this information.


Asunto(s)
Ontologías Biológicas , Gráficos por Computador , Enfermedad , Humanos , Almacenamiento y Recuperación de la Información , Curva ROC , Semántica
8.
Sci Rep ; 9(1): 6281, 2019 04 18.
Artículo en Inglés | MEDLINE | ID: mdl-31000794

RESUMEN

Compounds that are candidates for drug repurposing can be ranked by leveraging knowledge available in the biomedical literature and databases. This knowledge, spread across a variety of sources, can be integrated within a knowledge graph, which thereby comprehensively describes known relationships between biomedical concepts, such as drugs, diseases, genes, etc. Our work uses the semantic information between drug and disease concepts as features, which are extracted from an existing knowledge graph that integrates 200 different biological knowledge sources. RepoDB, a standard drug repurposing database which describes drug-disease combinations that were approved or that failed in clinical trials, is used to train a random forest classifier. The 10-times repeated 10-fold cross-validation performance of the classifier achieves a mean area under the receiver operating characteristic curve (AUC) of 92.2%. We apply the classifier to prioritize 21 preclinical drug repurposing candidates that have been suggested for Autosomal Dominant Polycystic Kidney Disease (ADPKD). Mozavaptan, a vasopressin V2 receptor antagonist is predicted to be the drug most likely to be approved after a clinical trial, and belongs to the same drug class as tolvaptan, the only treatment for ADPKD that is currently approved. We conclude that semantic properties of concepts in a knowledge graph can be exploited to prioritize drug repurposing candidates for testing in clinical trials.


Asunto(s)
Reposicionamiento de Medicamentos/métodos , Difusión de la Información/métodos , Riñón Poliquístico Autosómico Dominante/tratamiento farmacológico , Semántica , Benzazepinas/uso terapéutico , Ensayos Clínicos como Asunto , Bases de Datos Factuales , Humanos , Conocimiento , Reconocimiento de Normas Patrones Automatizadas
9.
J Biomed Semantics ; 9(1): 23, 2018 09 06.
Artículo en Inglés | MEDLINE | ID: mdl-30189889

RESUMEN

BACKGROUND: Biomedical knowledge graphs have become important tools to computationally analyse the comprehensive body of biomedical knowledge. They represent knowledge as subject-predicate-object triples, in which the predicate indicates the relationship between subject and object. A triple can also contain provenance information, which consists of references to the sources of the triple (e.g. scientific publications or database entries). Knowledge graphs have been used to classify drug-disease pairs for drug efficacy screening, but existing computational methods have often ignored predicate and provenance information. Using this information, we aimed to develop a supervised machine learning classifier and determine the added value of predicate and provenance information for drug efficacy screening. To ensure the biological plausibility of our method we performed our research on the protein level, where drugs are represented by their drug target proteins, and diseases by their disease proteins. RESULTS: Using random forests with repeated 10-fold cross-validation, our method achieved an area under the ROC curve (AUC) of 78.1% and 74.3% for two reference sets. We benchmarked against a state-of-the-art knowledge-graph technique that does not use predicate and provenance information, obtaining AUCs of 65.6% and 64.6%, respectively. Classifiers that only used predicate information performed superior to classifiers that only used provenance information, but using both performed best. CONCLUSION: We conclude that both predicate and provenance information provide added value for drug efficacy screening.


Asunto(s)
Ontologías Biológicas , Gráficos por Computador , Evaluación Preclínica de Medicamentos , Reacciones Falso Negativas , Curva ROC
10.
BMC Bioinformatics ; 19(1): 183, 2018 05 25.
Artículo en Inglés | MEDLINE | ID: mdl-29801439

RESUMEN

BACKGROUND: A quantitative trait locus (QTL) is a genomic region that correlates with a phenotype. Most of the experimental information about QTL mapping studies is described in tables of scientific publications. Traditional text mining techniques aim to extract information from unstructured text rather than from tables. We present QTLTableMiner++ (QTM), a table mining tool that extracts and semantically annotates QTL information buried in (heterogeneous) tables of plant science literature. QTM is a command line tool written in the Java programming language. This tool takes scientific articles from the Europe PMC repository as input, extracts QTL tables using keyword matching and ontology-based concept identification. The tables are further normalized using rules derived from table properties such as captions, column headers and table footers. Furthermore, table columns are classified into three categories namely column descriptors, properties and values based on column headers and data types of cell entries. Abbreviations found in the tables are expanded using the Schwartz and Hearst algorithm. Finally, the content of QTL tables is semantically enriched with domain-specific ontologies (e.g. Crop Ontology, Plant Ontology and Trait Ontology) using the Apache Solr search platform and the results are stored in a relational database and a text file. RESULTS: The performance of the QTM tool was assessed by precision and recall based on the information retrieved from two manually annotated corpora of open access articles, i.e. QTL mapping studies in tomato (Solanum lycopersicum) and in potato (S. tuberosum). In summary, QTM detected QTL statements in tomato with 74.53% precision and 92.56% recall and in potato with 82.82% precision and 98.94% recall. CONCLUSION: QTM is a unique tool that aids in providing QTL information in machine-readable and semantically interoperable formats.


Asunto(s)
Minería de Datos/métodos , Sitios de Carácter Cuantitativo , Programas Informáticos , Algoritmos , Gráficos por Computador , Bases de Datos Factuales , Solanum lycopersicum/genética , Publicaciones , Semántica , Solanum tuberosum/genética
11.
J Biomed Inform ; 71: 178-189, 2017 07.
Artículo en Inglés | MEDLINE | ID: mdl-28579531

RESUMEN

PROBLEM: Biomedical literature and databases contain important clues for the identification of potential disease biomarkers. However, searching these enormous knowledge reservoirs and integrating findings across heterogeneous sources is costly and difficult. Here we demonstrate how semantically integrated knowledge, extracted from biomedical literature and structured databases, can be used to automatically identify potential migraine biomarkers. METHOD: We used a knowledge graph containing more than 3.5 million biomedical concepts and 68.4 million relationships. Biochemical compound concepts were filtered and ranked by their potential as biomarkers based on their connections to a subgraph of migraine-related concepts. The ranked results were evaluated against the results of a systematic literature review that was performed manually by migraine researchers. Weight points were assigned to these reference compounds to indicate their relative importance. RESULTS: Ranked results automatically generated by the knowledge graph were highly consistent with results from the manual literature review. Out of 222 reference compounds, 163 (73%) ranked in the top 2000, with 547 out of the 644 (85%) weight points assigned to the reference compounds. For reference compounds that were not in the top of the list, an extensive error analysis has been performed. When evaluating the overall performance, we obtained a ROC-AUC of 0.974. DISCUSSION: Semantic knowledge graphs composed of information integrated from multiple and varying sources can assist researchers in identifying potential disease biomarkers.


Asunto(s)
Biomarcadores , Minería de Datos , Bases de Datos Factuales , Trastornos Migrañosos/diagnóstico , Semántica , Automatización , Humanos , Publicaciones
12.
Pharmacoepidemiol Drug Saf ; 26(8): 998-1005, 2017 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-28657162

RESUMEN

BACKGROUND: Assessment of drug and vaccine effects by combining information from different healthcare databases in the European Union requires extensive efforts in the harmonization of codes as different vocabularies are being used across countries. In this paper, we present a web application called CodeMapper, which assists in the mapping of case definitions to codes from different vocabularies, while keeping a transparent record of the complete mapping process. METHODS: CodeMapper builds upon coding vocabularies contained in the Metathesaurus of the Unified Medical Language System. The mapping approach consists of three phases. First, medical concepts are automatically identified in a free-text case definition. Second, the user revises the set of medical concepts by adding or removing concepts, or expanding them to related concepts that are more general or more specific. Finally, the selected concepts are projected to codes from the targeted coding vocabularies. We evaluated the application by comparing codes that were automatically generated from case definitions by applying CodeMapper's concept identification and successive concept expansion, with reference codes that were manually created in a previous epidemiological study. RESULTS: Automated concept identification alone had a sensitivity of 0.246 and positive predictive value (PPV) of 0.420 for reproducing the reference codes. Three successive steps of concept expansion increased sensitivity to 0.953 and PPV to 0.616. CONCLUSIONS: Automatic concept identification in the case definition alone was insufficient to reproduce the reference codes, but CodeMapper's operations for concept expansion provide an effective, efficient, and transparent way for reproducing the reference codes.


Asunto(s)
Bases de Datos Factuales/estadística & datos numéricos , Clasificación Internacional de Enfermedades/estadística & datos numéricos , Sistemas de Registros Médicos Computarizados/estadística & datos numéricos , Unified Medical Language System/estadística & datos numéricos , Europa (Continente)/epidemiología , Humanos
13.
Vaccine ; 34(50): 6166-6171, 2016 12 07.
Artículo en Inglés | MEDLINE | ID: mdl-27840012

RESUMEN

BACKGROUND: Public confidence in an immunization programme is a pivotal determinant of the programme's success. The mining of social media is increasingly employed to provide insight into the public's sentiment. This research further explores the value of monitoring social media to understand public sentiment about an international vaccination programme. OBJECTIVE: To gain insight into international public discussion on the paediatric pentavalent vaccine (DTP-HepB-Hib) programme by analysing Twitter messages. METHODS: Using a multilingual search, we retrospectively collected all public Twitter messages mentioning the DTP-HepB-Hib vaccine from July 2006 until May 2015. We analysed message characteristics by frequency of referencing other websites, type of websites, and geographic focus of the discussion. In addition, a sample of messages was manually annotated for positive or negative message tone. RESULTS: We retrieved 5771 messages. Only 3.1% of the messages were reactions to other messages, and 86.6% referred to websites, mostly news sites (70.7%), other social media (9.8%), and health-information sites (9.5%). Country mentions were identified in 70.4% of the messages, of which India (35.4%), Indonesia (18.3%), and Vietnam (13.9%) were the most prevalent. In the annotated sample, 63% of the messages showed a positive or neutral sentiment about DTP-HepB-Hib. Peaks in negative and positive messages could be related to country-specific programme events. CONCLUSIONS: Public messages about DTP-HepB-Hib were characterized by little interaction between tweeters, and by frequent referencing of websites and other information links. Twitter messages can indirectly reflect the public's opinion about major events in the debates about the DTP-HepB-Hib vaccine.


Asunto(s)
Vacuna contra Difteria, Tétanos y Tos Ferina/efectos adversos , Vacuna contra Difteria, Tétanos y Tos Ferina/inmunología , Vacunas contra Haemophilus/efectos adversos , Vacunas contra Haemophilus/inmunología , Vacunas contra Hepatitis B/efectos adversos , Vacunas contra Hepatitis B/inmunología , Inmunización/efectos adversos , Inmunización/psicología , Opinión Pública , Medios de Comunicación Sociales , Vacuna contra Difteria, Tétanos y Tos Ferina/administración & dosificación , Vacunas contra Haemophilus/administración & dosificación , Vacunas contra Hepatitis B/administración & dosificación , Humanos , Estudios Retrospectivos
14.
Artículo en Inglés | MEDLINE | ID: mdl-27141091

RESUMEN

We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.


Asunto(s)
Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Aprendizaje Automático , Patentes como Asunto , Modelos Estadísticos , Programas Informáticos
15.
Stud Health Technol Inform ; 223: 93-9, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27139390

RESUMEN

The vast amount of clinical data in electronic health records constitutes a great potential for secondary use. However, most of this content consists of unstructured or semi-structured texts, which is difficult to process. Several challenges are still pending: medical language idiosyncrasies in different natural languages, and the large variety of medical terminology systems. In this paper we present SEMCARE, a European initiative designed to minimize these problems by providing a multi-lingual platform (English, German, and Dutch) that allows users to express complex queries and obtain relevant search results from clinical texts. SEMCARE is based on a selection of adapted biomedical terminologies, together with Apache UIMA and Apache Solr as open source state-of-the-art natural language pipeline and indexing technologies. SEMCARE has been deployed and is currently being tested at three medical institutions in the UK, Austria, and the Netherlands, showing promising results in a cardiology use case.


Asunto(s)
Minería de Datos/métodos , Registros Electrónicos de Salud , Humanos , Almacenamiento y Recuperación de la Información/métodos , Lenguaje , Lingüística/métodos , Procesamiento de Lenguaje Natural , Semántica
16.
Artículo en Inglés | MEDLINE | ID: mdl-27081155

RESUMEN

We describe our approach to the chemical-disease relation (CDR) task in the BioCreative V challenge. The CDR task consists of two subtasks: automatic disease-named entity recognition and normalization (DNER), and extraction of chemical-induced diseases (CIDs) from Medline abstracts. For the DNER subtask, we used our concept recognition tool Peregrine, in combination with several optimization steps. For the CID subtask, our system, which we named RELigator, was trained on a rich feature set, comprising features derived from a graph database containing prior knowledge about chemicals and diseases, and linguistic and statistical features derived from the abstracts in the CDR training corpus. We describe the systems that were developed and present evaluation results for both subtasks on the CDR test set. For DNER, our Peregrine system reached anF-score of 0.757. For CID, the system achieved anF-score of 0.526, which ranked second among 18 participating teams. Several post-challenge modifications of the systems resulted in substantially improvedF-scores (0.828 for DNER and 0.602 for CID). RELigator is available as a web service athttp://biosemantics.org/index.php/software/religator.


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Bases de Datos Factuales , Enfermedad/etiología , Sustancias Peligrosas/toxicidad , Humanos , Toxicogenética
17.
PLoS One ; 11(2): e0149621, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-26919047

RESUMEN

High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing biomedical knowledge for identification and interpretation of gene-disease associations. The implicitome can be used in conjunction with experimental data resources to rationalize both known and novel associations. We demonstrate the usefulness of the implicitome by rationalizing known and novel gene-disease associations, including those from GWAS. To facilitate the re-use of implicit gene-disease associations, we publish our data in compliance with FAIR Data Publishing recommendations [https://www.force11.org/group/fairgroup] using nanopublications. An online tool (http://knowledge.bio) is available to explore established and potential gene-disease associations in the context of other biomedical relations.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos
18.
Drug Saf ; 38(10): 921-30, 2015 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-26242616

RESUMEN

INTRODUCTION: There is growing interest in whether social media can capture patient-generated information relevant for medicines safety surveillance that cannot be found in traditional sources. OBJECTIVE: The aim of this study was to evaluate the potential contribution of mining social media networks for medicines safety surveillance using the following associations as case studies: (1) rosiglitazone and cardiovascular events (i.e. stroke and myocardial infarction); and (2) human papilloma virus (HPV) vaccine and infertility. METHODS: We collected publicly accessible, English-language posts on Facebook, Google+, and Twitter until September 2014. Data were queried for co-occurrence of keywords related to the drug/vaccine and event of interest within a post. Messages were analysed with respect to geographical distribution, context, linking to other web content, and author's assertion regarding the supposed association. RESULTS: A total of 2537 posts related to rosiglitazone/cardiovascular events and 2236 posts related to HPV vaccine/infertility were retrieved, with the majority of posts representing data from Twitter (98 and 85%, respectively) and originating from users in the US. Approximately 21% of rosiglitazone-related posts and 84% of HPV vaccine-related posts referenced other web pages, mostly news items, law firms' websites, or blogs. Assertion analysis predominantly showed affirmation of the association of rosiglitazone/cardiovascular events (72%; n = 1821) and of HPV vaccine/infertility (79%; n = 1758). Only ten posts described personal accounts of rosiglitazone/cardiovascular adverse event experiences, and nine posts described HPV vaccine problems related to infertility. CONCLUSIONS: Publicly available data from the considered social media networks were sparse and largely untrackable for the purpose of providing early clues of safety concerns regarding the prespecified case studies. Further research investigating other case studies and exploring other social media platforms are necessary to further characterise the usefulness of social media for safety surveillance.


Asunto(s)
Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Preparaciones Farmacéuticas/administración & dosificación , Seguridad , Medios de Comunicación Sociales , Blogging , Humanos , Internet
19.
J Am Med Inform Assoc ; 22(5): 948-56, 2015 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-25948699

RESUMEN

OBJECTIVE: To create a multilingual gold-standard corpus for biomedical concept recognition. MATERIALS AND METHODS: We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. RESULTS: The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. DISCUSSION: The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. CONCLUSION: To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.


Asunto(s)
Almacenamiento y Recuperación de la Información/métodos , Multilingüismo , Procesamiento de Lenguaje Natural , Terminología como Asunto , Semántica , Unified Medical Language System
20.
J Cheminform ; 7(Suppl 1 Text mining for chemistry and the CHEMDNER track): S10, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-25810767

RESUMEN

BACKGROUND: The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals. RESULTS: The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions. CONCLUSIONS: We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA