Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 59
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Pharmacoepidemiol Drug Saf ; 33(1): e5743, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38158381

RESUMEN

BACKGROUND: Medication errors (MEs) are a major public health concern which can cause harm and financial burden within the healthcare system. Characterizing MEs is crucial to develop strategies to mitigate MEs in the future. OBJECTIVES: To characterize ME-associated reports, and investigate signals of disproportionate reporting (SDRs) on MEs in the Food and Drug Administration's Adverse Event Reporting System (FAERS). METHODS: FAERS data from 2004 to 2020 was used. ME reports were identified with the narrow Standardised Medical Dictionary for Regulatory Activities® (MedDRA®) Query (SMQ) for MEs. Drug names were converted to the Anatomical Therapeutic Chemical (ATC) classification. SDRs were investigated using the reporting odds ratio (ROR). RESULTS: In total 488 470 ME reports were identified, mostly (59%) submitted by consumers and mainly (55%) associated with females. Median age at time of ME was 57 years (interquartile range: 37-70 years). Approximately 1 out of 3 reports stated a serious health outcome. The most prevalent reported drug class was "antineoplastic and immunomodulating agents" (25%). The most common ME type was "incorrect dose administered" (9%). Of the 1659 SDRs obtained, adalimumab was the most common drug associated with MEs, noting a ROR of 1.22 (95% confidence interval: 1.21-1.24). CONCLUSION: This study offers a first of its kind characterization of MEs as reported to FAERS. Reported MEs are frequent and may be associated with serious health outcomes. This FAERS data provides insights on ME prevention and offers possibilities for additional in-depth analyses.


Asunto(s)
Sistemas de Registro de Reacción Adversa a Medicamentos , Errores de Medicación , Femenino , Estados Unidos , Humanos , Adulto , Persona de Mediana Edad , Anciano , Preparaciones Farmacéuticas , United States Food and Drug Administration , Errores de Medicación/prevención & control , Adalimumab , Farmacovigilancia
2.
BMC Bioinformatics ; 19(1): 183, 2018 05 25.
Artículo en Inglés | MEDLINE | ID: mdl-29801439

RESUMEN

BACKGROUND: A quantitative trait locus (QTL) is a genomic region that correlates with a phenotype. Most of the experimental information about QTL mapping studies is described in tables of scientific publications. Traditional text mining techniques aim to extract information from unstructured text rather than from tables. We present QTLTableMiner++ (QTM), a table mining tool that extracts and semantically annotates QTL information buried in (heterogeneous) tables of plant science literature. QTM is a command line tool written in the Java programming language. This tool takes scientific articles from the Europe PMC repository as input, extracts QTL tables using keyword matching and ontology-based concept identification. The tables are further normalized using rules derived from table properties such as captions, column headers and table footers. Furthermore, table columns are classified into three categories namely column descriptors, properties and values based on column headers and data types of cell entries. Abbreviations found in the tables are expanded using the Schwartz and Hearst algorithm. Finally, the content of QTL tables is semantically enriched with domain-specific ontologies (e.g. Crop Ontology, Plant Ontology and Trait Ontology) using the Apache Solr search platform and the results are stored in a relational database and a text file. RESULTS: The performance of the QTM tool was assessed by precision and recall based on the information retrieved from two manually annotated corpora of open access articles, i.e. QTL mapping studies in tomato (Solanum lycopersicum) and in potato (S. tuberosum). In summary, QTM detected QTL statements in tomato with 74.53% precision and 92.56% recall and in potato with 82.82% precision and 98.94% recall. CONCLUSION: QTM is a unique tool that aids in providing QTL information in machine-readable and semantically interoperable formats.


Asunto(s)
Minería de Datos/métodos , Sitios de Carácter Cuantitativo , Programas Informáticos , Algoritmos , Gráficos por Computador , Bases de Datos Factuales , Solanum lycopersicum/genética , Publicaciones , Semántica , Solanum tuberosum/genética
3.
J Biomed Inform ; 71: 178-189, 2017 07.
Artículo en Inglés | MEDLINE | ID: mdl-28579531

RESUMEN

PROBLEM: Biomedical literature and databases contain important clues for the identification of potential disease biomarkers. However, searching these enormous knowledge reservoirs and integrating findings across heterogeneous sources is costly and difficult. Here we demonstrate how semantically integrated knowledge, extracted from biomedical literature and structured databases, can be used to automatically identify potential migraine biomarkers. METHOD: We used a knowledge graph containing more than 3.5 million biomedical concepts and 68.4 million relationships. Biochemical compound concepts were filtered and ranked by their potential as biomarkers based on their connections to a subgraph of migraine-related concepts. The ranked results were evaluated against the results of a systematic literature review that was performed manually by migraine researchers. Weight points were assigned to these reference compounds to indicate their relative importance. RESULTS: Ranked results automatically generated by the knowledge graph were highly consistent with results from the manual literature review. Out of 222 reference compounds, 163 (73%) ranked in the top 2000, with 547 out of the 644 (85%) weight points assigned to the reference compounds. For reference compounds that were not in the top of the list, an extensive error analysis has been performed. When evaluating the overall performance, we obtained a ROC-AUC of 0.974. DISCUSSION: Semantic knowledge graphs composed of information integrated from multiple and varying sources can assist researchers in identifying potential disease biomarkers.


Asunto(s)
Biomarcadores , Minería de Datos , Bases de Datos Factuales , Trastornos Migrañosos/diagnóstico , Semántica , Automatización , Humanos , Publicaciones
4.
Pharmacoepidemiol Drug Saf ; 26(8): 998-1005, 2017 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-28657162

RESUMEN

BACKGROUND: Assessment of drug and vaccine effects by combining information from different healthcare databases in the European Union requires extensive efforts in the harmonization of codes as different vocabularies are being used across countries. In this paper, we present a web application called CodeMapper, which assists in the mapping of case definitions to codes from different vocabularies, while keeping a transparent record of the complete mapping process. METHODS: CodeMapper builds upon coding vocabularies contained in the Metathesaurus of the Unified Medical Language System. The mapping approach consists of three phases. First, medical concepts are automatically identified in a free-text case definition. Second, the user revises the set of medical concepts by adding or removing concepts, or expanding them to related concepts that are more general or more specific. Finally, the selected concepts are projected to codes from the targeted coding vocabularies. We evaluated the application by comparing codes that were automatically generated from case definitions by applying CodeMapper's concept identification and successive concept expansion, with reference codes that were manually created in a previous epidemiological study. RESULTS: Automated concept identification alone had a sensitivity of 0.246 and positive predictive value (PPV) of 0.420 for reproducing the reference codes. Three successive steps of concept expansion increased sensitivity to 0.953 and PPV to 0.616. CONCLUSIONS: Automatic concept identification in the case definition alone was insufficient to reproduce the reference codes, but CodeMapper's operations for concept expansion provide an effective, efficient, and transparent way for reproducing the reference codes.


Asunto(s)
Bases de Datos Factuales/estadística & datos numéricos , Clasificación Internacional de Enfermedades/estadística & datos numéricos , Sistemas de Registros Médicos Computarizados/estadística & datos numéricos , Unified Medical Language System/estadística & datos numéricos , Europa (Continente)/epidemiología , Humanos
5.
Bioinformatics ; 30(23): 3365-71, 2014 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-25143286

RESUMEN

MOTIVATION: Knowledge of drug-drug interactions (DDIs) is crucial for health-care professionals to avoid adverse effects when co-administering drugs to patients. As most newly discovered DDIs are made available through scientific publications, automatic DDI extraction is highly relevant. RESULTS: We propose a novel feature-based approach to extract DDIs from text. Our approach consists of three steps. First, we apply text preprocessing to convert input sentences from a given dataset into structured representations. Second, we map each candidate DDI pair from that dataset into a suitable syntactic structure. Based on that, a novel set of features is used to generate feature vectors for these candidate DDI pairs. Third, the obtained feature vectors are used to train a support vector machine (SVM) classifier. When evaluated on two DDI extraction challenge test datasets from 2011 and 2013, our system achieves F-scores of 71.1% and 83.5%, respectively, outperforming any state-of-the-art DDI extraction system. AVAILABILITY AND IMPLEMENTATION: The source code is available for academic use at http://www.biosemantics.org/uploads/DDI.zip.


Asunto(s)
Minería de Datos/métodos , Interacciones Farmacológicas , Humanos , Máquina de Vectores de Soporte
6.
BMC Bioinformatics ; 15: 64, 2014 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-24593054

RESUMEN

BACKGROUND: Many biomedical relation extraction systems are machine-learning based and have to be trained on large annotated corpora that are expensive and cumbersome to construct. We developed a knowledge-based relation extraction system that requires minimal training data, and applied the system for the extraction of adverse drug events from biomedical text. The system consists of a concept recognition module that identifies drugs and adverse effects in sentences, and a knowledge-base module that establishes whether a relation exists between the recognized concepts. The knowledge base was filled with information from the Unified Medical Language System. The performance of the system was evaluated on the ADE corpus, consisting of 1644 abstracts with manually annotated adverse drug events. Fifty abstracts were used for training, the remaining abstracts were used for testing. RESULTS: The knowledge-based system obtained an F-score of 50.5%, which was 34.4 percentage points better than the co-occurrence baseline. Increasing the training set to 400 abstracts improved the F-score to 54.3%. When the system was compared with a machine-learning system, jSRE, on a subset of the sentences in the ADE corpus, our knowledge-based system achieved an F-score that is 7 percentage points higher than the F-score of jSRE trained on 50 abstracts, and still 2 percentage points higher than jSRE trained on 90% of the corpus. CONCLUSION: A knowledge-based approach can be successfully used to extract adverse drug events from biomedical text without need for a large training set. Whether use of a knowledge base is equally advantageous for other biomedical relation-extraction tasks remains to be investigated.


Asunto(s)
Inteligencia Artificial , Minería de Datos/métodos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Bases del Conocimiento , Humanos , Unified Medical Language System
7.
Artículo en Inglés | MEDLINE | ID: mdl-38934643

RESUMEN

OBJECTIVE: To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora. MATERIALS AND METHODS: Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English. RESULTS: The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision. DISCUSSION: Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools. CONCLUSION: This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.

8.
Int J Med Inform ; 189: 105506, 2024 May 29.
Artículo en Inglés | MEDLINE | ID: mdl-38820647

RESUMEN

OBJECTIVE: Observational studies using electronic health record (EHR) databases often face challenges due to unspecific clinical codes that can obscure detailed medical information, hindering precise data analysis. In this study, we aimed to assess the feasibility of refining these unspecific condition codes into more specific codes in a Dutch general practitioner (GP) EHR database by leveraging the available clinical free text. METHODS: We utilized three approaches for text classification-search queries, semi-supervised learning, and supervised learning-to improve the specificity of ten unspecific International Classification of Primary Care (ICPC-1) codes. Two text representations and three machine learning algorithms were evaluated for the (semi-)supervised models. Additionally, we measured the improvement achieved by the refinement process on all code occurrences in the database. RESULTS: The classification models performed well for most codes. In general, no single classification approach consistently outperformed the others. However, there were variations in the relative performance of the classification approaches within each code and in the use of different text representations and machine learning algorithms. Class imbalance and limited training data affected the performance of the (semi-)supervised models, yet the simple search queries remained particularly effective. Ultimately, the developed models improved the specificity of over half of all the unspecific code occurrences in the database. CONCLUSIONS: Our findings show the feasibility of using information from clinical text to improve the specificity of unspecific condition codes in observational healthcare databases, even with a limited range of machine-learning techniques and modest annotated training sets. Future work could investigate transfer learning, integration of structured data, alternative semi-supervised methods, and validation of models across healthcare settings. The improved level of detail enriches the interpretation of medical information and can benefit observational research and patient care.

9.
Pharmacoepidemiol Drug Saf ; 22(5): 459-67, 2013 May.
Artículo en Inglés | MEDLINE | ID: mdl-23208789

RESUMEN

PURPOSE: Pharmacovigilance methods have advanced greatly during the last decades, making post-market drug assessment an essential drug evaluation component. These methods mainly rely on the use of spontaneous reporting systems and health information databases to collect expertise from huge amounts of real-world reports. The EU-ADR Web Platform was built to further facilitate accessing, monitoring and exploring these data, enabling an in-depth analysis of adverse drug reactions risks. METHODS: The EU-ADR Web Platform exploits the wealth of data collected within a large-scale European initiative, the EU-ADR project. Millions of electronic health records, provided by national health agencies, are mined for specific drug events, which are correlated with literature, protein and pathway data, resulting in a rich drug-event dataset. Next, advanced distributed computing methods are tailored to coordinate the execution of data-mining and statistical analysis tasks. This permits obtaining a ranked drug-event list, removing spurious entries and highlighting relationships with high risk potential. RESULTS: The EU-ADR Web Platform is an open workspace for the integrated analysis of pharmacovigilance datasets. Using this software, researchers can access a variety of tools provided by distinct partners in a single centralized environment. Besides performing standalone drug-event assessments, they can also control the pipeline for an improved batch analysis of custom datasets. Drug-event pairs can be substantiated and statistically analysed within the platform's innovative working environment. CONCLUSIONS: A pioneering workspace that helps in explaining the biological path of adverse drug reactions was developed within the EU-ADR project consortium. This tool, targeted at the pharmacovigilance community, is available online at https://bioinformatics.ua.pt/euadr/.


Asunto(s)
Sistemas de Registro de Reacción Adversa a Medicamentos/organización & administración , Internet , Farmacovigilancia , Sistemas de Registro de Reacción Adversa a Medicamentos/estadística & datos numéricos , Minería de Datos/métodos , Bases de Datos Factuales/estadística & datos numéricos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Europa (Continente) , Humanos , Programas Informáticos
10.
J Am Med Inform Assoc ; 30(12): 1973-1984, 2023 11 17.
Artículo en Inglés | MEDLINE | ID: mdl-37587084

RESUMEN

OBJECTIVE: This work aims to explore the value of Dutch unstructured data, in combination with structured data, for the development of prognostic prediction models in a general practitioner (GP) setting. MATERIALS AND METHODS: We trained and validated prediction models for 4 common clinical prediction problems using various sparse text representations, common prediction algorithms, and observational GP electronic health record (EHR) data. We trained and validated 84 models internally and externally on data from different EHR systems. RESULTS: On average, over all the different text representations and prediction algorithms, models only using text data performed better or similar to models using structured data alone in 2 prediction tasks. Additionally, in these 2 tasks, the combination of structured and text data outperformed models using structured or text data alone. No large performance differences were found between the different text representations and prediction algorithms. DISCUSSION: Our findings indicate that the use of unstructured data alone can result in well-performing prediction models for some clinical prediction problems. Furthermore, the performance improvement achieved by combining structured and text data highlights the added value. Additionally, we demonstrate the significance of clinical natural language processing research in languages other than English and the possibility of validating text-based prediction models across various EHR systems. CONCLUSION: Our study highlights the potential benefits of incorporating unstructured data in clinical prediction models in a GP setting. Although the added value of unstructured data may vary depending on the specific prediction task, our findings suggest that it has the potential to enhance patient care.


Asunto(s)
Médicos Generales , Humanos , Registros Electrónicos de Salud , Lenguaje , Algoritmos , Programas Informáticos , Procesamiento de Lenguaje Natural
11.
Front Pharmacol ; 14: 1276340, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38035014

RESUMEN

Introduction: Monoclonal antibodies (mAbs) targeting immunoglobulin E (IgE) [omalizumab], type 2 (T2) cytokine interleukin (IL) 5 [mepolizumab, reslizumab], IL-4 Receptor (R) α [dupilumab], and IL-5R [benralizumab]), improve quality of life in patients with T2-driven inflammatory diseases. However, there is a concern for an increased risk of helminth infections. The aim was to explore safety signals of parasitic infections for omalizumab, mepolizumab, reslizumab, dupilumab, and benralizumab. Methods: Spontaneous reports were used from the Food and Drug Administration's Adverse Event Reporting System (FAERS) database from 2004 to 2021. Parasitic infections were defined as any type of parasitic infection term obtained from the Standardised Medical Dictionary for Regulatory Activities® (MedDRA®). Safety signal strength was assessed by the Reporting Odds Ratio (ROR). Results: 15,502,908 reports were eligible for analysis. Amongst 175,888 reports for omalizumab, mepolizumab, reslizumab, dupilumab, and benralizumab, there were 79 reports on parasitic infections. Median age was 55 years (interquartile range 24-63 years) and 59.5% were female. Indications were known in 26 (32.9%) reports; 14 (53.8%) biologicals were reportedly prescribed for asthma, 8 (30.7%) for various types of dermatitis, and 2 (7.6%) for urticaria. A safety signal was observed for each biological, except for reslizumab (due to lack of power), with the strongest signal attributed to benralizumab (ROR = 15.7, 95% Confidence Interval: 8.4-29.3). Conclusion: Parasitic infections were disproportionately reported for mAbs targeting IgE, T2 cytokines, or T2 cytokine receptors. While the number of adverse event reports on parasitic infections in the database was relatively low, resulting safety signals were disproportionate and warrant further investigation.

12.
BMC Bioinformatics ; 13: 17, 2012 Jan 30.
Artículo en Inglés | MEDLINE | ID: mdl-22289351

RESUMEN

BACKGROUND: To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC. RESULTS: We have tested the two scenarios using three chunkers, Lingpipe, OpenNLP, and Yamcha, and two different corpora, GENIA and PennBioIE. For the first scenario, we showed that the systems trained for noun-phrase recognition on the SSC in one domain performed 2.7-3.1 percentage points better in terms of F-score than the systems trained on the GSC in another domain, and only 0.2-0.8 percentage points less than when they were trained on a GSC in the same domain as the SSC. When the outputs of the chunkers were combined, the combined system showed little improvement when using the SSC. For the second scenario, the systems trained on a GSC supplemented with an SSC performed considerably better than systems that were trained on the GSC alone, especially when the GSC was small. For example, training the chunkers on a GSC consisting of only 10 abstracts but supplemented with an SSC yielded similar performance as training them on a GSC of 100-250 abstracts. The combined system even performed better than any of the individual chunkers trained on a GSC of 500 abstracts. CONCLUSIONS: We conclude that an SSC can be a viable alternative for or a supplement to a GSC when training chunkers in a biomedical domain. A combined system only shows improvement if the SSC is used to supplement a GSC. Whether the approach is applicable to other systems in a natural-language processing pipeline has to be further investigated.


Asunto(s)
Biología Computacional/métodos , Biología Computacional/normas , Procesamiento de Lenguaje Natural , Humanos , Neoplasias/genética
13.
Hum Mutat ; 33(11): 1503-12, 2012 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-22736453

RESUMEN

The advances in bioinformatics required to annotate human genomic variants and to place them in public data repositories have not kept pace with their discovery. Moreover, a law of diminishing returns has begun to operate both in terms of data publication and submission. Although the continued deposition of such data in the public domain is essential to maximize both their scientific and clinical utility, rewards for data sharing are few, representing a serious practical impediment to data submission. To date, two main strategies have been adopted as a means to encourage the submission of human genomic variant data: (1) database journal linkups involving the affiliation of a scientific journal with a publicly available database and (2) microattribution, involving the unambiguous linkage of data to their contributors via a unique identifier. The latter could in principle lead to the establishment of a microcitation-tracking system that acknowledges individual endeavor and achievement. Both approaches could incentivize potential data contributors, thereby encouraging them to share their data with the scientific community. Here, we summarize and critically evaluate approaches that have been proposed to address current deficiencies in data attribution and discuss ways in which they could become more widely adopted as novel scientific publication modalities.


Asunto(s)
Variación Genética , Genoma Humano , Edición , Biología Computacional , Recolección de Datos , Bases de Datos Genéticas , Humanos , Revisión de la Investigación por Pares
14.
J Biomed Inform ; 45(3): 423-8, 2012 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-22239956

RESUMEN

Recognition of medical concepts is a basic step in information extraction from clinical records. We wished to improve on the performance of a variety of concept recognition systems by combining their individual results. We selected two dictionary-based systems and five statistical-based systems that were trained to annotate medical problems, tests, and treatments in clinical records. Manually annotated clinical records for training and testing were made available through the 2010 i2b2/VA (Informatics for Integrating Biology and the Bedside) challenge. Results of individual systems were combined by a simple voting scheme. The statistical systems were trained on a set of 349 records. Performance (precision, recall, F-score) was assessed on a test set of 477 records, using varying voting thresholds. The combined annotation system achieved a best F-score of 82.2% (recall 81.2%, precision 83.3%) on the test set, a score that ranks third among 22 participants in the i2b2/VA concept annotation task. The ensemble system had better precision and recall than any of the individual systems, yielding an F-score that is 4.6% point higher than the best single system. Changing the voting threshold offered a simple way to obtain a system with high precision (and moderate recall) or one with high recall (and moderate precision). The ensemble-based approach is straightforward and allows the balancing of precision versus recall of the combined system. The ensemble system is freely available and can easily be extended, integrated in other systems, and retrained.


Asunto(s)
Minería de Datos/métodos , Registros Electrónicos de Salud , Humanos , Procesamiento de Lenguaje Natural , Semántica
15.
J Biomed Inform ; 45(5): 879-84, 2012 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-22554700

RESUMEN

Corpora with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators revise this annotation using a web-based interface. The agreement figures achieved show that the inter-annotator agreement is much better than the agreement with the system provided annotations. The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts. These annotated relationships will be used to train and evaluate text-mining software to capture these relationships in texts.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Factuales , Informática Médica/métodos , Documentación , Quimioterapia/clasificación , Humanos , Internet , Preparaciones Farmacéuticas/clasificación , Interfaz Usuario-Computador
16.
J Biomed Semantics ; 13(1): 24, 2022 10 18.
Artículo en Inglés | MEDLINE | ID: mdl-36258262

RESUMEN

BACKGROUND: Vaccine information in European electronic health record (EHR) databases is represented using various clinical and database-specific coding systems and drug vocabularies. The lack of harmonization constitutes a challenge in reusing EHR data in collaborative benefit-risk studies about vaccines. METHODS: We designed an ontology of the properties that are commonly used in vaccine descriptions, called Ontology of Vaccine Descriptions (VaccO), with a dictionary for the analysis of multilingual vaccine descriptions. We implemented five algorithms for the alignment of vaccine coding systems, i.e., the identification of corresponding codes from different coding ystems, based on an analysis of the code descriptors. The algorithms were evaluated by comparing their results with manually created alignments in two reference sets including clinical and database-specific coding systems with multilingual code descriptors. RESULTS: The best-performing algorithm represented code descriptors as logical statements about entities in the VaccO ontology and used an ontology reasoner to infer common properties and identify corresponding vaccine codes. The evaluation demonstrated excellent performance of the approach (F-scores 0.91 and 0.96). CONCLUSION: The VaccO ontology allows the identification, representation, and comparison of heterogeneous descriptions of vaccines. The automatic alignment of vaccine coding systems can accelerate the readiness of EHR databases in collaborative vaccine studies.


Asunto(s)
Registros Electrónicos de Salud , Vacunas , Bases de Datos Factuales , Algoritmos
17.
PLoS One ; 17(7): e0271395, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35830458

RESUMEN

Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) that play important roles in the genetic heritability of traits and diseases. With most of these SNPs located on the non-coding part of the genome, it is currently assumed that these SNPs influence the expression of nearby genes on the genome. However, identifying which genes are targeted by these disease-associated SNPs remains challenging. In the past, protein knowledge graphs have often been used to identify genes that are associated with disease, also referred to as "disease genes". Here, we explore whether protein knowledge graphs can be used to identify genes that are targeted by disease-associated non-coding SNPs by testing and comparing the performance of six existing methods for a protein knowledge graph, four of which were developed for disease gene identification. We compare our performance against two baselines: (1) an existing state-of-the-art method that is based on guilt-by-association, and (2) the leading assumption that SNPs target the nearest gene on the genome. We test these methods with four reference sets, three of which were obtained by different means. Furthermore, we combine methods to investigate whether their combination improves performance. We find that protein knowledge graphs that include predicate information perform comparable to the current state of the art, achieving an area under the receiver operating characteristic curve (AUC) of 79.6% on average across all four reference sets. Protein knowledge graphs that lack predicate information perform comparable to our other baseline (genetic distance) which achieved an AUC of 75.7% across all four reference sets. Combining multiple methods improved performance to 84.9% AUC. We conclude that methods for a protein knowledge graph can be used to identify which genes are targeted by disease-associated non-coding SNPs.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo/métodos , Reconocimiento de Normas Patrones Automatizadas , Fenotipo
18.
J Am Med Inform Assoc ; 29(7): 1292-1302, 2022 06 14.
Artículo en Inglés | MEDLINE | ID: mdl-35475536

RESUMEN

OBJECTIVE: This systematic review aims to assess how information from unstructured text is used to develop and validate clinical prognostic prediction models. We summarize the prediction problems and methodological landscape and determine whether using text data in addition to more commonly used structured data improves the prediction performance. MATERIALS AND METHODS: We searched Embase, MEDLINE, Web of Science, and Google Scholar to identify studies that developed prognostic prediction models using information extracted from unstructured text in a data-driven manner, published in the period from January 2005 to March 2021. Data items were extracted, analyzed, and a meta-analysis of the model performance was carried out to assess the added value of text to structured-data models. RESULTS: We identified 126 studies that described 145 clinical prediction problems. Combining text and structured data improved model performance, compared with using only text or only structured data. In these studies, a wide variety of dense and sparse numeric text representations were combined with both deep learning and more traditional machine learning methods. External validation, public availability, and attention for the explainability of the developed models were limited. CONCLUSION: The use of unstructured text in the development of prognostic prediction models has been found beneficial in addition to structured data in most studies. The text data are source of valuable information for prediction model development and should not be neglected. We suggest a future focus on explainability and external validation of the developed models, promoting robust and trustworthy prediction models in clinical practice.


Asunto(s)
Aprendizaje Automático , Pronóstico
19.
J Biomed Inform ; 44(2): 354-60, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21056118

RESUMEN

Text chunking is an essential pre-processing step in information extraction systems. No comparative studies of chunking systems, including sentence splitting, tokenization and part-of-speech tagging, are available for the biomedical domain. We compared the usability (ease of integration, speed, trainability) and performance of six state-of-the-art chunkers for the biomedical domain, and combined the chunker results in order to improve chunking performance. We investigated six frequently used chunkers: GATE chunker, Genia Tagger, Lingpipe, MetaMap, OpenNLP, and Yamcha. All chunkers were integrated into the Unstructured Information Management Architecture framework. The GENIA Treebank corpus was used for training and testing. Performance was assessed for noun-phrase and verb-phrase chunking. For both noun-phrase chunking and verb-phrase chunking, OpenNLP performed best (F-scores 89.7% and 95.7%, respectively), but differences with Genia Tagger and Yamcha were small. With respect to usability, Lingpipe and OpenNLP scored best. When combining the results of the chunkers by a simple voting scheme, the F-score of the combined system improved by 3.1 percentage point for noun phrases and 0.6 percentage point for verb phrases as compared to the best single chunker. Changing the voting threshold offered a simple way to obtain a system with high precision (and moderate recall) or high recall (and moderate precision). This study is the first to compare the performance of the whole chunking pipeline, and to combine different existing chunking systems. Several chunkers showed good performance, but OpenNLP scored best both in performance and usability. The combination of chunker results by a simple voting scheme can further improve performance and allows for different precision-recall settings.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Algoritmos , Bases de Datos Bibliográficas , Procesamiento de Lenguaje Natural , Terminología como Asunto , Vocabulario Controlado
20.
ALTEX ; 38(2): 187-197, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33637997

RESUMEN

Pre-competitive data sharing can offer the pharmaceutical industry significant benefits in terms of reducing the time and costs involved in getting a new drug to market through more informed testing strategies and knowledge gained by pooling data. If sufficient data is shared and can be co-analyzed, then it can also offer the potential for reduced animal usage and improvements in the in silico prediction of toxicological effects. Data sharing benefits can be further enhanced by applying the FAIR Guiding Principles, reducing time spent curating, transforming and aggregating datasets and allowing more time for data mining and analysis. We hope to facilitate data sharing by other organizations and initiatives by describing lessons learned as part of the Enhancing TRANslational SAFEty Assessment through Integrative Knowledge Management (eTRANSAFE) project, an Innovative Medicines Initiative (IMI) partnership which aims to integrate publicly available data sources with proprietary preclinical and clinical data donated by pharmaceutical organizations. Methods to foster trust and overcome non-technical barriers to data sharing such as legal and IPR (intellectual property rights) are described, including the security requirements that pharmaceutical organizations generally expect to be met. We share the consensus achieved among pharmaceutical partners on decision criteria to be included in internal clearance pro­cedures used to decide if data can be shared. We also report on the consensus achieved on specific data fields to be excluded from sharing for sensitive preclinical safety and pharmacology data that could otherwise not be shared.


Asunto(s)
Minería de Datos , Difusión de la Información , Animales , Simulación por Computador , Industria Farmacéutica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA