Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Pharmacoepidemiol Drug Saf ; 33(1): e5743, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38158381

RESUMO

BACKGROUND: Medication errors (MEs) are a major public health concern which can cause harm and financial burden within the healthcare system. Characterizing MEs is crucial to develop strategies to mitigate MEs in the future. OBJECTIVES: To characterize ME-associated reports, and investigate signals of disproportionate reporting (SDRs) on MEs in the Food and Drug Administration's Adverse Event Reporting System (FAERS). METHODS: FAERS data from 2004 to 2020 was used. ME reports were identified with the narrow Standardised Medical Dictionary for Regulatory Activities® (MedDRA®) Query (SMQ) for MEs. Drug names were converted to the Anatomical Therapeutic Chemical (ATC) classification. SDRs were investigated using the reporting odds ratio (ROR). RESULTS: In total 488 470 ME reports were identified, mostly (59%) submitted by consumers and mainly (55%) associated with females. Median age at time of ME was 57 years (interquartile range: 37-70 years). Approximately 1 out of 3 reports stated a serious health outcome. The most prevalent reported drug class was "antineoplastic and immunomodulating agents" (25%). The most common ME type was "incorrect dose administered" (9%). Of the 1659 SDRs obtained, adalimumab was the most common drug associated with MEs, noting a ROR of 1.22 (95% confidence interval: 1.21-1.24). CONCLUSION: This study offers a first of its kind characterization of MEs as reported to FAERS. Reported MEs are frequent and may be associated with serious health outcomes. This FAERS data provides insights on ME prevention and offers possibilities for additional in-depth analyses.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Erros de Medicação , Feminino , Estados Unidos , Humanos , Adulto , Pessoa de Meia-Idade , Idoso , Preparações Farmacêuticas , United States Food and Drug Administration , Erros de Medicação/prevenção & controle , Adalimumab , Farmacovigilância
2.
BMC Bioinformatics ; 19(1): 183, 2018 05 25.
Artigo em Inglês | MEDLINE | ID: mdl-29801439

RESUMO

BACKGROUND: A quantitative trait locus (QTL) is a genomic region that correlates with a phenotype. Most of the experimental information about QTL mapping studies is described in tables of scientific publications. Traditional text mining techniques aim to extract information from unstructured text rather than from tables. We present QTLTableMiner++ (QTM), a table mining tool that extracts and semantically annotates QTL information buried in (heterogeneous) tables of plant science literature. QTM is a command line tool written in the Java programming language. This tool takes scientific articles from the Europe PMC repository as input, extracts QTL tables using keyword matching and ontology-based concept identification. The tables are further normalized using rules derived from table properties such as captions, column headers and table footers. Furthermore, table columns are classified into three categories namely column descriptors, properties and values based on column headers and data types of cell entries. Abbreviations found in the tables are expanded using the Schwartz and Hearst algorithm. Finally, the content of QTL tables is semantically enriched with domain-specific ontologies (e.g. Crop Ontology, Plant Ontology and Trait Ontology) using the Apache Solr search platform and the results are stored in a relational database and a text file. RESULTS: The performance of the QTM tool was assessed by precision and recall based on the information retrieved from two manually annotated corpora of open access articles, i.e. QTL mapping studies in tomato (Solanum lycopersicum) and in potato (S. tuberosum). In summary, QTM detected QTL statements in tomato with 74.53% precision and 92.56% recall and in potato with 82.82% precision and 98.94% recall. CONCLUSION: QTM is a unique tool that aids in providing QTL information in machine-readable and semantically interoperable formats.


Assuntos
Mineração de Dados/métodos , Locos de Características Quantitativas , Software , Algoritmos , Gráficos por Computador , Bases de Dados Factuais , Solanum lycopersicum/genética , Publicações , Semântica , Solanum tuberosum/genética
3.
J Biomed Inform ; 71: 178-189, 2017 07.
Artigo em Inglês | MEDLINE | ID: mdl-28579531

RESUMO

PROBLEM: Biomedical literature and databases contain important clues for the identification of potential disease biomarkers. However, searching these enormous knowledge reservoirs and integrating findings across heterogeneous sources is costly and difficult. Here we demonstrate how semantically integrated knowledge, extracted from biomedical literature and structured databases, can be used to automatically identify potential migraine biomarkers. METHOD: We used a knowledge graph containing more than 3.5 million biomedical concepts and 68.4 million relationships. Biochemical compound concepts were filtered and ranked by their potential as biomarkers based on their connections to a subgraph of migraine-related concepts. The ranked results were evaluated against the results of a systematic literature review that was performed manually by migraine researchers. Weight points were assigned to these reference compounds to indicate their relative importance. RESULTS: Ranked results automatically generated by the knowledge graph were highly consistent with results from the manual literature review. Out of 222 reference compounds, 163 (73%) ranked in the top 2000, with 547 out of the 644 (85%) weight points assigned to the reference compounds. For reference compounds that were not in the top of the list, an extensive error analysis has been performed. When evaluating the overall performance, we obtained a ROC-AUC of 0.974. DISCUSSION: Semantic knowledge graphs composed of information integrated from multiple and varying sources can assist researchers in identifying potential disease biomarkers.


Assuntos
Biomarcadores , Mineração de Dados , Bases de Dados Factuais , Transtornos de Enxaqueca/diagnóstico , Semântica , Automação , Humanos , Publicações
4.
Pharmacoepidemiol Drug Saf ; 26(8): 998-1005, 2017 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-28657162

RESUMO

BACKGROUND: Assessment of drug and vaccine effects by combining information from different healthcare databases in the European Union requires extensive efforts in the harmonization of codes as different vocabularies are being used across countries. In this paper, we present a web application called CodeMapper, which assists in the mapping of case definitions to codes from different vocabularies, while keeping a transparent record of the complete mapping process. METHODS: CodeMapper builds upon coding vocabularies contained in the Metathesaurus of the Unified Medical Language System. The mapping approach consists of three phases. First, medical concepts are automatically identified in a free-text case definition. Second, the user revises the set of medical concepts by adding or removing concepts, or expanding them to related concepts that are more general or more specific. Finally, the selected concepts are projected to codes from the targeted coding vocabularies. We evaluated the application by comparing codes that were automatically generated from case definitions by applying CodeMapper's concept identification and successive concept expansion, with reference codes that were manually created in a previous epidemiological study. RESULTS: Automated concept identification alone had a sensitivity of 0.246 and positive predictive value (PPV) of 0.420 for reproducing the reference codes. Three successive steps of concept expansion increased sensitivity to 0.953 and PPV to 0.616. CONCLUSIONS: Automatic concept identification in the case definition alone was insufficient to reproduce the reference codes, but CodeMapper's operations for concept expansion provide an effective, efficient, and transparent way for reproducing the reference codes.


Assuntos
Bases de Dados Factuais/estatística & dados numéricos , Classificação Internacional de Doenças/estatística & dados numéricos , Sistemas Computadorizados de Registros Médicos/estatística & dados numéricos , Unified Medical Language System/estatística & dados numéricos , Europa (Continente)/epidemiologia , Humanos
5.
Bioinformatics ; 30(23): 3365-71, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25143286

RESUMO

MOTIVATION: Knowledge of drug-drug interactions (DDIs) is crucial for health-care professionals to avoid adverse effects when co-administering drugs to patients. As most newly discovered DDIs are made available through scientific publications, automatic DDI extraction is highly relevant. RESULTS: We propose a novel feature-based approach to extract DDIs from text. Our approach consists of three steps. First, we apply text preprocessing to convert input sentences from a given dataset into structured representations. Second, we map each candidate DDI pair from that dataset into a suitable syntactic structure. Based on that, a novel set of features is used to generate feature vectors for these candidate DDI pairs. Third, the obtained feature vectors are used to train a support vector machine (SVM) classifier. When evaluated on two DDI extraction challenge test datasets from 2011 and 2013, our system achieves F-scores of 71.1% and 83.5%, respectively, outperforming any state-of-the-art DDI extraction system. AVAILABILITY AND IMPLEMENTATION: The source code is available for academic use at http://www.biosemantics.org/uploads/DDI.zip.


Assuntos
Mineração de Dados/métodos , Interações Medicamentosas , Humanos , Máquina de Vetores de Suporte
6.
BMC Bioinformatics ; 15: 64, 2014 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-24593054

RESUMO

BACKGROUND: Many biomedical relation extraction systems are machine-learning based and have to be trained on large annotated corpora that are expensive and cumbersome to construct. We developed a knowledge-based relation extraction system that requires minimal training data, and applied the system for the extraction of adverse drug events from biomedical text. The system consists of a concept recognition module that identifies drugs and adverse effects in sentences, and a knowledge-base module that establishes whether a relation exists between the recognized concepts. The knowledge base was filled with information from the Unified Medical Language System. The performance of the system was evaluated on the ADE corpus, consisting of 1644 abstracts with manually annotated adverse drug events. Fifty abstracts were used for training, the remaining abstracts were used for testing. RESULTS: The knowledge-based system obtained an F-score of 50.5%, which was 34.4 percentage points better than the co-occurrence baseline. Increasing the training set to 400 abstracts improved the F-score to 54.3%. When the system was compared with a machine-learning system, jSRE, on a subset of the sentences in the ADE corpus, our knowledge-based system achieved an F-score that is 7 percentage points higher than the F-score of jSRE trained on 50 abstracts, and still 2 percentage points higher than jSRE trained on 90% of the corpus. CONCLUSION: A knowledge-based approach can be successfully used to extract adverse drug events from biomedical text without need for a large training set. Whether use of a knowledge base is equally advantageous for other biomedical relation-extraction tasks remains to be investigated.


Assuntos
Inteligência Artificial , Mineração de Dados/métodos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Bases de Conhecimento , Humanos , Unified Medical Language System
7.
Artigo em Inglês | MEDLINE | ID: mdl-38934643

RESUMO

OBJECTIVE: To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora. MATERIALS AND METHODS: Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English. RESULTS: The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision. DISCUSSION: Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools. CONCLUSION: This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.

8.
Int J Med Inform ; 189: 105506, 2024 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-38820647

RESUMO

OBJECTIVE: Observational studies using electronic health record (EHR) databases often face challenges due to unspecific clinical codes that can obscure detailed medical information, hindering precise data analysis. In this study, we aimed to assess the feasibility of refining these unspecific condition codes into more specific codes in a Dutch general practitioner (GP) EHR database by leveraging the available clinical free text. METHODS: We utilized three approaches for text classification-search queries, semi-supervised learning, and supervised learning-to improve the specificity of ten unspecific International Classification of Primary Care (ICPC-1) codes. Two text representations and three machine learning algorithms were evaluated for the (semi-)supervised models. Additionally, we measured the improvement achieved by the refinement process on all code occurrences in the database. RESULTS: The classification models performed well for most codes. In general, no single classification approach consistently outperformed the others. However, there were variations in the relative performance of the classification approaches within each code and in the use of different text representations and machine learning algorithms. Class imbalance and limited training data affected the performance of the (semi-)supervised models, yet the simple search queries remained particularly effective. Ultimately, the developed models improved the specificity of over half of all the unspecific code occurrences in the database. CONCLUSIONS: Our findings show the feasibility of using information from clinical text to improve the specificity of unspecific condition codes in observational healthcare databases, even with a limited range of machine-learning techniques and modest annotated training sets. Future work could investigate transfer learning, integration of structured data, alternative semi-supervised methods, and validation of models across healthcare settings. The improved level of detail enriches the interpretation of medical information and can benefit observational research and patient care.

9.
Pharmacoepidemiol Drug Saf ; 22(5): 459-67, 2013 May.
Artigo em Inglês | MEDLINE | ID: mdl-23208789

RESUMO

PURPOSE: Pharmacovigilance methods have advanced greatly during the last decades, making post-market drug assessment an essential drug evaluation component. These methods mainly rely on the use of spontaneous reporting systems and health information databases to collect expertise from huge amounts of real-world reports. The EU-ADR Web Platform was built to further facilitate accessing, monitoring and exploring these data, enabling an in-depth analysis of adverse drug reactions risks. METHODS: The EU-ADR Web Platform exploits the wealth of data collected within a large-scale European initiative, the EU-ADR project. Millions of electronic health records, provided by national health agencies, are mined for specific drug events, which are correlated with literature, protein and pathway data, resulting in a rich drug-event dataset. Next, advanced distributed computing methods are tailored to coordinate the execution of data-mining and statistical analysis tasks. This permits obtaining a ranked drug-event list, removing spurious entries and highlighting relationships with high risk potential. RESULTS: The EU-ADR Web Platform is an open workspace for the integrated analysis of pharmacovigilance datasets. Using this software, researchers can access a variety of tools provided by distinct partners in a single centralized environment. Besides performing standalone drug-event assessments, they can also control the pipeline for an improved batch analysis of custom datasets. Drug-event pairs can be substantiated and statistically analysed within the platform's innovative working environment. CONCLUSIONS: A pioneering workspace that helps in explaining the biological path of adverse drug reactions was developed within the EU-ADR project consortium. This tool, targeted at the pharmacovigilance community, is available online at https://bioinformatics.ua.pt/euadr/.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos/organização & administração , Internet , Farmacovigilância , Sistemas de Notificação de Reações Adversas a Medicamentos/estatística & dados numéricos , Mineração de Dados/métodos , Bases de Dados Factuais/estatística & dados numéricos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Europa (Continente) , Humanos , Software
10.
J Am Med Inform Assoc ; 30(12): 1973-1984, 2023 11 17.
Artigo em Inglês | MEDLINE | ID: mdl-37587084

RESUMO

OBJECTIVE: This work aims to explore the value of Dutch unstructured data, in combination with structured data, for the development of prognostic prediction models in a general practitioner (GP) setting. MATERIALS AND METHODS: We trained and validated prediction models for 4 common clinical prediction problems using various sparse text representations, common prediction algorithms, and observational GP electronic health record (EHR) data. We trained and validated 84 models internally and externally on data from different EHR systems. RESULTS: On average, over all the different text representations and prediction algorithms, models only using text data performed better or similar to models using structured data alone in 2 prediction tasks. Additionally, in these 2 tasks, the combination of structured and text data outperformed models using structured or text data alone. No large performance differences were found between the different text representations and prediction algorithms. DISCUSSION: Our findings indicate that the use of unstructured data alone can result in well-performing prediction models for some clinical prediction problems. Furthermore, the performance improvement achieved by combining structured and text data highlights the added value. Additionally, we demonstrate the significance of clinical natural language processing research in languages other than English and the possibility of validating text-based prediction models across various EHR systems. CONCLUSION: Our study highlights the potential benefits of incorporating unstructured data in clinical prediction models in a GP setting. Although the added value of unstructured data may vary depending on the specific prediction task, our findings suggest that it has the potential to enhance patient care.


Assuntos
Clínicos Gerais , Humanos , Registros Eletrônicos de Saúde , Idioma , Algoritmos , Software , Processamento de Linguagem Natural
11.
Front Pharmacol ; 14: 1276340, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38035014

RESUMO

Introduction: Monoclonal antibodies (mAbs) targeting immunoglobulin E (IgE) [omalizumab], type 2 (T2) cytokine interleukin (IL) 5 [mepolizumab, reslizumab], IL-4 Receptor (R) α [dupilumab], and IL-5R [benralizumab]), improve quality of life in patients with T2-driven inflammatory diseases. However, there is a concern for an increased risk of helminth infections. The aim was to explore safety signals of parasitic infections for omalizumab, mepolizumab, reslizumab, dupilumab, and benralizumab. Methods: Spontaneous reports were used from the Food and Drug Administration's Adverse Event Reporting System (FAERS) database from 2004 to 2021. Parasitic infections were defined as any type of parasitic infection term obtained from the Standardised Medical Dictionary for Regulatory Activities® (MedDRA®). Safety signal strength was assessed by the Reporting Odds Ratio (ROR). Results: 15,502,908 reports were eligible for analysis. Amongst 175,888 reports for omalizumab, mepolizumab, reslizumab, dupilumab, and benralizumab, there were 79 reports on parasitic infections. Median age was 55 years (interquartile range 24-63 years) and 59.5% were female. Indications were known in 26 (32.9%) reports; 14 (53.8%) biologicals were reportedly prescribed for asthma, 8 (30.7%) for various types of dermatitis, and 2 (7.6%) for urticaria. A safety signal was observed for each biological, except for reslizumab (due to lack of power), with the strongest signal attributed to benralizumab (ROR = 15.7, 95% Confidence Interval: 8.4-29.3). Conclusion: Parasitic infections were disproportionately reported for mAbs targeting IgE, T2 cytokines, or T2 cytokine receptors. While the number of adverse event reports on parasitic infections in the database was relatively low, resulting safety signals were disproportionate and warrant further investigation.

12.
BMC Bioinformatics ; 13: 17, 2012 Jan 30.
Artigo em Inglês | MEDLINE | ID: mdl-22289351

RESUMO

BACKGROUND: To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC. RESULTS: We have tested the two scenarios using three chunkers, Lingpipe, OpenNLP, and Yamcha, and two different corpora, GENIA and PennBioIE. For the first scenario, we showed that the systems trained for noun-phrase recognition on the SSC in one domain performed 2.7-3.1 percentage points better in terms of F-score than the systems trained on the GSC in another domain, and only 0.2-0.8 percentage points less than when they were trained on a GSC in the same domain as the SSC. When the outputs of the chunkers were combined, the combined system showed little improvement when using the SSC. For the second scenario, the systems trained on a GSC supplemented with an SSC performed considerably better than systems that were trained on the GSC alone, especially when the GSC was small. For example, training the chunkers on a GSC consisting of only 10 abstracts but supplemented with an SSC yielded similar performance as training them on a GSC of 100-250 abstracts. The combined system even performed better than any of the individual chunkers trained on a GSC of 500 abstracts. CONCLUSIONS: We conclude that an SSC can be a viable alternative for or a supplement to a GSC when training chunkers in a biomedical domain. A combined system only shows improvement if the SSC is used to supplement a GSC. Whether the approach is applicable to other systems in a natural-language processing pipeline has to be further investigated.


Assuntos
Biologia Computacional/métodos , Biologia Computacional/normas , Processamento de Linguagem Natural , Humanos , Neoplasias/genética
13.
J Biomed Inform ; 45(3): 423-8, 2012 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-22239956

RESUMO

Recognition of medical concepts is a basic step in information extraction from clinical records. We wished to improve on the performance of a variety of concept recognition systems by combining their individual results. We selected two dictionary-based systems and five statistical-based systems that were trained to annotate medical problems, tests, and treatments in clinical records. Manually annotated clinical records for training and testing were made available through the 2010 i2b2/VA (Informatics for Integrating Biology and the Bedside) challenge. Results of individual systems were combined by a simple voting scheme. The statistical systems were trained on a set of 349 records. Performance (precision, recall, F-score) was assessed on a test set of 477 records, using varying voting thresholds. The combined annotation system achieved a best F-score of 82.2% (recall 81.2%, precision 83.3%) on the test set, a score that ranks third among 22 participants in the i2b2/VA concept annotation task. The ensemble system had better precision and recall than any of the individual systems, yielding an F-score that is 4.6% point higher than the best single system. Changing the voting threshold offered a simple way to obtain a system with high precision (and moderate recall) or one with high recall (and moderate precision). The ensemble-based approach is straightforward and allows the balancing of precision versus recall of the combined system. The ensemble system is freely available and can easily be extended, integrated in other systems, and retrained.


Assuntos
Mineração de Dados/métodos , Registros Eletrônicos de Saúde , Humanos , Processamento de Linguagem Natural , Semântica
14.
J Biomed Inform ; 45(5): 879-84, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-22554700

RESUMO

Corpora with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators revise this annotation using a web-based interface. The agreement figures achieved show that the inter-annotator agreement is much better than the agreement with the system provided annotations. The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts. These annotated relationships will be used to train and evaluate text-mining software to capture these relationships in texts.


Assuntos
Mineração de Dados/métodos , Bases de Dados Factuais , Informática Médica/métodos , Documentação , Tratamento Farmacológico/classificação , Humanos , Internet , Preparações Farmacêuticas/classificação , Interface Usuário-Computador
15.
J Biomed Semantics ; 13(1): 24, 2022 10 18.
Artigo em Inglês | MEDLINE | ID: mdl-36258262

RESUMO

BACKGROUND: Vaccine information in European electronic health record (EHR) databases is represented using various clinical and database-specific coding systems and drug vocabularies. The lack of harmonization constitutes a challenge in reusing EHR data in collaborative benefit-risk studies about vaccines. METHODS: We designed an ontology of the properties that are commonly used in vaccine descriptions, called Ontology of Vaccine Descriptions (VaccO), with a dictionary for the analysis of multilingual vaccine descriptions. We implemented five algorithms for the alignment of vaccine coding systems, i.e., the identification of corresponding codes from different coding ystems, based on an analysis of the code descriptors. The algorithms were evaluated by comparing their results with manually created alignments in two reference sets including clinical and database-specific coding systems with multilingual code descriptors. RESULTS: The best-performing algorithm represented code descriptors as logical statements about entities in the VaccO ontology and used an ontology reasoner to infer common properties and identify corresponding vaccine codes. The evaluation demonstrated excellent performance of the approach (F-scores 0.91 and 0.96). CONCLUSION: The VaccO ontology allows the identification, representation, and comparison of heterogeneous descriptions of vaccines. The automatic alignment of vaccine coding systems can accelerate the readiness of EHR databases in collaborative vaccine studies.


Assuntos
Registros Eletrônicos de Saúde , Vacinas , Bases de Dados Factuais , Algoritmos
16.
PLoS One ; 17(7): e0271395, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35830458

RESUMO

Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) that play important roles in the genetic heritability of traits and diseases. With most of these SNPs located on the non-coding part of the genome, it is currently assumed that these SNPs influence the expression of nearby genes on the genome. However, identifying which genes are targeted by these disease-associated SNPs remains challenging. In the past, protein knowledge graphs have often been used to identify genes that are associated with disease, also referred to as "disease genes". Here, we explore whether protein knowledge graphs can be used to identify genes that are targeted by disease-associated non-coding SNPs by testing and comparing the performance of six existing methods for a protein knowledge graph, four of which were developed for disease gene identification. We compare our performance against two baselines: (1) an existing state-of-the-art method that is based on guilt-by-association, and (2) the leading assumption that SNPs target the nearest gene on the genome. We test these methods with four reference sets, three of which were obtained by different means. Furthermore, we combine methods to investigate whether their combination improves performance. We find that protein knowledge graphs that include predicate information perform comparable to the current state of the art, achieving an area under the receiver operating characteristic curve (AUC) of 79.6% on average across all four reference sets. Protein knowledge graphs that lack predicate information perform comparable to our other baseline (genetic distance) which achieved an AUC of 75.7% across all four reference sets. Combining multiple methods improved performance to 84.9% AUC. We conclude that methods for a protein knowledge graph can be used to identify which genes are targeted by disease-associated non-coding SNPs.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Estudo de Associação Genômica Ampla/métodos , Reconhecimento Automatizado de Padrão , Fenótipo
17.
J Am Med Inform Assoc ; 29(7): 1292-1302, 2022 06 14.
Artigo em Inglês | MEDLINE | ID: mdl-35475536

RESUMO

OBJECTIVE: This systematic review aims to assess how information from unstructured text is used to develop and validate clinical prognostic prediction models. We summarize the prediction problems and methodological landscape and determine whether using text data in addition to more commonly used structured data improves the prediction performance. MATERIALS AND METHODS: We searched Embase, MEDLINE, Web of Science, and Google Scholar to identify studies that developed prognostic prediction models using information extracted from unstructured text in a data-driven manner, published in the period from January 2005 to March 2021. Data items were extracted, analyzed, and a meta-analysis of the model performance was carried out to assess the added value of text to structured-data models. RESULTS: We identified 126 studies that described 145 clinical prediction problems. Combining text and structured data improved model performance, compared with using only text or only structured data. In these studies, a wide variety of dense and sparse numeric text representations were combined with both deep learning and more traditional machine learning methods. External validation, public availability, and attention for the explainability of the developed models were limited. CONCLUSION: The use of unstructured text in the development of prognostic prediction models has been found beneficial in addition to structured data in most studies. The text data are source of valuable information for prediction model development and should not be neglected. We suggest a future focus on explainability and external validation of the developed models, promoting robust and trustworthy prediction models in clinical practice.


Assuntos
Aprendizado de Máquina , Prognóstico
18.
J Biomed Inform ; 44(2): 354-60, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21056118

RESUMO

Text chunking is an essential pre-processing step in information extraction systems. No comparative studies of chunking systems, including sentence splitting, tokenization and part-of-speech tagging, are available for the biomedical domain. We compared the usability (ease of integration, speed, trainability) and performance of six state-of-the-art chunkers for the biomedical domain, and combined the chunker results in order to improve chunking performance. We investigated six frequently used chunkers: GATE chunker, Genia Tagger, Lingpipe, MetaMap, OpenNLP, and Yamcha. All chunkers were integrated into the Unstructured Information Management Architecture framework. The GENIA Treebank corpus was used for training and testing. Performance was assessed for noun-phrase and verb-phrase chunking. For both noun-phrase chunking and verb-phrase chunking, OpenNLP performed best (F-scores 89.7% and 95.7%, respectively), but differences with Genia Tagger and Yamcha were small. With respect to usability, Lingpipe and OpenNLP scored best. When combining the results of the chunkers by a simple voting scheme, the F-score of the combined system improved by 3.1 percentage point for noun phrases and 0.6 percentage point for verb phrases as compared to the best single chunker. Changing the voting threshold offered a simple way to obtain a system with high precision (and moderate recall) or high recall (and moderate precision). This study is the first to compare the performance of the whole chunking pipeline, and to combine different existing chunking systems. Several chunkers showed good performance, but OpenNLP scored best both in performance and usability. The combination of chunker results by a simple voting scheme can further improve performance and allows for different precision-recall settings.


Assuntos
Indexação e Redação de Resumos/métodos , Algoritmos , Bases de Dados Bibliográficas , Processamento de Linguagem Natural , Terminologia como Assunto , Vocabulário Controlado
19.
Bioinformatics ; 25(12): i69-76, 2009 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-19478019

RESUMO

MOTIVATION: For many years, the Unified Medical Language System (UMLS) semantic network (SN) has been used as an upper-level semantic framework for the categorization of terms from terminological resources in biomedicine. BioTop has recently been developed as an upper-level ontology for the biomedical domain. In contrast to the SN, it is founded upon strict ontological principles, using OWL DL as a formal representation language, which has become standard in the semantic Web. In order to make logic-based reasoning available for the resources annotated or categorized with the SN, a mapping ontology was developed aligning the SN with BioTop. METHODS: The theoretical foundations and the practical realization of the alignment are being described, with a focus on the design decisions taken, the problems encountered and the adaptations of BioTop that became necessary. For evaluation purposes, UMLS concept pairs obtained from MEDLINE abstracts by a named entity recognition system were tested for possible semantic relationships. Furthermore, all semantic-type combinations that occur in the UMLS Metathesaurus were checked for satisfiability. RESULTS: The effort-intensive alignment process required major design changes and enhancements of BioTop and brought up several design errors that could be fixed. A comparison between a human curator and the ontology yielded only a low agreement. Ontology reasoning was also used to successfully identify 133 inconsistent semantic-type combinations. AVAILABILITY: BioTop, the OWL DL representation of the UMLS SN, and the mapping ontology are available at http://www.purl.org/biotop/.


Assuntos
Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Unified Medical Language System/normas , Bases de Dados Factuais , Reconhecimento Automatizado de Padrão , Semântica , Vocabulário Controlado
20.
J Biomed Semantics ; 11(1): 9, 2020 08 20.
Artigo em Inglês | MEDLINE | ID: mdl-32819419

RESUMO

BACKGROUND: Knowledge graphs can represent the contents of biomedical literature and databases as subject-predicate-object triples, thereby enabling comprehensive analyses that identify e.g. relationships between diseases. Some diseases are often diagnosed in patients in specific temporal sequences, which are referred to as disease trajectories. Here, we determine whether a sequence of two diseases forms a trajectory by leveraging the predicate information from paths between (disease) proteins in a knowledge graph. Furthermore, we determine the added value of directional information of predicates for this task. To do so, we create four feature sets, based on two methods for representing indirect paths, and both with and without directional information of predicates (i.e., which protein is considered subject and which object). The added value of the directional information of predicates is quantified by comparing the classification performance of the feature sets that include or exclude it. RESULTS: Our method achieved a maximum area under the ROC curve of 89.8% and 74.5% when evaluated with two different reference sets. Use of directional information of predicates significantly improved performance by 6.5 and 2.0 percentage points respectively. CONCLUSIONS: Our work demonstrates that predicates between proteins can be used to identify disease trajectories. Using the directional information of predicates significantly improved performance over not using this information.


Assuntos
Ontologias Biológicas , Gráficos por Computador , Doença , Humanos , Armazenamento e Recuperação da Informação , Curva ROC , Semântica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA