Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
1.
Bioinformatics ; 2024 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-38924508

RESUMO

MOTIVATION: Citations have a fundamental role in scholarly communication and assessment. Citation accuracy and transparency is crucial for the integrity of scientific evidence. In this work, we focus on quotation errors, errors in citation content that can distort the scientific evidence and that are hard to detect for humans. We construct a corpus and propose natural language processing (NLP) methods to identify such errors in biomedical publications. RESULTS: We manually annotated 100 highly-cited biomedical publications (reference articles) and citations to them. The annotation involved labeling citation context in the citing article, relevant evidence sentences in the reference article, and the accuracy of the citation. A total of 3,063 citation instances were annotated (39.18% with accuracy errors). For NLP, we combined a sentence retriever with a fine-tuned claim verification model to label citations as accurate, not_accurate, or irrelevant. We also explored few-shot in-context learning with generative large language models. The best performing model-which uses citation sentences as citation context, the BM25 model with MonoT5 reranker for retrieving top-20 sentences, and a fine-tuned MultiVerS model for accuracy label classification-yielded 0.59 micro-F1 and 0.52 macro-F1 score. GPT-4 in-context learning performed better in identifying accurate citations, but it lagged for erroneous citations (0.65 micro-F1, 0.45 macro-F1). Citation quotation errors are often subtle, and it is currently challenging for NLP models to identify erroneous citations. With further improvements, the models could serve to improve citation quality and accuracy. AVAILABILITY: We make the corpus and the best-performing NLP model publicly available at https://github.com/ScienceNLP-Lab/Citation-Integrity/.

2.
J Biomed Inform ; 155: 104658, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38782169

RESUMO

OBJECTIVE: Relation extraction is an essential task in the field of biomedical literature mining and offers significant benefits for various downstream applications, including database curation, drug repurposing, and literature-based discovery. The broad-coverage natural language processing (NLP) tool SemRep has established a solid baseline for extracting subject-predicate-object triples from biomedical text and has served as the backbone of the Semantic MEDLINE Database (SemMedDB), a PubMed-scale repository of semantic triples. While SemRep achieves reasonable precision (0.69), its recall is relatively low (0.42). In this study, we aimed to enhance SemRep using a relation classification approach, in order to eventually increase the size and the utility of SemMedDB. METHODS: We combined and extended existing SemRep evaluation datasets to generate training data. We leveraged the pre-trained PubMedBERT model, enhancing it through additional contrastive pre-training and fine-tuning. We experimented with three entity representations: mentions, semantic types, and semantic groups. We evaluated the model performance on a portion of the SemRep Gold Standard dataset and compared it to SemRep performance. We also assessed the effect of the model on a larger set of 12K randomly selected PubMed abstracts. RESULTS: Our results show that the best model yields a precision of 0.62, recall of 0.81, and F1 score of 0.70. Assessment on 12K abstracts shows that the model could double the size of SemMedDB, when applied to entire PubMed. We also manually assessed the quality of 506 triples predicted by the model that SemRep had not previously identified, and found that 67% of these triples were correct. CONCLUSION: These findings underscore the promise of our model in achieving a more comprehensive coverage of relationships mentioned in biomedical literature, thereby showing its potential in enhancing various downstream applications of biomedical literature mining. Data and code related to this study are available at https://github.com/Michelle-Mings/SemRep_RelationClassification.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Semântica , Mineração de Dados/métodos , MEDLINE , PubMed , Algoritmos , Humanos , Bases de Dados Factuais
3.
Sci Rep ; 14(1): 8693, 2024 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-38622164

RESUMO

Non-pharmaceutical interventions (NPI) have great potential to improve cognitive function but limited investigation to discover NPI repurposing for Alzheimer's Disease (AD). This is the first study to develop an innovative framework to extract and represent NPI information from biomedical literature in a knowledge graph (KG), and train link prediction models to repurpose novel NPIs for AD prevention. We constructed a comprehensive KG, called ADInt, by extracting NPI information from biomedical literature. We used the previously-created SuppKG and NPI lexicon to identify NPI entities. Four KG embedding models (i.e., TransE, RotatE, DistMult and ComplEX) and two novel graph convolutional network models (i.e., R-GCN and CompGCN) were trained and compared to learn the representation of ADInt. Models were evaluated and compared on two test sets (time slice and clinical trial ground truth) and the best performing model was used to predict novel NPIs for AD. Discovery patterns were applied to generate mechanistic pathways for high scoring candidates. The ADInt has 162,212 nodes and 1,017,284 edges. R-GCN performed best in time slice (MR = 5.2054, Hits@10 = 0.8496) and clinical trial ground truth (MR = 3.4996, Hits@10 = 0.9192) test sets. After evaluation by domain experts, 10 novel dietary supplements and 10 complementary and integrative health were proposed from the score table calculated by R-GCN. Among proposed novel NPIs, we found plausible mechanistic pathways for photodynamic therapy and Choerospondias axillaris to prevent AD, and validated psychotherapy and manual therapy techniques using real-world data analysis. The proposed framework shows potential for discovering new NPIs for AD prevention and understanding their mechanistic pathways.


Assuntos
Doença de Alzheimer , Humanos , Doença de Alzheimer/tratamento farmacológico , Aprendizagem
4.
medRxiv ; 2024 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-38633775

RESUMO

Objective: To develop text classification models for determining whether the checklist items in the CONSORT reporting guidelines are reported in randomized controlled trial publications. Materials and Methods: Using a corpus annotated at the sentence level with 37 fine-grained CONSORT items, we trained several sentence classification models (PubMedBERT fine-tuning, BioGPT fine-tuning, and in-context learning with GPT-4) and compared their performance. To address the problem of small training dataset, we used several data augmentation methods (EDA, UMLS-EDA, text generation and rephrasing with GPT-4) and assessed their impact on the fine-tuned PubMedBERT model. We also fine-tuned PubMedBERT models limited to checklist items associated with specific sections (e.g., Methods) to evaluate whether such models could improve performance compared to the single full model. We performed 5-fold cross-validation and report precision, recall, F1 score, and area under curve (AUC). Results: Fine-tuned PubMedBERT model that takes as input the sentence and the surrounding sentence representations and uses section headers yielded the best overall performance (0.71 micro-F1, 0.64 macro-F1). Data augmentation had limited positive effect, UMLS-EDA yielding slightly better results than data augmentation using GPT-4. BioGPT fine-tuning and GPT-4 in-context learning exhibited suboptimal results. Methods-specific model yielded higher performance for methodology items, other section-specific models did not have significant impact. Conclusion: Most CONSORT checklist items can be recognized reasonably well with the fine-tuned PubMedBERT model but there is room for improvement. Improved models can underpin the journal editorial workflows and CONSORT adherence checks and can help authors in improving the reporting quality and completeness of their manuscripts.

5.
J Biomed Inform ; 152: 104628, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38548008

RESUMO

OBJECTIVE: Acknowledging study limitations in a scientific publication is a crucial element in scientific transparency and progress. However, limitation reporting is often inadequate. Natural language processing (NLP) methods could support automated reporting checks, improving research transparency. In this study, our objective was to develop a dataset and NLP methods to detect and categorize self-acknowledged limitations (e.g., sample size, blinding) reported in randomized controlled trial (RCT) publications. METHODS: We created a data model of limitation types in RCT studies and annotated a corpus of 200 full-text RCT publications using this data model. We fine-tuned BERT-based sentence classification models to recognize the limitation sentences and their types. To address the small size of the annotated corpus, we experimented with data augmentation approaches, including Easy Data Augmentation (EDA) and Prompt-Based Data Augmentation (PromDA). We applied the best-performing model to a set of about 12K RCT publications to characterize self-acknowledged limitations at larger scale. RESULTS: Our data model consists of 15 categories and 24 sub-categories (e.g., Population and its sub-category DiagnosticCriteria). We annotated 1090 instances of limitation types in 952 sentences (4.8 limitation sentences and 5.5 limitation types per article). A fine-tuned PubMedBERT model for limitation sentence classification improved upon our earlier model by about 1.5 absolute percentage points in F1 score (0.821 vs. 0.8) with statistical significance (p<.001). Our best-performing limitation type classification model, PubMedBERT fine-tuning with PromDA (Output View), achieved an F1 score of 0.7, improving upon the vanilla PubMedBERT model by 2.7 percentage points, with statistical significance (p<.001). CONCLUSION: The model could support automated screening tools which can be used by journals to draw the authors' attention to reporting issues. Automatic extraction of limitations from RCT publications could benefit peer review and evidence synthesis, and support advanced methods to search and aggregate the evidence from the clinical trial literature.


Assuntos
Processamento de Linguagem Natural , Publicações , Tamanho da Amostra , Idioma
6.
JAMA Netw Open ; 7(1): e2350688, 2024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38190185

RESUMO

Importance: Publishing study protocols might reduce research waste because of unclear methods or incomplete reporting; on the other hand, there might be few additional benefits of publishing protocols for registered trials that are never completed or published. No study has investigated the proportion of published protocols associated with published results. Objective: To estimate the proportion of published trial protocols for which there are not associated published results. Design, Setting, and Participants: This cross-sectional study used stratified random sampling to identify registered clinical trials with protocols published between January 2011 and August 2022 and indexed in PubMed Central. Ongoing studies and those within 1 year of the primary completion date on ClinicalTrials.gov were excluded. Published results were sought from August 2022 to March 2023 by searching ClinicalTrials.gov, emailing authors, and using an automated tool, as well as through incidental discovery. Main Outcomes and Measures: The primary outcome was a weighted estimate of the proportion of registered trials with published protocols that also had published main results. The proportion of trials with unpublished results was estimated using a weighted mean. Results: From 1500 citations that were screened, 308 clinical trial protocols were included, and it was found that 87 trials had not published their main results. Most included trials were investigator-initiated evaluations of nonregulated products. When published, results appeared a mean (SD) of 3.4 (2.0) years after protocol publications. With the use of a weighted mean, an estimated 4754 (95% CI, 4296-5226) eligible clinical trial protocols were published and indexed in PubMed Central between 2011 and 2022. In the weighted analysis, 1708 of those protocols (36%; 95% CI, 31%-41%) were not associated with publication of main results. In a sensitivity analysis excluding protocols published after 2019, an estimated 25% (95% CI, 20%-30%) of 3670 (95% CI, 3310-4032) protocol publications were not associated with publication of main results. Conclusions and Relevance: This cross-sectional study of clinical trial protocols published on PubMed Central between 2011 and 2022 suggests that many protocols were not associated with subsequent publication of results. The overall benefits of publishing study protocols might outweigh the research waste caused by unnecessary protocol publications.


Assuntos
Protocolos de Ensaio Clínico como Assunto , Achados Incidentais , Editoração , Humanos , Estudos Transversais , Projetos de Pesquisa , Editoração/estatística & dados numéricos
8.
J Am Med Inform Assoc ; 31(2): 426-434, 2024 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-37952122

RESUMO

OBJECTIVE: To construct an exhaustive Complementary and Integrative Health (CIH) Lexicon (CIHLex) to help better represent the often underrepresented physical and psychological CIH approaches in standard terminologies, and to also apply state-of-the-art natural language processing (NLP) techniques to help recognize them in the biomedical literature. MATERIALS AND METHODS: We constructed the CIHLex by integrating various resources, compiling and integrating data from biomedical literature and relevant sources of knowledge. The Lexicon encompasses 724 unique concepts with 885 corresponding unique terms. We matched these concepts to the Unified Medical Language System (UMLS), and we developed and utilized BERT models comparing their efficiency in CIH named entity recognition to well-established models including MetaMap and CLAMP, as well as the large language model GPT3.5-turbo. RESULTS: Of the 724 unique concepts in CIHLex, 27.2% could be matched to at least one term in the UMLS. About 74.9% of the mapped UMLS Concept Unique Identifiers were categorized as "Therapeutic or Preventive Procedure." Among the models applied to CIH named entity recognition, BLUEBERT delivered the highest macro-average F1-score of 0.91, surpassing other models. CONCLUSION: Our CIHLex significantly augments representation of CIH approaches in biomedical literature. Demonstrating the utility of advanced NLP models, BERT notably excelled in CIH entity recognition. These results highlight promising strategies for enhancing standardization and recognition of CIH terminology in biomedical contexts.


Assuntos
Algoritmos , Unified Medical Language System , Processamento de Linguagem Natural , Idioma
9.
J Clin Epidemiol ; 162: 19-28, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37562729

RESUMO

OBJECTIVES: To describe randomized controlled trial (RCT) methodology reporting over time. STUDY DESIGN AND SETTING: We used a deep learning-based sentence classification model based on the Consolidated Standards of Reporting Trials (CONSORT) statement, considered minimum requirements for reporting RCTs. We included 176,469 RCT reports published between 1966 and 2018. We analyzed the reporting trends over 5-year time periods, grouping trials from 1966 to 1990 in a single stratum. We also explored the effect of journal impact factor (JIF) and medical discipline. RESULTS: Population, Intervention, Comparator, Outcome (PICO) items were commonly reported during each period, and reporting increased over time (e.g., interventions: 79.1% during 1966-1990 to 87.5% during 2010-2018). Reporting of some methods information has increased, although there is room for improvement (e.g., sequence generation: 10.8-41.8%). Some items are reported infrequently (e.g., allocation concealment: 5.1-19.3%). The number of items reported and JIF are weakly correlated (Pearson's r (162,702) = 0.16, P < 0.001). The differences in the proportion of items reported between disciplines are small (<10%). CONCLUSION: Our analysis provides large-scale quantitative support for the hypothesis that RCT methodology reporting has improved over time. Extending these models to all CONSORT items could facilitate compliance checking during manuscript authoring and peer review, and support metaresearch.


Assuntos
Fator de Impacto de Revistas , Publicações , Humanos , Ensaios Clínicos Controlados Aleatórios como Assunto , Padrões de Referência
10.
medRxiv ; 2023 May 21.
Artigo em Inglês | MEDLINE | ID: mdl-37292731

RESUMO

Recently, computational drug repurposing has emerged as a promising method for identifying new pharmaceutical interventions (PI) for Alzheimer's Disease (AD). Non-pharmaceutical interventions (NPI), such as Vitamin E and Music therapy, have great potential to improve cognitive function and slow the progression of AD, but have largely been unexplored. This study predicts novel NPIs for AD through link prediction on our developed biomedical knowledge graph. We constructed a comprehensive knowledge graph containing AD concepts and various potential interventions, called ADInt, by integrating a dietary supplement domain knowledge graph, SuppKG, with semantic relations from SemMedDB database. Four knowledge graph embedding models (TransE, RotatE, DistMult and ComplEX) and two graph convolutional network models (R-GCN and CompGCN) were compared to learn the representation of ADInt. R-GCN outperformed other models by evaluating on the time slice test set and the clinical trial test set and was used to generate the score tables of the link prediction task. Discovery patterns were applied to generate mechanism pathways for high scoring triples. Our ADInt had 162,213 nodes and 1,017,319 edges. The graph convolutional network model, R-GCN, performed best in both the Time Slicing test set (MR = 7.099, MRR = 0.5007, Hits@1 = 0.4112, Hits@3 = 0.5058, Hits@10 = 0.6804) and the Clinical Trials test set (MR = 1.731, MRR = 0.8582, Hits@1 = 0.7906, Hits@3 = 0.9033, Hits@10 = 0.9848). Among high scoring triples in the link prediction results, we found the plausible mechanism pathways of (Photodynamic therapy, PREVENTS, Alzheimer's Disease) and (Choerospondias axillaris, PREVENTS, Alzheimer's Disease) by discovery patterns and discussed them further. In conclusion, we presented a novel methodology to extend an existing knowledge graph and discover NPIs (dietary supplements (DS) and complementary and integrative health (CIH)) for AD. We used discovery patterns to find mechanisms for predicted triples to solve the poor interpretability of artificial neural networks. Our method can potentially be applied to other clinical problems, such as discovering drug adverse reactions and drug-drug interactions.

11.
J Biomed Inform ; 140: 104341, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36933632

RESUMO

BACKGROUND: Pharmacokinetic natural product-drug interactions (NPDIs) occur when botanical or other natural products are co-consumed with pharmaceutical drugs. With the growing use of natural products, the risk for potential NPDIs and consequent adverse events has increased. Understanding mechanisms of NPDIs is key to preventing or minimizing adverse events. Although biomedical knowledge graphs (KGs) have been widely used for drug-drug interaction applications, computational investigation of NPDIs is novel. We constructed NP-KG as a first step toward computational discovery of plausible mechanistic explanations for pharmacokinetic NPDIs that can be used to guide scientific research. METHODS: We developed a large-scale, heterogeneous KG with biomedical ontologies, linked data, and full texts of the scientific literature. To construct the KG, biomedical ontologies and drug databases were integrated with the Phenotype Knowledge Translator framework. The semantic relation extraction systems, SemRep and Integrated Network and Dynamic Reasoning Assembler, were used to extract semantic predications (subject-relation-object triples) from full texts of the scientific literature related to the exemplar natural products green tea and kratom. A literature-based graph constructed from the predications was integrated into the ontology-grounded KG to create NP-KG. NP-KG was evaluated with case studies of pharmacokinetic green tea- and kratom-drug interactions through KG path searches and meta-path discovery to determine congruent and contradictory information in NP-KG compared to ground truth data. We also conducted an error analysis to identify knowledge gaps and incorrect predications in the KG. RESULTS: The fully integrated NP-KG consisted of 745,512 nodes and 7,249,576 edges. Evaluation of NP-KG resulted in congruent (38.98% for green tea, 50% for kratom), contradictory (15.25% for green tea, 21.43% for kratom), and both congruent and contradictory (15.25% for green tea, 21.43% for kratom) information compared to ground truth data. Potential pharmacokinetic mechanisms for several purported NPDIs, including the green tea-raloxifene, green tea-nadolol, kratom-midazolam, kratom-quetiapine, and kratom-venlafaxine interactions were congruent with the published literature. CONCLUSION: NP-KG is the first KG to integrate biomedical ontologies with full texts of the scientific literature focused on natural products. We demonstrate the application of NP-KG to identify known pharmacokinetic interactions between natural products and pharmaceutical drugs mediated by drug metabolizing enzymes and transporters. Future work will incorporate context, contradiction analysis, and embedding-based methods to enrich NP-KG. NP-KG is publicly available at https://doi.org/10.5281/zenodo.6814507. The code for relation extraction, KG construction, and hypothesis generation is available at https://github.com/sanyabt/np-kg.


Assuntos
Ontologias Biológicas , Produtos Biológicos , Reconhecimento Automatizado de Padrão , Interações Medicamentosas , Semântica , Preparações Farmacêuticas
12.
Health Res Policy Syst ; 21(1): 25, 2023 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-36973785

RESUMO

BACKGROUND: Comments in PubMed are usually short papers for supporting or refuting claims, or discussing methods and findings in original articles. This study aims to explore whether they can be used as a quick and reliable evidence appraisal instrument for promoting research findings into practice, especially in emergency situations such as COVID-19 in which only missing, incomplete or uncertain evidence is available. METHODS: Evidence-comment networks (ECNs) were constructed by linking COVID-19-related articles to the commentaries (letters, editorials or brief correspondence) they received. PubTator Central was used to extract entities with a high volume of comments from the titles and abstracts of the articles. Among them, six drugs were selected, and their evidence assertions were analysed by exploring the structural information in the ECNs as well as the sentiment of the comments (positive, negative, neutral). Recommendations in WHO guidelines were used as the gold standard control to validate the consistency, coverage and efficiency of comments in reshaping clinical knowledge claims. RESULTS: The overall positive/negative sentiments of comments were aligned with recommendations for/against the corresponding treatments in the WHO guidelines. Comment topics covered all significant points of evidence appraisal and beyond. Furthermore, comments may indicate the uncertainty regarding drug use for clinical practice. Half of the critical comments emerged 4.25 months earlier on average than the guideline release. CONCLUSIONS: Comments have the potential as a support tool for rapid evidence appraisal as they have a selection effect by appraising the benefits, limitations and other clinical practice issues of concern in existing evidence. We suggest as a future direction an appraisal framework based on the comment topics and sentiment orientations to leverage the potential of scientific commentaries supporting evidence appraisal and decision-making.


Assuntos
COVID-19 , Humanos , Incerteza
13.
Res Sq ; 2023 Feb 16.
Artigo em Inglês | MEDLINE | ID: mdl-36824778

RESUMO

Background: Identifying chemical mentions within the Alzheimer's and dementia literature can provide a powerful tool to further therapeutic research. Leveraging the Chemical Entities of Biological Interest (ChEBI) ontology, which is rich in hierarchical and other relationship types, for entity normalization can provide an advantage for future downstream applications. We provide a reproducible hybrid approach that combines an ontology-enhanced PubMedBERT model for disambiguation with a dictionary-based method for candidate selection. Results: There were 56,553 chemical mentions in the titles of 44,812 unique PubMed article abstracts. Based on our gold standard, our method of disambiguation improved entity normalization by 25.3 percentage points compared to using only the dictionary-based approach with fuzzy-string matching for disambiguation. For our Alzheimer's and dementia cohort, we were able to add 47.1% more potential mappings between MeSH and ChEBI when compared to BioPortal. Conclusion: Use of natural language models like PubMedBERT and resources such as ChEBI and PubChem provide a beneficial way to link entity mentions to ontology terms, while further supporting downstream tasks like filtering ChEBI mentions based on roles and assertions to find beneficial therapies for Alzheimer's and dementia.

14.
AMIA Jt Summits Transl Sci Proc ; 2022: 254-263, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35854729

RESUMO

Lack of large quantities of annotated data is a major barrier in developing effective text mining models of biomedical literature. In this study, we explored weak supervision to improve the accuracy of text classification models for assessing methodological transparency of randomized controlled trial (RCT) publications. Specifically, we used Snorkel, a framework to programmatically build training sets, and UMLS-EDA, a data augmentation method that leverages a small number of labeled examples to generate new training instances, and assessed their effect on a BioBERT-based text classification model proposed for the task in previous work. Performance improvements due to weak supervision were limited and were surpassed by gains from hyperparameter tuning. Our analysis suggests that refinements to the weak supervision strategies to better deal with multi-label case could be beneficial. Our code and data are available at https://github.com/kilicogluh/CONSORT-TM/tree/master/weakSupervision.

16.
BMC Res Notes ; 15(1): 203, 2022 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-35690782

RESUMO

The rising rate of preprints and publications, combined with persistent inadequate reporting practices and problems with study design and execution, have strained the traditional peer review system. Automated screening tools could potentially enhance peer review by helping authors, journal editors, and reviewers to identify beneficial practices and common problems in preprints or submitted manuscripts. Tools can screen many papers quickly, and may be particularly helpful in assessing compliance with journal policies and with straightforward items in reporting guidelines. However, existing tools cannot understand or interpret the paper in the context of the scientific literature. Tools cannot yet determine whether the methods used are suitable to answer the research question, or whether the data support the authors' conclusions. Editors and peer reviewers are essential for assessing journal fit and the overall quality of a paper, including the experimental design, the soundness of the study's conclusions, potential impact and innovation. Automated screening tools cannot replace peer review, but may aid authors, reviewers, and editors in improving scientific papers. Strategies for responsible use of automated tools in peer review may include setting performance criteria for tools, transparently reporting tool performance and use, and training users to interpret reports.


Assuntos
Políticas Editoriais , Revisão da Pesquisa por Pares , Projetos de Pesquisa , Relatório de Pesquisa
17.
J Biomed Inform ; 131: 104120, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35709900

RESUMO

OBJECTIVE: Develop a novel methodology to create a comprehensive knowledge graph (SuppKG) to represent a domain with limited coverage in the Unified Medical Language System (UMLS), specifically dietary supplement (DS) information for discovering drug-supplement interactions (DSI), by leveraging biomedical natural language processing (NLP) technologies and a DS domain terminology. MATERIALS AND METHODS: We created SemRepDS (an extension of an NLP tool, SemRep), capable of extracting semantic relations from abstracts by leveraging a DS-specific terminology (iDISK) containing 28,884 DS terms not found in the UMLS. PubMed abstracts were processed using SemRepDS to generate semantic relations, which were then filtered using a PubMedBERT model to remove incorrect relations before generating SuppKG. Two discovery pathways were applied to SuppKG to identify potential DSIs, which are then compared with an existing DSI database and also evaluated by medical professionals for mechanistic plausibility. RESULTS: SemRepDS returned 158.5% more DS entities and 206.9% more DS relations than SemRep. The fine-tuned PubMedBERT model (significantly outperformed other machine learning and BERT models) obtained an F1 score of 0.8605 and removed 43.86% of semantic relations, improving the precision of the relations by 26.4% over pre-filtering. SuppKG consists of 56,635 nodes and 595,222 directed edges with 2,928 DS-specific nodes and 164,738 edges. Manual review of findings identified 182 of 250 (72.8%) proposed DS-Gene-Drug and 77 of 100 (77%) proposed DS-Gene1-Function-Gene2-Drug pathways to be mechanistically plausible. DISCUSSION: With added DS terminology to the UMLS, SemRepDS has the capability to find more DS-specific semantic relationships from PubMed than SemRep. The utility of the resulting SuppKG was demonstrated using discovery patterns to find novel DSIs. CONCLUSION: For the domain with limited coverage in the traditional terminology (e.g., UMLS), we demonstrated an approach to leverage domain terminology and improve existing NLP tools to generate a more comprehensive knowledge graph for the downstream task. Even this study focuses on DSI, the method may be adapted to other domains.


Assuntos
Processamento de Linguagem Natural , Unified Medical Language System , Suplementos Nutricionais , PubMed , Semântica
18.
AMIA Annu Symp Proc ; 2022: 542-551, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-37128457

RESUMO

Most biomedical information extraction (IE) approaches focus on entity types such as diseases, drugs, and genes, and relations such as gene-disease associations. In this paper, we introduce the task of methodological IE to support fine-grained quality assessment of randomized controlled trial (RCT) publications. We draw from the Ontology of Clinical Research (OCRe) and the CONSORT reporting guidelines for RCTs to create a categorization of relevant methodological characteristics. In a pilot annotation study, we annotate a corpus of 70 full-text publications with these characteristics. We also train baseline named entity recognition (NER) models to recognize these items in RCT publications using several training sets with different negative sampling strategies. We evaluate the models at span and document levels. Our results show that it is feasible to use natural language processing (NLP) and machine learning for fine-grained extraction of methodological information. We propose that our models, after improvements, can support assessment of methodological quality in RCT publications. Our annotated corpus, models, and code are publicly available at https://github.com/kellyhoang0610/RCTMethodologyIE.


Assuntos
Aprendizado de Máquina , Processamento de Linguagem Natural , Humanos , Projetos Piloto , Armazenamento e Recuperação da Informação
19.
Artigo em Inglês | MEDLINE | ID: mdl-34541578

RESUMO

Preventable adverse events as a result of medical errors present a growing concern in the healthcare system. As drug-drug interactions (DDIs) may lead to preventable adverse events, being able to extract DDIs from drug labels into a machine-processable form is an important step toward effective dissemination of drug safety information. Herein, we tackle the problem of jointly extracting mentions of drugs and their interactions, including interaction outcome, from drug labels. Our deep learning approach entails composing various intermediate representations, including graph-based context derived using graph convolutions (GCs) with a novel attention-based gating mechanism (holistically called GCA), which are combined in meaningful ways to predict on all subtasks jointly. Our model is trained and evaluated on the 2018 TAC DDI corpus. Our GCA model in conjunction with transfer learning performs at 39.20% F1 and 26.09% F1 on entity recognition (ER) and relation extraction (RE), respectively, on the first official test set and at 45.30% F1 and 27.87% F1 on ER and RE, respectively, on the second official test set. These updated results lead to improvements over our prior best by up to 6 absolute F1 points. After controlling for available training data, the proposed model exhibits state-of-the-art performance for this task.

20.
J Biomed Inform ; 116: 103717, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33647518

RESUMO

OBJECTIVE: To annotate a corpus of randomized controlled trial (RCT) publications with the checklist items of CONSORT reporting guidelines and using the corpus to develop text mining methods for RCT appraisal. METHODS: We annotated a corpus of 50 RCT articles at the sentence level using 37 fine-grained CONSORT checklist items. A subset (31 articles) was double-annotated and adjudicated, while 19 were annotated by a single annotator and reconciled by another. We calculated inter-annotator agreement at the article and section level using MASI (Measuring Agreement on Set-Valued Items) and at the CONSORT item level using Krippendorff's α. We experimented with two rule-based methods (phrase-based and section header-based) and two supervised learning approaches (support vector machine and BioBERT-based neural network classifiers), for recognizing 17 methodology-related items in the RCT Methods sections. RESULTS: We created CONSORT-TM consisting of 10,709 sentences, 4,845 (45%) of which were annotated with 5,246 labels. A median of 28 CONSORT items (out of possible 37) were annotated per article. Agreement was moderate at the article and section levels (average MASI: 0.60 and 0.64, respectively). Agreement varied considerably among individual checklist items (Krippendorff's α= 0.06-0.96). The model based on BioBERT performed best overall for recognizing methodology-related items (micro-precision: 0.82, micro-recall: 0.63, micro-F1: 0.71). Combining models using majority vote and label aggregation further improved precision and recall, respectively. CONCLUSION: Our annotated corpus, CONSORT-TM, contains more fine-grained information than earlier RCT corpora. Low frequency of some CONSORT items made it difficult to train effective text mining models to recognize them. For the items commonly reported, CONSORT-TM can serve as a testbed for text mining methods that assess RCT transparency, rigor, and reliability, and support methods for peer review and authoring assistance. Minor modifications to the annotation scheme and a larger corpus could facilitate improved text mining models. CONSORT-TM is publicly available at https://github.com/kilicogluh/CONSORT-TM.


Assuntos
Lista de Checagem , Publicações Seriadas/normas , Máquina de Vetores de Suporte , Humanos , Ensaios Clínicos Controlados Aleatórios como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...