Pesquisa | Portal de Pesquisa da BVS

1.

Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation.

Chen, Jiyu; Goudey, Benjamin; Geard, Nicholas; Verspoor, Karin.

Bioinformatics ; 40(Supplement_1): i390-i400, 2024 Jun 28.

Artigo em Inglês | MEDLINE | ID: mdl-38940182

RESUMO

MOTIVATION: Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process. RESULTS: We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies.This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge. AVAILABILITY AND IMPLEMENTATION: https://github.com/jiyuc/de-inconsistency.

Assuntos

Ontologia Genética , Anotação de Sequência Molecular , Anotação de Sequência Molecular/métodos , Bases de Dados Genéticas , Biologia Computacional/métodos , Semântica , Humanos

2.

'Fighting fire with fire' - using LLMs to combat LLM hallucinations.

Verspoor, Karin.

Nature ; 630(8017): 569-570, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38898288

3.

Propagation, detection and correction of errors using the sequence database network.

Goudey, Benjamin; Geard, Nicholas; Verspoor, Karin; Zobel, Justin.

Brief Bioinform ; 23(6)2022 11 19.

Artigo em Inglês | MEDLINE | ID: mdl-36266246

RESUMO

Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.

Assuntos

Biologia Computacional , Bases de Dados de Ácidos Nucleicos , Sequência de Aminoácidos

4.

The gut microbiome is a significant risk factor for future chronic lung disease.

Liu, Yang; Teo, Shu Mei; Méric, Guillaume; Tang, Howard H F; Zhu, Qiyun; Sanders, Jon G; Vázquez-Baeza, Yoshiki; Verspoor, Karin; Vartiainen, Ville A; Jousilahti, Pekka; Lahti, Leo; Niiranen, Teemu; Havulinna, Aki S; Knight, Rob; Salomaa, Veikko; Inouye, Michael.

J Allergy Clin Immunol ; 151(4): 943-952, 2023 04.

Artigo em Inglês | MEDLINE | ID: mdl-36587850

RESUMO

BACKGROUND: The gut-lung axis is generally recognized, but there are few large studies of the gut microbiome and incident respiratory disease in adults. OBJECTIVE: We sought to investigate the association and predictive capacity of the gut microbiome for incident asthma and chronic obstructive pulmonary disease (COPD). METHODS: Shallow metagenomic sequencing was performed for stool samples from a prospective, population-based cohort (FINRISK02; N = 7115 adults) with linked national administrative health register-derived classifications for incident asthma and COPD up to 15 years after baseline. Generalized linear models and Cox regressions were used to assess associations of microbial taxa and diversity with disease occurrence. Predictive models were constructed using machine learning with extreme gradient boosting. Models considered taxa abundances individually and in combination with other risk factors, including sex, age, body mass index, and smoking status. RESULTS: A total of 695 and 392 statistically significant associations were found between baseline taxonomic groups and incident asthma and COPD, respectively. Gradient boosting decision trees of baseline gut microbiome abundance predicted incident asthma and COPD in the validation data sets with mean area under the curves of 0.608 and 0.780, respectively. Cox analysis showed that the baseline gut microbiome achieved higher predictive performance than individual conventional risk factors, with C-indices of 0.623 for asthma and 0.817 for COPD. The integration of the gut microbiome and conventional risk factors further improved prediction capacities. CONCLUSIONS: The gut microbiome is a significant risk factor for incident asthma and incident COPD and is largely independent of conventional risk factors.

Assuntos

Asma , Microbioma Gastrointestinal , Doença Pulmonar Obstrutiva Crônica , Adulto , Humanos , Estudos Prospectivos , Fatores de Risco

5.

Evaluation of consensus strategies for haplotype phasing.

Al Bkhetan, Ziad; Chana, Gursharan; Ramamohanarao, Kotagiri; Verspoor, Karin; Goudey, Benjamin.

Brief Bioinform ; 22(4)2021 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-33236761

RESUMO

Haplotype phasing is a critical step for many genetic applications but incorrect estimates of phase can negatively impact downstream analyses. One proposed strategy to improve phasing accuracy is to combine multiple independent phasing estimates to overcome the limitations of any individual estimate. However, such a strategy is yet to be thoroughly explored. This study provides a comprehensive evaluation of consensus strategies for haplotype phasing. We explore the performance of different consensus paradigms, and the effect of specific constituent tools, across several datasets with different characteristics and their impact on the downstream task of genotype imputation. Based on the outputs of existing phasing tools, we explore two different strategies to construct haplotype consensus estimators: voting across outputs from multiple phasing tools and multiple outputs of a single non-deterministic tool. We find that the consensus approach from multiple tools reduces SE by an average of 10% compared to any constituent tool when applied to European populations and has the highest accuracy regardless of population ethnicity, sample size, variant density or variant frequency. Furthermore, the consensus estimator improves the accuracy of the downstream task of genotype imputation carried out by the widely used Minimac3, pbwt and BEAGLE5 tools. Our results provide guidance on how to produce the most accurate phasing estimates and the trade-offs that a consensus approach may have. Our implementation of consensus haplotype phasing, consHap, is available freely at https://github.com/ziadbkh/consHap. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

Assuntos

Algoritmos , Bases de Dados de Ácidos Nucleicos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA , Haplótipos , Humanos

6.

Exploring automatic inconsistency detection for literature-based gene ontology annotation.

Chen, Jiyu; Goudey, Benjamin; Zobel, Justin; Geard, Nicholas; Verspoor, Karin.

Bioinformatics ; 38(Suppl 1): i273-i281, 2022 06 24.

Artigo em Inglês | MEDLINE | ID: mdl-35758780

RESUMO

MOTIVATION: Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. RESULTS: We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows. The data underlying this article are available in Github at https://github.com/jiyuc/AutoGOAConsistency.

Assuntos

Publicações , Ontologia Genética , Anotação de Sequência Molecular

7.

MPVNN: Mutated Pathway Visible Neural Network architecture for interpretable prediction of cancer-specific survival risk.

Ghosh Roy, Gourab; Geard, Nicholas; Verspoor, Karin; He, Shan.

Bioinformatics ; 38(22): 5026-5032, 2022 11 15.

Artigo em Inglês | MEDLINE | ID: mdl-36124954

RESUMO

MOTIVATION: Survival risk prediction using gene expression data is important in making treatment decisions in cancer. Standard neural network (NN) survival analysis models are black boxes with a lack of interpretability. More interpretable visible neural network architectures are designed using biological pathway knowledge. But they do not model how pathway structures can change for particular cancer types. RESULTS: We propose a novel Mutated Pathway Visible Neural Network (MPVNN) architecture, designed using prior signaling pathway knowledge and random replacement of known pathway edges using gene mutation data simulating signal flow disruption. As a case study, we use the PI3K-Akt pathway and demonstrate overall improved cancer-specific survival risk prediction of MPVNN over other similar-sized NN and standard survival analysis methods. We show that trained MPVNN architecture interpretation, which points to smaller sets of genes connected by signal flow within the PI3K-Akt pathway that is important in risk prediction for particular cancer types, is reliable. AVAILABILITY AND IMPLEMENTATION: The data and code are available at https://github.com/gourabghoshroy/MPVNN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Neoplasias , Fosfatidilinositol 3-Quinases , Humanos , Fosfatidilinositol 3-Quinases/genética , Proteínas Proto-Oncogênicas c-akt , Redes Neurais de Computação , Neoplasias/genética , Mutação

8.

Graph embedding-based link prediction for literature-based discovery in Alzheimer's Disease.

Pu, Yiyuan; Beck, Daniel; Verspoor, Karin.

J Biomed Inform ; 145: 104464, 2023 09.

Artigo em Inglês | MEDLINE | ID: mdl-37541406

RESUMO

OBJECTIVE: We explore the framing of literature-based discovery (LBD) as link prediction and graph embedding learning, with Alzheimer's Disease (AD) as our focus disease context. The key link prediction setting of prediction window length is specifically examined in the context of a time-sliced evaluation methodology. METHODS: We propose a four-stage approach to explore literature-based discovery for Alzheimer's Disease, creating and analyzing a knowledge graph tailored to the AD context, and predicting and evaluating new knowledge based on time-sliced link prediction. The first stage is to collect an AD-specific corpus. The second stage involves constructing an AD knowledge graph with identified AD-specific concepts and relations from the corpus. In the third stage, 20 pairs of training and testing datasets are constructed with the time-slicing methodology. Finally, we infer new knowledge with graph embedding-based link prediction methods. We compare different link prediction methods in this context. The impact of limiting prediction evaluation of LBD models in the context of short-term and longer-term knowledge evolution for Alzheimer's Disease is assessed. RESULTS: We constructed an AD corpus of over 16 k papers published in 1977-2021, and automatically annotated it with concepts and relations covering 11 AD-specific semantic entity types. The knowledge graph of Alzheimer's Disease derived from this resource consisted of â¼11 k nodes and â¼394 k edges, among which 34% were genotype-phenotype relationships, 57% were genotype-genotype relationships, and 9% were phenotype-phenotype relationships. A Structural Deep Network Embedding (SDNE) model consistently showed the best performance in terms of returning the most confident set of link predictions as time progresses over 20 years. A huge improvement in model performance was observed when changing the link prediction evaluation setting to consider a more distant future, reflecting the time required for knowledge accumulation. CONCLUSION: Neural network graph-embedding link prediction methods show promise for the literature-based discovery context, although the prediction setting is extremely challenging, with graph densities of less than 1%. Varying prediction window length on the time-sliced evaluation methodology leads to hugely different results and interpretations of LBD studies. Our approach can be generalized to enable knowledge discovery for other diseases. AVAILABILITY: Code, AD ontology, and data are available at https://github.com/READ-BioMed/readbiomed-lbd.

Assuntos

Doença de Alzheimer , Descoberta do Conhecimento , Humanos , Descoberta do Conhecimento/métodos , Doença de Alzheimer/diagnóstico , Redes Neurais de Computação , Aprendizagem , Fenótipo

9.

Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities.

Liu, Jinghui; Capurro, Daniel; Nguyen, Anthony; Verspoor, Karin.

J Biomed Inform ; 145: 104466, 2023 09.

Artigo em Inglês | MEDLINE | ID: mdl-37549722

RESUMO

OBJECTIVE: With the increasing amount and growing variety of healthcare data, multimodal machine learning supporting integrated modeling of structured and unstructured data is an increasingly important tool for clinical machine learning tasks. However, it is non-trivial to manage the differences in dimensionality, volume, and temporal characteristics of data modalities in the context of a shared target task. Furthermore, patients can have substantial variations in the availability of data, while existing multimodal modeling methods typically assume data completeness and lack a mechanism to handle missing modalities. METHODS: We propose a Transformer-based fusion model with modality-specific tokens that summarize the corresponding modalities to achieve effective cross-modal interaction accommodating missing modalities in the clinical context. The model is further refined by inter-modal, inter-sample contrastive learning to improve the representations for better predictive performance. We denote the model as Attention-based cRoss-MOdal fUsion with contRast (ARMOUR). We evaluate ARMOUR using two input modalities (structured measurements and unstructured text), six clinical prediction tasks, and two evaluation regimes, either including or excluding samples with missing modalities. RESULTS: Our model shows improved performances over unimodal or multimodal baselines in both evaluation regimes, including or excluding patients with missing modalities in the input. The contrastive learning improves the representation power and is shown to be essential for better results. The simple setup of modality-specific tokens enables ARMOUR to handle patients with missing modalities and allows comparison with existing unimodal benchmark results. CONCLUSION: We propose a multimodal model for robust clinical prediction to achieve improved performance while accommodating patients with missing modalities. This work could inspire future research to study the effective incorporation of multiple, more complex modalities of clinical data into a single model.

Assuntos

Benchmarking , Aprendizado de Máquina , Humanos

10.

Detecting evidence of invasive fungal infections in cytology and histopathology reports enriched with concept-level annotations.

Rozova, Vlada; Khanina, Anna; Teng, Jasmine C; Teh, Joanne S K; Worth, Leon J; Slavin, Monica A; Thursky, Karin A; Verspoor, Karin.

J Biomed Inform ; 139: 104293, 2023 03.

Artigo em Inglês | MEDLINE | ID: mdl-36682389

RESUMO

Invasive fungal infections (IFIs) are particularly dangerous to high-risk patients with haematological malignancies and are responsible for excessive mortality and delays in cancer therapy. Surveillance of IFI in clinical settings offers an opportunity to identify potential risk factors and evaluate new therapeutic strategies. However, manual surveillance is both time- and resource-intensive. As part of a broader project aimed to develop a system for automated IFI surveillance by leveraging electronic medical records, we present our approach to detecting evidence of IFI in the key diagnostic domain of histopathology. Using natural language processing (NLP), we analysed cytology and histopathology reports to identify IFI-positive reports. We compared a conventional bag-of-words classification model to a method that relies on concept-level annotations. Although the investment to prepare data supporting concept annotations is substantial, extracting targeted information specific to IFI as a pre-processing step increased the performance of the classifier from the PR AUC of 0.84 to 0.92 and enabled model interpretability. We have made publicly available the annotated dataset of 283 reports, the Cytology and Histopathology IFI Reports corpus (CHIFIR), to allow the clinical NLP research community to further build on our results.

Assuntos

Infecções Fúngicas Invasivas , Humanos , Infecções Fúngicas Invasivas/epidemiologia , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Fatores de Risco

11.

Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study.

Suster, Simon; Baldwin, Timothy; Lau, Jey Han; Jimeno Yepes, Antonio; Martinez Iraola, David; Otmakhova, Yulia; Verspoor, Karin.

J Med Internet Res ; 25: e35568, 2023 03 13.

Artigo em Inglês | MEDLINE | ID: mdl-36722350

RESUMO

BACKGROUND: Assessment of the quality of medical evidence available on the web is a critical step in the preparation of systematic reviews. Existing tools that automate parts of this task validate the quality of individual studies but not of entire bodies of evidence and focus on a restricted set of quality criteria. OBJECTIVE: We proposed a quality assessment task that provides an overall quality rating for each body of evidence (BoE), as well as finer-grained justification for different quality criteria according to the Grading of Recommendation, Assessment, Development, and Evaluation formalization framework. For this purpose, we constructed a new data set and developed a machine learning baseline system (EvidenceGRADEr). METHODS: We algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews. Each BoE was defined by a set of population, intervention, comparison, and outcome criteria and assigned a quality grade (high, moderate, low, or very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of the review text were extracted as support for grading each BoE. After pruning the resulting data set with various quality checks, we used it to train several neural-model variants. The predictions were compared against the labels originally assigned by the authors of the systematic reviews. RESULTS: Our quality assessment data set, Cochrane Database of Systematic Reviews Quality of Evidence, contains 13,440 instances, or BoEs labeled for quality, originating from 2252 systematic reviews published on the internet from 2002 to 2020. On the basis of a 10-fold cross-validation, the best neural binary classifiers for quality criteria detected risk of bias at 0.78 F1 (P=.68; R=0.92) and imprecision at 0.75 F1 (P=.66; R=0.86), while the performance on inconsistency, indirectness, and publication bias criteria was lower (F1 in the range of 0.3-0.4). The prediction of the overall quality grade into 1 of the 4 levels resulted in 0.5 F1. When casting the task as a binary problem by merging the Grading of Recommendation, Assessment, Development, and Evaluation classes (high+moderate vs low+very low-quality evidence), we attained 0.74 F1. We also found that the results varied depending on the supporting information that is provided as an input to the models. CONCLUSIONS: Different factors affect the quality of evidence in the context of systematic reviews of medical evidence. Some of these (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis.

Assuntos

Aprendizado de Máquina , Humanos , Revisões Sistemáticas como Assunto , Viés

12.

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT.

Elangovan, Aparna; Li, Yuan; Pires, Douglas E V; Davis, Melissa J; Verspoor, Karin.

BMC Bioinformatics ; 23(1): 4, 2022 Jan 04.

Artigo em Inglês | MEDLINE | ID: mdl-34983371

RESUMO

MOTIVATION: Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. METHOD: We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. RESULTS AND CONCLUSION: The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.

Assuntos

Mineração de Dados , Processamento de Proteína Pós-Traducional , Humanos , Proteínas , PubMed

13.

PoLoBag: Polynomial Lasso Bagging for signed gene regulatory network inference from expression data.

Ghosh Roy, Gourab; Geard, Nicholas; Verspoor, Karin; He, Shan.

Bioinformatics ; 36(21): 5187-5193, 2021 01 29.

Artigo em Inglês | MEDLINE | ID: mdl-32697830

RESUMO

MOTIVATION: Inferring gene regulatory networks (GRNs) from expression data is a significant systems biology problem. A useful inference algorithm should not only unveil the global structure of the regulatory mechanisms but also the details of regulatory interactions such as edge direction (from regulator to target) and sign (activation/inhibition). Many popular GRN inference algorithms cannot infer edge signs, and those that can infer signed GRNs cannot simultaneously infer edge directions or network cycles. RESULTS: To address these limitations of existing algorithms, we propose Polynomial Lasso Bagging (PoLoBag) for signed GRN inference with both edge directions and network cycles. PoLoBag is an ensemble regression algorithm in a bagging framework where Lasso weights estimated on bootstrap samples are averaged. These bootstrap samples incorporate polynomial features to capture higher-order interactions. Results demonstrate that PoLoBag is consistently more accurate for signed inference than state-of-the-art algorithms on simulated and real-world expression datasets. AVAILABILITY AND IMPLEMENTATION: Algorithm and data are freely available at https://github.com/gourabghoshroy/PoLoBag. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Redes Reguladoras de Genes , Biologia Computacional , Biologia de Sistemas

14.

"Note Bloat" impacts deep learning-based NLP models for clinical prediction tasks.

Liu, Jinghui; Capurro, Daniel; Nguyen, Anthony; Verspoor, Karin.

J Biomed Inform ; 133: 104149, 2022 09.

Artigo em Inglês | MEDLINE | ID: mdl-35878821

RESUMO

One unintended consequence of the Electronic Health Records (EHR) implementation is the overuse of content-importing technology, such as copy-and-paste, that creates "bloated" notes containing large amounts of textual redundancy. Despite the rising interest in applying machine learning models to learn from real-patient data, it is unclear how the phenomenon of note bloat might affect the Natural Language Processing (NLP) models derived from these notes. Therefore, in this work we examine the impact of redundancy on deep learning-based NLP models, considering four clinical prediction tasks using a publicly available EHR database. We applied two deduplication methods to the hospital notes, identifying large quantities of redundancy, and found that removing the redundancy usually has little negative impact on downstream performances, and can in certain circumstances assist models to achieve significantly better results. We also showed it is possible to attack model predictions by simply adding note duplicates, causing changes of correct predictions made by trained models into wrong predictions. In conclusion, we demonstrated that EHR text redundancy substantively affects NLP models for clinical prediction tasks, showing that the awareness of clinical contexts and robust modeling methods are important to create effective and reliable NLP systems in healthcare contexts.

Assuntos

Aprendizado Profundo , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde , Humanos , Aprendizado de Máquina

15.

Use of a Victorian statewide surveillance programme to evaluate the burden of healthcare-associated Staphylococcus aureus bacteraemia and Clostridioides difficile infection in patients with cancer.

Valentine, Jake C; Hall, Lisa; Verspoor, Karin M; Gillespie, Elizabeth; Worth, Leon J.

Intern Med J ; 52(7): 1215-1224, 2022 07.

Artigo em Inglês | MEDLINE | ID: mdl-33755285

RESUMO

BACKGROUND: Patients with cancer are at high risk for infection, but the epidemiology of healthcare-associated Staphylococcus aureus bacteraemia (HA-SAB) and Clostridioides difficile infection (HA-CDI) in Australian cancer patients has not previously been reported. AIMS: To compare the cumulative aggregate incidence and time trends of HA-SAB and HA-CDI in a predefined cancer cohort with a mixed statewide patient population in Victoria, Australia. METHODS: All SAB and CDI events in patients admitted to Victorian healthcare facilities between 1 July 2010 and 31 December 2018 were submitted to the Victorian Healthcare Associated Infection Surveillance System Coordinating Centre. Descriptive analyses and multilevel mixed-effects Poisson regression modelling were applied to a standardised data extract. RESULTS: In total, 10 608 and 13 118 SAB and CDI events were reported across 139 Victorian healthcare facilities, respectively. Of these, 89 (85%) and 279 (88%) were healthcare-associated in the cancer cohort compared with 34% (3561/10 503) and 66% (8403/12 802) in the statewide cohort. The aggregate incidence was more than twofold higher in the cancer cohort compared with the statewide cohort for HA-SAB (2.25 (95% confidence interval (CI): 1.74-2.77) vs 1.11 (95% CI: 1.07-1.15) HA-SAB/10 000 occupied bed-days) and threefold higher for HA-CDI (6.26 (95% CI: 5.12-7.41) vs 2.31 (95% CI: 2.21-2.42) HA-CDI/10 000 occupied bed-days). Higher quarterly diminishing rates were observed in the cancer cohort than the statewide data for both infections. CONCLUSIONS: Our findings demonstrate a higher burden of HA-SAB and HA-CDI in a cancer cohort when compared with state data and highlight the need for cancer-specific targets and benchmarks to meaningfully support quality improvement.

Assuntos

Bacteriemia , Infecções por Clostridium , Infecção Hospitalar , Neoplasias , Infecções Estafilocócicas , Bacteriemia/epidemiologia , Infecções por Clostridium/epidemiologia , Infecção Hospitalar/epidemiologia , Atenção à Saúde , Humanos , Neoplasias/epidemiologia , Infecções Estafilocócicas/epidemiologia , Staphylococcus aureus , Vitória/epidemiologia

16.

Predicting Publication of Clinical Trials Using Structured and Unstructured Data: Model Development and Validation Study.

Wang, Siyang; Suster, Simon; Baldwin, Timothy; Verspoor, Karin.

J Med Internet Res ; 24(12): e38859, 2022 12 23.

Artigo em Inglês | MEDLINE | ID: mdl-36563029

RESUMO

BACKGROUND: Publication of registered clinical trials is a critical step in the timely dissemination of trial findings. However, a significant proportion of completed clinical trials are never published, motivating the need to analyze the factors behind success or failure to publish. This could inform study design, help regulatory decision-making, and improve resource allocation. It could also enhance our understanding of bias in the publication of trials and publication trends based on the research direction or strength of the findings. Although the publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack of research on the predictive analysis of a trial's publishability given an individual (planned) clinical trial description. OBJECTIVE: We aimed to conduct a study that combined structured and unstructured features relevant to publication status in a single predictive approach. Established natural language processing techniques as well as recent pretrained language models enabled us to incorporate information from the textual descriptions of clinical trials into a machine learning approach. We were particularly interested in whether and which textual features could improve the classification accuracy for publication outcomes. METHODS: In this study, we used metadata from ClinicalTrials.gov (a registry of clinical trials) and MEDLINE (a database of academic journal articles) to build a data set of clinical trials (N=76,950) that contained the description of a registered trial and its publication outcome (27,702/76,950, 36% published and 49,248/76,950, 64% unpublished). This is the largest data set of its kind, which we released as part of this work. The publication outcome in the data set was identified from MEDLINE based on clinical trial identifiers. We carried out a descriptive analysis and predicted the publication outcome using 2 approaches: a neural network with a large domain-specific language model and a random forest classifier using a weighted bag-of-words representation of text. RESULTS: First, our analysis of the newly created data set corroborates several findings from the existing literature regarding attributes associated with a higher publication rate. Second, a crucial observation from our predictive modeling was that the addition of textual features (eg, eligibility criteria) offers consistent improvements over using only structured data (F1-score=0.62-0.64 vs F1-score=0.61 without textual features). Both pretrained language models and more basic word-based representations provide high-utility text representations, with no significant empirical difference between the two. CONCLUSIONS: Different factors affect the publication of a registered clinical trial. Our approach to predictive modeling combines heterogeneous features, both structured and unstructured. We show that methods from natural language processing can provide effective textual features to enable more accurate prediction of publication success, which has not been explored for this task previously.

Assuntos

Idioma , Projetos de Pesquisa , Humanos

17.

Automatic consistency assurance for literature-based gene ontology annotation.

Chen, Jiyu; Geard, Nicholas; Zobel, Justin; Verspoor, Karin.

BMC Bioinformatics ; 22(1): 565, 2021 Nov 25.

Artigo em Inglês | MEDLINE | ID: mdl-34823464

RESUMO

BACKGROUND: Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. RESULTS: In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. CONCLUSIONS: Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Assuntos

Biologia Computacional , Mineração de Dados , Bases de Dados de Proteínas , Ontologia Genética , Humanos , Anotação de Sequência Molecular

18.

Burden and clinical outcomes of hospital-coded infections in patients with cancer: an 11-year longitudinal cohort study at an Australian cancer centre.

Valentine, Jake C; Hall, Lisa; Spelman, Tim; Verspoor, Karin M; Seymour, John F; Rischin, Danny; Thursky, Karin A; Slavin, Monica A; Worth, Leon J.

Support Care Cancer ; 28(12): 6023-6034, 2020 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-32291600

RESUMO

PURPOSE: Patients with cancer are at increased risk for infection, but the relative morbidity and mortality of all infections is not well understood. The objectives of this study were to determine the prevalence, incidence, time-trends and risk of mortality of infections associated with hospital admissions in patients with haematological- and solid-tumour malignancies over 11 years. METHODS: A retrospective, longitudinal cohort study of inpatient admissions between 1 January 2007 and 31 December 2017 at the Peter MacCallum Cancer Centre was conducted using administratively coded and patient demographics data. Descriptive analyses, autoregressive integrated moving average, Kaplan-Meier and Cox regression modelling were applied. RESULTS: Of 45,116 inpatient hospitalisations consisting of 3033 haematological malignancy (HM), 18,372 solid tumour neoplasm (STN) patients and 953 autologous haematopoietic stem cell transplantation recipients, 67%, 29% and 88% were coded with ≥ 1 infection, respectively. Gastrointestinal tract and bloodstream infections were observed with the highest incidence, and bloodstream infection rates increased significantly over time in both HM- and STN-cohorts. Inpatient length of stay was significantly higher in exposed patients with coded infection compared to unexposed in HM- and STN-cohorts (22 versus 4 days [p < 0.001] and 15 versus 4 days [p < 0.001], respectively). Risk of in-hospital mortality was higher in exposed than unexposed patients in the STN-cohort (adjusted hazard ratio [aHR] 1.61 [95% CI 1.41-1.83]; p < 0.001)) and HM-cohort (aHR 1.30 [95% CI 0.90-1.90]; p = 0.166). CONCLUSION: Infection burden among cancer patients is substantial and findings reflect the need for targeted surveillance in high-risk patient groups (e.g. haematological malignancy), in whom enhanced monitoring may be required to support infection prevention strategies.

Assuntos

Infecção Hospitalar/epidemiologia , Neoplasias/epidemiologia , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Austrália/epidemiologia , Institutos de Câncer , Estudos de Coortes , Infecção Hospitalar/diagnóstico , Feminino , Mortalidade Hospitalar , Humanos , Incidência , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Neoplasias/complicações , Neoplasias/diagnóstico , Prognóstico , Estudos Retrospectivos , Adulto Jovem

19.

From POS tagging to dependency parsing for biomedical event extraction.

Nguyen, Dat Quoc; Verspoor, Karin.

BMC Bioinformatics ; 20(1): 72, 2019 Feb 12.

Artigo em Inglês | MEDLINE | ID: mdl-30755172

RESUMO

BACKGROUND: Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance. RESULTS: We perform an empirical study comparing state-of-the-art traditional feature-based and neural network-based models for two core natural language processing tasks of part-of-speech (POS) tagging and dependency parsing on two benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge, there is no recent work making such comparisons in the biomedical context; specifically no detailed analysis of neural models on this data is available. Experimental results show that in general, the neural models outperform the feature-based models on two benchmark biomedical corpora GENIA and CRAFT. We also perform a task-oriented evaluation to investigate the influences of these models in a downstream application on biomedical event extraction, and show that better intrinsic parsing performance does not always imply better extrinsic event extraction performance. CONCLUSION: We have presented a detailed empirical study comparing traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context, and also investigated the influence of parser selection for a biomedical event extraction downstream task. AVAILABILITY OF DATA AND MATERIALS: We make the retrained models available at https://github.com/datquocnguyen/BioPosDep .

Assuntos

Pesquisa Biomédica , Armazenamento e Recuperação da Informação , Fala , Algoritmos , Humanos , Processamento de Linguagem Natural , Redes Neurais de Computação , Publicações , Vocabulário

20.

Automated assessment of biological database assertions using the scientific literature.

Bouadjenek, Mohamed Reda; Zobel, Justin; Verspoor, Karin.

BMC Bioinformatics ; 20(1): 216, 2019 Apr 29.

Artigo em Inglês | MEDLINE | ID: mdl-31035936

RESUMO

BACKGROUND: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS: Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS: BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.

Assuntos

Algoritmos , Mineração de Dados/métodos , Bases de Dados Factuais , Bases de Dados de Ácidos Nucleicos , Humanos , Mapas de Interação de Proteínas , Editoração

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA