Pesquisa | BVS Doenças Infecciosas e Parasitárias

1.

Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness.

Zhang, Gongbo; Jin, Qiao; Jered McInerney, Denis; Chen, Yong; Wang, Fei; Cole, Curtis L; Yang, Qian; Wang, Yanshan; Malin, Bradley A; Peleg, Mor; Wallace, Byron C; Lu, Zhiyong; Weng, Chunhua; Peng, Yifan.

J Biomed Inform ; 153: 104640, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38608915

RESUMO

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.

Assuntos

Inteligência Artificial , Medicina Baseada em Evidências , Humanos , Confiança , Processamento de Linguagem Natural

2.

Machine learning to help researchers evaluate biases in clinical trials: a prospective, randomized user study.

Soboczenski, Frank; Trikalinos, Thomas A; Kuiper, Joël; Bias, Randolph G; Wallace, Byron C; Marshall, Iain J.

BMC Med Inform Decis Mak ; 19(1): 96, 2019 05 08.

Artigo em Inglês | MEDLINE | ID: mdl-31068178

RESUMO

OBJECTIVE: Assessing risks of bias in randomized controlled trials (RCTs) is an important but laborious task when conducting systematic reviews. RobotReviewer (RR), an open-source machine learning (ML) system, semi-automates bias assessments. We conducted a user study of RobotReviewer, evaluating time saved and usability of the tool. MATERIALS AND METHODS: Systematic reviewers applied the Cochrane Risk of Bias tool to four randomly selected RCT articles. Reviewers judged: whether an RCT was at low, or high/unclear risk of bias for each bias domain in the Cochrane tool (Version 1); and highlighted article text justifying their decision. For a random two of the four articles, the process was semi-automated: users were provided with ML-suggested bias judgments and text highlights. Participants could amend the suggestions if necessary. We measured time taken for the task, ML suggestions, usability via the System Usability Scale (SUS) and collected qualitative feedback. RESULTS: For 41 volunteers, semi-automation was quicker than manual assessment (mean 755 vs. 824 s; relative time 0.75, 95% CI 0.62-0.92). Reviewers accepted 301/328 (91%) of the ML Risk of Bias (RoB) judgments, and 202/328 (62%) of text highlights without change. Overall, ML suggested text highlights had a recall of 0.90 (SD 0.14) and precision of 0.87 (SD 0.21) with respect to the users' final versions. Reviewers assigned the system a mean 77.7 SUS score, corresponding to a rating between "good" and "excellent". CONCLUSIONS: Semi-automation (where humans validate machine learning suggestions) can improve the efficiency of evidence synthesis. Our system was rated highly usable, and expedited bias assessment of RCTs.

Assuntos

Viés , Aprendizado de Máquina , Ensaios Clínicos Controlados Aleatórios como Assunto , Retroalimentação , Humanos , Estudos Prospectivos , Medição de Risco

3.

Phenotype Instance Verification and Evaluation Tool (PIVET): A Scaled Phenotype Evidence Generation Framework Using Web-Based Medical Literature.

Henderson, Jette; Ke, Junyuan; Ho, Joyce C; Ghosh, Joydeep; Wallace, Byron C.

J Med Internet Res ; 20(5): e164, 2018 05 04.

Artigo em Inglês | MEDLINE | ID: mdl-29728351

RESUMO

BACKGROUND: Researchers are developing methods to automatically extract clinically relevant and useful patient characteristics from raw healthcare datasets. These characteristics, often capturing essential properties of patients with common medical conditions, are called computational phenotypes. Being generated by automated or semiautomated, data-driven methods, such potential phenotypes need to be validated as clinically meaningful (or not) before they are acceptable for use in decision making. OBJECTIVE: The objective of this study was to present Phenotype Instance Verification and Evaluation Tool (PIVET), a framework that uses co-occurrence analysis on an online corpus of publically available medical journal articles to build clinical relevance evidence sets for user-supplied phenotypes. PIVET adopts a conceptual framework similar to the pioneering prototype tool PheKnow-Cloud that was developed for the phenotype validation task. PIVET completely refactors each part of the PheKnow-Cloud pipeline to deliver vast improvements in speed without sacrificing the quality of the insights PheKnow-Cloud achieved. METHODS: PIVET leverages indexing in NoSQL databases to efficiently generate evidence sets. Specifically, PIVET uses a succinct representation of the phenotypes that corresponds to the index on the corpus database and an optimized co-occurrence algorithm inspired by the Aho-Corasick algorithm. We compare PIVET's phenotype representation with PheKnow-Cloud's by using PheKnow-Cloud's experimental setup. In PIVET's framework, we also introduce a statistical model trained on domain expert-verified phenotypes to automatically classify phenotypes as clinically relevant or not. Additionally, we show how the classification model can be used to examine user-supplied phenotypes in an online, rather than batch, manner. RESULTS: PIVET maintains the discriminative power of PheKnow-Cloud in terms of identifying clinically relevant phenotypes for the same corpus with which PheKnow-Cloud was originally developed, but PIVET's analysis is an order of magnitude faster than that of PheKnow-Cloud. Not only is PIVET much faster, it can be scaled to a larger corpus and still retain speed. We evaluated multiple classification models on top of the PIVET framework and found ridge regression to perform best, realizing an average F1 score of 0.91 when predicting clinically relevant phenotypes. CONCLUSIONS: Our study shows that PIVET improves on the most notable existing computational tool for phenotype validation in terms of speed and automation and is comparable in terms of accuracy.

Assuntos

Armazenamento e Recuperação da Informação/métodos , Internet/instrumentação , MEDLARS/normas , Algoritmos , Humanos , Fenótipo

4.

Improving the utility of MeSH® terms using the TopicalMeSH representation.

Yu, Zhiguo; Bernstam, Elmer; Cohen, Trevor; Wallace, Byron C; Johnson, Todd R.

J Biomed Inform ; 61: 77-86, 2016 06.

Artigo em Inglês | MEDLINE | ID: mdl-27001195

RESUMO

OBJECTIVE: To evaluate whether vector representations encoding latent topic proportions that capture similarities to MeSH terms can improve performance on biomedical document retrieval and classification tasks, compared to using MeSH terms. MATERIALS AND METHODS: We developed the TopicalMeSH representation, which exploits the 'correspondence' between topics generated using latent Dirichlet allocation (LDA) and MeSH terms to create new document representations that combine MeSH terms and latent topic vectors. We used 15 systematic drug review corpora to evaluate performance on information retrieval and classification tasks using this TopicalMeSH representation, compared to using standard encodings that rely on either (1) the original MeSH terms, (2) the text, or (3) their combination. For the document retrieval task, we compared the precision and recall achieved by ranking citations using MeSH and TopicalMeSH representations, respectively. For the classification task, we considered three supervised machine learning approaches, Support Vector Machines (SVMs), logistic regression, and decision trees. We used these to classify documents as relevant or irrelevant using (independently) MeSH, TopicalMeSH, Words (i.e., n-grams extracted from citation titles and abstracts, encoded via bag-of-words representation), a combination of MeSH and Words, and a combination of TopicalMeSH and Words. We also used SVM to compare the classification performance of tf-idf weighted MeSH terms, LDA Topics, a combination of Topics and MeSH, and TopicalMeSH to supervised LDA's classification performance. RESULTS: For the document retrieval task, using the TopicalMeSH representation resulted in higher precision than MeSH in 11 of 15 corpora while achieving the same recall. For the classification task, use of TopicalMeSH features realized a higher F1 score in 14 of 15 corpora when used by SVMs, 12 of 15 corpora using logistic regression, and 12 of 15 corpora using decision trees. TopicalMeSH also had better document classification performance on 12 of 15 corpora when compared to Topics, tf-idf weighted MeSH terms, and a combination of Topics and MeSH using SVMs. Supervised LDA achieved the worst performance in most of the corpora. CONCLUSION: The proposed TopicalMeSH representation (which combines MeSH terms with latent topics) consistently improved performance on document retrieval and classification tasks, compared to using alternative standard representations using MeSH terms alone, as well as, several standard alternative approaches.

Assuntos

Armazenamento e Recuperação da Informação , Medical Subject Headings , Máquina de Vetores de Suporte , Árvores de Decisões , Humanos

5.

Question answering systems for health professionals at the point of care-a systematic review.

Kell, Gregory; Roberts, Angus; Umansky, Serge; Qian, Linglong; Ferrari, Davide; Soboczenski, Frank; Wallace, Byron C; Patel, Nikhil; Marshall, Iain J.

J Am Med Inform Assoc ; 31(4): 1009-1024, 2024 Apr 03.

Artigo em Inglês | MEDLINE | ID: mdl-38366879

RESUMO

OBJECTIVES: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. MATERIALS AND METHODS: We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology, and forward and backward citations on February 7, 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems. RESULTS: We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians. DISCUSSION: While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy.

Assuntos

Pessoal de Saúde , Sistemas Automatizados de Assistência Junto ao Leito , Humanos , Reprodutibilidade dos Testes , PubMed , Aprendizado de Máquina

6.

Revisiting Relation Extraction in the era of Large Language Models.

Wadhwa, Somin; Amir, Silvio; Wallace, Byron C.

Proc Conf Assoc Comput Linguist Meet ; 2023: 15566-15589, 2023 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-37674787

RESUMO

Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a sequence-to-sequence task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.

7.

Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges.

Ramprasad, Sanjana; Marshall, Iain J; McInerney, Denis Jered; Wallace, Byron C.

Proc Conf Assoc Comput Linguist Meet ; 2023: 236-247, 2023 May.

Artigo em Inglês | MEDLINE | ID: mdl-37483390

RESUMO

We present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work (Marshall et al., 2020), the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality. The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART (Lewis et al., 2019), and a multi-headed architecture intended to provide greater transparency to end-users. Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present. The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video is available at: https://vimeo.com/735605060 The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/.

8.

In a pilot study, automated real-time systematic review updates were feasible, accurate, and work-saving.

Marshall, Iain J; Trikalinos, Thomas A; Soboczenski, Frank; Yun, Hye Sun; Kell, Gregory; Marshall, Rachel; Wallace, Byron C.

J Clin Epidemiol ; 153: 26-33, 2023 01.

Artigo em Inglês | MEDLINE | ID: mdl-36150548

RESUMO

OBJECTIVES: The aim of this study is to describe and pilot a novel method for continuously identifying newly published trials relevant to a systematic review, enabled by combining artificial intelligence (AI) with human expertise. STUDY DESIGN AND SETTING: We used RobotReviewer LIVE to keep a review of COVID-19 vaccination trials updated from February to August 2021. We compared the papers identified by the system with those found by the conventional manual process by the review team. RESULTS: The manual update searches (last search date July 2021) retrieved 135 abstracts, of which 31 were included after screening (23% precision, 100% recall). By the same date, the automated system retrieved 56 abstracts, of which 31 were included after manual screening (55% precision, 100% recall). Key limitations of the system include that it is limited to searches of PubMed/MEDLINE, and considers only randomized controlled trial reports. We aim to address these limitations in future. The system is available as open-source software for further piloting and evaluation. CONCLUSION: Our system identified all relevant studies, reduced manual screening work, and enabled rolling updates on publication of new primary research.

Assuntos

Inteligência Artificial , COVID-19 , Humanos , Projetos Piloto , Vacinas contra COVID-19 , COVID-19/epidemiologia , COVID-19/prevenção & controle , PubMed

9.

Single cell time-resolved quorum responses reveal dependence on cell density and configuration.

Whitaker, Ragnhild D; Pember, Steven; Wallace, Byron C; Brodley, Carla E; Walt, David R.

J Biol Chem ; 286(24): 21623-32, 2011 Jun 17.

Artigo em Inglês | MEDLINE | ID: mdl-21527637

RESUMO

Bacterial communication via quorum sensing has been extensively investigated in recent years. Bacteria communicate in a complex manner through the production, release, and reception of diffusible low molecular weight chemical signaling molecules. Much work has focused on understanding the basic mechanisms of quorum sensing. As more and more bacteria grow resistant to conventional antibiotics, the development of drugs that do not kill bacteria but instead interrupt their communication is of increasing interest. This study presents a method for analyzing bacterial communication by investigating single cell responses. Most conventional analysis methods for bacterial communication are based on the averaged response from many bacteria, masking how individual cells respond to their immediate environment. We applied a fiber-optic microarray to record cellular communication from single cells. Single cell quorum sensing systems have previously been employed, but the highly ordered array reported here is an improvement because it allows us to simultaneously investigate cellular communication in many different environments with known cellular densities and configurations. We employed this method to detect how genes under quorum regulation are induced or repressed over time on the single cell level and to determine whether cellular density and configuration are indicative of the single cell temporal patterns of gene expression.

Assuntos

Regulação Bacteriana da Expressão Gênica , Percepção de Quorum/fisiologia , Proteínas de Bactérias/metabolismo , Biofísica/métodos , Comunicação Celular , Escherichia coli/metabolismo , Tecnologia de Fibra Óptica , Modelos Biológicos , Modelos Químicos , Análise de Sequência com Séries de Oligonucleotídeos , Fatores de Tempo , Transcrição Gênica

10.

Toward modernizing the systematic review pipeline in genetics: efficient updating via data mining.

Wallace, Byron C; Small, Kevin; Brodley, Carla E; Lau, Joseph; Schmid, Christopher H; Bertram, Lars; Lill, Christina M; Cohen, Joshua T; Trikalinos, Thomas A.

Genet Med ; 14(7): 663-9, 2012 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-22481134

RESUMO

PURPOSE: The aim of this study was to demonstrate that modern data mining tools can be used as one step in reducing the labor necessary to produce and maintain systematic reviews. METHODS: We used four continuously updated, manually curated resources that summarize MEDLINE-indexed articles in entire fields using systematic review methods (PDGene, AlzGene, and SzGene for genetic determinants of Parkinson disease, Alzheimer disease, and schizophrenia, respectively; and the Tufts Cost-Effectiveness Analysis (CEA) Registry for cost-effectiveness analyses). In each data set, we trained a classification model on citations screened up until 2009. We then evaluated the ability of the model to classify citations published in 2010 as "relevant" or "irrelevant" using human screening as the gold standard. RESULTS: Classification models did not miss any of the 104, 65, and 179 eligible citations in PDGene, AlzGene, and SzGene, respectively, and missed only 1 of 79 in the CEA Registry (100% sensitivity for the first three and 99% for the fourth). The respective specificities were 90, 93, 90, and 73%. Had the semiautomated system been used in 2010, a human would have needed to read only 605/5,616 citations to update the PDGene registry (11%) and 555/7,298 (8%), 717/5,381 (13%), and 334/1,015 (33%) for the other three databases. CONCLUSION: Data mining methodologies can reduce the burden of updating systematic reviews, without missing more papers than humans.

Assuntos

Mineração de Dados , Revisões Sistemáticas como Assunto , Humanos , Doença de Alzheimer/genética , Análise Custo-Benefício , Mineração de Dados/métodos , Bases de Dados Factuais , Pesquisa Empírica , Metanálise como Assunto , Doença de Parkinson/genética , Publicações Periódicas como Assunto , Esquizofrenia/genética , Software , Avaliação da Tecnologia Biomédica

11.

Semi-automated Tools for Systematic Searches.

Adam, Gaelen P; Wallace, Byron C; Trikalinos, Thomas A.

Methods Mol Biol ; 2345: 17-40, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-34550582

RESUMO

Traditionally, literature identification for systematic reviews has relied on a two-step process: first, searching databases to identify potentially relevant citations, and then manually screening those citations. A number of tools have been developed to streamline and semi-automate this process, including tools to generate terms; to visualize and evaluate search queries; to trace citation linkages; to deduplicate, limit, or translate searches across databases; and to prioritize relevant abstracts for screening. Research is ongoing into tools that can unify searching and screening into a single step, and several protype tools have been developed. As this field grows, it is becoming increasingly important to develop and codify methods for evaluating the extent to which these tools fulfill their purpose.

Assuntos

Bases de Dados Factuais , Automação , Programas de Rastreamento , Publicações , Revisões Sistemáticas como Assunto

12.

Self-Repetition in Abstractive Neural Summarizers.

Salkar, Nikita; Trikalinos, Thomas; Wallace, Byron C; Nenkova, Ani.

Proc Conf Assoc Comput Linguist Meet ; 2022: 341-350, 2022 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37484061

RESUMO

We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5 and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

13.

Evaluating Factuality in Text Simplification.

Devaraj, Ashwin; Sheffield, William; Wallace, Byron C; Li, Junyi Jessy.

Proc Conf Assoc Comput Linguist Meet ; 2022: 7331-7345, 2022 May.

Artigo em Inglês | MEDLINE | ID: mdl-36404800

RESUMO

Automated simplification models aim to make input texts more readable. Such methods have the potential to make complex information accessible to a wider audience, e.g., providing access to recent medical literature which might otherwise be impenetrable for a lay reader. However, such models risk introducing errors into automatically simplified texts, for instance by inserting statements unsupported by the corresponding original text, or by omitting key information. Providing more readable but inaccurate versions of texts may in many cases be worse than providing no such access at all. The problem of factual accuracy (and the lack thereof) has received heightened attention in the context of summarization models, but the factuality of automatically simplified texts has not been investigated. We introduce a taxonomy of errors that we use to analyze both references drawn from standard simplification datasets and state-of-the-art model outputs. We find that errors often appear in both that are not captured by existing evaluation metrics, motivating a need for research into ensuring the factual accuracy of automated simplification models.

14.

That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data.

McInerney, Denis Jered; Young, Geoffrey; van de Meent, Jan-Willem; Wallace, Byron C.

Proc Conf Empir Methods Nat Lang Process ; 2022: 3626-3648, 2022 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-37103483

RESUMO

Pretraining multimodal models on Electronic Health Records (EHRs) provides a means of learning representations that can transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between image regions and sentences. This is of particular interest in the medical domain, where alignments might highlight regions in an image relevant to specific phenomena described in free-text. While past work has suggested that attention "heatmaps" can be interpreted in this manner, there has been little evaluation of such alignments. We compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that link image regions to sentences. Our main finding is that the text has an often weak or unintuitive influence on attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications - such as substituting "left" for "right" - do not substantially influence highlights. Simple techniques such as allowing the model to opt out of attending to the image and few-shot finetuning show promise in terms of their ability to improve alignments with very little or no supervision. We make our code and checkpoints open-source.

15.

What Would it Take to get Biomedical QA Systems into Practice?

Kell, Gregory; Marshall, Iain J; Wallace, Byron C; Jaun, André.

Proc Conf Empir Methods Nat Lang Process ; 2021: 28-41, 2021 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-35663506

RESUMO

Medical question answering (QA) systems have the potential to answer clinicians' uncertainties about treatment and diagnosis on-demand, informed by the latest evidence. However, despite the significant progress in general QA made by the NLP community, medical QA systems are still not widely used in clinical environments. One likely reason for this is that clinicians may not readily trust QA system outputs, in part because transparency, trustworthiness, and provenance have not been key considerations in the design of such models. In this paper we discuss a set of criteria that, if met, we argue would likely increase the utility of biomedical QA systems, which may in turn lead to adoption of such systems in practice. We assess existing models, tasks, and datasets with respect to these criteria, highlighting shortcomings of previously proposed approaches and pointing toward what might be more usable QA systems.

16.

Paragraph-level Simplification of Medical Texts.

Devaraj, Ashwin; Wallace, Byron C; Marshall, Iain J; Li, Junyi Jessy.

Proc Conf ; 2021: 4972-4984, 2021 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-35663507

RESUMO

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing 'jargon' terms; we find that this yields improvements over baselines in terms of readability.

17.

Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization.

Wallace, Byron C; Saha, Sayantan; Soboczenski, Frank; Marshall, Iain J.

AMIA Jt Summits Transl Sci Proc ; 2021: 605-614, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34457176

RESUMO

We consider the problem of automatically generating a narrative biomedical evidence summary from multiple trial reports. We evaluate modern neural models for abstractive summarization of relevant article abstracts from systematic reviews previously conducted by members of the Cochrane collaboration, using the authors conclusions section of the review abstract as our target. We enlist medical professionals to evaluate generated summaries, and we find that summarization systems yield consistently fluent and relevant synopses, but these often contain factual inaccuracies. We propose new approaches that capitalize on domain-specific models to inform summarization, e.g., by explicitly demarcating snippets of inputs that convey key findings, and emphasizing the reports of large and high-quality trials. We find that these strategies modestly improve the factual accuracy of generated summaries. Finally, we propose a new method for automatically evaluating the factuality of generated narrative evidence syntheses using models that infer the directionality of reported findings.

18.

Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations.

Nye, Benjamin E; DeYoung, Jay; Lehman, Eric; Nenkova, Ani; Marshall, Iain J; Wallace, Byron C.

AMIA Jt Summits Transl Sci Proc ; 2021: 485-494, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34457164

RESUMO

The best evidence concerning comparative treatment effectiveness comes from clinical trials, the results of which are reported in unstructured articles. Medical experts must manually extract information from articles to inform decision-making, which is time-consuming and expensive. Here we consider the end-to-end task of both (a) extracting treatments and outcomes from full-text articles describing clinical trials (entity identification) and, (b) inferring the reported results for the former with respect to the latter (relation extraction). We introduce new data for this task, and evaluate models that have recently achieved state-of-the-art results on similar tasks in Natural Language Processing. We then propose a new method motivated by how trial results are typically presented that outperforms these purely data-driven baselines. Finally, we run a fielded evaluation of the model with a non-profit seeking to identify existing drugs that might be re-purposed for cancer, showing the potential utility of end-to-end evidence extraction systems.

Assuntos

Processamento de Linguagem Natural , Humanos

19.

State of the evidence: a survey of global disparities in clinical trials.

Marshall, Iain James; L'Esperance, Veline; Marshall, Rachel; Thomas, James; Noel-Storr, Anna; Soboczenski, Frank; Nye, Benjamin; Nenkova, Ani; Wallace, Byron C.

BMJ Glob Health ; 6(1)2021 01.

Artigo em Inglês | MEDLINE | ID: mdl-33402333

RESUMO

INTRODUCTION: Ideally, health conditions causing the greatest global disease burden should attract increased research attention. We conducted a comprehensive global study investigating the number of randomised controlled trials (RCTs) published on different health conditions, and how this compares with the global disease burden that they impose. METHODS: We use machine learning to monitor PubMed daily, and find and analyse RCT reports. We assessed RCTs investigating the leading causes of morbidity and mortality from the Global Burden of Disease study. Using regression models, we compared numbers of actual RCTs in different health conditions to numbers predicted from their global disease burden (disability-adjusted life years (DALYs)). We investigated whether RCT numbers differed for conditions disproportionately affecting countries with lower socioeconomic development. RESULTS: We estimate 463 000 articles describing RCTs (95% prediction interval 439 000 to 485 000) were published from 1990 to July 2020. RCTs recruited a median of 72 participants (IQR 32-195). 82% of RCTs were conducted by researchers in the top fifth of countries by socio-economic development. As DALYs increased for a particular health condition by 10%, the number of RCTs in the same year increased by 5% (3.2%-6.9%), but the association was weak (adjusted R2=0.13). Conditions disproportionately affecting countries with lower socioeconomic development, including respiratory infections and tuberculosis (7000 RCTs below predicted) and enteric infections (9700 RCTs below predicted), appear relatively under-researched for their disease burden. Each 10% shift in DALYs towards countries with low and middle socioeconomic development was associated with a 4% reduction in RCTs (3.7%-4.9%). These disparities have not changed substantially over time. CONCLUSION: Research priorities are not well optimised to reduce the global burden of disease. Most RCTs are produced by highly developed countries, and the health needs of these countries have been, on average, favoured.

Assuntos

Pessoas com Deficiência , Infecções Respiratórias , Carga Global da Doença , Saúde Global , Humanos , Anos de Vida Ajustados por Qualidade de Vida , Ensaios Clínicos Controlados Aleatórios como Assunto

20.

Semi-automated screening of biomedical citations for systematic reviews.

Wallace, Byron C; Trikalinos, Thomas A; Lau, Joseph; Brodley, Carla; Schmid, Christopher H.

BMC Bioinformatics ; 11: 55, 2010 Jan 26.

Artigo em Inglês | MEDLINE | ID: mdl-20102628

RESUMO

BACKGROUND: Systematic reviews address a specific clinical question by unbiasedly assessing and analyzing the pertinent literature. Citation screening is a time-consuming and critical step in systematic reviews. Typically, reviewers must evaluate thousands of citations to identify articles eligible for a given review. We explore the application of machine learning techniques to semi-automate citation screening, thereby reducing the reviewers' workload. RESULTS: We present a novel online classification strategy for citation screening to automatically discriminate "relevant" from "irrelevant" citations. We use an ensemble of Support Vector Machines (SVMs) built over different feature-spaces (e.g., abstract and title text), and trained interactively by the reviewer(s). Semi-automating the citation screening process is difficult because any such strategy must identify all citations eligible for the systematic review. This requirement is made harder still due to class imbalance; there are far fewer "relevant" than "irrelevant" citations for any given systematic review. To address these challenges we employ a custom active-learning strategy developed specifically for imbalanced datasets. Further, we introduce a novel undersampling technique. We provide experimental results over three real-world systematic review datasets, and demonstrate that our algorithm is able to reduce the number of citations that must be screened manually by nearly half in two of these, and by around 40% in the third, without excluding any of the citations eligible for the systematic review. CONCLUSIONS: We have developed a semi-automated citation screening algorithm for systematic reviews that has the potential to substantially reduce the number of citations reviewers have to manually screen, without compromising the quality and comprehensiveness of the review.

Assuntos

Armazenamento e Recuperação da Informação/métodos , Literatura de Revisão como Assunto , Publicações Periódicas como Assunto , Publicações

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA