Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges.

Ahsan, Hiba; McInerney, Denis Jered; Kim, Jisoo; Potter, Christopher; Young, Geoffrey; Amir, Silvio; Wallace, Byron C.

Proc Mach Learn Res ; 248: 489-505, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-39224857

RESUMO

Unstructured data in Electronic Health Records (EHRs) often contains critical information-complementary to imaging-that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR relevant to a given query. Our method entails tasking an LLM to infer whether a patient has, or is at risk of, a particular condition on the basis of associated notes; if so, we ask the model to summarize the supporting evidence. Under expert evaluation, we find that this LLM-based approach provides outputs consistently preferred to a pre-LLM information retrieval baseline. Manual evaluation is expensive, so we also propose and validate a method using an LLM to evaluate (other) LLM outputs for this task, allowing us to scale up evaluation. Our findings indicate the promise of LLMs as interfaces to EHR, but also highlight the outstanding challenge posed by "hallucinations". In this setting, however, we show that model confidence in outputs strongly correlates with faithful summaries, offering a practical means to limit confabulations.

2.

Literature search sandbox: a large language model that generates search queries for systematic reviews.

Adam, Gaelen P; DeYoung, Jay; Paul, Alice; Saldanha, Ian J; Balk, Ethan M; Trikalinos, Thomas A; Wallace, Byron C.

JAMIA Open ; 7(3): ooae098, 2024 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-39323560

RESUMO

Objectives: Development of search queries for systematic reviews (SRs) is time-consuming. In this work, we capitalize on recent advances in large language models (LLMs) and a relatively large dataset of natural language descriptions of reviews and corresponding Boolean searches to generate Boolean search queries from SR titles and key questions. Materials and Methods: We curated a training dataset of 10 346 SR search queries registered in PROSPERO. We used this dataset to fine-tune a set of models to generate search queries based on Mistral-Instruct-7b. We evaluated the models quantitatively using an evaluation dataset of 57 SRs and qualitatively through semi-structured interviews with 8 experienced medical librarians. Results: The model-generated search queries had median sensitivity of 85% (interquartile range [IQR] 40%-100%) and number needed to read of 1206 citations (IQR 205-5810). The interviews suggested that the models lack both the necessary sensitivity and precision to be used without scrutiny but could be useful for topic scoping or as initial queries to be refined. Discussion: Future research should focus on improving the dataset with more high-quality search queries, assessing whether fine-tuning the model on other fields, such as the population and intervention, improves performance, and exploring the addition of interactivity to the interface. Conclusions: The datasets developed for this project can be used to train and evaluate LLMs that map review descriptions to Boolean search queries. The models cannot replace thoughtful search query design but may be useful in providing suggestions for key words and the framework for the query.

3.

Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness.

Zhang, Gongbo; Jin, Qiao; Jered McInerney, Denis; Chen, Yong; Wang, Fei; Cole, Curtis L; Yang, Qian; Wang, Yanshan; Malin, Bradley A; Peleg, Mor; Wallace, Byron C; Lu, Zhiyong; Weng, Chunhua; Peng, Yifan.

J Biomed Inform ; 153: 104640, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38608915

RESUMO

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.

Assuntos

Inteligência Artificial , Medicina Baseada em Evidências , Humanos , Confiança , Processamento de Linguagem Natural

4.

Question answering systems for health professionals at the point of care-a systematic review.

Kell, Gregory; Roberts, Angus; Umansky, Serge; Qian, Linglong; Ferrari, Davide; Soboczenski, Frank; Wallace, Byron C; Patel, Nikhil; Marshall, Iain J.

J Am Med Inform Assoc ; 31(4): 1009-1024, 2024 04 03.

Artigo em Inglês | MEDLINE | ID: mdl-38366879

RESUMO

OBJECTIVES: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. MATERIALS AND METHODS: We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology, and forward and backward citations on February 7, 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems. RESULTS: We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians. DISCUSSION: While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy.

Assuntos

Sistemas Automatizados de Assistência Junto ao Leito , Humanos , Pessoal de Saúde , Garantia da Qualidade dos Cuidados de Saúde

5.

Revisiting Relation Extraction in the era of Large Language Models.

Wadhwa, Somin; Amir, Silvio; Wallace, Byron C.

Proc Conf Assoc Comput Linguist Meet ; 2023: 15566-15589, 2023 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-37674787

RESUMO

Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a sequence-to-sequence task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.

6.

Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges.

Ramprasad, Sanjana; Marshall, Iain J; McInerney, Denis Jered; Wallace, Byron C.

Proc Conf Assoc Comput Linguist Meet ; 2023: 236-247, 2023 May.

Artigo em Inglês | MEDLINE | ID: mdl-37483390

RESUMO

We present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work (Marshall et al., 2020), the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality. The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART (Lewis et al., 2019), and a multi-headed architecture intended to provide greater transparency to end-users. Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present. The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video is available at: https://vimeo.com/735605060 The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/.

7.

In a pilot study, automated real-time systematic review updates were feasible, accurate, and work-saving.

Marshall, Iain J; Trikalinos, Thomas A; Soboczenski, Frank; Yun, Hye Sun; Kell, Gregory; Marshall, Rachel; Wallace, Byron C.

J Clin Epidemiol ; 153: 26-33, 2023 01.

Artigo em Inglês | MEDLINE | ID: mdl-36150548

RESUMO

OBJECTIVES: The aim of this study is to describe and pilot a novel method for continuously identifying newly published trials relevant to a systematic review, enabled by combining artificial intelligence (AI) with human expertise. STUDY DESIGN AND SETTING: We used RobotReviewer LIVE to keep a review of COVID-19 vaccination trials updated from February to August 2021. We compared the papers identified by the system with those found by the conventional manual process by the review team. RESULTS: The manual update searches (last search date July 2021) retrieved 135 abstracts, of which 31 were included after screening (23% precision, 100% recall). By the same date, the automated system retrieved 56 abstracts, of which 31 were included after manual screening (55% precision, 100% recall). Key limitations of the system include that it is limited to searches of PubMed/MEDLINE, and considers only randomized controlled trial reports. We aim to address these limitations in future. The system is available as open-source software for further piloting and evaluation. CONCLUSION: Our system identified all relevant studies, reduced manual screening work, and enabled rolling updates on publication of new primary research.

Assuntos

Inteligência Artificial , COVID-19 , Humanos , Projetos Piloto , Vacinas contra COVID-19 , COVID-19/epidemiologia , COVID-19/prevenção & controle , PubMed

8.

Evaluating Factuality in Text Simplification.

Devaraj, Ashwin; Sheffield, William; Wallace, Byron C; Li, Junyi Jessy.

Proc Conf Assoc Comput Linguist Meet ; 2022: 7331-7345, 2022 May.

Artigo em Inglês | MEDLINE | ID: mdl-36404800

RESUMO

Automated simplification models aim to make input texts more readable. Such methods have the potential to make complex information accessible to a wider audience, e.g., providing access to recent medical literature which might otherwise be impenetrable for a lay reader. However, such models risk introducing errors into automatically simplified texts, for instance by inserting statements unsupported by the corresponding original text, or by omitting key information. Providing more readable but inaccurate versions of texts may in many cases be worse than providing no such access at all. The problem of factual accuracy (and the lack thereof) has received heightened attention in the context of summarization models, but the factuality of automatically simplified texts has not been investigated. We introduce a taxonomy of errors that we use to analyze both references drawn from standard simplification datasets and state-of-the-art model outputs. We find that errors often appear in both that are not captured by existing evaluation metrics, motivating a need for research into ensuring the factual accuracy of automated simplification models.

9.

Accuracy and Efficiency of Machine Learning-Assisted Risk-of-Bias Assessments in "Real-World" Systematic Reviews : A Noninferiority Randomized Controlled Trial.

Arno, Anneliese; Thomas, James; Wallace, Byron; Marshall, Iain J; McKenzie, Joanne E; Elliott, Julian H.

Ann Intern Med ; 175(7): 1001-1009, 2022 07.

Artigo em Inglês | MEDLINE | ID: mdl-35635850

RESUMO

BACKGROUND: Automation is a proposed solution for the increasing difficulty of maintaining up-to-date, high-quality health evidence. Evidence assessing the effectiveness of semiautomated data synthesis, such as risk-of-bias (RoB) assessments, is lacking. OBJECTIVE: To determine whether RobotReviewer-assisted RoB assessments are noninferior in accuracy and efficiency to assessments conducted with human effort only. DESIGN: Two-group, parallel, noninferiority, randomized trial. (Monash Research Office Project 11256). SETTING: Health-focused systematic reviews using Covidence. PARTICIPANTS: Systematic reviewers, who had not previously used RobotReviewer, completing Cochrane RoB assessments between February 2018 and May 2020. INTERVENTION: In the intervention group, reviewers received an RoB form prepopulated by RobotReviewer; in the comparison group, reviewers received a blank form. Studies were assigned in a 1:1 ratio via simple randomization to receive RobotReviewer assistance for either Reviewer 1 or Reviewer 2. Participants were blinded to study allocation before starting work on each RoB form. MEASUREMENTS: Co-primary outcomes were the accuracy of individual reviewer RoB assessments and the person-time required to complete individual assessments. Domain-level RoB accuracy was a secondary outcome. RESULTS: Of the 15 recruited review teams, 7 completed the trial (145 included studies). Integration of RobotReviewer resulted in noninferior overall RoB assessment accuracy (risk difference, -0.014 [95% CI, -0.093 to 0.065]; intervention group: 88.8% accurate assessments; control group: 90.2% accurate assessments). Data were inconclusive for the person-time outcome (RobotReviewer saved 1.40 minutes [CI, -5.20 to 2.41 minutes]). LIMITATION: Variability in user behavior and a limited number of assessable reviews led to an imprecise estimate of the time outcome. CONCLUSION: In health-related systematic reviews, RoB assessments conducted with RobotReviewer assistance are noninferior in accuracy to those conducted without RobotReviewer assistance. PRIMARY FUNDING SOURCE: University College London and Monash University.

Assuntos

Aprendizado de Máquina , Projetos de Pesquisa , Viés , Humanos , Medição de Risco

10.

Semi-automated Tools for Systematic Searches.

Adam, Gaelen P; Wallace, Byron C; Trikalinos, Thomas A.

Methods Mol Biol ; 2345: 17-40, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-34550582

RESUMO

Traditionally, literature identification for systematic reviews has relied on a two-step process: first, searching databases to identify potentially relevant citations, and then manually screening those citations. A number of tools have been developed to streamline and semi-automate this process, including tools to generate terms; to visualize and evaluate search queries; to trace citation linkages; to deduplicate, limit, or translate searches across databases; and to prioritize relevant abstracts for screening. Research is ongoing into tools that can unify searching and screening into a single step, and several protype tools have been developed. As this field grows, it is becoming increasingly important to develop and codify methods for evaluating the extent to which these tools fulfill their purpose.

Assuntos

Bases de Dados Factuais , Automação , Programas de Rastreamento , Publicações , Revisões Sistemáticas como Assunto

11.

That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data.

McInerney, Denis Jered; Young, Geoffrey; van de Meent, Jan-Willem; Wallace, Byron C.

Proc Conf Empir Methods Nat Lang Process ; 2022: 3626-3648, 2022 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-37103483

RESUMO

Pretraining multimodal models on Electronic Health Records (EHRs) provides a means of learning representations that can transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between image regions and sentences. This is of particular interest in the medical domain, where alignments might highlight regions in an image relevant to specific phenomena described in free-text. While past work has suggested that attention "heatmaps" can be interpreted in this manner, there has been little evaluation of such alignments. We compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that link image regions to sentences. Our main finding is that the text has an often weak or unintuitive influence on attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications - such as substituting "left" for "right" - do not substantially influence highlights. Simple techniques such as allowing the model to opt out of attending to the image and few-shot finetuning show promise in terms of their ability to improve alignments with very little or no supervision. We make our code and checkpoints open-source.

12.

Self-Repetition in Abstractive Neural Summarizers.

Salkar, Nikita; Trikalinos, Thomas; Wallace, Byron C; Nenkova, Ani.

Proc Conf Assoc Comput Linguist Meet ; 2022: 341-350, 2022 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37484061

RESUMO

We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5 and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

13.

Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations.

Nye, Benjamin E; DeYoung, Jay; Lehman, Eric; Nenkova, Ani; Marshall, Iain J; Wallace, Byron C.

AMIA Jt Summits Transl Sci Proc ; 2021: 485-494, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34457164

RESUMO

The best evidence concerning comparative treatment effectiveness comes from clinical trials, the results of which are reported in unstructured articles. Medical experts must manually extract information from articles to inform decision-making, which is time-consuming and expensive. Here we consider the end-to-end task of both (a) extracting treatments and outcomes from full-text articles describing clinical trials (entity identification) and, (b) inferring the reported results for the former with respect to the latter (relation extraction). We introduce new data for this task, and evaluate models that have recently achieved state-of-the-art results on similar tasks in Natural Language Processing. We then propose a new method motivated by how trial results are typically presented that outperforms these purely data-driven baselines. Finally, we run a fielded evaluation of the model with a non-profit seeking to identify existing drugs that might be re-purposed for cancer, showing the potential utility of end-to-end evidence extraction systems.

Assuntos

Processamento de Linguagem Natural , Humanos

14.

Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization.

Wallace, Byron C; Saha, Sayantan; Soboczenski, Frank; Marshall, Iain J.

AMIA Jt Summits Transl Sci Proc ; 2021: 605-614, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34457176

RESUMO

We consider the problem of automatically generating a narrative biomedical evidence summary from multiple trial reports. We evaluate modern neural models for abstractive summarization of relevant article abstracts from systematic reviews previously conducted by members of the Cochrane collaboration, using the authors conclusions section of the review abstract as our target. We enlist medical professionals to evaluate generated summaries, and we find that summarization systems yield consistently fluent and relevant synopses, but these often contain factual inaccuracies. We propose new approaches that capitalize on domain-specific models to inform summarization, e.g., by explicitly demarcating snippets of inputs that convey key findings, and emphasizing the reports of large and high-quality trials. We find that these strategies modestly improve the factual accuracy of generated summaries. Finally, we propose a new method for automatically evaluating the factuality of generated narrative evidence syntheses using models that infer the directionality of reported findings.

15.

The views of health guideline developers on the use of automation in health evidence synthesis.

Arno, Anneliese; Elliott, Julian; Wallace, Byron; Turner, Tari; Thomas, James.

Syst Rev ; 10(1): 16, 2021 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-33419479

RESUMO

BACKGROUND: The increasingly rapid rate of evidence publication has made it difficult for evidence synthesis-systematic reviews and health guidelines-to be continually kept up to date. One proposed solution for this is the use of automation in health evidence synthesis. Guideline developers are key gatekeepers in the acceptance and use of evidence, and therefore, their opinions on the potential use of automation are crucial. METHODS: The objective of this study was to analyze the attitudes of guideline developers towards the use of automation in health evidence synthesis. The Diffusion of Innovations framework was chosen as an initial analytical framework because it encapsulates some of the core issues which are thought to affect the adoption of new innovations in practice. This well-established theory posits five dimensions which affect the adoption of novel technologies: Relative Advantage, Compatibility, Complexity, Trialability, and Observability. Eighteen interviews were conducted with individuals who were currently working, or had previously worked, in guideline development. After transcription, a multiphase mixed deductive and grounded approach was used to analyze the data. First, transcripts were coded with a deductive approach using Rogers' Diffusion of Innovation as the top-level themes. Second, sub-themes within the framework were identified using a grounded approach. RESULTS: Participants were consistently most concerned with the extent to which an innovation is in line with current values and practices (i.e., Compatibility in the Diffusion of Innovations framework). Participants were also concerned with Relative Advantage and Observability, which were discussed in approximately equal amounts. For the latter, participants expressed a desire for transparency in the methodology of automation software. Participants were noticeably less interested in Complexity and Trialability, which were discussed infrequently. These results were reasonably consistent across all participants. CONCLUSIONS: If machine learning and other automation technologies are to be used more widely and to their full potential in systematic reviews and guideline development, it is crucial to ensure new technologies are in line with current values and practice. It will also be important to maximize the transparency of the methods of these technologies to address the concerns of guideline developers.

Assuntos

Revisões Sistemáticas como Assunto , Automação , Humanos

16.

State of the evidence: a survey of global disparities in clinical trials.

Marshall, Iain James; L'Esperance, Veline; Marshall, Rachel; Thomas, James; Noel-Storr, Anna; Soboczenski, Frank; Nye, Benjamin; Nenkova, Ani; Wallace, Byron C.

BMJ Glob Health ; 6(1)2021 01.

Artigo em Inglês | MEDLINE | ID: mdl-33402333

RESUMO

INTRODUCTION: Ideally, health conditions causing the greatest global disease burden should attract increased research attention. We conducted a comprehensive global study investigating the number of randomised controlled trials (RCTs) published on different health conditions, and how this compares with the global disease burden that they impose. METHODS: We use machine learning to monitor PubMed daily, and find and analyse RCT reports. We assessed RCTs investigating the leading causes of morbidity and mortality from the Global Burden of Disease study. Using regression models, we compared numbers of actual RCTs in different health conditions to numbers predicted from their global disease burden (disability-adjusted life years (DALYs)). We investigated whether RCT numbers differed for conditions disproportionately affecting countries with lower socioeconomic development. RESULTS: We estimate 463 000 articles describing RCTs (95% prediction interval 439 000 to 485 000) were published from 1990 to July 2020. RCTs recruited a median of 72 participants (IQR 32-195). 82% of RCTs were conducted by researchers in the top fifth of countries by socio-economic development. As DALYs increased for a particular health condition by 10%, the number of RCTs in the same year increased by 5% (3.2%-6.9%), but the association was weak (adjusted R2=0.13). Conditions disproportionately affecting countries with lower socioeconomic development, including respiratory infections and tuberculosis (7000 RCTs below predicted) and enteric infections (9700 RCTs below predicted), appear relatively under-researched for their disease burden. Each 10% shift in DALYs towards countries with low and middle socioeconomic development was associated with a 4% reduction in RCTs (3.7%-4.9%). These disparities have not changed substantially over time. CONCLUSION: Research priorities are not well optimised to reduce the global burden of disease. Most RCTs are produced by highly developed countries, and the health needs of these countries have been, on average, favoured.

Assuntos

Pessoas com Deficiência , Infecções Respiratórias , Carga Global da Doença , Saúde Global , Humanos , Anos de Vida Ajustados por Qualidade de Vida , Ensaios Clínicos Controlados Aleatórios como Assunto

17.

What Would it Take to get Biomedical QA Systems into Practice?

Kell, Gregory; Marshall, Iain J; Wallace, Byron C; Jaun, André.

Proc Conf Empir Methods Nat Lang Process ; 2021: 28-41, 2021 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-35663506

RESUMO

Medical question answering (QA) systems have the potential to answer clinicians' uncertainties about treatment and diagnosis on-demand, informed by the latest evidence. However, despite the significant progress in general QA made by the NLP community, medical QA systems are still not widely used in clinical environments. One likely reason for this is that clinicians may not readily trust QA system outputs, in part because transparency, trustworthiness, and provenance have not been key considerations in the design of such models. In this paper we discuss a set of criteria that, if met, we argue would likely increase the utility of biomedical QA systems, which may in turn lead to adoption of such systems in practice. We assess existing models, tasks, and datasets with respect to these criteria, highlighting shortcomings of previously proposed approaches and pointing toward what might be more usable QA systems.

18.

Paragraph-level Simplification of Medical Texts.

Devaraj, Ashwin; Wallace, Byron C; Marshall, Iain J; Li, Junyi Jessy.

Proc Conf ; 2021: 4972-4984, 2021 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-35663507

RESUMO

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing 'jargon' terms; we find that this yields improvements over baselines in terms of readability.

19.

Predicting Unplanned Readmissions Following a Hip or Knee Arthroplasty: Retrospective Observational Study.

Mohammadi, Ramin; Jain, Sarthak; Namin, Amir T; Scholem Heller, Melissa; Palacholla, Ramya; Kamarthi, Sagar; Wallace, Byron.

JMIR Med Inform ; 8(11): e19761, 2020 Nov 27.

Artigo em Inglês | MEDLINE | ID: mdl-33245283

RESUMO

BACKGROUND: Total joint replacements are high-volume and high-cost procedures that should be monitored for cost and quality control. Models that can identify patients at high risk of readmission might help reduce costs by suggesting who should be enrolled in preventive care programs. Previous models for risk prediction have relied on structured data of patients rather than clinical notes in electronic health records (EHRs). The former approach requires manual feature extraction by domain experts, which may limit the applicability of these models. OBJECTIVE: This study aims to develop and evaluate a machine learning model for predicting the risk of 30-day readmission following knee and hip arthroplasty procedures. The input data for these models come from raw EHRs. We empirically demonstrate that unstructured free-text notes contain a reasonably predictive signal for this task. METHODS: We performed a retrospective analysis of data from 7174 patients at Partners Healthcare collected between 2006 and 2016. These data were split into train, validation, and test sets. These data sets were used to build, validate, and test models to predict unplanned readmission within 30 days of hospital discharge. The proposed models made predictions on the basis of clinical notes, obviating the need for performing manual feature extraction by domain and machine learning experts. The notes that served as model inputs were written by physicians, nurses, pathologists, and others who diagnose and treat patients and may have their own predictions, even if these are not recorded. RESULTS: The proposed models output readmission risk scores (propensities) for each patient. The best models (as selected on a development set) yielded an area under the receiver operating characteristic curve of 0.846 (95% CI 82.75-87.11) for hip and 0.822 (95% CI 80.94-86.22) for knee surgery, indicating reasonable discriminative ability. CONCLUSIONS: Machine learning models can predict which patients are at a high risk of readmission within 30 days following hip and knee arthroplasty procedures on the basis of notes in EHRs with reasonable discriminative power. Following further validation and empirical demonstration that the models realize predictive performance above that which clinical judgment may provide, such models may be used to build an automated decision support tool to help caretakers identify at-risk patients.

20.

Trialstreamer: A living, automatically updated database of clinical trial reports.

Marshall, Iain J; Nye, Benjamin; Kuiper, Joël; Noel-Storr, Anna; Marshall, Rachel; Maclean, Rory; Soboczenski, Frank; Nenkova, Ani; Thomas, James; Wallace, Byron C.

J Am Med Inform Assoc ; 27(12): 1903-1912, 2020 12 09.

Artigo em Inglês | MEDLINE | ID: mdl-32940710

RESUMO

OBJECTIVE: Randomized controlled trials (RCTs) are the gold standard method for evaluating whether a treatment works in health care but can be difficult to find and make use of. We describe the development and evaluation of a system to automatically find and categorize all new RCT reports. MATERIALS AND METHODS: Trialstreamer continuously monitors PubMed and the World Health Organization International Clinical Trials Registry Platform, looking for new RCTs in humans using a validated classifier. We combine machine learning and rule-based methods to extract information from the RCT abstracts, including free-text descriptions of trial PICO (populations, interventions/comparators, and outcomes) elements and map these snippets to normalized MeSH (Medical Subject Headings) vocabulary terms. We additionally identify sample sizes, predict the risk of bias, and extract text conveying key findings. We store all extracted data in a database, which we make freely available for download, and via a search portal, which allows users to enter structured clinical queries. Results are ranked automatically to prioritize larger and higher-quality studies. RESULTS: As of early June 2020, we have indexed 673 191 publications of RCTs, of which 22 363 were published in the first 5 months of 2020 (142 per day). We additionally include 304 111 trial registrations from the International Clinical Trials Registry Platform. The median trial sample size was 66. CONCLUSIONS: We present an automated system for finding and categorizing RCTs. This yields a novel resource: a database of structured information automatically extracted for all published RCTs in humans. We make daily updates of this database available on our website (https://trialstreamer.robotreviewer.net).

Assuntos

Curadoria de Dados , Gerenciamento de Dados , Bases de Dados Factuais , Ensaios Clínicos Controlados Aleatórios como Assunto , Viés , Medicina Baseada em Evidências , Humanos , Medical Subject Headings

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA