Búsqueda | Portal Regional de la BVS

1.

A toolbox for surfacing health equity harms and biases in large language models.

Pfohl, Stephen R; Cole-Lewis, Heather; Sayres, Rory; Neal, Darlene; Asiedu, Mercy; Dieng, Awa; Tomasev, Nenad; Rashid, Qazi Mamunur; Azizi, Shekoofeh; Rostamzadeh, Negar; McCoy, Liam G; Celi, Leo Anthony; Liu, Yun; Schaekermann, Mike; Walton, Alanna; Parrish, Alicia; Nagpal, Chirag; Singh, Preeti; Dewitt, Akeiylah; Mansfield, Philip; Prakash, Sushant; Heller, Katherine; Karthikesalingam, Alan; Semturs, Christopher; Barral, Joelle; Corrado, Greg; Matias, Yossi; Smith-Loud, Jamila; Horn, Ivor; Singhal, Karan.

Nat Med ; 2024 Sep 23.

Artículo en Inglés | MEDLINE | ID: mdl-39313595

RESUMEN

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.

2.

Quantifying urban park use in the USA at scale: empirical estimates of realised park usage using smartphone location data.

Young, Michael T; Vispute, Swapnil; Serghiou, Stylianos; Kumok, Akim; Shah, Yash; Lane, Kevin J; Black-Ingersoll, Flannery; Brochu, Paige; Bharel, Monica; Skenazy, Sarah; Karthikesalingam, Alan; Bavadekar, Shailesh; Kansal, Mansi; Shekel, Tomer; Gabrilovich, Evgeniy; Wellenius, Gregory A.

Lancet Planet Health ; 8(8): e564-e573, 2024 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-39122325

RESUMEN

BACKGROUND: A large body of evidence connects access to greenspace with substantial benefits to physical and mental health. In urban settings where access to greenspace can be limited, park access and use have been associated with higher levels of physical activity, improved physical health, and lower levels of markers of mental distress. Despite the potential health benefits of urban parks, little is known about how park usage varies across locations (between or within cities) or over time. METHODS: We estimated park usage among urban residents (identified as residents of urban census tracts) in 498 US cities from 2019 to 2021 from aggregated and anonymised opted-in smartphone location history data. We used descriptive statistics to quantify differences in park usage over time, between cities, and across census tracts within cities, and used generalised linear models to estimate the associations between park usage and census tract level descriptors. FINDINGS: In spring (March 1 to May 31) 2019, 18·9% of urban residents visited a park at least once per week, with average use higher in northwest and southwest USA, and lowest in the southeast. Park usage varied substantially both within and between cities; was unequally distributed across census tract-level markers of race, ethnicity, income, and social vulnerability; and was only moderately correlated with established markers of census tract greenspace. In spring 2019, a doubling of walking time to parks was associated with a 10·1% (95% CI 5·6-14·3) lower average weekly park usage, adjusting for city and social vulnerability index. The median decline in park usage from spring 2019 to spring 2020 was 38·0% (IQR 28·4-46·5), coincident with the onset of physical distancing policies across much of the country. We estimated that the COVID-19-related decline in park usage was more pronounced for those living further from a park and those living in areas of higher social vulnerability. INTERPRETATION: These estimates provide novel insights into the patterns and correlates of park use and could enable new studies of the health benefits of urban greenspace. In addition, the availability of an empirical park usage metric that varies over time could be a useful tool for assessing the effectiveness of policies intended to increase such activities. FUNDING: Google.

Asunto(s)

Ciudades , Parques Recreativos , Teléfono Inteligente , Parques Recreativos/estadística & datos numéricos , Estados Unidos , Humanos , Teléfono Inteligente/estadística & datos numéricos , COVID-19 , Población Urbana/estadística & datos numéricos , Recreación

3.

Generative models improve fairness of medical classifiers under distribution shifts.

Ktena, Ira; Wiles, Olivia; Albuquerque, Isabela; Rebuffi, Sylvestre-Alvise; Tanno, Ryutaro; Roy, Abhijit Guha; Azizi, Shekoofeh; Belgrave, Danielle; Kohli, Pushmeet; Cemgil, Taylan; Karthikesalingam, Alan; Gowal, Sven.

Nat Med ; 30(4): 1166-1173, 2024 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-38600282

RESUMEN

Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and 'labeling' by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. We hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching our training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. We show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner. We demonstrate that learned augmentations make models more robust and statistically fair in-distribution and out of distribution. To evaluate the generality of our approach, we studied three distinct medical imaging contexts of varying difficulty: (1) histopathology, (2) chest X-ray and (3) dermatology images. Complementing real samples with synthetic ones improved the robustness of models in all three medical tasks and increased fairness by improving the accuracy of clinical diagnosis within underrepresented groups, especially out of distribution.

Asunto(s)

Inteligencia Artificial , Aprendizaje Automático

4.

Infusing behavior science into large language models for activity coaching.

Hegde, Narayan; Vardhan, Madhurima; Nathani, Deepak; Rosenzweig, Emily; Speed, Cathy; Karthikesalingam, Alan; Seneviratne, Martin.

PLOS Digit Health ; 3(4): e0000431, 2024 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-38564502

RESUMEN

Large language models (LLMs) have shown promise for task-oriented dialogue across a range of domains. The use of LLMs in health and fitness coaching is under-explored. Behavior science frameworks such as COM-B, which conceptualizes behavior change in terms of capability (C), Opportunity (O) and Motivation (M), can be used to architect coaching interventions in a way that promotes sustained change. Here we aim to incorporate behavior science principles into an LLM using two knowledge infusion techniques: coach message priming (where exemplar coach responses are provided as context to the LLM), and dialogue re-ranking (where the COM-B category of the LLM output is matched to the inferred user need). Simulated conversations were conducted between the primed or unprimed LLM and a member of the research team, and then evaluated by 8 human raters. Ratings for the primed conversations were significantly higher in terms of empathy and actionability. The same raters also compared a single response generated by the unprimed, primed and re-ranked models, finding a significant uplift in actionability and empathy from the re-ranking technique. This is a proof of concept of how behavior science frameworks can be infused into automated conversational agents for a more principled coaching experience.

5.

Metrics reloaded: recommendations for image analysis validation.

Maier-Hein, Lena; Reinke, Annika; Godau, Patrick; Tizabi, Minu D; Buettner, Florian; Christodoulou, Evangelia; Glocker, Ben; Isensee, Fabian; Kleesiek, Jens; Kozubek, Michal; Reyes, Mauricio; Riegler, Michael A; Wiesenfarth, Manuel; Kavur, A Emre; Sudre, Carole H; Baumgartner, Michael; Eisenmann, Matthias; Heckmann-Nötzel, Doreen; Rädsch, Tim; Acion, Laura; Antonelli, Michela; Arbel, Tal; Bakas, Spyridon; Benis, Arriel; Blaschko, Matthew B; Cardoso, M Jorge; Cheplygina, Veronika; Cimini, Beth A; Collins, Gary S; Farahani, Keyvan; Ferrer, Luciana; Galdran, Adrian; van Ginneken, Bram; Haase, Robert; Hashimoto, Daniel A; Hoffman, Michael M; Huisman, Merel; Jannin, Pierre; Kahn, Charles E; Kainmueller, Dagmar; Kainz, Bernhard; Karargyris, Alexandros; Karthikesalingam, Alan; Kofler, Florian; Kopp-Schneider, Annette; Kreshuk, Anna; Kurc, Tahsin; Landman, Bennett A; Litjens, Geert; Madani, Amin.

Nat Methods ; 21(2): 195-212, 2024 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-38347141

RESUMEN

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.

Asunto(s)

Algoritmos , Procesamiento de Imagen Asistido por Computador , Aprendizaje Automático , Semántica

6.

An intentional approach to managing bias in general purpose embedding models.

Weng, Wei-Hung; Sellergen, Andrew; Kiraly, Atilla P; D'Amour, Alexander; Park, Jungyeon; Pilgrim, Rory; Pfohl, Stephen; Lau, Charles; Natarajan, Vivek; Azizi, Shekoofeh; Karthikesalingam, Alan; Cole-Lewis, Heather; Matias, Yossi; Corrado, Greg S; Webster, Dale R; Shetty, Shravya; Prabhakara, Shruthi; Eswaran, Krish; Celi, Leo A G; Liu, Yun.

Lancet Digit Health ; 6(2): e126-e130, 2024 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-38278614

RESUMEN

Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components-GPPEs-from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended.

Asunto(s)

Atención a la Salud , Aprendizaje Automático , Humanos , Sesgo , Algoritmos

7.

Understanding metric-related pitfalls in image analysis validation.

Reinke, Annika; Tizabi, Minu D; Baumgartner, Michael; Eisenmann, Matthias; Heckmann-Nötzel, Doreen; Kavur, A Emre; Rädsch, Tim; Sudre, Carole H; Acion, Laura; Antonelli, Michela; Arbel, Tal; Bakas, Spyridon; Benis, Arriel; Blaschko, Matthew; Buettner, Florian; Cardoso, M Jorge; Cheplygina, Veronika; Chen, Jianxu; Christodoulou, Evangelia; Cimini, Beth A; Collins, Gary S; Farahani, Keyvan; Ferrer, Luciana; Galdran, Adrian; van Ginneken, Bram; Glocker, Ben; Godau, Patrick; Haase, Robert; Hashimoto, Daniel A; Hoffman, Michael M; Huisman, Merel; Isensee, Fabian; Jannin, Pierre; Kahn, Charles E; Kainmueller, Dagmar; Kainz, Bernhard; Karargyris, Alexandros; Karthikesalingam, Alan; Kenngott, Hannes; Kleesiek, Jens; Kofler, Florian; Kooi, Thijs; Kopp-Schneider, Annette; Kozubek, Michal; Kreshuk, Anna; Kurc, Tahsin; Landman, Bennett A; Litjens, Geert; Madani, Amin; Maier-Hein, Klaus.

ArXiv ; 2024 Feb 23.

Artículo en Inglés | MEDLINE | ID: mdl-36945687

RESUMEN

Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.

8.

Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians.

Dvijotham, Krishnamurthy Dj; Winkens, Jim; Barsbey, Melih; Ghaisas, Sumedh; Stanforth, Robert; Pawlowski, Nick; Strachan, Patricia; Ahmed, Zahra; Azizi, Shekoofeh; Bachrach, Yoram; Culp, Laura; Daswani, Mayank; Freyberg, Jan; Kelly, Christopher; Kiraly, Atilla; Kohlberger, Timo; McKinney, Scott; Mustafa, Basil; Natarajan, Vivek; Geras, Krzysztof; Witowski, Jan; Qin, Zhi Zhen; Creswell, Jacob; Shetty, Shravya; Sieniek, Marcin; Spitz, Terry; Corrado, Greg; Kohli, Pushmeet; Cemgil, Taylan; Karthikesalingam, Alan.

Nat Med ; 29(7): 1814-1820, 2023 07.

Artículo en Inglés | MEDLINE | ID: mdl-37460754

RESUMEN

Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5-15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC's performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.

Asunto(s)

Inteligencia Artificial , Triaje , Reproducibilidad de los Resultados , Flujo de Trabajo , Humanos

9.

Detecting shortcut learning for fair medical AI using shortcut testing.

Brown, Alexander; Tomasev, Nenad; Freyberg, Jan; Liu, Yuan; Karthikesalingam, Alan; Schrouff, Jessica.

Nat Commun ; 14(1): 4314, 2023 07 18.

Artículo en Inglés | MEDLINE | ID: mdl-37463884

RESUMEN

Machine learning (ML) holds great promise for improving healthcare, but it is critical to ensure that its use will not propagate or amplify health disparities. An important step is to characterize the (un)fairness of ML models-their tendency to perform differently across subgroups of the population-and to understand its underlying mechanisms. One potential driver of algorithmic unfairness, shortcut learning, arises when ML models base predictions on improper correlations in the training data. Diagnosing this phenomenon is difficult as sensitive attributes may be causally linked with disease. Using multitask learning, we propose a method to directly test for the presence of shortcut learning in clinical ML systems and demonstrate its application to clinical tasks in radiology and dermatology. Finally, our approach reveals instances when shortcutting is not responsible for unfairness, highlighting the need for a holistic approach to fairness mitigation in medical AI.

Asunto(s)

Instituciones de Salud , Aprendizaje Automático

10.

Publisher Correction: Large language models encode clinical knowledge.

Singhal, Karan; Azizi, Shekoofeh; Tu, Tao; Mahdavi, S Sara; Wei, Jason; Chung, Hyung Won; Scales, Nathan; Tanwani, Ajay; Cole-Lewis, Heather; Pfohl, Stephen; Payne, Perry; Seneviratne, Martin; Gamble, Paul; Kelly, Chris; Babiker, Abubakr; Schärli, Nathanael; Chowdhery, Aakanksha; Mansfield, Philip; Demner-Fushman, Dina; Agüera Y Arcas, Blaise; Webster, Dale; Corrado, Greg S; Matias, Yossi; Chou, Katherine; Gottweis, Juraj; Tomasev, Nenad; Liu, Yun; Rajkomar, Alvin; Barral, Joelle; Semturs, Christopher; Karthikesalingam, Alan; Natarajan, Vivek.

Nature ; 620(7973): E19, 2023 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-37500979

11.

Large language models encode clinical knowledge.

Singhal, Karan; Azizi, Shekoofeh; Tu, Tao; Mahdavi, S Sara; Wei, Jason; Chung, Hyung Won; Scales, Nathan; Tanwani, Ajay; Cole-Lewis, Heather; Pfohl, Stephen; Payne, Perry; Seneviratne, Martin; Gamble, Paul; Kelly, Chris; Babiker, Abubakr; Schärli, Nathanael; Chowdhery, Aakanksha; Mansfield, Philip; Demner-Fushman, Dina; Agüera Y Arcas, Blaise; Webster, Dale; Corrado, Greg S; Matias, Yossi; Chou, Katherine; Gottweis, Juraj; Tomasev, Nenad; Liu, Yun; Rajkomar, Alvin; Barral, Joelle; Semturs, Christopher; Karthikesalingam, Alan; Natarajan, Vivek.

Nature ; 620(7972): 172-180, 2023 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-37438534

RESUMEN

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Asunto(s)

Benchmarking , Simulación por Computador , Conocimiento , Medicina , Procesamiento de Lenguaje Natural , Sesgo , Competencia Clínica , Comprensión , Conjuntos de Datos como Asunto , Concesión de Licencias , Medicina/métodos , Medicina/normas , Seguridad del Paciente , Médicos

12.

Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging.

Azizi, Shekoofeh; Culp, Laura; Freyberg, Jan; Mustafa, Basil; Baur, Sebastien; Kornblith, Simon; Chen, Ting; Tomasev, Nenad; Mitrovic, Jovana; Strachan, Patricia; Mahdavi, S Sara; Wulczyn, Ellery; Babenko, Boris; Walker, Megan; Loh, Aaron; Chen, Po-Hsuan Cameron; Liu, Yuan; Bavishi, Pinal; McKinney, Scott Mayer; Winkens, Jim; Roy, Abhijit Guha; Beaver, Zach; Ryan, Fiona; Krogue, Justin; Etemadi, Mozziyar; Telang, Umesh; Liu, Yun; Peng, Lily; Corrado, Greg S; Webster, Dale R; Fleet, David; Hinton, Geoffrey; Houlsby, Neil; Karthikesalingam, Alan; Norouzi, Mohammad; Natarajan, Vivek.

Nat Biomed Eng ; 7(6): 756-779, 2023 06.

Artículo en Inglés | MEDLINE | ID: mdl-37291435

RESUMEN

Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such 'out of distribution' performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for 'Robust and Efficient Medical Imaging with Self-supervision'), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1-33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging.

Asunto(s)

Aprendizaje Automático , Aprendizaje Automático Supervisado , Diagnóstico por Imagen

13.

Measure by measure: Resting heart rate across the 24-hour cycle.

Speed, Cathy; Arneil, Thomas; Harle, Robert; Wilson, Alex; Karthikesalingam, Alan; McConnell, Michael; Phillips, Justin.

PLOS Digit Health ; 2(4): e0000236, 2023 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-37115739

RESUMEN

BACKGROUND: Photoplethysmography (PPG) sensors, typically found in wrist-worn devices, can continuously monitor heart rate (HR) in large populations in real-world settings. Resting heart rate (RHR) is an important biomarker of morbidities and mortality, but no universally accepted definition nor measurement criteria exist. In this study, we provide a working definition of RHR and describe a method for accurate measurement of this biomarker, recorded using PPG derived from wristband measurement across the 24-hour cycle. METHODS: 433 healthy subjects wore a wrist device that measured activity and HR for up to 3 months. HR during inactivity was recorded and the duration of inactivity needed for HR to stabilise was ascertained. We identified the lowest HR during each 24-hour cycle (true RHR) and examined the time of day or night this occurred. The variation of HR during inactivity through the 24-hour cycle was also assessed. The sample was also subdivided according to daily activity levels for subset analysis. FINDINGS: Adequate data was obtained for 19,242 days and 18,520 nights. HR stabilised in most subjects after 4 minutes of inactivity. Mean (SD) RHR for the sample was 54.5 (8.0) bpm (day) and 50.5 (7.6) bpm (night). RHR values were highest in the least active group (lowest MET quartile). A circadian variation of HR during inactivity was confirmed, with the lowest values being between 0300 and 0700 hours for most subjects. INTERPRETATION: RHR measured using a PPG-based wrist-worn device is significantly lower at night than in the day, and a circadian rhythm of HR during inactivity was confirmed. Since RHR is such an important health metric, clarity on the definition and measurement methodology used is important. For most subjects, a minimum rest time of 4 minutes provides a reliable measurement of HR during inactivity and true RHR in a 24-hour cycle is best measured between 0300 and 0700 hours. Funding: This study was funded by Google.

14.

Tackling bias in AI health datasets through the STANDING Together initiative.

Ganapathi, Shaswath; Palmer, Jo; Alderman, Joseph E; Calvert, Melanie; Espinoza, Cyrus; Gath, Jacqui; Ghassemi, Marzyeh; Heller, Katherine; Mckay, Francis; Karthikesalingam, Alan; Kuku, Stephanie; Mackintosh, Maxine; Manohar, Sinduja; Mateen, Bilal A; Matin, Rubeta; McCradden, Melissa; Oakden-Rayner, Lauren; Ordish, Johan; Pearson, Russell; Pfohl, Stephen R; Rostamzadeh, Negar; Sapey, Elizabeth; Sebire, Neil; Sounderajah, Viknesh; Summers, Charlotte; Treanor, Darren; Denniston, Alastair K; Liu, Xiaoxuan.

Nat Med ; 28(11): 2232-2233, 2022 11.

Artículo en Inglés | MEDLINE | ID: mdl-36163296

Asunto(s)

Inteligencia Artificial , Sesgo

15.

Developing Specific Reporting Standards in Artificial Intelligence Centred Research.

Sounderajah, Viknesh; Ashrafian, Hutan; Karthikesalingam, Alan; Markar, Sheraz R; Normahani, Pasha; Collins, Gary S; Bossuyt, Patrick M; Darzi, Ara.

Ann Surg ; 275(3): e547-e548, 2022 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-35120063

Asunto(s)

Inteligencia Artificial , Investigación Biomédica , Proyectos de Investigación/normas

16.

Does your dermatology classifier know what it doesn't know? Detecting the long-tail of unseen conditions.

Guha Roy, Abhijit; Ren, Jie; Azizi, Shekoofeh; Loh, Aaron; Natarajan, Vivek; Mustafa, Basil; Pawlowski, Nick; Freyberg, Jan; Liu, Yuan; Beaver, Zach; Vo, Nam; Bui, Peggy; Winter, Samantha; MacWilliams, Patricia; Corrado, Greg S; Telang, Umesh; Liu, Yun; Cemgil, Taylan; Karthikesalingam, Alan; Lakshminarayanan, Balaji; Winkens, Jim.

Med Image Anal ; 75: 102274, 2022 01.

Artículo en Inglés | MEDLINE | ID: mdl-34731777

RESUMEN

Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real-world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To prevent models from generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent 'outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between the model training, validation, and test sets. Unlike traditional OOD detection benchmarks where the task is to detect dataset distribution shift, we aim at the more challenging task of detecting subtle differences resulting from a different pathology or condition. We propose a novel hierarchical outlier detection (HOD) loss, which assigns multiple abstention classes corresponding to each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD loss based approach outperforms leading methods that leverage outlier data during training. Further, performance is significantly boosted by using recent representation learning methods (BiT, SimCLR, MICLe). Further, we explore ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also perform a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrate the gains of our framework in comparison to baseline. Furthermore, we go beyond traditional performance metrics and introduce a cost matrix for model trust analysis to approximate downstream clinical impact. We use this cost matrix to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world scenarios.

Asunto(s)

Dermatología , Benchmarking , Humanos

17.

A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI.

Sounderajah, Viknesh; Ashrafian, Hutan; Rose, Sherri; Shah, Nigam H; Ghassemi, Marzyeh; Golub, Robert; Kahn, Charles E; Esteva, Andre; Karthikesalingam, Alan; Mateen, Bilal; Webster, Dale; Milea, Dan; Ting, Daniel; Treanor, Darren; Cushnan, Dominic; King, Dominic; McPherson, Duncan; Glocker, Ben; Greaves, Felix; Harling, Leanne; Ordish, Johan; Cohen, Jérémie F; Deeks, Jon; Leeflang, Mariska; Diamond, Matthew; McInnes, Matthew D F; McCradden, Melissa; Abràmoff, Michael D; Normahani, Pasha; Markar, Sheraz R; Chang, Stephanie; Liu, Xiaoxuan; Mallett, Susan; Shetty, Shravya; Denniston, Alastair; Collins, Gary S; Moher, David; Whiting, Penny; Bossuyt, Patrick M; Darzi, Ara.

Nat Med ; 27(10): 1663-1665, 2021 10.

Artículo en Inglés | MEDLINE | ID: mdl-34635854

Asunto(s)

Inteligencia Artificial/tendencias , Pruebas Diagnósticas de Rutina/tendencias , Humanos

18.

Clinically Applicable Segmentation of Head and Neck Anatomy for Radiotherapy: Deep Learning Algorithm Development and Validation Study.

Nikolov, Stanislav; Blackwell, Sam; Zverovitch, Alexei; Mendes, Ruheena; Livne, Michelle; De Fauw, Jeffrey; Patel, Yojan; Meyer, Clemens; Askham, Harry; Romera-Paredes, Bernadino; Kelly, Christopher; Karthikesalingam, Alan; Chu, Carlton; Carnell, Dawn; Boon, Cheng; D'Souza, Derek; Moinuddin, Syed Ali; Garie, Bethany; McQuinlan, Yasmin; Ireland, Sarah; Hampton, Kiarna; Fuller, Krystle; Montgomery, Hugh; Rees, Geraint; Suleyman, Mustafa; Back, Trevor; Hughes, Cían Owen; Ledsam, Joseph R; Ronneberger, Olaf.

J Med Internet Res ; 23(7): e26151, 2021 07 12.

Artículo en Inglés | MEDLINE | ID: mdl-34255661

RESUMEN

BACKGROUND: Over half a million individuals are diagnosed with head and neck cancer each year globally. Radiotherapy is an important curative treatment for this disease, but it requires manual time to delineate radiosensitive organs at risk. This planning process can delay treatment while also introducing interoperator variability, resulting in downstream radiation dose differences. Although auto-segmentation algorithms offer a potentially time-saving solution, the challenges in defining, quantifying, and achieving expert performance remain. OBJECTIVE: Adopting a deep learning approach, we aim to demonstrate a 3D U-Net architecture that achieves expert-level performance in delineating 21 distinct head and neck organs at risk commonly segmented in clinical practice. METHODS: The model was trained on a data set of 663 deidentified computed tomography scans acquired in routine clinical practice and with both segmentations taken from clinical practice and segmentations created by experienced radiographers as part of this research, all in accordance with consensus organ at risk definitions. RESULTS: We demonstrated the model's clinical applicability by assessing its performance on a test set of 21 computed tomography scans from clinical practice, each with 21 organs at risk segmented by 2 independent experts. We also introduced surface Dice similarity coefficient, a new metric for the comparison of organ delineation, to quantify the deviation between organ at risk surface contours rather than volumes, better reflecting the clinical task of correcting errors in automated organ segmentations. The model's generalizability was then demonstrated on 2 distinct open-source data sets, reflecting different centers and countries to model training. CONCLUSIONS: Deep learning is an effective and clinically applicable technique for the segmentation of the head and neck anatomy for radiotherapy. With appropriate validation studies and regulatory approvals, this system could improve the efficiency, consistency, and safety of radiotherapy pathways.

Asunto(s)

Aprendizaje Profundo , Neoplasias de Cabeza y Cuello , Algoritmos , Neoplasias de Cabeza y Cuello/diagnóstico por imagen , Neoplasias de Cabeza y Cuello/radioterapia , Humanos , Tomografía Computarizada por Rayos X

19.

Validation and Clinical Applicability of Whole-Volume Automated Segmentation of Optical Coherence Tomography in Retinal Disease Using Deep Learning.

Wilson, Marc; Chopra, Reena; Wilson, Megan Z; Cooper, Charlotte; MacWilliams, Patricia; Liu, Yun; Wulczyn, Ellery; Florea, Daniela; Hughes, Cían O; Karthikesalingam, Alan; Khalid, Hagar; Vermeirsch, Sandra; Nicholson, Luke; Keane, Pearse A; Balaskas, Konstantinos; Kelly, Christopher J.

JAMA Ophthalmol ; 139(9): 964-973, 2021 Sep 01.

Artículo en Inglés | MEDLINE | ID: mdl-34236406

RESUMEN

IMPORTANCE: Quantitative volumetric measures of retinal disease in optical coherence tomography (OCT) scans are infeasible to perform owing to the time required for manual grading. Expert-level deep learning systems for automatic OCT segmentation have recently been developed. However, the potential clinical applicability of these systems is largely unknown. OBJECTIVE: To evaluate a deep learning model for whole-volume segmentation of 4 clinically important pathological features and assess clinical applicability. DESIGN, SETTING, PARTICIPANTS: This diagnostic study used OCT data from 173 patients with a total of 15â¯558 B-scans, treated at Moorfields Eye Hospital. The data set included 2 common OCT devices and 2 macular conditions: wet age-related macular degeneration (107 scans) and diabetic macular edema (66 scans), covering the full range of severity, and from 3 points during treatment. Two expert graders performed pixel-level segmentations of intraretinal fluid, subretinal fluid, subretinal hyperreflective material, and pigment epithelial detachment, including all B-scans in each OCT volume, taking as long as 50 hours per scan. Quantitative evaluation of whole-volume model segmentations was performed. Qualitative evaluation of clinical applicability by 3 retinal experts was also conducted. Data were collected from June 1, 2012, to January 31, 2017, for set 1 and from January 1 to December 31, 2017, for set 2; graded between November 2018 and January 2020; and analyzed from February 2020 to November 2020. MAIN OUTCOMES AND MEASURES: Rating and stack ranking for clinical applicability by retinal specialists, model-grader agreement for voxelwise segmentations, and total volume evaluated using Dice similarity coefficients, Bland-Altman plots, and intraclass correlation coefficients. RESULTS: Among the 173 patients included in the analysis (92 [53%] women), qualitative assessment found that automated whole-volume segmentation ranked better than or comparable to at least 1 expert grader in 127 scans (73%; 95% CI, 66%-79%). A neutral or positive rating was given to 135 model segmentations (78%; 95% CI, 71%-84%) and 309 expert gradings (2 per scan) (89%; 95% CI, 86%-92%). The model was rated neutrally or positively in 86% to 92% of diabetic macular edema scans and 53% to 87% of age-related macular degeneration scans. Intraclass correlations ranged from 0.33 (95% CI, 0.08-0.96) to 0.96 (95% CI, 0.90-0.99). Dice similarity coefficients ranged from 0.43 (95% CI, 0.29-0.66) to 0.78 (95% CI, 0.57-0.85). CONCLUSIONS AND RELEVANCE: This deep learning-based segmentation tool provided clinically useful measures of retinal disease that would otherwise be infeasible to obtain. Qualitative evaluation was additionally important to reveal clinical applicability for both care management and research.

Asunto(s)

Aprendizaje Profundo , Retinopatía Diabética , Edema Macular , Degeneración Macular Húmeda , Retinopatía Diabética/diagnóstico por imagen , Femenino , Humanos , Edema Macular/diagnóstico por imagen , Masculino , Tomografía de Coherencia Óptica/métodos , Degeneración Macular Húmeda/diagnóstico

20.

Multitask prediction of organ dysfunction in the intensive care unit using sequential subnetwork routing.

Roy, Subhrajit; Mincu, Diana; Loreaux, Eric; Mottram, Anne; Protsyuk, Ivan; Harris, Natalie; Xue, Yuan; Schrouff, Jessica; Montgomery, Hugh; Connell, Alistair; Tomasev, Nenad; Karthikesalingam, Alan; Seneviratne, Martin.

J Am Med Inform Assoc ; 28(9): 1936-1946, 2021 08 13.

Artículo en Inglés | MEDLINE | ID: mdl-34151965

RESUMEN

OBJECTIVE: Multitask learning (MTL) using electronic health records allows concurrent prediction of multiple endpoints. MTL has shown promise in improving model performance and training efficiency; however, it often suffers from negative transfer - impaired learning if tasks are not appropriately selected. We introduce a sequential subnetwork routing (SeqSNR) architecture that uses soft parameter sharing to find related tasks and encourage cross-learning between them. MATERIALS AND METHODS: Using the MIMIC-III (Medical Information Mart for Intensive Care-III) dataset, we train deep neural network models to predict the onset of 6 endpoints including specific organ dysfunctions and general clinical outcomes: acute kidney injury, continuous renal replacement therapy, mechanical ventilation, vasoactive medications, mortality, and length of stay. We compare single-task (ST) models with naive multitask and SeqSNR in terms of discriminative performance and label efficiency. RESULTS: SeqSNR showed a modest yet statistically significant performance boost across 4 of 6 tasks compared with ST and naive multitasking. When the size of the training dataset was reduced for a given task (label efficiency), SeqSNR outperformed ST for all cases showing an average area under the precision-recall curve boost of 2.1%, 2.9%, and 2.1% for tasks using 1%, 5%, and 10% of labels, respectively. CONCLUSIONS: The SeqSNR architecture shows superior label efficiency compared with ST and naive multitasking, suggesting utility in scenarios in which endpoint labels are difficult to ascertain.

Asunto(s)

Aprendizaje Automático , Insuficiencia Multiorgánica , Registros Electrónicos de Salud , Humanos , Unidades de Cuidados Intensivos , Redes Neurales de la Computación

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

Asunto(s)

RESUMEN

Asunto(s)

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA