Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 56
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nature ; 620(7972): 172-180, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37438534

RESUMO

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.


Assuntos
Benchmarking , Simulação por Computador , Conhecimento , Medicina , Processamento de Linguagem Natural , Viés , Competência Clínica , Compreensão , Conjuntos de Dados como Assunto , Licenciamento , Medicina/métodos , Medicina/normas , Segurança do Paciente , Médicos
2.
Nature ; 577(7788): 89-94, 2020 01.
Artigo em Inglês | MEDLINE | ID: mdl-31894144

RESUMO

Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful1. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives2. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening.


Assuntos
Inteligência Artificial/normas , Neoplasias da Mama/diagnóstico por imagem , Detecção Precoce de Câncer/métodos , Detecção Precoce de Câncer/normas , Feminino , Humanos , Mamografia/normas , Reprodutibilidade dos Testes , Reino Unido , Estados Unidos
4.
JAMA ; 331(3): 242-244, 2024 01 16.
Artigo em Inglês | MEDLINE | ID: mdl-38227029

RESUMO

Importance: Interest in artificial intelligence (AI) has reached an all-time high, and health care leaders across the ecosystem are faced with questions about where, when, and how to deploy AI and how to understand its risks, problems, and possibilities. Observations: While AI as a concept has existed since the 1950s, all AI is not the same. Capabilities and risks of various kinds of AI differ markedly, and on examination 3 epochs of AI emerge. AI 1.0 includes symbolic AI, which attempts to encode human knowledge into computational rules, as well as probabilistic models. The era of AI 2.0 began with deep learning, in which models learn from examples labeled with ground truth. This era brought about many advances both in people's daily lives and in health care. Deep learning models are task-specific, meaning they do one thing at a time, and they primarily focus on classification and prediction. AI 3.0 is the era of foundation models and generative AI. Models in AI 3.0 have fundamentally new (and potentially transformative) capabilities, as well as new kinds of risks, such as hallucinations. These models can do many different kinds of tasks without being retrained on a new dataset. For example, a simple text instruction will change the model's behavior. Prompts such as "Write this note for a specialist consultant" and "Write this note for the patient's mother" will produce markedly different content. Conclusions and Relevance: Foundation models and generative AI represent a major revolution in AI's capabilities, ffering tremendous potential to improve care. Health care leaders are making decisions about AI today. While any heuristic omits details and loses nuance, the framework of AI 1.0, 2.0, and 3.0 may be helpful to decision-makers because each epoch has fundamentally different capabilities and risks.


Assuntos
Inteligência Artificial , Atenção à Saúde , Humanos , Inteligência Artificial/classificação , Inteligência Artificial/história , Tomada de Decisões , Atenção à Saúde/história , História do Século XX , História do Século XXI
5.
Radiology ; 306(1): 124-137, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36066366

RESUMO

Background The World Health Organization (WHO) recommends chest radiography to facilitate tuberculosis (TB) screening. However, chest radiograph interpretation expertise remains limited in many regions. Purpose To develop a deep learning system (DLS) to detect active pulmonary TB on chest radiographs and compare its performance to that of radiologists. Materials and Methods A DLS was trained and tested using retrospective chest radiographs (acquired between 1996 and 2020) from 10 countries. To improve generalization, large-scale chest radiograph pretraining, attention pooling, and semisupervised learning ("noisy-student") were incorporated. The DLS was evaluated in a four-country test set (China, India, the United States, and Zambia) and in a mining population in South Africa, with positive TB confirmed with microbiological tests or nucleic acid amplification testing (NAAT). The performance of the DLS was compared with that of 14 radiologists. The authors studied the efficacy of the DLS compared with that of nine radiologists using the Obuchowski-Rockette-Hillis procedure. Given WHO targets of 90% sensitivity and 70% specificity, the operating point of the DLS (0.45) was prespecified to favor sensitivity. Results A total of 165 754 images in 22 284 subjects (mean age, 45 years; 21% female) were used for model development and testing. In the four-country test set (1236 subjects, 17% with active TB), the receiver operating characteristic (ROC) curve of the DLS was higher than those for all nine India-based radiologists, with an area under the ROC curve of 0.89 (95% CI: 0.87, 0.91). Compared with these radiologists, at the prespecified operating point, the DLS sensitivity was higher (88% vs 75%, P < .001) and specificity was noninferior (79% vs 84%, P = .004). Trends were similar within other patient subgroups, in the South Africa data set, and across various TB-specific chest radiograph findings. In simulations, the use of the DLS to identify likely TB-positive chest radiographs for NAAT confirmation reduced the cost by 40%-80% per TB-positive patient detected. Conclusion A deep learning method was found to be noninferior to radiologists for the determination of active tuberculosis on digital chest radiographs. © RSNA, 2022 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.


Assuntos
Aprendizado Profundo , Tuberculose Pulmonar , Humanos , Feminino , Pessoa de Meia-Idade , Masculino , Radiografia Torácica/métodos , Estudos Retrospectivos , Radiografia , Tuberculose Pulmonar/diagnóstico por imagem , Radiologistas , Sensibilidade e Especificidade
6.
Surg Endosc ; 36(12): 9215-9223, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-35941306

RESUMO

BACKGROUND: The potential role and benefits of AI in surgery has yet to be determined. This study is a first step in developing an AI system for minimizing adverse events and improving patient's safety. We developed an Artificial Intelligence (AI) algorithm and evaluated its performance in recognizing surgical phases of laparoscopic cholecystectomy (LC) videos spanning a range of complexities. METHODS: A set of 371 LC videos with various complexity levels and containing adverse events was collected from five hospitals. Two expert surgeons segmented each video into 10 phases including Calot's triangle dissection and clipping and cutting. For each video, adverse events were also annotated when present (major bleeding; gallbladder perforation; major bile leakage; and incidental finding) and complexity level (on a scale of 1-5) was also recorded. The dataset was then split in an 80:20 ratio (294 and 77 videos), stratified by complexity, hospital, and adverse events to train and test the AI model, respectively. The AI-surgeon agreement was then compared to the agreement between surgeons. RESULTS: The mean accuracy of the AI model for surgical phase recognition was 89% [95% CI 87.1%, 90.6%], comparable to the mean inter-annotator agreement of 90% [95% CI 89.4%, 90.5%]. The model's accuracy was inversely associated with procedure complexity, decreasing from 92% (complexity level 1) to 88% (complexity level 3) to 81% (complexity level 5). CONCLUSION: The AI model successfully identified surgical phases in both simple and complex LC procedures. Further validation and system training is warranted to evaluate its potential applications such as to increase patient safety during surgery.


Assuntos
Colecistectomia Laparoscópica , Doenças da Vesícula Biliar , Humanos , Colecistectomia Laparoscópica/métodos , Inteligência Artificial , Doenças da Vesícula Biliar/cirurgia , Dissecação
9.
Radiology ; 294(2): 421-431, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-31793848

RESUMO

BackgroundDeep learning has the potential to augment the use of chest radiography in clinical radiology, but challenges include poor generalizability, spectrum bias, and difficulty comparing across studies.PurposeTo develop and evaluate deep learning models for chest radiograph interpretation by using radiologist-adjudicated reference standards.Materials and MethodsDeep learning models were developed to detect four findings (pneumothorax, opacity, nodule or mass, and fracture) on frontal chest radiographs. This retrospective study used two data sets. Data set 1 (DS1) consisted of 759 611 images from a multicity hospital network and ChestX-ray14 is a publicly available data set with 112 120 images. Natural language processing and expert review of a subset of images provided labels for 657 954 training images. Test sets consisted of 1818 and 1962 images from DS1 and ChestX-ray14, respectively. Reference standards were defined by radiologist-adjudicated image review. Performance was evaluated by area under the receiver operating characteristic curve analysis, sensitivity, specificity, and positive predictive value. Four radiologists reviewed test set images for performance comparison. Inverse probability weighting was applied to DS1 to account for positive radiograph enrichment and estimate population-level performance.ResultsIn DS1, population-adjusted areas under the receiver operating characteristic curve for pneumothorax, nodule or mass, airspace opacity, and fracture were, respectively, 0.95 (95% confidence interval [CI]: 0.91, 0.99), 0.72 (95% CI: 0.66, 0.77), 0.91 (95% CI: 0.88, 0.93), and 0.86 (95% CI: 0.79, 0.92). With ChestX-ray14, areas under the receiver operating characteristic curve were 0.94 (95% CI: 0.93, 0.96), 0.91 (95% CI: 0.89, 0.93), 0.94 (95% CI: 0.93, 0.95), and 0.81 (95% CI: 0.75, 0.86), respectively.ConclusionExpert-level models for detecting clinically relevant chest radiograph findings were developed for this study by using adjudicated reference standards and with population-level performance estimation. Radiologist-adjudicated labels for 2412 ChestX-ray14 validation set images and 1962 test set images are provided.© RSNA, 2019Online supplemental material is available for this article.See also the editorial by Chang in this issue.


Assuntos
Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Radiografia Torácica/métodos , Doenças Respiratórias/diagnóstico por imagem , Traumatismos Torácicos/diagnóstico por imagem , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Criança , Pré-Escolar , Aprendizado Profundo , Feminino , Humanos , Lactente , Masculino , Pessoa de Meia-Idade , Pneumotórax , Radiologistas , Padrões de Referência , Reprodutibilidade dos Testes , Estudos Retrospectivos , Sensibilidade e Especificidade , Adulto Jovem
10.
Ophthalmology ; 126(12): 1627-1639, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31561879

RESUMO

PURPOSE: To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. DESIGN: Development and validation of an algorithm. PARTICIPANTS: Fundus images from screening programs, studies, and a glaucoma clinic. METHODS: A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. MAIN OUTCOME MEASURES: The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. RESULTS: The algorithm's AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929-0.960) in dataset A, 0.855 (95% CI, 0.841-0.870) in dataset B, and 0.881 (95% CI, 0.838-0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. CONCLUSIONS: A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup.


Assuntos
Aprendizado Profundo , Glaucoma de Ângulo Aberto/diagnóstico , Oftalmologistas , Disco Óptico/patologia , Doenças do Nervo Óptico/diagnóstico , Especialização , Idoso , Área Sob a Curva , Conjuntos de Dados como Assunto , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Fibras Nervosas/patologia , Curva ROC , Encaminhamento e Consulta , Células Ganglionares da Retina/patologia , Estudos Retrospectivos , Sensibilidade e Especificidade
11.
Ophthalmology ; 126(4): 552-564, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30553900

RESUMO

PURPOSE: To understand the impact of deep learning diabetic retinopathy (DR) algorithms on physician readers in computer-assisted settings. DESIGN: Evaluation of diagnostic technology. PARTICIPANTS: One thousand seven hundred ninety-six retinal fundus images from 1612 diabetic patients. METHODS: Ten ophthalmologists (5 general ophthalmologists, 4 retina specialists, 1 retina fellow) read images for DR severity based on the International Clinical Diabetic Retinopathy disease severity scale in each of 3 conditions: unassisted, grades only, or grades plus heatmap. Grades-only assistance comprised a histogram of DR predictions (grades) from a trained deep-learning model. For grades plus heatmap, we additionally showed explanatory heatmaps. MAIN OUTCOME MEASURES: For each experiment arm, we computed sensitivity and specificity of each reader and the algorithm for different levels of DR severity against an adjudicated reference standard. We also measured accuracy (exact 5-class level agreement and Cohen's quadratically weighted κ), reader-reported confidence (5-point Likert scale), and grading time. RESULTS: Readers graded more accurately with model assistance than without for the grades-only condition (P < 0.001). Grades plus heatmaps improved accuracy for patients with DR (P < 0.001), but reduced accuracy for patients without DR (P = 0.006). Both forms of assistance increased readers' sensitivity moderate-or-worse DR: unassisted: mean, 79.4% [95% confidence interval (CI), 72.3%-86.5%]; grades only: mean, 87.5% [95% CI, 85.1%-89.9%]; grades plus heatmap: mean, 88.7% [95% CI, 84.9%-92.5%] without a corresponding drop in specificity (unassisted: mean, 96.6% [95% CI, 95.9%-97.4%]; grades only: mean, 96.1% [95% CI, 95.5%-96.7%]; grades plus heatmap: mean, 95.5% [95% CI, 94.8%-96.1%]). Algorithmic assistance increased the accuracy of retina specialists above that of the unassisted reader or model alone; and increased grading confidence and grading time across all readers. For most cases, grades plus heatmap was only as effective as grades only. Over the course of the experiment, grading time decreased across all conditions, although most sharply for grades plus heatmap. CONCLUSIONS: Deep learning algorithms can improve the accuracy of, and confidence in, DR diagnosis in an assisted read setting. They also may increase grading time, although these effects may be ameliorated with experience.


Assuntos
Algoritmos , Aprendizado Profundo , Retinopatia Diabética/classificação , Retinopatia Diabética/diagnóstico , Diagnóstico por Computador/métodos , Feminino , Humanos , Masculino , Oftalmologistas/normas , Fotografação/métodos , Curva ROC , Padrões de Referência , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
12.
Ophthalmology ; 125(8): 1264-1272, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-29548646

RESUMO

PURPOSE: Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading. DESIGN: Retrospective analysis. PARTICIPANTS: Retinal fundus images from DR screening programs. METHODS: Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard. MAIN OUTCOME MEASURES: For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity. RESULTS: Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR. CONCLUSIONS: Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists.


Assuntos
Algoritmos , Competência Clínica/normas , Retinopatia Diabética/diagnóstico , Aprendizado de Máquina , Programas de Rastreamento/normas , Oftalmologistas/normas , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Curva ROC , Padrões de Referência , Estudos Retrospectivos
13.
EBioMedicine ; 102: 105075, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38565004

RESUMO

BACKGROUND: AI models have shown promise in performing many medical imaging tasks. However, our ability to explain what signals these models have learned is severely lacking. Explanations are needed in order to increase the trust of doctors in AI-based models, especially in domains where AI prediction capabilities surpass those of humans. Moreover, such explanations could enable novel scientific discovery by uncovering signals in the data that aren't yet known to experts. METHODS: In this paper, we present a workflow for generating hypotheses to understand which visual signals in images are correlated with a classification model's predictions for a given task. This approach leverages an automatic visual explanation algorithm followed by interdisciplinary expert review. We propose the following 4 steps: (i) Train a classifier to perform a given task to assess whether the imagery indeed contains signals relevant to the task; (ii) Train a StyleGAN-based image generator with an architecture that enables guidance by the classifier ("StylEx"); (iii) Automatically detect, extract, and visualize the top visual attributes that the classifier is sensitive towards. For visualization, we independently modify each of these attributes to generate counterfactual visualizations for a set of images (i.e., what the image would look like with the attribute increased or decreased); (iv) Formulate hypotheses for the underlying mechanisms, to stimulate future research. Specifically, present the discovered attributes and corresponding counterfactual visualizations to an interdisciplinary panel of experts so that hypotheses can account for social and structural determinants of health (e.g., whether the attributes correspond to known patho-physiological or socio-cultural phenomena, or could be novel discoveries). FINDINGS: To demonstrate the broad applicability of our approach, we present results on eight prediction tasks across three medical imaging modalities-retinal fundus photographs, external eye photographs, and chest radiographs. We showcase examples where many of the automatically-learned attributes clearly capture clinically known features (e.g., types of cataract, enlarged heart), and demonstrate automatically-learned confounders that arise from factors beyond physiological mechanisms (e.g., chest X-ray underexposure is correlated with the classifier predicting abnormality, and eye makeup is correlated with the classifier predicting low hemoglobin levels). We further show that our method reveals a number of physiologically plausible, previously-unknown attributes based on the literature (e.g., differences in the fundus associated with self-reported sex, which were previously unknown). INTERPRETATION: Our approach enables hypotheses generation via attribute visualizations and has the potential to enable researchers to better understand, improve their assessment, and extract new knowledge from AI-based models, as well as debug and design better datasets. Though not designed to infer causality, importantly, we highlight that attributes generated by our framework can capture phenomena beyond physiology or pathophysiology, reflecting the real world nature of healthcare delivery and socio-cultural factors, and hence interdisciplinary perspectives are critical in these investigations. Finally, we will release code to help researchers train their own StylEx models and analyze their predictive tasks of interest, and use the methodology presented in this paper for responsible interpretation of the revealed attributes. FUNDING: Google.


Assuntos
Algoritmos , Catarata , Humanos , Cardiomegalia , Fundo de Olho , Inteligência Artificial
14.
PLOS Glob Public Health ; 4(6): e0003204, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38833495

RESUMO

Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. We investigated the potential to use photoplethysmography (PPG), a sensing technology available on most smartphones that can potentially enable large-scale screening at low cost, for CVD risk prediction. We developed a deep learning PPG-based CVD risk score (DLS) to predict the probability of having major adverse cardiovascular events (MACE: non-fatal myocardial infarction, stroke, and cardiovascular death) within ten years, given only age, sex, smoking status and PPG as predictors. We compare the DLS with the office-based refit-WHO score, which adopts the shared predictors from WHO and Globorisk scores (age, sex, smoking status, height, weight and systolic blood pressure) but refitted on the UK Biobank (UKB) cohort. All models were trained on a development dataset (141,509 participants) and evaluated on a geographically separate test (54,856 participants) dataset, both from UKB. DLS's C-statistic (71.1%, 95% CI 69.9-72.4) is non-inferior to office-based refit-WHO score (70.9%, 95% CI 69.7-72.2; non-inferiority margin of 2.5%, p<0.01) in the test dataset. The calibration of the DLS is satisfactory, with a 1.8% mean absolute calibration error. Adding DLS features to the office-based score increases the C-statistic by 1.0% (95% CI 0.6-1.4). DLS predicts ten-year MACE risk comparable with the office-based refit-WHO score. Interpretability analyses suggest that the DLS-extracted features are related to PPG waveform morphology and are independent of heart rate. Our study provides a proof-of-concept and suggests the potential of a PPG-based approach strategies for community-based primary prevention in resource-limited regions.

15.
Lancet Digit Health ; 6(2): e126-e130, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38278614

RESUMO

Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components-GPPEs-from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended.


Assuntos
Atenção à Saúde , Aprendizado de Máquina , Humanos , Viés , Algoritmos
16.
Radiol Artif Intell ; 6(3): e230079, 2024 05.
Artigo em Inglês | MEDLINE | ID: mdl-38477661

RESUMO

Purpose To evaluate the impact of an artificial intelligence (AI) assistant for lung cancer screening on multinational clinical workflows. Materials and Methods An AI assistant for lung cancer screening was evaluated on two retrospective randomized multireader multicase studies where 627 (141 cancer-positive cases) low-dose chest CT cases were each read twice (with and without AI assistance) by experienced thoracic radiologists (six U.S.-based or six Japan-based radiologists), resulting in a total of 7524 interpretations. Positive cases were defined as those within 2 years before a pathology-confirmed lung cancer diagnosis. Negative cases were defined as those without any subsequent cancer diagnosis for at least 2 years and were enriched for a spectrum of diverse nodules. The studies measured the readers' level of suspicion (on a 0-100 scale), country-specific screening system scoring categories, and management recommendations. Evaluation metrics included the area under the receiver operating characteristic curve (AUC) for level of suspicion and sensitivity and specificity of recall recommendations. Results With AI assistance, the radiologists' AUC increased by 0.023 (0.70 to 0.72; P = .02) for the U.S. study and by 0.023 (0.93 to 0.96; P = .18) for the Japan study. Scoring system specificity for actionable findings increased 5.5% (57% to 63%; P < .001) for the U.S. study and 6.7% (23% to 30%; P < .001) for the Japan study. There was no evidence of a difference in corresponding sensitivity between unassisted and AI-assisted reads for the U.S. (67.3% to 67.5%; P = .88) and Japan (98% to 100%; P > .99) studies. Corresponding stand-alone AI AUC system performance was 0.75 (95% CI: 0.70, 0.81) and 0.88 (95% CI: 0.78, 0.97) for the U.S.- and Japan-based datasets, respectively. Conclusion The concurrent AI interface improved lung cancer screening specificity in both U.S.- and Japan-based reader studies, meriting further study in additional international screening environments. Keywords: Assistive Artificial Intelligence, Lung Cancer Screening, CT Supplemental material is available for this article. Published under a CC BY 4.0 license.


Assuntos
Inteligência Artificial , Detecção Precoce de Câncer , Neoplasias Pulmonares , Tomografia Computadorizada por Raios X , Humanos , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/epidemiologia , Japão , Estados Unidos/epidemiologia , Estudos Retrospectivos , Detecção Precoce de Câncer/métodos , Feminino , Masculino , Pessoa de Meia-Idade , Idoso , Sensibilidade e Especificidade , Interpretação de Imagem Radiográfica Assistida por Computador/métodos
17.
EClinicalMedicine ; 70: 102479, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38685924

RESUMO

Background: Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods: Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., "R") was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings: Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation: Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. Funding: Google LLC.

18.
Transl Vis Sci Technol ; 12(12): 11, 2023 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-38079169

RESUMO

Purpose: Real-world evaluation of a deep learning model that prioritizes patients based on risk of progression to moderate or worse (MOD+) diabetic retinopathy (DR). Methods: This nonrandomized, single-arm, prospective, interventional study included patients attending DR screening at four centers across Thailand from September 2019 to January 2020, with mild or no DR. Fundus photographs were input into the model, and patients were scheduled for their subsequent screening from September 2020 to January 2021 in order of predicted risk. Evaluation focused on model sensitivity, defined as correctly ranking patients that developed MOD+ within the first 50% of subsequent screens. Results: We analyzed 1,757 patients, of which 52 (3.0%) developed MOD+. Using the model-proposed order, the model's sensitivity was 90.4%. Both the model-proposed order and mild/no DR plus HbA1c had significantly higher sensitivity than the random order (P < 0.001). Excluding one major (rural) site that had practical implementation challenges, the remaining sites included 567 patients and 15 (2.6%) developed MOD+. Here, the model-proposed order achieved 86.7% versus 73.3% for the ranking that used DR grade and hemoglobin A1c. Conclusions: The model can help prioritize follow-up visits for the largest subgroups of DR patients (those with no or mild DR). Further research is needed to evaluate the impact on clinical management and outcomes. Translational Relevance: Deep learning demonstrated potential for risk stratification in DR screening. However, real-world practicalities must be resolved to fully realize the benefit.


Assuntos
Aprendizado Profundo , Diabetes Mellitus , Retinopatia Diabética , Humanos , Retinopatia Diabética/diagnóstico , Retinopatia Diabética/epidemiologia , Estudos Prospectivos , Hemoglobinas Glicadas , Medição de Risco
19.
Lancet Digit Health ; 5(5): e257-e264, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-36966118

RESUMO

BACKGROUND: Photographs of the external eye were recently shown to reveal signs of diabetic retinal disease and elevated glycated haemoglobin. This study aimed to test the hypothesis that external eye photographs contain information about additional systemic medical conditions. METHODS: We developed a deep learning system (DLS) that takes external eye photographs as input and predicts systemic parameters, such as those related to the liver (albumin, aspartate aminotransferase [AST]); kidney (estimated glomerular filtration rate [eGFR], urine albumin-to-creatinine ratio [ACR]); bone or mineral (calcium); thyroid (thyroid stimulating hormone); and blood (haemoglobin, white blood cells [WBC], platelets). This DLS was trained using 123 130 images from 38 398 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA, USA. Evaluation focused on nine prespecified systemic parameters and leveraged three validation sets (A, B, C) spanning 25 510 patients with and without diabetes undergoing eye screening in three independent sites in Los Angeles county, CA, and the greater Atlanta area, GA, USA. We compared performance against baseline models incorporating available clinicodemographic variables (eg, age, sex, race and ethnicity, years with diabetes). FINDINGS: Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST >36·0 U/L, calcium <8·6 mg/dL, eGFR <60·0 mL/min/1·73 m2, haemoglobin <11·0 g/dL, platelets <150·0 × 103/µL, ACR ≥300 mg/g, and WBC <4·0 × 103/µL on validation set A (a population resembling the development datasets), with the area under the receiver operating characteristic curve (AUC) of the DLS exceeding that of the baseline by 5·3-19·9% (absolute differences in AUC). On validation sets B and C, with substantial patient population differences compared with the development datasets, the DLS outperformed the baseline for ACR ≥300·0 mg/g and haemoglobin <11·0 g/dL by 7·3-13·2%. INTERPRETATION: We found further evidence that external eye photographs contain biomarkers spanning multiple organ systems. Such biomarkers could enable accessible and non-invasive screening of disease. Further work is needed to understand the translational implications. FUNDING: Google.


Assuntos
Aprendizado Profundo , Retinopatia Diabética , Humanos , Estudos Retrospectivos , Cálcio , Retinopatia Diabética/diagnóstico , Biomarcadores , Albuminas
20.
JAMA Netw Open ; 6(3): e2254891, 2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36917112

RESUMO

Importance: Identifying new prognostic features in colon cancer has the potential to refine histopathologic review and inform patient care. Although prognostic artificial intelligence systems have recently demonstrated significant risk stratification for several cancer types, studies have not yet shown that the machine learning-derived features associated with these prognostic artificial intelligence systems are both interpretable and usable by pathologists. Objective: To evaluate whether pathologist scoring of a histopathologic feature previously identified by machine learning is associated with survival among patients with colon cancer. Design, Setting, and Participants: This prognostic study used deidentified, archived colorectal cancer cases from January 2013 to December 2015 from the University of Milano-Bicocca. All available histologic slides from 258 consecutive colon adenocarcinoma cases were reviewed from December 2021 to February 2022 by 2 pathologists, who conducted semiquantitative scoring for tumor adipose feature (TAF), which was previously identified via a prognostic deep learning model developed with an independent colorectal cancer cohort. Main Outcomes and Measures: Prognostic value of TAF for overall survival and disease-specific survival as measured by univariable and multivariable regression analyses. Interpathologist agreement in TAF scoring was also evaluated. Results: A total of 258 colon adenocarcinoma histopathologic cases from 258 patients (138 men [53%]; median age, 67 years [IQR, 65-81 years]) with stage II (n = 119) or stage III (n = 139) cancer were included. Tumor adipose feature was identified in 120 cases (widespread in 63 cases, multifocal in 31, and unifocal in 26). For overall survival analysis after adjustment for tumor stage, TAF was independently prognostic in 2 ways: TAF as a binary feature (presence vs absence: hazard ratio [HR] for presence of TAF, 1.55 [95% CI, 1.07-2.25]; P = .02) and TAF as a semiquantitative categorical feature (HR for widespread TAF, 1.87 [95% CI, 1.23-2.85]; P = .004). Interpathologist agreement for widespread TAF vs lower categories (absent, unifocal, or multifocal) was 90%, corresponding to a κ metric at this threshold of 0.69 (95% CI, 0.58-0.80). Conclusions and Relevance: In this prognostic study, pathologists were able to learn and reproducibly score for TAF, providing significant risk stratification on this independent data set. Although additional work is warranted to understand the biological significance of this feature and to establish broadly reproducible TAF scoring, this work represents the first validation to date of human expert learning from machine learning in pathology. Specifically, this validation demonstrates that a computationally identified histologic feature can represent a human-identifiable, prognostic feature with the potential for integration into pathology practice.


Assuntos
Adenocarcinoma , Neoplasias do Colo , Masculino , Humanos , Idoso , Neoplasias do Colo/diagnóstico , Patologistas , Inteligência Artificial , Aprendizado de Máquina , Medição de Risco
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa