RESUMO
PURPOSE: To understand the impact of deep learning diabetic retinopathy (DR) algorithms on physician readers in computer-assisted settings. DESIGN: Evaluation of diagnostic technology. PARTICIPANTS: One thousand seven hundred ninety-six retinal fundus images from 1612 diabetic patients. METHODS: Ten ophthalmologists (5 general ophthalmologists, 4 retina specialists, 1 retina fellow) read images for DR severity based on the International Clinical Diabetic Retinopathy disease severity scale in each of 3 conditions: unassisted, grades only, or grades plus heatmap. Grades-only assistance comprised a histogram of DR predictions (grades) from a trained deep-learning model. For grades plus heatmap, we additionally showed explanatory heatmaps. MAIN OUTCOME MEASURES: For each experiment arm, we computed sensitivity and specificity of each reader and the algorithm for different levels of DR severity against an adjudicated reference standard. We also measured accuracy (exact 5-class level agreement and Cohen's quadratically weighted κ), reader-reported confidence (5-point Likert scale), and grading time. RESULTS: Readers graded more accurately with model assistance than without for the grades-only condition (P < 0.001). Grades plus heatmaps improved accuracy for patients with DR (P < 0.001), but reduced accuracy for patients without DR (P = 0.006). Both forms of assistance increased readers' sensitivity moderate-or-worse DR: unassisted: mean, 79.4% [95% confidence interval (CI), 72.3%-86.5%]; grades only: mean, 87.5% [95% CI, 85.1%-89.9%]; grades plus heatmap: mean, 88.7% [95% CI, 84.9%-92.5%] without a corresponding drop in specificity (unassisted: mean, 96.6% [95% CI, 95.9%-97.4%]; grades only: mean, 96.1% [95% CI, 95.5%-96.7%]; grades plus heatmap: mean, 95.5% [95% CI, 94.8%-96.1%]). Algorithmic assistance increased the accuracy of retina specialists above that of the unassisted reader or model alone; and increased grading confidence and grading time across all readers. For most cases, grades plus heatmap was only as effective as grades only. Over the course of the experiment, grading time decreased across all conditions, although most sharply for grades plus heatmap. CONCLUSIONS: Deep learning algorithms can improve the accuracy of, and confidence in, DR diagnosis in an assisted read setting. They also may increase grading time, although these effects may be ameliorated with experience.
Assuntos
Algoritmos , Aprendizado Profundo , Retinopatia Diabética/classificação , Retinopatia Diabética/diagnóstico , Diagnóstico por Computador/métodos , Feminino , Humanos , Masculino , Oftalmologistas/normas , Fotografação/métodos , Curva ROC , Padrões de Referência , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
PURPOSE: To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. DESIGN: Development and validation of an algorithm. PARTICIPANTS: Fundus images from screening programs, studies, and a glaucoma clinic. METHODS: A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. MAIN OUTCOME MEASURES: The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. RESULTS: The algorithm's AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929-0.960) in dataset A, 0.855 (95% CI, 0.841-0.870) in dataset B, and 0.881 (95% CI, 0.838-0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. CONCLUSIONS: A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup.
Assuntos
Aprendizado Profundo , Glaucoma de Ângulo Aberto/diagnóstico , Oftalmologistas , Disco Óptico/patologia , Doenças do Nervo Óptico/diagnóstico , Especialização , Idoso , Área Sob a Curva , Conjuntos de Dados como Assunto , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Fibras Nervosas/patologia , Curva ROC , Encaminhamento e Consulta , Células Ganglionares da Retina/patologia , Estudos Retrospectivos , Sensibilidade e EspecificidadeRESUMO
BACKGROUND/AIMS: Deep learning systems (DLSs) for diabetic retinopathy (DR) detection show promising results but can underperform in racial and ethnic minority groups, therefore external validation within these populations is critical for health equity. This study evaluates the performance of a DLS for DR detection among Indigenous Australians, an understudied ethnic group who suffer disproportionately from DR-related blindness. METHODS: We performed a retrospective external validation study comparing the performance of a DLS against a retinal specialist for the detection of more-than-mild DR (mtmDR), vision-threatening DR (vtDR) and all-cause referable DR. The validation set consisted of 1682 consecutive, single-field, macula-centred retinal photographs from 864 patients with diabetes (mean age 54.9 years, 52.4% women) at an Indigenous primary care service in Perth, Australia. Three-person adjudication by a panel of specialists served as the reference standard. RESULTS: For mtmDR detection, sensitivity of the DLS was superior to the retina specialist (98.0% (95% CI, 96.5 to 99.4) vs 87.1% (95% CI, 83.6 to 90.6), McNemar's test p<0.001) with a small reduction in specificity (95.1% (95% CI, 93.6 to 96.4) vs 97.0% (95% CI, 95.9 to 98.0), p=0.006). For vtDR, the DLS's sensitivity was again superior to the human grader (96.2% (95% CI, 93.4 to 98.6) vs 84.4% (95% CI, 79.7 to 89.2), p<0.001) with a slight drop in specificity (95.8% (95% CI, 94.6 to 96.9) vs 97.8% (95% CI, 96.9 to 98.6), p=0.002). For all-cause referable DR, there was a substantial increase in sensitivity (93.7% (95% CI, 91.8 to 95.5) vs 74.4% (95% CI, 71.1 to 77.5), p<0.001) and a smaller reduction in specificity (91.7% (95% CI, 90.0 to 93.3) vs 96.3% (95% CI, 95.2 to 97.4), p<0.001). CONCLUSION: The DLS showed improved sensitivity and similar specificity compared with a retina specialist for DR detection. This demonstrates its potential to support DR screening among Indigenous Australians, an underserved population with a high burden of diabetic eye disease.
Assuntos
População Australasiana , Aprendizado Profundo , Diabetes Mellitus , Retinopatia Diabética , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Austrália , Retinopatia Diabética/diagnóstico , Retinopatia Diabética/epidemiologia , Etnicidade , Grupos Minoritários , Estudos Retrospectivos , Povos Aborígenes Australianos e Ilhéus do Estreito de TorresRESUMO
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.
RESUMO
BACKGROUND: Diabetic retinopathy is a leading cause of preventable blindness, especially in low-income and middle-income countries (LMICs). Deep-learning systems have the potential to enhance diabetic retinopathy screenings in these settings, yet prospective studies assessing their usability and performance are scarce. METHODS: We did a prospective interventional cohort study to evaluate the real-world performance and feasibility of deploying a deep-learning system into the health-care system of Thailand. Patients with diabetes and listed on the national diabetes registry, aged 18 years or older, able to have their fundus photograph taken for at least one eye, and due for screening as per the Thai Ministry of Public Health guidelines were eligible for inclusion. Eligible patients were screened with the deep-learning system at nine primary care sites under Thailand's national diabetic retinopathy screening programme. Patients with a previous diagnosis of diabetic macular oedema, severe non-proliferative diabetic retinopathy, or proliferative diabetic retinopathy; previous laser treatment of the retina or retinal surgery; other non-diabetic retinopathy eye disease requiring referral to an ophthalmologist; or inability to have fundus photograph taken of both eyes for any reason were excluded. Deep-learning system-based interpretations of patient fundus images and referral recommendations were provided in real time. As a safety mechanism, regional retina specialists over-read each image. Performance of the deep-learning system (accuracy, sensitivity, specificity, positive predictive value [PPV], and negative predictive value [NPV]) were measured against an adjudicated reference standard, provided by fellowship-trained retina specialists. This study is registered with the Thai national clinical trials registry, TCRT20190902002. FINDINGS: Between Dec 12, 2018, and March 29, 2020, 7940 patients were screened for inclusion. 7651 (96·3%) patients were eligible for study analysis, and 2412 (31·5%) patients were referred for diabetic retinopathy, diabetic macular oedema, ungradable images, or low visual acuity. For vision-threatening diabetic retinopathy, the deep-learning system had an accuracy of 94·7% (95% CI 93·0-96·2), sensitivity of 91·4% (87·1-95·0), and specificity of 95·4% (94·1-96·7). The retina specialist over-readers had an accuracy of 93·5 (91·7-95·0; p=0·17), a sensitivity of 84·8% (79·4-90·0; p=0·024), and specificity of 95·5% (94·1-96·7; p=0·98). The PPV for the deep-learning system was 79·2 (95% CI 73·8-84·3) compared with 75·6 (69·8-81·1) for the over-readers. The NPV for the deep-learning system was 95·5 (92·8-97·9) compared with 92·4 (89·3-95·5) for the over-readers. INTERPRETATION: A deep-learning system can deliver real-time diabetic retinopathy detection capability similar to retina specialists in community-based screening settings. Socioenvironmental factors and workflows must be taken into consideration when implementing a deep-learning system within a large-scale screening programme in LMICs. FUNDING: Google and Rajavithi Hospital, Bangkok, Thailand. TRANSLATION: For the Thai translation of the abstract see Supplementary Materials section.
Assuntos
Aprendizado Profundo , Diabetes Mellitus , Retinopatia Diabética , Edema Macular , Estudos de Coortes , Retinopatia Diabética/diagnóstico , Humanos , Edema Macular/diagnóstico , Estudos Prospectivos , TailândiaRESUMO
Importance: Most dermatologic cases are initially evaluated by nondermatologists such as primary care physicians (PCPs) or nurse practitioners (NPs). Objective: To evaluate an artificial intelligence (AI)-based tool that assists with diagnoses of dermatologic conditions. Design, Setting, and Participants: This multiple-reader, multiple-case diagnostic study developed an AI-based tool and evaluated its utility. Primary care physicians and NPs retrospectively reviewed an enriched set of cases representing 120 different skin conditions. Randomization was used to ensure each clinician reviewed each case either with or without AI assistance; each clinician alternated between batches of 50 cases in each modality. The reviews occurred from February 21 to April 28, 2020. Data were analyzed from May 26, 2020, to January 27, 2021. Exposures: An AI-based assistive tool for interpreting clinical images and associated medical history. Main Outcomes and Measures: The primary analysis evaluated agreement with reference diagnoses provided by a panel of 3 dermatologists for PCPs and NPs. Secondary analyses included diagnostic accuracy for biopsy-confirmed cases, biopsy and referral rates, review time, and diagnostic confidence. Results: Forty board-certified clinicians, including 20 PCPs (14 women [70.0%]; mean experience, 11.3 [range, 2-32] years) and 20 NPs (18 women [90.0%]; mean experience, 13.1 [range, 2-34] years) reviewed 1048 retrospective cases (672 female [64.2%]; median age, 43 [interquartile range, 30-56] years; 41â¯920 total reviews) from a teledermatology practice serving 11 sites and provided 0 to 5 differential diagnoses per case (mean [SD], 1.6 [0.7]). The PCPs were located across 12 states, and the NPs practiced in primary care without physician supervision across 9 states. The NPs had a mean of 13.1 (range, 2-34) years of experience and practiced in primary care without physician supervision across 9 states. Artificial intelligence assistance was significantly associated with higher agreement with reference diagnoses. For PCPs, the increase in diagnostic agreement was 10% (95% CI, 8%-11%; P < .001), from 48% to 58%; for NPs, the increase was 12% (95% CI, 10%-14%; P < .001), from 46% to 58%. In secondary analyses, agreement with biopsy-obtained diagnosis categories of maglignant, precancerous, or benign increased by 3% (95% CI, -1% to 7%) for PCPs and by 8% (95% CI, 3%-13%) for NPs. Rates of desire for biopsies decreased by 1% (95% CI, 0-3%) for PCPs and 2% (95% CI, 1%-3%) for NPs; the rate of desire for referrals decreased by 3% (95% CI, 1%-4%) for PCPs and NPs. Diagnostic agreement on cases not indicated for a dermatologist referral increased by 10% (95% CI, 8%-12%) for PCPs and 12% (95% CI, 10%-14%) for NPs, and median review time increased slightly by 5 (95% CI, 0-8) seconds for PCPs and 7 (95% CI, 5-10) seconds for NPs per case. Conclusions and Relevance: Artificial intelligence assistance was associated with improved diagnoses by PCPs and NPs for 1 in every 8 to 10 cases, indicating potential for improving the quality of dermatologic care.
Assuntos
Inteligência Artificial , Diagnóstico por Computador , Profissionais de Enfermagem , Médicos de Atenção Primária , Dermatopatias/diagnóstico , Adulto , Dermatologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Encaminhamento e Consulta , TelemedicinaRESUMO
Repeating object images produces stimulus-specific repetition suppression referred to as functional magnetic resonance imaging-adaptation (fMRI-A) in ventral temporal cortex (VTC). However, the effects of stimulus repetition on functional selectivity are largely unknown. We investigated the effects of short-lagged (SL, immediate) and long-lagged (LL, many intervening stimuli) repetitions on category selectivity in VTC using high-resolution fMRI. We asked whether repetition produces scaling or sharpening of fMRI responses both within category-selective regions as well as in the distributed response pattern across VTC. Results illustrate that repetition effects across time scales vary quantitatively along an anterior-posterior axis and qualitatively along a lateral-medial axis. In lateral VTC, both SL and LL repetitions produce proportional fMRI-A with no change in either selectivity or distributed responses as predicted by a scaling model. Further, there is larger fMRI-A in anterior subregions irrespective of category selectivity. Medial VTC exhibits similar scaling effects during SL repetitions. However, for LL repetitions, both the selectivity and distributed pattern of responses vary with category in medial VTC as predicted by a sharpening model. Specifically, there is larger fMRI-A for nonpreferred categories compared with the preferred category, and category selectivity does not predict fMRI-A across the pattern of distributed response. Finally, simulations indicate that different neural mechanisms likely underlie fMRI-A in medial compared to lateral VTC. These results have important implications for future fMRI-A experiments because they suggest that fMRI-A does not reflect a universal neural mechanism and that results of fMRI-A experiments will likely be paradigm independent in lateral VTC but paradigm dependent in medial VTC.
Assuntos
Adaptação Fisiológica/fisiologia , Mapeamento Encefálico , Imageamento por Ressonância Magnética , Reconhecimento Visual de Modelos/fisiologia , Lobo Temporal/irrigação sanguínea , Lobo Temporal/fisiologia , Adulto , Análise de Variância , Simulação por Computador , Extremidades , Face , Feminino , Humanos , Processamento de Imagem Assistida por Computador , Masculino , Modelos Neurológicos , Oxigênio/sangue , Estimulação Luminosa/métodos , Fatores de Tempo , Vias Visuais/irrigação sanguínea , Vias Visuais/fisiologia , Adulto JovemRESUMO
While the fourth human visual field map (hV4) has been studied for two decades, there remain uncertainties about its spatial organization. In analyzing fMRI measurements designed to resolve these issues, we discovered a significant problem that afflicts measurements from ventral occipital cortex, and particularly measurements near hV4. In most hemispheres the fMRI hV4 data are contaminated by artifacts from the transverse sinus (TS). We created a model of the TS artifact and showed that the model predicts the locations of anomalous fMRI responses to simple large-field on-off stimuli. In many subjects, and particularly the left hemisphere, the TS artifact masks fMRI responses specifically in the region of cortex that distinguishes the two main hV4 models. By selecting subjects with a TS displaced from the lateral edge of hV4, we were able to see around the vein. In these subjects, the visual field coverage extends to the lower meridian, or nearly so, consistent with a model in which hV4 is located on the ventral surface and responds to signals throughout the full contralateral hemifield.
Assuntos
Mapeamento Encefálico/métodos , Imageamento por Ressonância Magnética/métodos , Seios Transversos/anatomia & histologia , Córtex Visual/anatomia & histologia , Campos Visuais/fisiologia , Vias Visuais/anatomia & histologia , Humanos , Processamento de Imagem Assistida por Computador , Estimulação Luminosa , Valores de Referência , Córtex Visual/fisiologiaRESUMO
A region in ventral human cortex (fusiform face area, FFA) thought to be important for face perception responds strongly to faces and less strongly to nonface objects. This pattern of response may reflect a uniform face-selective neural population or activity averaged across populations with heterogeneous selectivity. Using high-resolution functional magnetic resonance imaging (MRI), we found that the FFA has a reliable heterogeneous structure: localized subregions within the FFA highly selective to faces are spatially interdigitated with localized subregions highly selective to different object categories. We found a preponderance of face-selective responses in the FFA, but no difference in selectivity to faces compared to nonfaces. Thus, standard fMRI of the FFA reflects averaging of heterogeneous highly selective neural populations of differing sizes, rather than higher selectivity to faces. These results suggest that visual processing in this region is not exclusive to faces. Overall, our approach provides a framework for understanding the fine-scale structure of neural representations in the human brain.
Assuntos
Córtex Cerebral/fisiologia , Face , Imageamento por Ressonância Magnética/métodos , Percepção Visual/fisiologia , Adulto , Animais , Análise por Conglomerados , Feminino , Percepção de Forma/fisiologia , Humanos , Masculino , Reconhecimento Visual de Modelos/fisiologia , Estimulação Luminosa , Reprodutibilidade dos Testes , Vias Visuais/fisiologiaRESUMO
OBJECTIVE: To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR. METHODS: We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient's color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality. RESULTS: There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p = 0.008; HG: from 74% to 57%, p < 0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%). CONCLUSION: On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings.
Assuntos
Aprendizado Profundo , Retinopatia Diabética/diagnóstico por imagem , Fundo de Olho , Interpretação de Imagem Assistida por Computador , Edema Macular/diagnóstico por imagem , Programas de Rastreamento , Fotografação , Idoso , Proliferação de Células , Retinopatia Diabética/epidemiologia , Feminino , Humanos , Incidência , Estudos Longitudinais , Edema Macular/epidemiologia , Masculino , Pessoa de Meia-Idade , Programas Nacionais de Saúde , Valor Preditivo dos Testes , Prevalência , Reprodutibilidade dos Testes , Índice de Gravidade de Doença , Tailândia/epidemiologiaRESUMO
Importance: Expert-level artificial intelligence (AI) algorithms for prostate biopsy grading have recently been developed. However, the potential impact of integrating such algorithms into pathologist workflows remains largely unexplored. Objective: To evaluate an expert-level AI-based assistive tool when used by pathologists for the grading of prostate biopsies. Design, Setting, and Participants: This diagnostic study used a fully crossed multiple-reader, multiple-case design to evaluate an AI-based assistive tool for prostate biopsy grading. Retrospective grading of prostate core needle biopsies from 2 independent medical laboratories in the US was performed between October 2019 and January 2020. A total of 20 general pathologists reviewed 240 prostate core needle biopsies from 240 patients. Each pathologist was randomized to 1 of 2 study cohorts. The 2 cohorts reviewed every case in the opposite modality (with AI assistance vs without AI assistance) to each other, with the modality switching after every 10 cases. After a minimum 4-week washout period for each batch, the pathologists reviewed the cases for a second time using the opposite modality. The pathologist-provided grade group for each biopsy was compared with the majority opinion of urologic pathology subspecialists. Exposure: An AI-based assistive tool for Gleason grading of prostate biopsies. Main Outcomes and Measures: Agreement between pathologists and subspecialists with and without the use of an AI-based assistive tool for the grading of all prostate biopsies and Gleason grade group 1 biopsies. Results: Biopsies from 240 patients (median age, 67 years; range, 39-91 years) with a median prostate-specific antigen level of 6.5 ng/mL (range, 0.6-97.0 ng/mL) were included in the analyses. Artificial intelligence-assisted review by pathologists was associated with a 5.6% increase (95% CI, 3.2%-7.9%; P < .001) in agreement with subspecialists (from 69.7% for unassisted reviews to 75.3% for assisted reviews) across all biopsies and a 6.2% increase (95% CI, 2.7%-9.8%; P = .001) in agreement with subspecialists (from 72.3% for unassisted reviews to 78.5% for assisted reviews) for grade group 1 biopsies. A secondary analysis indicated that AI assistance was also associated with improvements in tumor detection, mean review time, mean self-reported confidence, and interpathologist agreement. Conclusions and Relevance: In this study, the use of an AI-based assistive tool for the review of prostate biopsies was associated with improvements in the quality, efficiency, and consistency of cancer detection and grading.
Assuntos
Inteligência Artificial/normas , Patologia Clínica/normas , Neoplasias da Próstata/diagnóstico , Adulto , Idoso , Idoso de 80 Anos ou mais , Biópsia com Agulha de Grande Calibre/estatística & dados numéricos , Humanos , Masculino , Pessoa de Meia-Idade , Gradação de Tumores , Neoplasias da Próstata/patologia , Estudos RetrospectivosRESUMO
PURPOSE: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. METHODS: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. RESULTS: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). CONCLUSIONS: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. TRANSLATIONAL RELEVANCE: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows.
RESUMO
Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.
RESUMO
[This corrects the article DOI: 10.1038/s41746-019-0099-8.].
RESUMO
What is the relationship between retinotopy and object selectivity in human lateral occipital (LO) cortex? We used functional magnetic resonance imaging (fMRI) to examine sensitivity to retinal position and category in LO, an object-selective region positioned posterior to MT along the lateral cortical surface. Six subjects participated in phase-encoded retinotopic mapping experiments as well as block-design experiments in which objects from six different categories were presented at six distinct positions in the visual field. We found substantial position modulation in LO using standard nonobject retinotopic mapping stimuli; this modulation extended beyond the boundaries of visual field maps LO-1 and LO-2. Further, LO showed a pronounced lower visual field bias: more LO voxels represented the lower contralateral visual field, and the mean LO response was higher to objects presented below fixation than above fixation. However, eccentricity effects produced by retinotopic mapping stimuli and objects differed. Whereas LO voxels preferred a range of eccentricities lying mostly outside the fovea in the retinotopic mapping experiment, LO responses were strongest to foveally presented objects. Finally, we found a stronger effect of position than category on both the mean LO response, as well as the distributed response across voxels. Overall these results demonstrate that retinal position exhibits strong effects on neural response in LO and indicates that these position effects may be explained by retinotopic organization.
Assuntos
Mapeamento Encefálico , Lobo Occipital/fisiologia , Reconhecimento Visual de Modelos/fisiologia , Retina/fisiologia , Vias Visuais/fisiologia , Adulto , Análise de Variância , Dominância Ocular , Feminino , Humanos , Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Masculino , Lobo Occipital/irrigação sanguínea , Oxigênio/sangue , Estimulação Luminosa/métodos , Tempo de Reação/fisiologia , Campos Visuais/fisiologia , Vias Visuais/irrigação sanguíneaRESUMO
Object-selective cortical regions exhibit a decreased response when an object stimulus is repeated [repetition suppression (RS)]. RS is often associated with priming: reduced response times and increased accuracy for repeated stimuli. It is unknown whether RS reflects stimulus-specific repetition, the associated changes in response time, or the combination of the two. To address this question, we performed a rapid event-related functional MRI (fMRI) study in which we measured BOLD signal in object-selective cortex, as well as object recognition performance, while we manipulated stimulus repetition. Our design allowed us to examine separately the roles of response time and repetition in explaining RS. We found that repetition played a robust role in explaining RS: repeated trials produced weaker BOLD responses than nonrepeated trials, even when comparing trials with matched response times. In contrast, response time played a weak role in explaining RS when repetition was controlled for: it explained BOLD responses only for one region of interest (ROI) and one experimental condition. Thus repetition suppression seems to be mostly driven by repetition rather than performance changes. We further examined whether RS reflects processes occurring at the same time as recognition or after recognition by manipulating stimulus presentation duration. In one experiment, durations were longer than required for recognition (2 s), whereas in a second experiment, durations were close to the minimum time required for recognition (85-101 ms). We found significant RS for brief presentations (albeit with a reduced magnitude), which again persisted when controlling for performance. This suggests a substantial amount of RS occurs during recognition.