RESUMO
Background: The integration of artificial intelligence (AI) into medicine is growing, with some experts predicting its standalone use soon. However, skepticism remains due to limited positive outcomes from independent validations. This research evaluates AI software's effectiveness in analyzing chest X-rays (CXR) to identify lung nodules, a possible lung cancer indicator. Methods: This retrospective study analyzed 7,670,212 record pairs from radiological exams conducted between 2020 and 2022 during the Moscow Computer Vision Experiment, focusing on CXR and computed tomography (CT) scans. All images were acquired during clinical routine. The final dataset comprised 100 CXR images (50 with lung nodules, 50 without), selected consecutively and based on inclusion and exclusion criteria, to evaluate the performance of all five AI-based solutions, participating in the Moscow Computer Vision Experiment and analyzing CXR. The evaluation was performed in 3 stages. In the first stage, the probability of a nodule in the lung obtained from AI services was compared with the Ground Truth (1-there is a nodule, 0-there is no nodule). In the second stage, 3 radiologists evaluated the segmentation of nodules performed by the AI services (1-nodule correctly segmented, 0-nodule incorrectly segmented or not segmented at all). In the third stage, the same radiologists additionally evaluated the classification of the nodules (1-nodule correctly segmented and classified, 0-all other cases). The results obtained in stages 2 and 3 were compared with Ground Truth, which was common to all three stages. For each stage, diagnostic accuracy metrics were calculated for each AI service. Results: Three software solutions (Celsus, Lunit INSIGHT CXR, and qXR) demonstrated diagnostic metrics that matched or surpassed the vendor specifications, and achieved the highest area under the receiver operating characteristic curve (AUC) of 0.956 [95% confidence interval (CI): 0.918 to 0.994]. However, when evaluated by three radiologists for accurate nodule segmentation and classification, all solutions performed below the vendor-declared metrics, with the highest AUC reaching 0.812 (95% CI: 0.744 to 0.879). Meanwhile, all AI services demonstrated 100% specificity at stages 2 and 3 of the study. Conclusions: To ensure the reliability and applicability of AI-based software, it is crucial to validate performance metrics using high-quality datasets and engage radiologists in the evaluation process. Developers are recommended to improve the accuracy of the underlying models before allowing the standalone use of the software for lung nodule detection. The dataset created during the study may be accessed at https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/.
RESUMO
RATIONALE AND OBJECTIVES: Post-COVID condition (PCC) is associated with long-term neuropsychiatric symptoms. Magnetic resonance imaging (MRI) in PCC examines the brain metabolism, connectivity, and morphometry. Such techniques are not easily available in routine practice. We conducted a scoping review to determine what is known about the routine MRI findings in PCC patients. MATERIALS AND METHODS: The PubMed database was searched up to 11 April 2023. We included cohort, cross-sectional, and before-after studies in English. Articles with only advanced MRI sequences (DTI, fMRI, VBM, PWI, ASL), preprints, and case reports were excluded. The National Heart, Lung, and Blood Institute and PRISMA Extension tools were used for quality assurance. RESULTS: A total of 7 citations out of 167 were included. The total sample size was 451 patients (average age 51 ± 8 years; 67% female). Five studies followed a single recovering cohort, while two studies compared findings between two severity groups. The MRI findings were perivascular spaces (47%), microbleeds (27%) and white matter lesions (10%). All the studies agreed that PCC manifestations are not associated with specific MRI findings. CONCLUSION: The results of the included studies are heterogeneous due to the low agreement on the types of MRI abnormalities in PCC. Our findings indicate that the routine brain MRI protocol has little value for long COVID diagnostics.
RESUMO
PURPOSE: replicability and generalizability of medical AI are the recognized challenges that hinder a broad AI deployment in clinical practice. Pulmonary nodes detection and characterization based on chest CT images is one of the demanded use cases for automatization by means of AI, and multiple AI solutions addressing this task are becoming available. Here, we evaluated and compared the performance of several commercially available radiological AI with the same clinical task on the same external datasets acquired before and during the pandemic of COVID-19. APPROACH: 5 commercially available AI models for pulmonary nodule detection were tested on two external datasets labelled by experts according to the intended clinical task. Dataset1 was acquired before the pandemic and did not contain radiological signs of COVID-19; dataset2 was collected during the pandemic and did contain radiological signs of COVID-19. ROC-analysis was applied separately for the dataset1 and dataset2 to select probability thresholds for each dataset separately. AUROC, sensitivity and specificity metrics were used to assess and compare the results of AI performance. RESULTS: Statistically significant differences in AUROC values were observed between the AI models for the dataset1. Whereas for the dataset2 the differences of AUROC values became statistically insignificant. Sensitivity and specificity differed statistically significantly between the AI models for the dataset1. This difference was insignificant for the dataset2 when we applied the probability threshold initially selected for the dataset1. An update of the probability threshold based on the dataset2 created statistically significant differences of sensitivity and specificity between AI models for the dataset2. For 3 out of 5 AI models, the update of the probability threshold was valuable to compensate for the degradation of AI model performances with the population shift caused by the pandemic. CONCLUSIONS: Population shift in the data is able to deteriorate differences of AI models performance. Update of the probability threshold together with the population shift seems to be valuable to preserve AI models performance without retraining them.
Assuntos
COVID-19 , Radiologia , Humanos , Pandemias , COVID-19/diagnóstico por imagem , COVID-19/epidemiologia , Radiografia , Tomografia Computadorizada por Raios XRESUMO
An international reader study was conducted to gauge an average diagnostic accuracy of radiologists interpreting chest X-ray images, including those from fluorography and mammography, and establish requirements for stand-alone radiological artificial intelligence (AI) models. The retrospective studies in the datasets were labelled as containing or not containing target pathological findings based on a consensus of two experienced radiologists, and the results of a laboratory test and follow-up examination, where applicable. A total of 204 radiologists from 11 countries with various experience performed an assessment of the dataset with a 5-point Likert scale via a web platform. Eight commercial radiological AI models analyzed the same dataset. The AI AUROC was 0.87 (95% CI:0.83-0.9) versus 0.96 (95% CI 0.94-0.97) for radiologists. The sensitivity and specificity of AI versus radiologists were 0.71 (95% CI 0.64-0.78) versus 0.91 (95% CI 0.86-0.95) and 0.93 (95% CI 0.89-0.96) versus 0.9 (95% CI 0.85-0.94) for AI. The overall diagnostic accuracy of radiologists was superior to AI for chest X-ray and mammography. However, the accuracy of AI was noninferior to the least experienced radiologists for mammography and fluorography, and to all radiologists for chest X-ray. Therefore, an AI-based first reading could be recommended to reduce the workload burden of radiologists for the most common radiological studies such as chest X-ray and mammography.
RESUMO
We performed a multicenter external evaluation of the practical and clinical efficacy of a commercial AI algorithm for chest X-ray (CXR) analysis (Lunit INSIGHT CXR). A retrospective evaluation was performed with a multi-reader study. For a prospective evaluation, the AI model was run on CXR studies; the results were compared to the reports of 226 radiologists. In the multi-reader study, the area under the curve (AUC), sensitivity, and specificity of the AI were 0.94 (CI95%: 0.87-1.0), 0.9 (CI95%: 0.79-1.0), and 0.89 (CI95%: 0.79-0.98); the AUC, sensitivity, and specificity of the radiologists were 0.97 (CI95%: 0.94-1.0), 0.9 (CI95%: 0.79-1.0), and 0.95 (CI95%: 0.89-1.0). In most regions of the ROC curve, the AI performed a little worse or at the same level as an average human reader. The McNemar test showed no statistically significant differences between AI and radiologists. In the prospective study with 4752 cases, the AUC, sensitivity, and specificity of the AI were 0.84 (CI95%: 0.82-0.86), 0.77 (CI95%: 0.73-0.80), and 0.81 (CI95%: 0.80-0.82). Lower accuracy values obtained during the prospective validation were mainly associated with false-positive findings considered by experts to be clinically insignificant and the false-negative omission of human-reported "opacity", "nodule", and calcification. In a large-scale prospective validation of the commercial AI algorithm in clinical practice, lower sensitivity and specificity values were obtained compared to the prior retrospective evaluation of the data of the same population.
RESUMO
In this review, we focused on the applicability of artificial intelligence (AI) for opportunistic abdominal aortic aneurysm (AAA) detection in computed tomography (CT). We used the academic search system PubMed as the primary source for the literature search and Google Scholar as a supplementary source of evidence. We searched through 2 February 2022. All studies on automated AAA detection or segmentation in noncontrast abdominal CT were included. For bias assessment, we developed and used an adapted version of the QUADAS-2 checklist. We included eight studies with 355 cases, of which 273 (77%) contained AAA. The highest risk of bias and level of applicability concerns were observed for the "patient selection" domain, due to the 100% pathology rate in the majority (75%) of the studies. The mean sensitivity value was 95% (95% CI 100-87%), the mean specificity value was 96.6% (95% CI 100-75.7%), and the mean accuracy value was 95.2% (95% CI 100-54.5%). Half of the included studies performed diagnostic accuracy estimation, with only one study having data on all diagnostic accuracy metrics. Therefore, we conducted a narrative synthesis. Our findings indicate high study heterogeneity, requiring further research with balanced noncontrast CT datasets and adherence to reporting standards in order to validate the high sensitivity value obtained.