RESUMO
Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful1. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives2. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening.
Assuntos
Inteligência Artificial/normas , Neoplasias da Mama/diagnóstico por imagem , Detecção Precoce de Câncer/métodos , Detecção Precoce de Câncer/normas , Feminino , Humanos , Mamografia/normas , Reprodutibilidade dos Testes , Reino Unido , Estados UnidosRESUMO
Background The World Health Organization (WHO) recommends chest radiography to facilitate tuberculosis (TB) screening. However, chest radiograph interpretation expertise remains limited in many regions. Purpose To develop a deep learning system (DLS) to detect active pulmonary TB on chest radiographs and compare its performance to that of radiologists. Materials and Methods A DLS was trained and tested using retrospective chest radiographs (acquired between 1996 and 2020) from 10 countries. To improve generalization, large-scale chest radiograph pretraining, attention pooling, and semisupervised learning ("noisy-student") were incorporated. The DLS was evaluated in a four-country test set (China, India, the United States, and Zambia) and in a mining population in South Africa, with positive TB confirmed with microbiological tests or nucleic acid amplification testing (NAAT). The performance of the DLS was compared with that of 14 radiologists. The authors studied the efficacy of the DLS compared with that of nine radiologists using the Obuchowski-Rockette-Hillis procedure. Given WHO targets of 90% sensitivity and 70% specificity, the operating point of the DLS (0.45) was prespecified to favor sensitivity. Results A total of 165 754 images in 22 284 subjects (mean age, 45 years; 21% female) were used for model development and testing. In the four-country test set (1236 subjects, 17% with active TB), the receiver operating characteristic (ROC) curve of the DLS was higher than those for all nine India-based radiologists, with an area under the ROC curve of 0.89 (95% CI: 0.87, 0.91). Compared with these radiologists, at the prespecified operating point, the DLS sensitivity was higher (88% vs 75%, P < .001) and specificity was noninferior (79% vs 84%, P = .004). Trends were similar within other patient subgroups, in the South Africa data set, and across various TB-specific chest radiograph findings. In simulations, the use of the DLS to identify likely TB-positive chest radiographs for NAAT confirmation reduced the cost by 40%-80% per TB-positive patient detected. Conclusion A deep learning method was found to be noninferior to radiologists for the determination of active tuberculosis on digital chest radiographs. © RSNA, 2022 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.
Assuntos
Aprendizado Profundo , Tuberculose Pulmonar , Humanos , Feminino , Pessoa de Meia-Idade , Masculino , Radiografia Torácica/métodos , Estudos Retrospectivos , Radiografia , Tuberculose Pulmonar/diagnóstico por imagem , Radiologistas , Sensibilidade e EspecificidadeRESUMO
Shoulder pain is common among elite swimmers due to the tremendous stress over shoulders during swimming. Supraspinatus muscle is one of the major prime movers and stabilizers of shoulder and is highly susceptible to overloading and tendinopathy. An understanding of the relationship between supraspinatus tendon and pain; and between supraspinatus tendon and strength would assist health care practitioners for developing training regime. The objectives of this study are to evaluate 1) the association between structural abnormality of supraspinatus tendon and shoulder pain and 2) the association between structural abnormality of supraspinatus tendon and shoulder strength. We hypothesized that structural abnormality of supraspinatus tendons positively associated with shoulder pain and negatively associated with shoulder muscle strength among elite swimmers. 44 elite swimmers were recruited from the Hong Kong China Swimming Association. Supraspinatus tendon condition was evaluated using diagnostic ultrasound imaging and shoulder internal and external rotation strength was evaluated by the isokinetic dynamometer. Pearson's R was used to study the correlation between shoulder pain and supraspinatus tendon condition and to evaluate the association between isokinetic strength of shoulders and supraspinatus tendon condition. 82 shoulders had supraspinatus tendinopathy or tendon tear (93.18%). However, there was no statistically significant association between structural abnormality of supraspinatus tendon and shoulder pain. The results showed that there was no association between supraspinatus tendon abnormality and shoulder pain and there was a significant correlation between left maximal supraspinatus tendon thickness (LMSTT) and left external rotation/ concentric (LER/Con) and left external rotation/ eccentric (LER/Ecc) shoulder strength (p < 0.05) while internal rotation/ external rotation (IR/ER) ratio can also be a significant predicator on LMSTT >6mm (R2 = 0.462, F = 7.016, df = 1, p = 0.038). Structural change of supraspinatus tendon was not associated with shoulder pain, but could be a predictor on MSTT >6mm in elite swimmers.
Assuntos
Manguito Rotador , Tendinopatia , Humanos , Dor de Ombro , Estudos Transversais , ChinaRESUMO
Background Developing deep learning models for radiology requires large data sets and substantial computational resources. Data set size limitations can be further exacerbated by distribution shifts, such as rapid changes in patient populations and standard of care during the COVID-19 pandemic. A common partial mitigation is transfer learning by pretraining a "generic network" on a large nonmedical data set and then fine-tuning on a task-specific radiology data set. Purpose To reduce data set size requirements for chest radiography deep learning models by using an advanced machine learning approach (supervised contrastive [SupCon] learning) to generate chest radiography networks. Materials and Methods SupCon helped generate chest radiography networks from 821 544 chest radiographs from India and the United States. The chest radiography networks were used as a starting point for further machine learning model development for 10 prediction tasks (eg, airspace opacity, fracture, tuberculosis, and COVID-19 outcomes) by using five data sets comprising 684 955 chest radiographs from India, the United States, and China. Three model development setups were tested (linear classifier, nonlinear classifier, and fine-tuning the full network) with different data set sizes from eight to 85. Results Across a majority of tasks, compared with transfer learning from a nonmedical data set, SupCon reduced label requirements up to 688-fold and improved the area under the receiver operating characteristic curve (AUC) at matching data set sizes. At the extreme low-data regimen, training small nonlinear models by using only 45 chest radiographs yielded an AUC of 0.95 (noninferior to radiologist performance) in classifying microbiology-confirmed tuberculosis in external validation. At a more moderate data regimen, training small nonlinear models by using only 528 chest radiographs yielded an AUC of 0.75 in predicting severe COVID-19 outcomes. Conclusion Supervised contrastive learning enabled performance comparable to state-of-the-art deep learning models in multiple clinical tasks by using as few as 45 images and is a promising method for predictive modeling with use of small data sets and for predicting outcomes in shifting patient populations. © RSNA, 2022 Online supplemental material is available for this article.
Assuntos
COVID-19 , Aprendizado Profundo , Humanos , Radiografia Torácica/métodos , Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Pandemias , COVID-19/diagnóstico por imagem , Estudos Retrospectivos , Radiografia , Aprendizado de MáquinaRESUMO
Manikin carrying is a lifesaving sports technique, in which athletes stroke with one arm and carry a manikin of 60 kg with the other arm as they swim. Stabilizing the manikin exerts great demand on the shoulder muscles of the carrying arm; thus, this study aimed to investigate the muscle activation of the carrying shoulder and the possible factors associated with it. This was a cross-sectional study, in which 20 young elite lifesaving athletes were recruited from the Hong Kong Lifesaving Society. The muscle activity of the posterior deltoid (PD), teres major (TM), and middle trapezius (MT) were recorded with wireless surface electromyography (sEMG) during the performance of 25-m manikin carrying in a swimming pool. The 25-m manikin-carrying was divided into and analyzed in 3 phases: initial, middle, and end phase. The initial phase was defined as the period from the athlete's first swimming stroke to the end of the third stroke; the middle phase was defined as the period between the initial and the end phase; and the end phase was defined as the period from the last third stroke to the last stroke at the 25-m finishing line. The first web space and grip strength were measured. The speed and number of inhalations were calculated. PD showed muscle activity of 55.73% of maximal voluntary isometric contraction (MVIC) in the initial phase and 40.21% MVIC in middle phase. TM showed a muscle activity of 65.26% MVIC in the initial phase and 64.35% MVIC in the middle phase. MT showed 84.54% MVIC in the initial phase and 68.54% MVIC in the middle phase. Young elite athletes showed significant use of PD, TM, and MT during manikin-carrying. The muscle activity levels correlated with the first web space, grip strength, speed, and number of inhalations of the athletes.
Assuntos
Ombro , Músculos Superficiais do Dorso , Atletas , Estudos Transversais , Humanos , Manequins , Ombro/fisiologiaRESUMO
BACKGROUND: Evidence has shown that velocity-specific exercise results in additional benefits for peripheral joint muscles by promoting their functions, however, its effects on spinal muscles are yet to be investigated. This study aimed to examine the feasibility and effects of velocity-specific exercise compared to isometric exercise on cervical muscle functions and performance in healthy individuals. METHODS: Thirty healthy adults were randomised to practise either the velocity-specific exercise (VSE, n = 15) or isometric exercise (IE, n = 15) for 6 weeks. Functions and performance of the cervical extensors and flexors were assessed pre- and post-program, by analyzing the peak torque and electromyography during the isokinetic testing, and cross-sectional area of the deep cervical muscles at rest. The self-reported level of difficulty and post-exercise soreness during the exercise were recorded to evaluate the feasibility and safety of the two exercise programs. RESULTS: Both VSE and IE exercises resulted in significant improvement of the muscle functions and performance while there were no between-group differences at reassessment of the (a) cross-sectional area of longus colli and semispinalis capitis; (b) EMG amplitude in sternocleidomastoid and cervical erector spinae, and (c) peak torque values. Further analysis revealed that degree of correlation between extension torque and EMG amplitude of cervical erector spinae increased in both groups. However, significant correlation was found only in VSE group post-program. There were no significant differences for the level of difficulty and post-exercise soreness found between two groups. CONCLUSIONS: Both velocity-specific and isometric exercises significantly promoted cervical muscle functions and performance. The present study confirms that velocity-specific exercise can be practised safely and it also contributes to a greater enhancement in neuromuscular efficiency of the cervical extensors. These findings indicate that the velocity-specific exercise can be considered as a safe alternative for training of the cervical muscles. Further study is recommended to examine its benefit and application for promoting the muscle functions and recovery in symptomatic individuals.
Assuntos
Exercício Físico , Músculos do Pescoço , Adulto , Eletromiografia , Terapia por Exercício , Humanos , Músculo Esquelético , TorqueRESUMO
The objectives of this systematic review were to summarize and evaluate the effectiveness of strength and conditioning trainings on front crawl swimming, starts and turns performance with relevant biomechanical parameters. Four online databases including PudMed, ESCSOhost, Web of Science and SPORTDiscus were searched according to different combination of keywords. 954 articles were extracted from databases, and ultimately 15 articles were included in this study after removal of duplicate and articles screening according to inclusion and exclusion criteria. Meta-analyses were adopted when appropriate and Egger's regression symmetry was adopted to assess the publication bias and the results were presented with forest plots and funnel plots respectively. Fifteen articles studied the effects of strength and resistance, core, and plyometric trainings. The quality of the investigation was assessed by the checklist developed by Downs and Black. Most of the investigations found out that training programs were beneficial to front crawl sprinting swimming performance, stroke biomechanics, force, and muscle strength. First, strength and resistance trainings and core trainings were effective on sprinting performance enhancement. Second, resistance trainings were found to have positive effects on stroke rate. Plyometric trainings were beneficial to start performance, while there was no sufficient evidence for confirming the positive improvement on turn biomechanical, also overall swimming performance, after weeks of plyometric trainings. Strength and Conditioning trainings are suggested to implement in regular training regime regarding to the positive effects on swimming performance, including starts, turns and front crawl swim, and relevant biomechanical parameters, instead of swimming training only. Further research with higher quality is recommended to conduct and more investigations on the training effects to other stroke styles are also suggested.
Assuntos
Exercício Pliométrico , Treinamento Resistido , Fenômenos Biomecânicos , Humanos , Força Muscular/fisiologia , Treinamento Resistido/métodos , Natação/fisiologiaRESUMO
BackgroundDeep learning has the potential to augment the use of chest radiography in clinical radiology, but challenges include poor generalizability, spectrum bias, and difficulty comparing across studies.PurposeTo develop and evaluate deep learning models for chest radiograph interpretation by using radiologist-adjudicated reference standards.Materials and MethodsDeep learning models were developed to detect four findings (pneumothorax, opacity, nodule or mass, and fracture) on frontal chest radiographs. This retrospective study used two data sets. Data set 1 (DS1) consisted of 759 611 images from a multicity hospital network and ChestX-ray14 is a publicly available data set with 112 120 images. Natural language processing and expert review of a subset of images provided labels for 657 954 training images. Test sets consisted of 1818 and 1962 images from DS1 and ChestX-ray14, respectively. Reference standards were defined by radiologist-adjudicated image review. Performance was evaluated by area under the receiver operating characteristic curve analysis, sensitivity, specificity, and positive predictive value. Four radiologists reviewed test set images for performance comparison. Inverse probability weighting was applied to DS1 to account for positive radiograph enrichment and estimate population-level performance.ResultsIn DS1, population-adjusted areas under the receiver operating characteristic curve for pneumothorax, nodule or mass, airspace opacity, and fracture were, respectively, 0.95 (95% confidence interval [CI]: 0.91, 0.99), 0.72 (95% CI: 0.66, 0.77), 0.91 (95% CI: 0.88, 0.93), and 0.86 (95% CI: 0.79, 0.92). With ChestX-ray14, areas under the receiver operating characteristic curve were 0.94 (95% CI: 0.93, 0.96), 0.91 (95% CI: 0.89, 0.93), 0.94 (95% CI: 0.93, 0.95), and 0.81 (95% CI: 0.75, 0.86), respectively.ConclusionExpert-level models for detecting clinically relevant chest radiograph findings were developed for this study by using adjudicated reference standards and with population-level performance estimation. Radiologist-adjudicated labels for 2412 ChestX-ray14 validation set images and 1962 test set images are provided.© RSNA, 2019Online supplemental material is available for this article.See also the editorial by Chang in this issue.
Assuntos
Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Radiografia Torácica/métodos , Doenças Respiratórias/diagnóstico por imagem , Traumatismos Torácicos/diagnóstico por imagem , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Criança , Pré-Escolar , Aprendizado Profundo , Feminino , Humanos , Lactente , Masculino , Pessoa de Meia-Idade , Pneumotórax , Radiologistas , Padrões de Referência , Reprodutibilidade dos Testes , Estudos Retrospectivos , Sensibilidade e Especificidade , Adulto JovemRESUMO
Purpose To evaluate the impact of an artificial intelligence (AI) assistant for lung cancer screening on multinational clinical workflows. Materials and Methods An AI assistant for lung cancer screening was evaluated on two retrospective randomized multireader multicase studies where 627 (141 cancer-positive cases) low-dose chest CT cases were each read twice (with and without AI assistance) by experienced thoracic radiologists (six U.S.-based or six Japan-based radiologists), resulting in a total of 7524 interpretations. Positive cases were defined as those within 2 years before a pathology-confirmed lung cancer diagnosis. Negative cases were defined as those without any subsequent cancer diagnosis for at least 2 years and were enriched for a spectrum of diverse nodules. The studies measured the readers' level of suspicion (on a 0-100 scale), country-specific screening system scoring categories, and management recommendations. Evaluation metrics included the area under the receiver operating characteristic curve (AUC) for level of suspicion and sensitivity and specificity of recall recommendations. Results With AI assistance, the radiologists' AUC increased by 0.023 (0.70 to 0.72; P = .02) for the U.S. study and by 0.023 (0.93 to 0.96; P = .18) for the Japan study. Scoring system specificity for actionable findings increased 5.5% (57% to 63%; P < .001) for the U.S. study and 6.7% (23% to 30%; P < .001) for the Japan study. There was no evidence of a difference in corresponding sensitivity between unassisted and AI-assisted reads for the U.S. (67.3% to 67.5%; P = .88) and Japan (98% to 100%; P > .99) studies. Corresponding stand-alone AI AUC system performance was 0.75 (95% CI: 0.70, 0.81) and 0.88 (95% CI: 0.78, 0.97) for the U.S.- and Japan-based datasets, respectively. Conclusion The concurrent AI interface improved lung cancer screening specificity in both U.S.- and Japan-based reader studies, meriting further study in additional international screening environments. Keywords: Assistive Artificial Intelligence, Lung Cancer Screening, CT Supplemental material is available for this article. Published under a CC BY 4.0 license.
Assuntos
Inteligência Artificial , Detecção Precoce de Câncer , Neoplasias Pulmonares , Tomografia Computadorizada por Raios X , Humanos , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/epidemiologia , Japão , Estados Unidos/epidemiologia , Estudos Retrospectivos , Detecção Precoce de Câncer/métodos , Feminino , Masculino , Pessoa de Meia-Idade , Idoso , Sensibilidade e Especificidade , Interpretação de Imagem Radiográfica Assistida por Computador/métodosRESUMO
Importance: Fetal ultrasonography is essential for confirmation of gestational age (GA), and accurate GA assessment is important for providing appropriate care throughout pregnancy and for identifying complications, including fetal growth disorders. Derivation of GA from manual fetal biometry measurements (ie, head, abdomen, and femur) is operator dependent and time-consuming. Objective: To develop artificial intelligence (AI) models to estimate GA with higher accuracy and reliability, leveraging standard biometry images and fly-to ultrasonography videos. Design, Setting, and Participants: To improve GA estimates, this diagnostic study used AI to interpret standard plane ultrasonography images and fly-to ultrasonography videos, which are 5- to 10-second videos that can be automatically recorded as part of the standard of care before the still image is captured. Three AI models were developed and validated: (1) an image model using standard plane images, (2) a video model using fly-to videos, and (3) an ensemble model (combining both image and video models). The models were trained and evaluated on data from the Fetal Age Machine Learning Initiative (FAMLI) cohort, which included participants from 2 study sites at Chapel Hill, North Carolina (US), and Lusaka, Zambia. Participants were eligible to be part of this study if they received routine antenatal care at 1 of these sites, were aged 18 years or older, had a viable intrauterine singleton pregnancy, and could provide written consent. They were not eligible if they had known uterine or fetal abnormality, or had any other conditions that would make participation unsafe or complicate interpretation. Data analysis was performed from January to July 2022. Main Outcomes and Measures: The primary analysis outcome for GA was the mean difference in absolute error between the GA model estimate and the clinical standard estimate, with the ground truth GA extrapolated from the initial GA estimated at an initial examination. Results: Of the total cohort of 3842 participants, data were calculated for a test set of 404 participants with a mean (SD) age of 28.8 (5.6) years at enrollment. All models were statistically superior to standard fetal biometry-based GA estimates derived from images captured by expert sonographers. The ensemble model had the lowest mean absolute error compared with the clinical standard fetal biometry (mean [SD] difference, -1.51 [3.96] days; 95% CI, -1.90 to -1.10 days). All 3 models outperformed standard biometry by a more substantial margin on fetuses that were predicted to be small for their GA. Conclusions and Relevance: These findings suggest that AI models have the potential to empower trained operators to estimate GA with higher accuracy.
Assuntos
Inteligência Artificial , Aprendizado de Máquina , Humanos , Gravidez , Feminino , Idade Gestacional , Reprodutibilidade dos Testes , Zâmbia , UltrassonografiaRESUMO
Background: Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings. Methods: Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones. Results: Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference -1.4 ± 4.5 days, 95% CI -1.8, -0.9, n = 406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n = 613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 s after completion of a sweep. Conclusions: The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings.
RESUMO
OBJECTIVE: Demonstrate the importance of combining multiple readers' opinions, in a context-aware manner, when establishing the reference standard for validation of artificial intelligence (AI) applications for, e.g. chest radiographs. By comparing individual readers, majority vote of a panel, and panel-based discussion, we identify methods which maximize interobserver agreement and label reproducibility. METHODS: 1100 frontal chest radiographs were evaluated for 6 findings: airspace opacity, cardiomegaly, pulmonary edema, fracture, nodules, and pneumothorax. Each image was reviewed by six radiologists, first individually and then via asynchronous adjudication (web-based discussion) in two panels of three readers to resolve disagreements within each panel. We quantified the reproducibility of each method by measuring interreader agreement. RESULTS: Panel-based majority vote improved agreement relative to individual readers for all findings. Most disagreements were resolved with two rounds of adjudication, which further improved reproducibility for some findings, particularly reducing misses. Improvements varied across finding categories, with adjudication improving agreement for cardiomegaly, fractures, and pneumothorax. CONCLUSION: The likelihood of interreader agreement, even within panels of US board-certified radiologists, must be considered before reads can be used as a reference standard for validation of proposed AI tools. Agreement and, by extension, reproducibility can be improved by applying majority vote, maximum sensitivity, or asynchronous adjudication for different findings, which supports the development of higher quality clinical research. ADVANCES IN KNOWLEDGE: A panel of three experts is a common technique for establishing reference standards when ground truth is not available for use in AI validation. The manner in which differing opinions are resolved is shown to be important, and has not been previously explored.
Assuntos
Inteligência Artificial/normas , Radiografia Torácica , Humanos , Variações Dependentes do Observador , Melhoria de Qualidade , Radiologistas , Padrões de Referência , Reprodutibilidade dos TestesRESUMO
Deriving interpretable prognostic features from deep-learning-based prognostic histopathology models remains a challenge. In this study, we developed a deep learning system (DLS) for predicting disease-specific survival for stage II and III colorectal cancer using 3652 cases (27,300 slides). When evaluated on two validation datasets containing 1239 cases (9340 slides) and 738 cases (7140 slides), respectively, the DLS achieved a 5-year disease-specific survival AUC of 0.70 (95% CI: 0.66-0.73) and 0.69 (95% CI: 0.64-0.72), and added significant predictive value to a set of nine clinicopathologic features. To interpret the DLS, we explored the ability of different human-interpretable features to explain the variance in DLS scores. We observed that clinicopathologic features such as T-category, N-category, and grade explained a small fraction of the variance in DLS scores (R2 = 18% in both validation sets). Next, we generated human-interpretable histologic features by clustering embeddings from a deep-learning-based image-similarity model and showed that they explained the majority of the variance (R2 of 73-80%). Furthermore, the clustering-derived feature most strongly associated with high DLS scores was also highly prognostic in isolation. With a distinct visual appearance (poorly differentiated tumor cell clusters adjacent to adipose tissue), this feature was identified by annotators with 87.0-95.5% accuracy. Our approach can be used to explain predictions from a prognostic deep learning model and uncover potentially-novel prognostic features that can be reliably identified by people for future validation studies.
RESUMO
Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to detect every possible condition by building multiple separate systems, each of which detects one or more pre-specified conditions. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For training and tuning the system, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system trained using a large dataset containing a diverse array of CXR abnormalities generalizes to new patient populations and unseen diseases. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7-28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist. Lastly, to facilitate the continued development of AI models for CXR, we release our collected labels for the publicly available dataset.
Assuntos
COVID-19/diagnóstico por imagem , Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Tuberculose/diagnóstico por imagem , Adulto , Idoso , Algoritmos , Estudos de Casos e Controles , China , Aprendizado Profundo , Feminino , Humanos , Índia , Masculino , Pessoa de Meia-Idade , Radiografia Torácica , Estados UnidosRESUMO
With an estimated 160,000 deaths in 2018, lung cancer is the most common cause of cancer death in the United States1. Lung cancer screening using low-dose computed tomography has been shown to reduce mortality by 20-43% and is now included in US screening guidelines1-6. Existing challenges include inter-grader variability and high false-positive and false-negative rates7-10. We propose a deep learning algorithm that uses a patient's current and prior computed tomography volumes to predict the risk of lung cancer. Our model achieves a state-of-the-art performance (94.4% area under the curve) on 6,716 National Lung Cancer Screening Trial cases, and performs similarly on an independent clinical validation set of 1,139 cases. We conducted two reader studies. When prior computed tomography imaging was not available, our model outperformed all six radiologists with absolute reductions of 11% in false positives and 5% in false negatives. Where prior computed tomography imaging was available, the model performance was on-par with the same radiologists. This creates an opportunity to optimize the screening process via computer assistance and automation. While the vast majority of patients remain unscreened, we show the potential for deep learning models to increase the accuracy, consistency and adoption of lung cancer screening worldwide.
Assuntos
Aprendizado Profundo , Diagnóstico por Computador/métodos , Neoplasias Pulmonares/diagnóstico por imagem , Neoplasias Pulmonares/diagnóstico , Programas de Rastreamento/métodos , Tomografia Computadorizada por Raios X , Algoritmos , Bases de Dados Factuais , Aprendizado Profundo/estatística & dados numéricos , Diagnóstico por Computador/estatística & dados numéricos , Humanos , Imageamento Tridimensional/estatística & dados numéricos , Programas de Rastreamento/estatística & dados numéricos , Redes Neurais de Computação , Estudos Retrospectivos , Fatores de Risco , Tomografia Computadorizada por Raios X/estatística & dados numéricos , Estados UnidosRESUMO
An amendment to this paper has been published and can be accessed via a link at the top of the paper.