RESUMEN
PURPOSE: To assess the ability of the Annalise Enterprise CXR Triage Trauma (Annalise AI Pty Ltd, Sydney, NSW, Australia) artificial intelligence model to identify vertebral compression fractures on chest radiographs and its potential to address undiagnosed osteoporosis and its treatment. MATERIALS AND METHODS: This retrospective study used a consecutive cohort of 596 chest radiographs from four US hospitals between 2015 and 2021. Each radiograph included both frontal (anteroposterior or posteroanterior) and lateral projections. These radiographs were assessed for the presence of vertebral compression fracture in a consensus manner by up to three thoracic radiologists. The model then performed inference on the cases. A chart review was also performed for the presence of osteoporosis-related International Classification of Diseases, 10th revision diagnostic codes and medication use for the study period and an additional year of follow-up. RESULTS: The model successfully completed inference on 595 cases (99.8%); these cases included 272 positive cases and 323 negative cases. The model performed with area under the receiver operating characteristic curve of 0.955 (95% confidence interval [CI]: 0.939-0.968), sensitivity 89.3% (95% CI: 85.7%-92.7%) and specificity 89.2% (95% CI: 85.4%-92.3%). Out of the 236 true-positive cases (ie, correctly identified vertebral compression fractures by the model) with available chart information, only 86 (36.4%) had a diagnosis of vertebral compression fracture and 140 (59.3%) had a diagnosis of either osteoporosis or osteopenia; only 78 (33.1%) were receiving a disease-modifying medication for osteoporosis. CONCLUSION: The model identified vertebral compression fracture accurately with a sensitivity 89.3% (95% CI: 85.7%-92.7%) and specificity of 89.2% (95% CI: 85.4%-92.3%). Its automated use could help identify patients who have undiagnosed osteoporosis and who may benefit from taking disease-modifying medications.
RESUMEN
The opportunistic use of radiological examinations for disease detection can potentially enable timely management. We assessed if an index created by an AI software to quantify chest radiography (CXR) findings associated with heart failure (HF) could distinguish between patients who would develop HF or not within a year of the examination. Our multicenter retrospective study included patients who underwent CXR without an HF diagnosis. We included 1117 patients (age 67.6 ± 13 years; m:f 487:630) that underwent CXR. A total of 413 patients had the CXR image taken within one year of their HF diagnosis. The rest (n = 704) were patients without an HF diagnosis after the examination date. All CXR images were processed with the model (qXR-HF, Qure.AI) to obtain information on cardiac silhouette, pleural effusion, and the index. We calculated the accuracy, sensitivity, specificity, and area under the curve (AUC) of the index to distinguish patients who developed HF within a year of the CXR and those who did not. We report an AUC of 0.798 (95%CI 0.77-0.82), accuracy of 0.73, sensitivity of 0.81, and specificity of 0.68 for the overall AI performance. AI AUCs by lead time to diagnosis (<3 months: 0.85; 4-6 months: 0.82; 7-9 months: 0.75; 10-12 months: 0.71), accuracy (0.68-0.72), and specificity (0.68) remained stable. Our results support the ongoing investigation efforts for opportunistic screening in radiology.
RESUMEN
PURPOSE: We compared the performance of generative artificial intelligence (AI) (Augmented Transformer Assisted Radiology Intelligence [ATARI, Microsoft Nuance, Microsoft Corporation, Redmond, Washington]) and natural language processing (NLP) tools for identifying laterality errors in radiology reports and images. METHODS: We used an NLP-based (mPower, Microsoft Nuance) tool to identify radiology reports flagged for laterality errors in its Quality Assurance Dashboard. The NLP model detects and highlights laterality mismatches in radiology reports. From an initial pool of 1,124 radiology reports flagged by the NLP for laterality errors, we selected and evaluated 898 reports that encompassed radiography, CT, MRI, and ultrasound modalities to ensure comprehensive coverage. A radiologist reviewed each radiology report to assess if the flagged laterality errors were present (reporting error-true-positive) or absent (NLP error-false-positive). Next, we applied ATARI to 237 radiology reports and images with consecutive NLP true-positive (118 reports) and false-positive (119 reports) laterality errors. We estimated accuracy of NLP and generative AI tools to identify overall and modality-wise laterality errors. RESULTS: Among the 898 NLP-flagged laterality errors, 64% (574 of 898) had NLP errors and 36% (324 of 898) were reporting errors. The text query ATARI feature correctly identified the absence of laterality mismatch (NLP false-positives) with a 97.4% accuracy (115 of 118 reports; 95% confidence interval [CI] = 96.5%-98.3%). Combined vision and text query resulted in 98.3% accuracy (116 of 118 reports or images; 95% CI = 97.6%-99.0%), and query alone had a 98.3% accuracy (116 of 118 images; 95% CI = 97.6%-99.0%). CONCLUSION: The generative AI-empowered ATARI prototype outperformed the assessed NLP tool for determining true and false laterality errors in radiology reports while enabling an image-based laterality determination. Underlying errors in ATARI text query in complex radiology reports emphasize the need for further improvement in the technology.
Asunto(s)
Inteligencia Artificial , Procesamiento de Lenguaje Natural , Humanos , Sistemas de Información Radiológica , Errores Diagnósticos , Diagnóstico por ImagenRESUMEN
PURPOSE: We created an infrastructure for no code machine learning (NML) platform for non-programming physicians to create NML model. We tested the platform by creating an NML model for classifying radiographs for the presence and absence of clavicle fractures. METHODS: Our IRB-approved retrospective study included 4135 clavicle radiographs from 2039 patients (mean age 52 ± 20 years, F:M 1022:1017) from 13 hospitals. Each patient had two-view clavicle radiographs with axial and anterior-posterior projections. The positive radiographs had either displaced or non-displaced clavicle fractures. We configured the NML platform to automatically retrieve the eligible exams using the series' unique identification from the hospital virtual network archive via web access to DICOM Objects. The platform trained a model until the validation loss plateaus. Once the testing was complete, the platform provided the receiver operating characteristics curve and confusion matrix for estimating sensitivity, specificity, and accuracy. RESULTS: The NML platform successfully retrieved 3917 radiographs (3917/4135, 94.7 %) and parsed them for creating a ML classifier with 2151 radiographs in the training, 100 radiographs for validation, and 1666 radiographs in testing datasets (772 radiographs with clavicle fracture, 894 without clavicle fracture). The network identified clavicle fracture with 90 % sensitivity, 87 % specificity, and 88 % accuracy with AUC of 0.95 (confidence interval 0.94-0.96). CONCLUSION: A NML platform can help physicians create and test machine learning models from multicenter imaging datasets such as the one in our study for classifying radiographs based on the presence of clavicle fracture.
Asunto(s)
Clavícula , Fracturas Óseas , Aprendizaje Automático , Humanos , Clavícula/lesiones , Clavícula/diagnóstico por imagen , Fracturas Óseas/diagnóstico por imagen , Fracturas Óseas/clasificación , Femenino , Persona de Mediana Edad , Masculino , Estudios Retrospectivos , Sensibilidad y Especificidad , Adulto , Radiografía/métodosRESUMEN
BACKGROUND AND PURPOSE: Mass effect and vasogenic edema are critical findings on CT of the head. This study compared the accuracy of an artificial intelligence model (Annalise Enterprise CTB) with consensus neuroradiologists' interpretations in detecting mass effect and vasogenic edema. MATERIALS AND METHODS: A retrospective stand-alone performance assessment was conducted on data sets of noncontrast CT head cases acquired between 2016 and 2022 for each finding. The cases were obtained from patients 18 years of age or older from 5 hospitals in the United States. The positive cases were selected consecutively on the basis of the original clinical reports using natural language processing and manual confirmation. The negative cases were selected by taking the next negative case acquired from the same CT scanner after positive cases. Each case was interpreted independently by up-to-three neuroradiologists to establish consensus interpretations. Each case was then interpreted by the artificial intelligence model for the presence of the relevant finding. The neuroradiologists were provided with the entire CT study. The artificial intelligence model separately received thin (≤1.5 mm) and/or thick (>1.5 and ≤5 mm) axial series. RESULTS: The 2 cohorts included 818 cases for mass effect and 310 cases for vasogenic edema. The artificial intelligence model identified mass effect with a sensitivity of 96.6% (95% CI, 94.9%-98.2%) and a specificity of 89.8% (95% CI, 84.7%-94.2%) for the thin series, and 95.3% (95% CI, 93.5%-96.8%) and 93.1% (95% CI, 89.1%-96.6%) for the thick series. It identified vasogenic edema with a sensitivity of 90.2% (95% CI, 82.0%-96.7%) and a specificity of 93.5% (95% CI, 88.9%-97.2%) for the thin series, and 90.0% (95% CI, 84.0%-96.0%) and 95.5% (95% CI, 92.5%-98.0%) for the thick series. The corresponding areas under the curve were at least 0.980. CONCLUSIONS: The assessed artificial intelligence model accurately identified mass effect and vasogenic edema in this CT data set. It could assist the clinical workflow by prioritizing interpretation of cases with abnormal findings, possibly benefiting patients through earlier identification and subsequent treatment.
Asunto(s)
Inteligencia Artificial , Edema Encefálico , Tomografía Computarizada por Rayos X , Humanos , Edema Encefálico/diagnóstico por imagen , Estudios Retrospectivos , Femenino , Tomografía Computarizada por Rayos X/métodos , Masculino , Persona de Mediana Edad , Anciano , Sensibilidad y Especificidad , AdultoRESUMEN
Polypharmacy remains an important challenge for patients with extensive medical complexity. Given the primary care shortage and the increasing aging population, effective polypharmacy management is crucial to manage the increasing burden of care. The capacity of large language model (LLM)-based artificial intelligence to aid in polypharmacy management has yet to be evaluated. Here, we evaluate ChatGPT's performance in polypharmacy management via its deprescribing decisions in standardized clinical vignettes. We inputted several clinical vignettes originally from a study of general practicioners' deprescribing decisions into ChatGPT 3.5, a publicly available LLM, and evaluated its capacity for yes/no binary deprescribing decisions as well as list-based prompts in which the model was prompted to choose which of several medications to deprescribe. We recorded ChatGPT responses to yes/no binary deprescribing prompts and the number and types of medications deprescribed. In yes/no binary deprescribing decisions, ChatGPT universally recommended deprescribing medications regardless of ADL status in patients with no overlying CVD history; in patients with CVD history, ChatGPT's answers varied by technical replicate. Total number of medications deprescribed ranged from 2.67 to 3.67 (out of 7) and did not vary with CVD status, but increased linearly with severity of ADL impairment. Among medication types, ChatGPT preferentially deprescribed pain medications. ChatGPT's deprescribing decisions vary along the axes of ADL status, CVD history, and medication type, indicating some concordance of internal logic between general practitioners and the model. These results indicate that specifically trained LLMs may provide useful clinical support in polypharmacy management for primary care physicians.
Asunto(s)
Enfermedades Cardiovasculares , Deprescripciones , Médicos Generales , Humanos , Anciano , Polifarmacia , Inteligencia ArtificialRESUMEN
PURPOSE: Medical imaging accounts for 85% of digital health's venture capital funding. As funding grows, it is expected that artificial intelligence (AI) products will increase commensurately. The study's objective is to project the number of new AI products given the statistical association between historical funding and FDA-approved AI products. METHODS: The study used data from the ACR Data Science Institute and for the number of FDA-approved AI products (2008-2022) and data from Rock Health for AI funding (2013-2022). Employing a 6-year lag between funding and product approved, we used linear regression to estimate the association between new products approved in a certain year, based on the lagged funding (ie, product-year funding). Using this statistical relationship, we forecasted the number of new FDA-approved products. RESULTS: The results show that there are 11.33 (95% confidence interval: 7.03-15.64) new AI products for every $1 billion in funding assuming a 6-year lag between funding and product approval. In 2022 there were 69 new FDA-approved products associated with $4.8 billion in funding. In 2035, product-year funding is projected to reach $30.8 billion, resulting in 350 new products that year. CONCLUSIONS: FDA-approved AI products are expected to grow from 69 in 2022 to 350 in 2035 given the expected funding growth in the coming years. AI is likely to change the practice of diagnostic radiology as new products are developed and integrated into practice. As more AI products are integrated, it may incentivize increased investment for future AI products.
Asunto(s)
Inteligencia Artificial , Financiación del Capital , Academias e Institutos , Ciencia de los Datos , Inversiones en SaludRESUMEN
PURPOSE: To evaluate the real-world performance of two FDA-approved artificial intelligence (AI)-based computer-aided triage and notification (CADt) detection devices and compare them with the manufacturer-reported performance testing in the instructions for use. MATERIALS AND METHODS: Clinical performance of two FDA-cleared CADt large-vessel occlusion (LVO) devices was retrospectively evaluated at two separate stroke centers. Consecutive "code stroke" CT angiography examinations were included and assessed for patient demographics, scanner manufacturer, presence or absence of CADt result, CADt result, and LVO in the internal carotid artery (ICA), horizontal middle cerebral artery (MCA) segment (M1), Sylvian MCA segments after the bifurcation (M2), precommunicating part of cerebral artery, postcommunicating part of the cerebral artery, vertebral artery, basilar artery vessel segments. The original radiology report served as the reference standard, and a study radiologist extracted the above data elements from the imaging examination and radiology report. RESULTS: At hospital A, the CADt algorithm manufacturer reports assessment of intracranial ICA and MCA with sensitivity of 97% and specificity of 95.6%. Real-world performance of 704 cases included 79 in which no CADt result was available. Sensitivity and specificity in ICA and M1 segments were 85.3% and 91.9%. Sensitivity decreased to 68.5% when M2 segments were included and to 59.9% when all proximal vessel segments were included. At hospital B the CADt algorithm manufacturer reports sensitivity of 87.8% and specificity of 89.6%, without specifying the vessel segments. Real-world performance of 642 cases included 20 cases in which no CADt result was available. Sensitivity and specificity in ICA and M1 segments were 90.7% and 97.9%. Sensitivity decreased to 76.4% when M2 segments were included and to 59.4% when all proximal vessel segments are included. DISCUSSION: Real-world testing of two CADt LVO detection algorithms identified gaps in the detection and communication of potentially treatable LVOs when considering vessels beyond the intracranial ICA and M1 segments and in cases with absent and uninterpretable data.
Asunto(s)
Inteligencia Artificial , Accidente Cerebrovascular , Humanos , Triaje , Estudios Retrospectivos , Accidente Cerebrovascular/diagnóstico por imagen , Algoritmos , ComputadoresRESUMEN
The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment.
Asunto(s)
Empatía , Medicina , Humanos , Instituciones de Salud , Lenguaje , Atención a la SaludRESUMEN
BACKGROUND: Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated. OBJECTIVE: This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. METHODS: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks. RESULTS: ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (ß=-15.8%; P<.001) and clinical management (ß=-7.4%; P=.02) question types. CONCLUSIONS: ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.
Asunto(s)
Inteligencia Artificial , Humanos , Toma de Decisiones Clínicas , Organizaciones , Flujo de Trabajo , Diseño Centrado en el UsuarioRESUMEN
OBJECTIVE: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain. METHODS: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores. RESULTS: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%. DISCUSSION: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.
Asunto(s)
Neoplasias de la Mama , Mastodinia , Radiología , Humanos , Femenino , Neoplasias de la Mama/diagnóstico por imagen , Toma de DecisionesRESUMEN
Radiologic tests often contain rich imaging data not relevant to the clinical indication. Opportunistic screening refers to the practice of systematically leveraging these incidental imaging findings. Although opportunistic screening can apply to imaging modalities such as conventional radiography, US, and MRI, most attention to date has focused on body CT by using artificial intelligence (AI)-assisted methods. Body CT represents an ideal high-volume modality whereby a quantitative assessment of tissue composition (eg, bone, muscle, fat, and vascular calcium) can provide valuable risk stratification and help detect unsuspected presymptomatic disease. The emergence of "explainable" AI algorithms that fully automate these measurements could eventually lead to their routine clinical use. Potential barriers to widespread implementation of opportunistic CT screening include the need for buy-in from radiologists, referring providers, and patients. Standardization of acquiring and reporting measures is needed, in addition to expanded normative data according to age, sex, and race and ethnicity. Regulatory and reimbursement hurdles are not insurmountable but pose substantial challenges to commercialization and clinical use. Through demonstration of improved population health outcomes and cost-effectiveness, these opportunistic CT-based measures should be attractive to both payers and health care systems as value-based reimbursement models mature. If highly successful, opportunistic screening could eventually justify a practice of standalone "intended" CT screening.
Asunto(s)
Inteligencia Artificial , Radiología , Humanos , Algoritmos , Radiólogos , Tamizaje Masivo/métodos , Radiología/métodosRESUMEN
RATIONALE AND OBJECTIVES: Suboptimal chest radiographs (CXR) can limit interpretation of critical findings. Radiologist-trained AI models were evaluated for differentiating suboptimal(sCXR) and optimal(oCXR) chest radiographs. MATERIALS AND METHODS: Our IRB-approved study included 3278 CXRs from adult patients (mean age 55 ± 20 years) identified from a retrospective search of CXR in radiology reports from 5 sites. A chest radiologist reviewed all CXRs for the cause of suboptimality. The de-identified CXRs were uploaded into an AI server application for training and testing 5 AI models. The training set consisted of 2202 CXRs (n = 807 oCXR; n = 1395 sCXR) while 1076 CXRs (n = 729 sCXR; n = 347 oCXR) were used for testing. Data were analyzed with the Area under the curve (AUC) for the model's ability to classify oCXR and sCXR correctly. RESULTS: For the two-class classification into sCXR or oCXR from all sites, for CXR with missing anatomy, AI had sensitivity, specificity, accuracy, and AUC of 78%, 95%, 91%, 0.87(95% CI 0.82-0.92), respectively. AI identified obscured thoracic anatomy with 91% sensitivity, 97% specificity, 95% accuracy, and 0.94 AUC (95% CI 0.90-0.97). Inadequate exposure with 90% sensitivity, 93% specificity, 92% accuracy, and AUC of 0.91 (95% CI 0.88-0.95). The presence of low lung volume was identified with 96% sensitivity, 92% specificity, 93% accuracy, and 0.94 AUC (95% CI 0.92-0.96). The sensitivity, specificity, accuracy, and AUC of AI in identifying patient rotation were 92%, 96%, 95%, and 0.94 (95% CI 0.91-0.98), respectively. CONCLUSION: The radiologist-trained AI models can accurately classify optimal and suboptimal CXRs. Such AI models at the front end of radiographic equipment can enable radiographers to repeat sCXRs when necessary.
Asunto(s)
Pulmón , Radiografía Torácica , Adulto , Humanos , Persona de Mediana Edad , Anciano , Pulmón/diagnóstico por imagen , Estudios Retrospectivos , Radiografía , RadiólogosRESUMEN
IMPORTANCE: Large language model (LLM) artificial intelligence (AI) chatbots direct the power of large training datasets towards successive, related tasks, as opposed to single-ask tasks, for which AI already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as virtual physicians, has not yet been evaluated. OBJECTIVE: To evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. DESIGN: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. SETTING: ChatGPT, a publicly available LLM. PARTICIPANTS: Clinical vignettes featured hypothetical patients with a variety of age and gender identities, and a range of Emergency Severity Indices (ESIs) based on initial clinical presentation. EXPOSURES: MSD Clinical Manual vignettes. MAIN OUTCOMES AND MEASURES: We measured the proportion of correct responses to the questions posed within the clinical vignettes tested. RESULTS: ChatGPT achieved 71.7% (95% CI, 69.3% to 74.1%) accuracy overall across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI, 67.8% to 86.1%), and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI, 54.2% to 66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (ß=-15.8%, p<0.001) and clinical management (ß=-7.4%, p=0.02) type questions. CONCLUSIONS AND RELEVANCE: ChatGPT achieves impressive accuracy in clinical decision making, with particular strengths emerging as it has more clinical information at its disposal.
RESUMEN
The multitude of artificial intelligence (AI)-based solutions, vendors, and platforms poses a challenging proposition to an already complex clinical radiology practice. Apart from assessing and ensuring acceptable local performance and workflow fit to improve imaging services, AI tools require multiple stakeholders, including clinical, technical, and financial, who collaborate to move potential deployable applications to full clinical deployment in a structured and efficient manner. Postdeployment monitoring and surveillance of such tools require an infrastructure that ensures proper and safe use. Herein, the authors describe their experience and framework for implementing and supporting the use of AI applications in radiology workflow.
Asunto(s)
Inteligencia Artificial , Radiología , Radiología/métodos , Diagnóstico por Imagen , Flujo de Trabajo , ComercioRESUMEN
Purpose: Motion-impaired CT images can result in limited or suboptimal diagnostic interpretation (with missed or miscalled lesions) and patient recall. We trained and tested an artificial intelligence (AI) model for identifying substantial motion artifacts on CT pulmonary angiography (CTPA) that have a negative impact on diagnostic interpretation. Methods: With IRB approval and HIPAA compliance, we queried our multicenter radiology report database (mPower, Nuance) for CTPA reports between July 2015 and March 2022 for the following terms: "motion artifacts", "respiratory motion", "technically inadequate", and "suboptimal" or "limited exam". All CTPA reports were from two quaternary (Site A, n = 335; B, n = 259) and a community (C, n = 199) healthcare sites. A thoracic radiologist reviewed CT images of all positive hits for motion artifacts (present or absent) and their severity (no diagnostic effect or major diagnostic impairment). Coronal multiplanar images from 793 CTPA exams were de-identified and exported offline into an AI model building prototype (Cognex Vision Pro, Cognex Corporation) to train an AI model to perform two-class classification ("motion" or "no motion") with data from the three sites (70% training dataset, n = 554; 30% validation dataset, n = 239). Separately, data from Site A and Site C were used for training and validating; testing was performed on the Site B CTPA exams. A five-fold repeated cross-validation was performed to evaluate the model performance with accuracy and receiver operating characteristics analysis (ROC). Results: Among the CTPA images from 793 patients (mean age 63 ± 17 years; 391 males, 402 females), 372 had no motion artifacts, and 421 had substantial motion artifacts. The statistics for the average performance of the AI model after five-fold repeated cross-validation for the two-class classification included 94% sensitivity, 91% specificity, 93% accuracy, and 0.93 area under the ROC curve (AUC: 95% CI 0.89-0.97). Conclusion: The AI model used in this study can successfully identify CTPA exams with diagnostic interpretation limiting motion artifacts in multicenter training and test datasets. Clinical relevance: The AI model used in the study can help alert technologists about the presence of substantial motion artifacts on CTPA, where a repeat image acquisition can help salvage diagnostic information.