RESUMEN
Artificial intelligence (AI) algorithms are prone to bias at multiple stages of model development, with potential for exacerbating health disparities. However, bias in imaging AI is a complex topic that encompasses multiple coexisting definitions. Bias may refer to unequal preference to a person or group owing to preexisting attitudes or beliefs, either intentional or unintentional. However, cognitive bias refers to systematic deviation from objective judgment due to reliance on heuristics, and statistical bias refers to differences between true and expected values, commonly manifesting as systematic error in model prediction (ie, a model with output unrepresentative of real-world conditions). Clinical decisions informed by biased models may lead to patient harm due to action on inaccurate AI results or exacerbate health inequities due to differing performance among patient populations. However, while inequitable bias can harm patients in this context, a mindful approach leveraging equitable bias can address underrepresentation of minority groups or rare diseases. Radiologists should also be aware of bias after AI deployment such as automation bias, or a tendency to agree with automated decisions despite contrary evidence. Understanding common sources of imaging AI bias and the consequences of using biased models can guide preventive measures to mitigate its impact. Accordingly, the authors focus on sources of bias at stages along the imaging machine learning life cycle, attempting to simplify potentially intimidating technical terminology for general radiologists using AI tools in practice or collaborating with data scientists and engineers for AI tool development. The authors review definitions of bias in AI, describe common sources of bias, and present recommendations to guide quality control measures to mitigate the impact of bias in imaging AI. Understanding the terms featured in this article will enable a proactive approach to identifying and mitigating bias in imaging AI. Published under a CC BY 4.0 license. Test Your Knowledge questions for this article are available in the supplemental material. See the invited commentary by Rouzrokh and Erickson in this issue.
Asunto(s)
Algoritmos , Inteligencia Artificial , Humanos , Automatización , Aprendizaje Automático , SesgoRESUMEN
OBJECTIVE: To develop machine learning (ML) models capable of predicting ICU admission and extended length of stay (LOS) after torso (chest, abdomen, or pelvis) trauma, by using clinical and/or imaging data. MATERIALS AND METHODS: This was a retrospective study of 840 adult patients admitted to a level 1 trauma center after injury to the torso over the course of 1 year. Clinical parameters included age, sex, vital signs, clinical scores, and laboratory values. Imaging data consisted of any injury present on CT. The two outcomes of interest were ICU admission and extended LOS, defined as more than the median LOS in the dataset. We developed and tested artificial neural network (ANN) and support vector machine (SVM) models, and predictive performance was evaluated by area under the receiver operating characteristic (ROC) curve (AUC). RESULTS: The AUCs of SVM and ANN models to predict ICU admission were up to 0.87 ± 0.03 and 0.78 ± 0.02, respectively. The AUCs of SVM and ANN models to predict extended LOS were up to 0.80 ± 0.04 and 0.81 ± 0.05, respectively. Predictions based on imaging alone or imaging with clinical parameters were consistently more accurate than those based solely on clinical parameters. CONCLUSIONS: The best performing models incorporated imaging findings and outperformed those with clinical findings alone. ML models have the potential to help predict outcomes in trauma by integrating clinical and imaging findings, although further research may be needed to optimize their performance. KEY POINTS: ⢠Artificial neural network and support vector machine-based models were used to predict the intensive care unit admission and extended length of stay after trauma to the torso. ⢠Our input data consisted of clinical parameters and CT imaging findings derived from radiology reports, and we found that combining the two significantly enhanced the prediction of both outcomes with either model. ⢠The highest accuracy (83%) and highest area under the receiver operating characteristic curve (0.87) were obtained for artificial neural networks and support vector machines, respectively, by combining clinical and imaging features in the prediction of intensive care unit admission.
Asunto(s)
Unidades de Cuidados Intensivos , Aprendizaje Automático , Adulto , Humanos , Tiempo de Internación , Estudios Retrospectivos , Tomografía Computarizada por Rayos X , TorsoRESUMEN
BACKGROUND: Recently developed convolutional neural network (CNN) models determine bone age more accurately than radiologists. OBJECTIVE: The purpose of this study was to determine whether a CNN and radiologists can accurately predict bone age from radiographs using only the index finger rather than the whole hand. MATERIALS AND METHODS: We used a public anonymized dataset provided by the Radiological Society of North America (RSNA) pediatric bone age challenge. The dataset contains 12,611 hand radiographs for training and 200 radiographs for testing. The index finger was cropped from these images to create a second dataset. Separate CNN models were trained using the whole-hand radiographs and the cropped second-digit dataset using the consensus ground truth provided by the RSNA bone age challenge. Bone age determination using both models was compared with ground truth as provided by the RSNA dataset. Separately, three pediatric radiologists determined bone age from the whole-hand and index-finger radiographs, and the consensus was compared to the ground truth and CNN-model-determined bone ages. RESULTS: The mean absolute difference between the ground truth and CNN bone age for whole-hand and index-finger was similar (4.7 months vs. 5.1 months, P=0.14), and both values were significantly smaller than that for radiologist bone age determination from the single-finger radiographs (8.0 months, P<0.0001). CONCLUSION: CNN-model-determined bone ages from index-finger radiographs are similar to whole-hand bone age interpreted by radiologists in the dataset, as well as a model trained on the whole-hand radiograph. In addition, the index-finger model performed better than the ground truth compared to subspecialty trained pediatric radiologists also using only the index finger to determine bone age. The radiologist interpreting bone age can use the second digit as a reliable starting point in their search pattern.
Asunto(s)
Determinación de la Edad por el Esqueleto , Falanges de los Dedos de la Mano/diagnóstico por imagen , Redes Neurales de la Computación , Adolescente , Niño , Preescolar , Conjuntos de Datos como Asunto , Femenino , Humanos , Procesamiento de Imagen Asistido por Computador , Lactante , Masculino , Estudios RetrospectivosRESUMEN
Purpose: To develop and evaluate domain-specific and pretrained bidirectional encoder representations from transformers (BERT) models in a transfer learning task on varying training dataset sizes to annotate a larger overall dataset. Materials and Methods: The authors retrospectively reviewed 69 095 anonymized adult chest radiograph reports (reports dated April 2020-March 2021). From the overall cohort, 1004 reports were randomly selected and labeled for the presence or absence of each of the following devices: endotracheal tube (ETT), enterogastric tube (NGT, or Dobhoff tube), central venous catheter (CVC), and Swan-Ganz catheter (SGC). Pretrained transformer models (BERT, PubMedBERT, DistilBERT, RoBERTa, and DeBERTa) were trained, validated, and tested on 60%, 20%, and 20%, respectively, of these reports through fivefold cross-validation. Additional training involved varying dataset sizes with 5%, 10%, 15%, 20%, and 40% of the 1004 reports. The best-performing epochs were used to assess area under the receiver operating characteristic curve (AUC) and determine run time on the overall dataset. Results: The highest average AUCs from fivefold cross-validation were 0.996 for ETT (RoBERTa), 0.994 for NGT (RoBERTa), 0.991 for CVC (PubMedBERT), and 0.98 for SGC (PubMedBERT). DeBERTa demonstrated the highest AUC for each support device trained on 5% of the training set. PubMedBERT showed a higher AUC with a decreasing training set size compared with BERT. Training and validation time was shortest for DistilBERT at 3 minutes 39 seconds on the annotated cohort. Conclusion: Pretrained and domain-specific transformer models required small training datasets and short training times to create a highly accurate final model that expedites autonomous annotation of large datasets.Keywords: Informatics, Named Entity Recognition, Transfer Learning Supplemental material is available for this article. ©RSNA, 2022See also the commentary by Zech in this issue.
RESUMEN
PURPOSE: To determine the feasibility of using deep learning with a multiview approach, similar to how a human radiologist reviews multiple images, for binomial classification of acute pediatric elbow radiographic abnormalities. MATERIALS AND METHODS: A total of 21 456 radiographic studies containing 58 817 images of the elbow and associated radiology reports over the course of a 4-year period from January 2014 through December 2017 at a dedicated children's hospital were retrospectively retrieved. Mean age was 7.2 years, and 43% were female patients. The studies were binomially classified, based on the reports, as either positive or negative for acute or subacute traumatic abnormality. The studies were randomly divided into a training set containing 20 350 studies and a validation set containing the remaining 1106 studies. A multiview approach was used for the model by combining both a convolutional neural network and recurrent neural network to interpret an entire series of three radiographs together. Sensitivity, specificity, positive predictive value, negative predictive value, area under the receiver operating characteristic curve (AUC), and their 95% confidence intervals were calculated. RESULTS: AUC was 0.95, and accuracy was 88% for the model on the studied dataset. Sensitivity for the model was 91% (536 of 590), while the specificity for the model was 84% (434 of 516). Of 241 supracondylar fractures, one was missed. Of 88 lateral condylar fractures, one was missed. Of 77 elbow effusions without fracture, 15 were missed. Of 184 other abnormalities, 37 were missed. CONCLUSION: Deep learning can effectively classify acute and nonacute pediatric elbow abnormalities on radiographs in the setting of trauma. A recurrent neural network was used to classify an entire radiographic series, arrive at a decision based on all views, and identify fractures in pediatric patients with variable skeletal immaturity.Supplemental material is available for this article.© RSNA, 2019.