Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 119
Filter
Add more filters

Publication year range
1.
Nat Methods ; 21(2): 182-194, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38347140

ABSTRACT

Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.


Subject(s)
Artificial Intelligence
2.
Nat Methods ; 21(2): 195-212, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38347141

ABSTRACT

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.


Subject(s)
Algorithms , Image Processing, Computer-Assisted , Machine Learning , Semantics
3.
BMC Cancer ; 23(1): 460, 2023 May 19.
Article in English | MEDLINE | ID: mdl-37208717

ABSTRACT

BACKGROUND: Double reading (DR) in screening mammography increases cancer detection and lowers recall rates, but has sustainability challenges due to workforce shortages. Artificial intelligence (AI) as an independent reader (IR) in DR may provide a cost-effective solution with the potential to improve screening performance. Evidence for AI to generalise across different patient populations, screening programmes and equipment vendors, however, is still lacking. METHODS: This retrospective study simulated DR with AI as an IR, using data representative of real-world deployments (275,900 cases, 177,882 participants) from four mammography equipment vendors, seven screening sites, and two countries. Non-inferiority and superiority were assessed for relevant screening metrics. RESULTS: DR with AI, compared with human DR, showed at least non-inferior recall rate, cancer detection rate, sensitivity, specificity and positive predictive value (PPV) for each mammography vendor and site, and superior recall rate, specificity, and PPV for some. The simulation indicates that using AI would have increased arbitration rate (3.3% to 12.3%), but could have reduced human workload by 30.0% to 44.8%. CONCLUSIONS: AI has potential as an IR in the DR workflow across different screening programmes, mammography equipment and geographies, substantially reducing human reader workload while maintaining or improving standard of care. TRIAL REGISTRATION: ISRCTN18056078 (20/03/2019; retrospectively registered).


Subject(s)
Breast Neoplasms , Humans , Female , Breast Neoplasms/diagnostic imaging , Mammography , Artificial Intelligence , Retrospective Studies , Early Detection of Cancer , Mass Screening
4.
Brain ; 145(6): 2064-2076, 2022 06 30.
Article in English | MEDLINE | ID: mdl-35377407

ABSTRACT

There is substantial interest in the potential for traumatic brain injury to result in progressive neurological deterioration. While blood biomarkers such as glial fibrillary acid protein (GFAP) and neurofilament light have been widely explored in characterizing acute traumatic brain injury (TBI), their use in the chronic phase is limited. Given increasing evidence that these proteins may be markers of ongoing neurodegeneration in a range of diseases, we examined their relationship to imaging changes and functional outcome in the months to years following TBI. Two-hundred and three patients were recruited in two separate cohorts; 6 months post-injury (n = 165); and >5 years post-injury (n = 38; 12 of whom also provided data ∼8 months post-TBI). Subjects underwent blood biomarker sampling (n = 199) and MRI (n = 172; including diffusion tensor imaging). Data from patient cohorts were compared to 59 healthy volunteers and 21 non-brain injury trauma controls. Mean diffusivity and fractional anisotropy were calculated in cortical grey matter, deep grey matter and whole brain white matter. Accelerated brain ageing was calculated at a whole brain level as the predicted age difference defined using T1-weighted images, and at a voxel-based level as the annualized Jacobian determinants in white matter and grey matter, referenced to a population of 652 healthy control subjects. Serum neurofilament light concentrations were elevated in the early chronic phase. While GFAP values were within the normal range at ∼8 months, many patients showed a secondary and temporally distinct elevations up to >5 years after injury. Biomarker elevation at 6 months was significantly related to metrics of microstructural injury on diffusion tensor imaging. Biomarker levels at ∼8 months predicted white matter volume loss at >5 years, and annualized brain volume loss between ∼8 months and 5 years. Patients who worsened functionally between ∼8 months and >5 years showed higher than predicted brain age and elevated neurofilament light levels. GFAP and neurofilament light levels can remain elevated months to years after TBI, and show distinct temporal profiles. These elevations correlate closely with microstructural injury in both grey and white matter on contemporaneous quantitative diffusion tensor imaging. Neurofilament light elevations at ∼8 months may predict ongoing white matter and brain volume loss over >5 years of follow-up. If confirmed, these findings suggest that blood biomarker levels at late time points could be used to identify TBI survivors who are at high risk of progressive neurological damage.


Subject(s)
Brain Injuries, Traumatic , Brain Injuries , White Matter , Biomarkers , Brain Injuries/complications , Brain Injuries, Traumatic/complications , Brain Injuries, Traumatic/diagnostic imaging , Diffusion Tensor Imaging/methods , Disease Progression , Glial Fibrillary Acidic Protein/metabolism , Humans
5.
Hum Brain Mapp ; 41(15): 4406-4418, 2020 10 15.
Article in English | MEDLINE | ID: mdl-32643852

ABSTRACT

Multiple biomarkers can capture different facets of Alzheimer's disease. However, statistical models of biomarkers to predict outcomes in Alzheimer's rarely model nonlinear interactions between these measures. Here, we used Gaussian Processes to address this, modelling nonlinear interactions to predict progression from mild cognitive impairment (MCI) to Alzheimer's over 3 years, using Alzheimer's Disease Neuroimaging Initiative (ADNI) data. Measures included: demographics, APOE4 genotype, CSF (amyloid-ß42, total tau, phosphorylated tau), [18F ]florbetapir, hippocampal volume and brain-age. We examined: (a) the independent value of each biomarker; and (b) whether modelling nonlinear interactions between biomarkers improved predictions. Each measured added complementary information when predicting conversion to Alzheimer's. A linear model classifying stable from progressive MCI explained over half the variance (R2 = 0.51, p < .001); the strongest independently contributing biomarker was hippocampal volume (R2 = 0.13). When comparing sensitivity of different models to progressive MCI (independent biomarker models, additive models, nonlinear interaction models), we observed a significant improvement (p < .001) for various two-way interaction models. The best performing model included an interaction between amyloid-ß-PET and P-tau, while accounting for hippocampal volume (sensitivity = 0.77, AUC = 0.826). Closely related biomarkers contributed uniquely to predict conversion to Alzheimer's. Nonlinear biomarker interactions were also implicated, and results showed that although for some patients adding additional biomarkers may add little value (i.e., when hippocampal volume is high), for others (i.e., with low hippocampal volume) further invasive and expensive examination may be warranted. Our framework enables visualisation of these interactions, in individual patient biomarker 'space', providing information for personalised or stratified healthcare or clinical trial design.


Subject(s)
Alzheimer Disease/diagnosis , Cognitive Dysfunction/diagnosis , Disease Progression , Models, Theoretical , Aged , Aged, 80 and over , Alzheimer Disease/genetics , Alzheimer Disease/metabolism , Alzheimer Disease/pathology , Biomarkers , Cognitive Dysfunction/genetics , Cognitive Dysfunction/metabolism , Cognitive Dysfunction/pathology , Female , Follow-Up Studies , Humans , Magnetic Resonance Imaging , Male , Positron-Emission Tomography , Sensitivity and Specificity
6.
Radiol Med ; 125(1): 48-56, 2020 Jan.
Article in English | MEDLINE | ID: mdl-31522345

ABSTRACT

PURPOSE: Development of a fully automatic algorithm for the automatic localization and identification of vertebral bodies in computed tomography (CT). MATERIALS AND METHODS: This algorithm was developed using a dataset based on real-world data of 232 thoraco-abdominopelvic CT scans retrospectively collected. In order to achieve an accurate solution, a two-stage automated method was developed: decision forests for a rough prediction of vertebral bodies position, and morphological image processing techniques to refine the previous detection by locating the position of the spinal canal. RESULTS: The mean distance error between the predicted vertebrae centroid position and truth was 13.7 mm. The identification rate was 79.6% on the thoracic region and of 74.8% on the lumbar segment. CONCLUSION: The algorithm provides a new method to detect and identify vertebral bodies from arbitrary field-of-view body CT scans.


Subject(s)
Algorithms , Decision Trees , Machine Learning , Multidetector Computed Tomography/methods , Spine/diagnostic imaging , Adult , Aged , Aged, 80 and over , Anatomic Landmarks/diagnostic imaging , Datasets as Topic , Humans , Middle Aged , Retrospective Studies , Young Adult
7.
J Cardiovasc Magn Reson ; 21(1): 18, 2019 03 14.
Article in English | MEDLINE | ID: mdl-30866968

ABSTRACT

BACKGROUND: The trend towards large-scale studies including population imaging poses new challenges in terms of quality control (QC). This is a particular issue when automatic processing tools such as image segmentation methods are employed to derive quantitative measures or biomarkers for further analyses. Manual inspection and visual QC of each segmentation result is not feasible at large scale. However, it is important to be able to automatically detect when a segmentation method fails in order to avoid inclusion of wrong measurements into subsequent analyses which could otherwise lead to incorrect conclusions. METHODS: To overcome this challenge, we explore an approach for predicting segmentation quality based on Reverse Classification Accuracy, which enables us to discriminate between successful and failed segmentations on a per-cases basis. We validate this approach on a new, large-scale manually-annotated set of 4800 cardiovascular magnetic resonance (CMR) scans. We then apply our method to a large cohort of 7250 CMR on which we have performed manual QC. RESULTS: We report results used for predicting segmentation quality metrics including Dice Similarity Coefficient (DSC) and surface-distance measures. As initial validation, we present data for 400 scans demonstrating 99% accuracy for classifying low and high quality segmentations using the predicted DSC scores. As further validation we show high correlation between real and predicted scores and 95% classification accuracy on 4800 scans for which manual segmentations were available. We mimic real-world application of the method on 7250 CMR where we show good agreement between predicted quality metrics and manual visual QC scores. CONCLUSIONS: We show that Reverse classification accuracy has the potential for accurate and fully automatic segmentation QC on a per-case basis in the context of large-scale population imaging as in the UK Biobank Imaging Study.


Subject(s)
Heart/diagnostic imaging , Image Interpretation, Computer-Assisted/standards , Magnetic Resonance Imaging/standards , Automation , Humans , Predictive Value of Tests , Quality Control , Reproducibility of Results , United Kingdom
8.
Neuroimage ; 181: 521-538, 2018 11 01.
Article in English | MEDLINE | ID: mdl-30048747

ABSTRACT

Predictive models allow subject-specific inference when analyzing disease related alterations in neuroimaging data. Given a subject's data, inference can be made at two levels: global, i.e. identifiying condition presence for the subject, and local, i.e. detecting condition effect on each individual measurement extracted from the subject's data. While global inference is widely used, local inference, which can be used to form subject-specific effect maps, is rarely used because existing models often yield noisy detections composed of dispersed isolated islands. In this article, we propose a reconstruction method, named RSM, to improve subject-specific detections of predictive modeling approaches and in particular, binary classifiers. RSM specifically aims to reduce noise due to sampling error associated with using a finite sample of examples to train classifiers. The proposed method is a wrapper-type algorithm that can be used with different binary classifiers in a diagnostic manner, i.e. without information on condition presence. Reconstruction is posed as a Maximum-A-Posteriori problem with a prior model whose parameters are estimated from training data in a classifier-specific fashion. Experimental evaluation is performed on synthetically generated data and data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Results on synthetic data demonstrate that using RSM yields higher detection accuracy compared to using models directly or with bootstrap averaging. Analyses on the ADNI dataset show that RSM can also improve correlation between subject-specific detections in cortical thickness data and non-imaging markers of Alzheimer's Disease (AD), such as the Mini Mental State Examination Score and Cerebrospinal Fluid amyloid-ß levels. Further reliability studies on the longitudinal ADNI dataset show improvement on detection reliability when RSM is used.


Subject(s)
Alzheimer Disease , Amyloid beta-Peptides/cerebrospinal fluid , Image Processing, Computer-Assisted/methods , Mental Status and Dementia Tests , Models, Theoretical , Neuroimaging/methods , Alzheimer Disease/cerebrospinal fluid , Alzheimer Disease/diagnostic imaging , Alzheimer Disease/physiopathology , Computer Simulation , Datasets as Topic , Humans
9.
Neuroimage ; 169: 431-442, 2018 04 01.
Article in English | MEDLINE | ID: mdl-29278772

ABSTRACT

Graph representations are often used to model structured data at an individual or population level and have numerous applications in pattern recognition problems. In the field of neuroscience, where such representations are commonly used to model structural or functional connectivity between a set of brain regions, graphs have proven to be of great importance. This is mainly due to the capability of revealing patterns related to brain development and disease, which were previously unknown. Evaluating similarity between these brain connectivity networks in a manner that accounts for the graph structure and is tailored for a particular application is, however, non-trivial. Most existing methods fail to accommodate the graph structure, discarding information that could be beneficial for further classification or regression analyses based on these similarities. We propose to learn a graph similarity metric using a siamese graph convolutional neural network (s-GCN) in a supervised setting. The proposed framework takes into consideration the graph structure for the evaluation of similarity between a pair of graphs, by employing spectral graph convolutions that allow the generalisation of traditional convolutions to irregular graphs and operates in the graph spectral domain. We apply the proposed model on two datasets: the challenging ABIDE database, which comprises functional MRI data of 403 patients with autism spectrum disorder (ASD) and 468 healthy controls aggregated from multiple acquisition sites, and a set of 2500 subjects from UK Biobank. We demonstrate the performance of the method for the tasks of classification between matching and non-matching graphs, as well as individual subject classification and manifold learning, showing that it leads to significantly improved results compared to traditional methods.


Subject(s)
Autism Spectrum Disorder/physiopathology , Connectome/methods , Image Processing, Computer-Assisted/methods , Magnetic Resonance Imaging/methods , Models, Theoretical , Nerve Net/physiology , Neural Networks, Computer , Autism Spectrum Disorder/diagnostic imaging , Databases, Factual , Datasets as Topic , Humans , Nerve Net/diagnostic imaging , Nerve Net/physiopathology
10.
Neuroimage ; 167: 453-465, 2018 02 15.
Article in English | MEDLINE | ID: mdl-29100940

ABSTRACT

In brain imaging, accurate alignment of cortical surfaces is fundamental to the statistical sensitivity and spatial localisation of group studies, and cortical surface-based alignment has generally been accepted to be superior to volume-based approaches at aligning cortical areas. However, human subjects have considerable variation in cortical folding, and in the location of functional areas relative to these folds. This makes alignment of cortical areas a challenging problem. The Multimodal Surface Matching (MSM) tool is a flexible, spherical registration approach that enables accurate registration of surfaces based on a variety of different features. Using MSM, we have previously shown that driving cross-subject surface alignment, using areal features, such as resting state-networks and myelin maps, improves group task fMRI statistics and map sharpness. However, the initial implementation of MSM's regularisation function did not penalize all forms of surface distortion evenly. In some cases, this allowed peak distortions to exceed neurobiologically plausible limits, unless regularisation strength was increased to a level which prevented the algorithm from fully maximizing surface alignment. Here we propose and implement a new regularisation penalty, derived from physically relevant equations of strain (deformation) energy, and demonstrate that its use leads to improved and more robust alignment of multimodal imaging data. In addition, since spherical warps incorporate projection distortions that are unavoidable when mapping from a convoluted cortical surface to the sphere, we also propose constraints that enforce smooth deformation of cortical anatomies. We test the impact of this approach for longitudinal modelling of cortical development for neonates (born between 31 and 43 weeks of post-menstrual age) and demonstrate that the proposed method increases the biological interpretability of the distortion fields and improves the statistical significance of population-based analysis relative to other spherical methods.


Subject(s)
Cerebral Cortex/anatomy & histology , Cerebral Cortex/diagnostic imaging , Image Processing, Computer-Assisted/methods , Magnetic Resonance Imaging/methods , Neuroimaging/methods , Cerebral Cortex/growth & development , Humans , Infant, Newborn , Longitudinal Studies , Models, Theoretical
11.
J Cardiovasc Magn Reson ; 20(1): 65, 2018 09 14.
Article in English | MEDLINE | ID: mdl-30217194

ABSTRACT

BACKGROUND: Cardiovascular resonance (CMR) imaging is a standard imaging modality for assessing cardiovascular diseases (CVDs), the leading cause of death globally. CMR enables accurate quantification of the cardiac chamber volume, ejection fraction and myocardial mass, providing information for diagnosis and monitoring of CVDs. However, for years, clinicians have been relying on manual approaches for CMR image analysis, which is time consuming and prone to subjective errors. It is a major clinical challenge to automatically derive quantitative and clinically relevant information from CMR images. METHODS: Deep neural networks have shown a great potential in image pattern recognition and segmentation for a variety of tasks. Here we demonstrate an automated analysis method for CMR images, which is based on a fully convolutional network (FCN). The network is trained and evaluated on a large-scale dataset from the UK Biobank, consisting of 4,875 subjects with 93,500 pixelwise annotated images. The performance of the method has been evaluated using a number of technical metrics, including the Dice metric, mean contour distance and Hausdorff distance, as well as clinically relevant measures, including left ventricle (LV) end-diastolic volume (LVEDV) and end-systolic volume (LVESV), LV mass (LVM); right ventricle (RV) end-diastolic volume (RVEDV) and end-systolic volume (RVESV). RESULTS: By combining FCN with a large-scale annotated dataset, the proposed automated method achieves a high performance in segmenting the LV and RV on short-axis CMR images and the left atrium (LA) and right atrium (RA) on long-axis CMR images. On a short-axis image test set of 600 subjects, it achieves an average Dice metric of 0.94 for the LV cavity, 0.88 for the LV myocardium and 0.90 for the RV cavity. The mean absolute difference between automated measurement and manual measurement is 6.1 mL for LVEDV, 5.3 mL for LVESV, 6.9 gram for LVM, 8.5 mL for RVEDV and 7.2 mL for RVESV. On long-axis image test sets, the average Dice metric is 0.93 for the LA cavity (2-chamber view), 0.95 for the LA cavity (4-chamber view) and 0.96 for the RA cavity (4-chamber view). The performance is comparable to human inter-observer variability. CONCLUSIONS: We show that an automated method achieves a performance on par with human experts in analysing CMR images and deriving clinically relevant measures.


Subject(s)
Heart Diseases/diagnostic imaging , Image Interpretation, Computer-Assisted/methods , Magnetic Resonance Imaging, Cine/methods , Myocardial Contraction , Neural Networks, Computer , Stroke Volume , Ventricular Function, Left , Ventricular Function, Right , Aged , Automation , Databases, Factual , Deep Learning , Female , Heart Diseases/physiopathology , Humans , Male , Middle Aged , Observer Variation , Predictive Value of Tests , Reproducibility of Results
12.
Neuroimage ; 162: 226-248, 2017 11 15.
Article in English | MEDLINE | ID: mdl-28889005

ABSTRACT

Advances in neuroimaging have provided a tremendous amount of in-vivo information on the brain's organisation. Its anatomy and cortical organisation can be investigated from the point of view of several imaging modalities, many of which have been studied for mapping functionally specialised cortical areas. There is strong evidence that a single modality is not sufficient to fully identify the brain's cortical organisation. Combining multiple modalities in the same parcellation task has the potential to provide more accurate and robust subdivisions of the cortex. Nonetheless, existing brain parcellation methods are typically developed and tested on single modalities using a specific type of information. In this paper, we propose Graph-based Multi-modal Parcellation (GraMPa), an iterative framework designed to handle the large variety of available input modalities to tackle the multi-modal parcellation task. At each iteration, we compute a set of parcellations from different modalities and fuse them based on their local reliabilities. The fused parcellation is used to initialise the next iteration, forcing the parcellations to converge towards a set of mutually informed modality specific parcellations, where correspondences are established. We explore two different multi-modal configurations for group-wise parcellation using resting-state fMRI, diffusion MRI tractography, myelin maps and task fMRI. Quantitative and qualitative results on the Human Connectome Project database show that integrating multi-modal information yields a stronger agreement with well established atlases and more robust connectivity networks that provide a better representation of the population.


Subject(s)
Brain Mapping/methods , Cerebral Cortex/anatomy & histology , Image Processing, Computer-Assisted/methods , Humans
13.
Med Image Anal ; 97: 103260, 2024 Oct.
Article in English | MEDLINE | ID: mdl-38970862

ABSTRACT

Robustness of deep learning segmentation models is crucial for their safe incorporation into clinical practice. However, these models can falter when faced with distributional changes. This challenge is evident in magnetic resonance imaging (MRI) scans due to the diverse acquisition protocols across various domains, leading to differences in image characteristics such as textural appearances. We posit that the restricted anatomical differences between subjects could be harnessed to refine the latent space into a set of shape components. The learned set then aims to encompass the relevant anatomical shape variation found within the patient population. We explore this by utilising multiple MRI sequences to learn texture invariant and shape equivariant features which are used to construct a shape dictionary using vector quantisation. We investigate shape equivariance to a number of different types of groups. We hypothesise and prove that the greater the group order, i.e., the denser the constraint, the better becomes the model robustness. We achieve shape equivariance either with a contrastive based approach or by imposing equivariant constraints on the convolutional kernels. The resulting shape equivariant dictionary is then sampled to compose the segmentation output. Our method achieves state-of-the-art performance for the task of single domain generalisation for prostate and cardiac MRI segmentation. Code is available at https://github.com/AinkaranSanthi/A_Geometric_Perspective_For_Robust_Segmentation.


Subject(s)
Deep Learning , Magnetic Resonance Imaging , Humans , Magnetic Resonance Imaging/methods , Image Processing, Computer-Assisted/methods , Male , Algorithms
14.
Article in English | MEDLINE | ID: mdl-38740720

ABSTRACT

PURPOSE: Automated prostate disease classification on multi-parametric MRI has recently shown promising results with the use of convolutional neural networks (CNNs). The vision transformer (ViT) is a convolutional free architecture which only exploits the self-attention mechanism and has surpassed CNNs in some natural imaging classification tasks. However, these models are not very robust to textural shifts in the input space. In MRI, we often have to deal with textural shift arising from varying acquisition protocols. Here, we focus on the ability of models to generalise well to new magnet strengths for MRI. METHOD: We propose a new framework to improve the robustness of vision transformer-based models for disease classification by constructing discrete representations of the data using vector quantisation. We sample a subset of the discrete representations to form the input into a transformer-based model. We use cross-attention in our transformer model to combine the discrete representations of T2-weighted and apparent diffusion coefficient (ADC) images. RESULTS: We analyse the robustness of our model by training on a 1.5 T scanner and test on a 3 T scanner and vice versa. Our approach achieves SOTA performance for classification of lesions on prostate MRI and outperforms various other CNN and transformer-based models in terms of robustness to domain shift and perturbations in the input space. CONCLUSION: We develop a method to improve the robustness of transformer-based disease classification of prostate lesions on MRI using discrete representations of the T2-weighted and ADC images.

15.
Commun Med (Lond) ; 4(1): 21, 2024 Feb 19.
Article in English | MEDLINE | ID: mdl-38374436

ABSTRACT

BACKGROUND: Breast density is an important risk factor for breast cancer complemented by a higher risk of cancers being missed during screening of dense breasts due to reduced sensitivity of mammography. Automated, deep learning-based prediction of breast density could provide subject-specific risk assessment and flag difficult cases during screening. However, there is a lack of evidence for generalisability across imaging techniques and, importantly, across race. METHODS: This study used a large, racially diverse dataset with 69,697 mammographic studies comprising 451,642 individual images from 23,057 female participants. A deep learning model was developed for four-class BI-RADS density prediction. A comprehensive performance evaluation assessed the generalisability across two imaging techniques, full-field digital mammography (FFDM) and two-dimensional synthetic (2DS) mammography. A detailed subgroup performance and bias analysis assessed the generalisability across participants' race. RESULTS: Here we show that a model trained on FFDM-only achieves a 4-class BI-RADS classification accuracy of 80.5% (79.7-81.4) on FFDM and 79.4% (78.5-80.2) on unseen 2DS data. When trained on both FFDM and 2DS images, the performance increases to 82.3% (81.4-83.0) and 82.3% (81.3-83.1). Racial subgroup analysis shows unbiased performance across Black, White, and Asian participants, despite a separate analysis confirming that race can be predicted from the images with a high accuracy of 86.7% (86.0-87.4). CONCLUSIONS: Deep learning-based breast density prediction generalises across imaging techniques and race. No substantial disparities are found for any subgroup, including races that were never seen during model development, suggesting that density predictions are unbiased.


Women with dense breasts have a higher risk of breast cancer. For dense breasts, it is also more difficult to spot cancer in mammograms, which are the X-ray images commonly used for breast cancer screening. Thus, knowing about an individual's breast density provides important information to doctors and screening participants. This study investigated whether an artificial intelligence algorithm (AI) can be used to accurately determine the breast density by analysing mammograms. The study tested whether such an algorithm performs equally well across different imaging devices, and importantly, across individuals from different self-reported race groups. A large, racially diverse dataset was used to evaluate the algorithm's performance. The results show that there were no substantial differences in the accuracy for any of the groups, providing important assurances that AI can be used safely and ethically for automated prediction of breast density.

16.
JMIR Res Protoc ; 13: e51614, 2024 Jun 28.
Article in English | MEDLINE | ID: mdl-38941147

ABSTRACT

BACKGROUND: Artificial intelligence (AI) medical devices have the potential to transform existing clinical workflows and ultimately improve patient outcomes. AI medical devices have shown potential for a range of clinical tasks such as diagnostics, prognostics, and therapeutic decision-making such as drug dosing. There is, however, an urgent need to ensure that these technologies remain safe for all populations. Recent literature demonstrates the need for rigorous performance error analysis to identify issues such as algorithmic encoding of spurious correlations (eg, protected characteristics) or specific failure modes that may lead to patient harm. Guidelines for reporting on studies that evaluate AI medical devices require the mention of performance error analysis; however, there is still a lack of understanding around how performance errors should be analyzed in clinical studies, and what harms authors should aim to detect and report. OBJECTIVE: This systematic review will assess the frequency and severity of AI errors and adverse events (AEs) in randomized controlled trials (RCTs) investigating AI medical devices as interventions in clinical settings. The review will also explore how performance errors are analyzed including whether the analysis includes the investigation of subgroup-level outcomes. METHODS: This systematic review will identify and select RCTs assessing AI medical devices. Search strategies will be deployed in MEDLINE (Ovid), Embase (Ovid), Cochrane CENTRAL, and clinical trial registries to identify relevant papers. RCTs identified in bibliographic databases will be cross-referenced with clinical trial registries. The primary outcomes of interest are the frequency and severity of AI errors, patient harms, and reported AEs. Quality assessment of RCTs will be based on version 2 of the Cochrane risk-of-bias tool (RoB2). Data analysis will include a comparison of error rates and patient harms between study arms, and a meta-analysis of the rates of patient harm in control versus intervention arms will be conducted if appropriate. RESULTS: The project was registered on PROSPERO in February 2023. Preliminary searches have been completed and the search strategy has been designed in consultation with an information specialist and methodologist. Title and abstract screening started in September 2023. Full-text screening is ongoing and data collection and analysis began in April 2024. CONCLUSIONS: Evaluations of AI medical devices have shown promising results; however, reporting of studies has been variable. Detection, analysis, and reporting of performance errors and patient harms is vital to robustly assess the safety of AI medical devices in RCTs. Scoping searches have illustrated that the reporting of harms is variable, often with no mention of AEs. The findings of this systematic review will identify the frequency and severity of AI performance errors and patient harms and generate insights into how errors should be analyzed to account for both overall and subgroup performance. TRIAL REGISTRATION: PROSPERO CRD42023387747; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=387747. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): PRR1-10.2196/51614.


Subject(s)
Algorithms , Artificial Intelligence , Randomized Controlled Trials as Topic , Humans , Randomized Controlled Trials as Topic/methods , Systematic Reviews as Topic , Patient Harm/prevention & control , Equipment and Supplies/adverse effects , Equipment and Supplies/standards , Research Design
17.
JMIR Res Protoc ; 13: e48156, 2024 Jul 11.
Article in English | MEDLINE | ID: mdl-38990628

ABSTRACT

BACKGROUND: The reporting of adverse events (AEs) relating to medical devices is a long-standing area of concern, with suboptimal reporting due to a range of factors including a failure to recognize the association of AEs with medical devices, lack of knowledge of how to report AEs, and a general culture of nonreporting. The introduction of artificial intelligence as a medical device (AIaMD) requires a robust safety monitoring environment that recognizes both generic risks of a medical device and some of the increasingly recognized risks of AIaMD (such as algorithmic bias). There is an urgent need to understand the limitations of current AE reporting systems and explore potential mechanisms for how AEs could be detected, attributed, and reported with a view to improving the early detection of safety signals. OBJECTIVE: The systematic review outlined in this protocol aims to yield insights into the frequency and severity of AEs while characterizing the events using existing regulatory guidance. METHODS: Publicly accessible AE databases will be searched to identify AE reports for AIaMD. Scoping searches have identified 3 regulatory territories for which public access to AE reports is provided: the United States, the United Kingdom, and Australia. AEs will be included for analysis if an artificial intelligence (AI) medical device is involved. Software as a medical device without AI is not within the scope of this review. Data extraction will be conducted using a data extraction tool designed for this review and will be done independently by AUK and a second reviewer. Descriptive analysis will be conducted to identify the types of AEs being reported, and their frequency, for different types of AIaMD. AEs will be analyzed and characterized according to existing regulatory guidance. RESULTS: Scoping searches are being conducted with screening to begin in April 2024. Data extraction and synthesis will commence in May 2024, with planned completion by August 2024. The review will highlight the types of AEs being reported for different types of AI medical devices and where the gaps are. It is anticipated that there will be particularly low rates of reporting for indirect harms associated with AIaMD. CONCLUSIONS: To our knowledge, this will be the first systematic review of 3 different regulatory sources reporting AEs associated with AIaMD. The review will focus on real-world evidence, which brings certain limitations, compounded by the opacity of regulatory databases generally. The review will outline the characteristics and frequency of AEs reported for AIaMD and help regulators and policy makers to continue developing robust safety monitoring processes. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): PRR1-10.2196/48156.


Subject(s)
Artificial Intelligence , Systematic Reviews as Topic , Humans , Equipment and Supplies/adverse effects , Equipment and Supplies/standards , Databases, Factual , United States , United Kingdom , Australia
18.
Insights Imaging ; 15(1): 47, 2024 Feb 16.
Article in English | MEDLINE | ID: mdl-38361108

ABSTRACT

OBJECTIVES: MAchine Learning In MyelomA Response (MALIMAR) is an observational clinical study combining "real-world" and clinical trial data, both retrospective and prospective. Images were acquired on three MRI scanners over a 10-year window at two institutions, leading to a need for extensive curation. METHODS: Curation involved image aggregation, pseudonymisation, allocation between project phases, data cleaning, upload to an XNAT repository visible from multiple sites, annotation, incorporation of machine learning research outputs and quality assurance using programmatic methods. RESULTS: A total of 796 whole-body MR imaging sessions from 462 subjects were curated. A major change in scan protocol part way through the retrospective window meant that approximately 30% of available imaging sessions had properties that differed significantly from the remainder of the data. Issues were found with a vendor-supplied clinical algorithm for "composing" whole-body images from multiple imaging stations. Historic weaknesses in a digital video disk (DVD) research archive (already addressed by the mid-2010s) were highlighted by incomplete datasets, some of which could not be completely recovered. The final dataset contained 736 imaging sessions for 432 subjects. Software was written to clean and harmonise data. Implications for the subsequent machine learning activity are considered. CONCLUSIONS: MALIMAR exemplifies the vital role that curation plays in machine learning studies that use real-world data. A research repository such as XNAT facilitates day-to-day management, ensures robustness and consistency and enhances the value of the final dataset. The types of process described here will be vital for future large-scale multi-institutional and multi-national imaging projects. CRITICAL RELEVANCE STATEMENT: This article showcases innovative data curation methods using a state-of-the-art image repository platform; such tools will be vital for managing the large multi-institutional datasets required to train and validate generalisable ML algorithms and future foundation models in medical imaging. KEY POINTS: • Heterogeneous data in the MALIMAR study required the development of novel curation strategies. • Correction of multiple problems affecting the real-world data was successful, but implications for machine learning are still being evaluated. • Modern image repositories have rich application programming interfaces enabling data enrichment and programmatic QA, making them much more than simple "image marts".

19.
ArXiv ; 2024 Feb 23.
Article in English | MEDLINE | ID: mdl-36945687

ABSTRACT

Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.

20.
Radiol Artif Intell ; 5(6): e230060, 2023 Nov.
Article in English | MEDLINE | ID: mdl-38074789

ABSTRACT

Purpose: To analyze a recently published chest radiography foundation model for the presence of biases that could lead to subgroup performance disparities across biologic sex and race. Materials and Methods: This Health Insurance Portability and Accountability Act-compliant retrospective study used 127 118 chest radiographs from 42 884 patients (mean age, 63 years ± 17 [SD]; 23 623 male, 19 261 female) from the CheXpert dataset that were collected between October 2002 and July 2017. To determine the presence of bias in features generated by a chest radiography foundation model and baseline deep learning model, dimensionality reduction methods together with two-sample Kolmogorov-Smirnov tests were used to detect distribution shifts across sex and race. A comprehensive disease detection performance analysis was then performed to associate any biases in the features to specific disparities in classification performance across patient subgroups. Results: Ten of 12 pairwise comparisons across biologic sex and race showed statistically significant differences in the studied foundation model, compared with four significant tests in the baseline model. Significant differences were found between male and female (P < .001) and Asian and Black (P < .001) patients in the feature projections that primarily capture disease. Compared with average model performance across all subgroups, classification performance on the "no finding" label decreased between 6.8% and 7.8% for female patients, and performance in detecting "pleural effusion" decreased between 10.7% and 11.6% for Black patients. Conclusion: The studied chest radiography foundation model demonstrated racial and sex-related bias, which led to disparate performance across patient subgroups; thus, this model may be unsafe for clinical applications.Keywords: Conventional Radiography, Computer Application-Detection/Diagnosis, Chest Radiography, Bias, Foundation Models Supplemental material is available for this article. Published under a CC BY 4.0 license.See also commentary by Czum and Parr in this issue.

SELECTION OF CITATIONS
SEARCH DETAIL