Distribution shift detection for the postmarket surveillance of medical AI algorithms: a retrospective simulation study.

Koch, Lisa M; Baumgartner, Christian F; Berens, Philipp

Koch, Lisa M; Baumgartner, Christian F; Berens, Philipp.

Affiliation

Koch LM; Hertie Institute for AI in Brain Health, University of Tübingen, Tübingen, Germany. lisa.koch@uni-tuebingen.de.
Baumgartner CF; Cluster of Excellence Machine Learning: New Perspectives for Science, University of Tübingen, Tübingen, Germany.
Berens P; Faculty of Health Sciences and Medicine, University of Lucerne, Lucerne, Switzerland.

NPJ Digit Med ; 7(1): 120, 2024 May 09.

Article in En | MEDLINE | ID: mdl-38724581

ABSTRACT

ABSTRACT

Distribution shifts remain a problem for the safe application of regulated medical AI systems, and may impact their real-world performance if undetected. Postmarket shifts can occur for example if algorithms developed on data from various acquisition settings and a heterogeneous population are predominantly applied in hospitals with lower quality data acquisition or other centre-specific acquisition factors, or where some ethnicities are over-represented. Therefore, distribution shift detection could be important for monitoring AI-based medical products during postmarket surveillance. We implemented and evaluated three deep-learning based shift detection techniques (classifier-based, deep kernel, and multiple univariate kolmogorov-smirnov tests) on simulated shifts in a dataset of 130'486 retinal images. We trained a deep learning classifier for diabetic retinopathy grading. We then simulated population shifts by changing the prevalence of patients' sex, ethnicity, and co-morbidities, and example acquisition shifts by changes in image quality. We observed classification subgroup performance disparities w.r.t. image quality, patient sex, ethnicity and co-morbidity presence. The sensitivity at detecting referable diabetic retinopathy ranged from 0.50 to 0.79 for different ethnicities. This motivates the need for detecting shifts after deployment. Classifier-based tests performed best overall, with perfect detection rates for quality and co-morbidity subgroup shifts at a sample size of 1000. It was the only method to detect shifts in patient sex, but required large sample sizes ( > 3 0 ' 000 ). All methods identified easier-to-detect out-of-distribution shifts with small (≤300) sample sizes. We conclude that effective tools exist for detecting clinically relevant distribution shifts. In particular classifier-based tests can be easily implemented components in the post-market surveillance strategy of medical device manufacturers.

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: NPJ Digit Med Year: 2024 Document type: Article Affiliation country: Germany Country of publication: United kingdom

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google