RESUMEN
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.
Asunto(s)
Algoritmos , Procesamiento de Imagen Asistido por Computador , Aprendizaje Automático , SemánticaRESUMEN
Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.
Asunto(s)
Inteligencia ArtificialRESUMEN
We aimed to explore the underlying reasons that estimates of overdiagnosis vary across and within low-dose computed tomography (LDCT) lung cancer screening trials. We conducted a systematic review to identify estimates of overdiagnosis from randomised controlled trials of LDCT screening. We then analysed the association of Ps (the excess incidence of lung cancer as a proportion of screen-detected cases) with postscreening follow-up time using a linear random effects meta-regression model. Separately, we analysed annual Ps estimates from the US National Lung Screening Trial (NLST) and German Lung Cancer Screening Intervention Trial (LUSI) using exponential decay models with asymptotes. We conducted stratified analyses to investigate participant characteristics associated with Ps using the extended follow-up data from NLST. Among 12 overdiagnosis estimates from 8 trials, the postscreening follow-up ranged from 3.8 to 9.3 years, and Ps ranged from -27.0% (ITALUNG, 8.3 years follow-up) to 67.2% (DLCST, 5.0 years follow-up). Across trials, 39.1% of the variation in Ps was explained by postscreening follow-up time. The annual changes in Ps were -3.5% and -3.9% in the NLST and LUSI trials, respectively. Ps was predicted to plateau at 2.2% for NLST and 9.2% for LUSI with hypothetical infinite follow-up. In NLST, Ps increased with age from -14.9% (55-59 years) to 21.7% (70-74 years), and time trends in Ps varied by histological type. The findings suggest that differences in postscreening follow-up time partially explain variation in overdiagnosis estimates across lung cancer screening trials. Estimates of overdiagnosis should be interpreted in the context of postscreening follow-up and population characteristics.
Asunto(s)
Detección Precoz del Cáncer , Neoplasias Pulmonares , Detección Precoz del Cáncer/métodos , Estudios de Seguimiento , Humanos , Neoplasias Pulmonares/diagnóstico por imagen , Neoplasias Pulmonares/epidemiología , Tamizaje Masivo/métodos , Persona de Mediana Edad , SobrediagnósticoRESUMEN
INTRODUCTION: There is no global agreement on how to best determine pregnancy of unknown location viability and location using biomarkers. Measurements of progesterone and ß human chorionic gonadotropin (ßhCG) are still used in clinical practice to exclude the possibility of a viable intrauterine pregnancy (VIUP). We evaluate the predictive value of progesterone, ßhCG, and ßhCG ratio cut-off levels to exclude a VIUP in women with a pregnancy of unknown location. MATERIAL AND METHODS: This was a secondary analysis of prospective multicenter study data of consecutive women with a pregnancy of unknown location between January 2015 and 2017 collected from dedicated early pregnancy assessment units of eight hospitals. Single progesterone and serial ßhCG measurements were taken. Women were followed up until final pregnancy outcome between 11 and 14 weeks of gestation was confirmed using transvaginal ultrasonography: (1) VIUP, (2) non-viable intrauterine pregnancy or failed pregnancy of unknown location, and (3) ectopic pregnancy or persisting pregnancy of unknown location. The predictive value of cut-off levels for ruling out VIUP were evaluated across a range of values likely to be encountered clinically for progesterone, ßhCG, and ßhCG ratio. RESULTS: Data from 2507 of 3272 (76.6%) women were suitable for analysis. All had data for ßhCG levels, 2248 (89.7%) had progesterone levels, and 1809 (72.2%) had ßhCG ratio. The likelihood of viability falls with the progesterone level. Although the median progesterone level associated with viability was 59 nmol/L, VIUP were identified with levels as low as 5 nmol/L. No single ßhCG cut-off reliably ruled out the presence of viability with certainty, even when the level was more than 3000 IU/L, there were 39/358 (11%) women who had a VIUP. The probability of viability decreases with the ßhCG ratio. Although the median ßhCG ratio associated with viability was 2.26, VIUP were identified with ratios as low as 1.02. A progesterone level below 2 nmol/L and ßhCG ratio below 0.87 were unlikely to be associated with viability but were not definitive when considering multiple imputation. CONCLUSIONS: Cut-off levels for ßhCG, ßhCG ratio, and progesterone are not safe to be used clinically to exclude viability in early pregnancy. Although ßhCG ratio and progesterone have slightly better performance in comparison, single ßhCG used in this manner is highly unreliable.
Asunto(s)
Embarazo Ectópico/diagnóstico , Diagnóstico Prenatal , Adulto , Gonadotropina Coriónica/metabolismo , Gonadotropina Coriónica Humana de Subunidad beta/metabolismo , Estudios de Cohortes , Femenino , Humanos , Londres , Valor Predictivo de las Pruebas , Embarazo , Embarazo Ectópico/sangre , Progesterona/metabolismo , Estudios Prospectivos , Medicina EstatalRESUMEN
BACKGROUND: In estrogen receptor-positive (ER+), human epidermal growth factor receptor 2 (HER-2) negative breast cancers, the progesterone receptor (PR) is an independent prognostic marker. Little is known about the prognostic value of PR by tumor grade. We assessed this in two independent datasets. PATIENTS AND METHODS: Women with primary operable, invasive ER+ HER-2 negative breast cancer diagnosed between 2000 and 2012, treated at University Hospitals Leuven, were included. We assessed the association of PR status and subtype (grade 1-2 vs. grade 3) with distant recurrence-free interval (DRFI) and breast cancer-specific survival. The interaction between PR status and subtype was investigated, and associations of PR status by subtype were calculated. The BIG 1-98 data set was used for validation. RESULTS: In total, 4,228 patients from Leuven and 5,419 from BIG 1-98 were analyzed. In the Leuven cohort, the adjusted hazard ratio (HR) of PR-positive versus PR-negative tumors for DRFI was 0.66 (95% confidence interval [CI], 0.50-0.89). For the interaction with subtype (p = .34), the HR of PR status was 0.79 (95% CI, 0.61-1.01) in luminal A-like and 0.59 (95% CI, 0.46-0.76) in luminal B-like tumors. In luminal A-like tumors, observed 5-year cumulative incidences of distant recurrence were 4.1% for PR-negative and 2.8% for PR-positive tumors, and in luminal B-like 18.7% and 9.2%, respectively. In the BIG 1-98 cohort, similar results were observed; for the interaction with subtype (p = .12), the adjusted HR of PR status for DRFI was 0.88 (95% CI, 0.57-1.35) in luminal A-like and 0.58 (95% CI, 0.43-0.77) in luminal B-like tumors. Observed 5-year cumulative incidences were similar. CONCLUSION: PR positivity may be more protective against metastatic relapse in luminal B-like versus luminal A-like breast cancer, but no strong conclusions can be made. In absolute risk, results suggest an absent PR is clinically more important in high compared with low proliferative ER+ HER-2 negative tumors. IMPLICATIONS FOR PRACTICE: An absent progesterone receptor (PR) predicts a worse outcome in women treated for an estrogen receptor-positive, human epidermal growth factor receptor 2 negative breast cancer. As low proliferative tumors lacking PR are now also classified high risk, the prognostic value of PR across risk groups was studied. Despite a negative test for interaction of the prognostic value of PR by tumor grade, the magnitude of an absent PR on breast cancer relapse is much larger in high than in low proliferative breast cancers.
Asunto(s)
Neoplasias de la Mama/genética , Receptores de Progesterona/metabolismo , Neoplasias de la Mama/mortalidad , Neoplasias de la Mama/patología , Femenino , Humanos , Pronóstico , Análisis de SupervivenciaRESUMEN
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
RESUMEN
Background: Lung cancer screening may provide a favorable opportunity for a spirometry examination, to diagnose participants with undiagnosed lung function impairments, or to improve targeting of computed tomography (CT) screening intensity in view of expected net benefit. Methods: Spirometry was performed in the CT screening arm (n=2,029) of the German Lung Cancer Screening Intervention Study (LUSI)-a trial examining the effects of annual CT screening on lung cancer mortality, in 50-69-year-old long-term smokers. Participants were classified as having chronic obstructive pulmonary disease (COPD) [forced expiration in one second (FEV1)/forced vital lung capacity (FVC) <0.7], preserved ratio impaired spirometry (PRISm; FEV1/FVC ≥0.7 and FEV1% predicted <80%), or normal spirometry. Descriptive statistics were used to examine associations of COPD or PRISm with respiratory symptoms, and self-reported medical diagnoses of respiratory and other morbidities. Logistic regression and proportional hazards regression were used to examine associations of COPD and PRISm, as well as their self-reported medical diagnoses, with risks of lung cancer and all-cause mortality. Results: A total of 1,987 screening arm participants (98%) provided interpretable spirometry measurements; of these, 34.3% had spirometric patterns consistent with either COPD (18.6%) or PRISm (15.7%). Two thirds of participants with COPD or PRISm were asymptomatic, and only 23% reported a previous medical diagnosis concordant with COPD. Participants reporting a diagnosis tended to be more often current and heavier smokers, and more often had respiratory symptoms, cardiovascular comorbidities, or more severe lung function impairments. Independently of smoking history, moderate-to-severe (GOLD 2-4) COPD (OR =2.14; 95% CI: 1.54-2.98), and PRISm (OR =2.68; 95% CI: 1.61-4.40), were associated with increased lung cancer risk. Lung cancer patients with PRISm less frequently had adenocarcinomas, and more often squamous cell or small cell tumors, compared to those with normal spirometry (n=45), and both PRISm and COPD were associated with more advanced lung cancer tumor stage for screen-detected cancers. PRISm and COPD, depending on GOLD stage, were also associated with about 2- to 4-fold increases in risk of overall mortality, which to 87 percent had causes other than lung cancer. Conclusions: About one third of smokers eligible for lung cancer screening in Germany have COPD or PRISm. As these conditions were associated with detection of lung cancer, spirometry may help identify populations at high risk for death of lung cancer or other causes, and who might particularly benefit from CT screening.
RESUMEN
The button sequestrum sign is demonstrated in a number of medical conditions and refers to a lesion of devascularised bone which is surrounded by lucency. Although it may be difficult to arrive at a single diagnosis based on this sign, the combination of clinical and paraclinical findings, patient's medical history and imagistic presentation of this sign can lead to a great specificity in chronic osteomyelitis, even if osteomyelitis is accompanied by osteopetrosis, as in the present case.
RESUMEN
Enostoses, also known as bone islands, are common benign sclerotic bone lesion that usually represent incidental findings. They constitute a small focus of compact bone within cancellous bone. Enostoses can be seen on radiographs, CT, and MRI, and are considered one of the skeletal do not touch lesions.
RESUMEN
The current paper focuses on a trial to understand the imaging manifestations in combination with the clinical presentation of the sacrococcygeal chordoma in a patient with referred back pain. Also, the steps for the final diagnosis are described and via this procedure, the paper demonstrates the crucial role of magnetic resonance imaging, computed tomography guided biopsy and histopathological examination in order to minimize the differential diagnosis and lead to the correct diagnosis.
RESUMEN
BACKGROUND: We suggest an adaptive sample size calculation method for developing clinical prediction models, in which model performance is monitored sequentially as new data comes in. METHODS: We illustrate the approach using data for the diagnosis of ovarian cancer (n = 5914, 33% event fraction) and obstructive coronary artery disease (CAD; n = 4888, 44% event fraction). We used logistic regression to develop a prediction model consisting only of a priori selected predictors and assumed linear relations for continuous predictors. We mimicked prospective patient recruitment by developing the model on 100 randomly selected patients, and we used bootstrapping to internally validate the model. We sequentially added 50 random new patients until we reached a sample size of 3000 and re-estimated model performance at each step. We examined the required sample size for satisfying the following stopping rule: obtaining a calibration slope ≥ 0.9 and optimism in the c-statistic (or AUC) < = 0.02 at two consecutive sample sizes. This procedure was repeated 500 times. We also investigated the impact of alternative modeling strategies: modeling nonlinear relations for continuous predictors and correcting for bias on the model estimates (Firth's correction). RESULTS: Better discrimination was achieved in the ovarian cancer data (c-statistic 0.9 with 7 predictors) than in the CAD data (c-statistic 0.7 with 11 predictors). Adequate calibration and limited optimism in discrimination was achieved after a median of 450 patients (interquartile range 450-500) for the ovarian cancer data (22 events per parameter (EPP), 20-24) and 850 patients (750-900) for the CAD data (33 EPP, 30-35). A stricter criterion, requiring AUC optimism < = 0.01, was met with a median of 500 (23 EPP) and 1500 (59 EPP) patients, respectively. These sample sizes were much higher than the well-known 10 EPP rule of thumb and slightly higher than a recently published fixed sample size calculation method by Riley et al. Higher sample sizes were required when nonlinear relationships were modeled, and lower sample sizes when Firth's correction was used. CONCLUSIONS: Adaptive sample size determination can be a useful supplement to fixed a priori sample size calculations, because it allows to tailor the sample size to the specific prediction modeling context in a dynamic fashion.
RESUMEN
OBJECTIVES: The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature. STUDY DESIGN AND SETTING: We conducted a Medline literature search (1/2016 to 8/2017) and extracted comparisons between LR and ML models for binary outcomes. RESULTS: We included 71 of 927 studies. The median sample size was 1,250 (range 72-3,994,872), with 19 predictors considered (range 5-563) and eight events per predictor (range 0.3-6,697). The most common ML methods were classification trees, random forests, artificial neural networks, and support vector machines. In 48 (68%) studies, we observed potential bias in the validation procedures. Sixty-four (90%) studies used the area under the receiver operating characteristic curve (AUC) to assess discrimination. Calibration was not addressed in 56 (79%) studies. We identified 282 comparisons between an LR and ML model (AUC range, 0.52-0.99). For 145 comparisons at low risk of bias, the difference in logit(AUC) between LR and ML was 0.00 (95% confidence interval, -0.18 to 0.18). For 137 comparisons at high risk of bias, logit(AUC) was 0.34 (0.20-0.47) higher for ML. CONCLUSION: We found no evidence of superior performance of ML over LR. Improvements in methodology and reporting are needed for studies that compare modeling algorithms.
Asunto(s)
Modelos Logísticos , Modelos Teóricos , Aprendizaje Automático Supervisado , Algoritmos , Área Bajo la Curva , Humanos , Evaluación de Resultado en la Atención de Salud , Valor Predictivo de las Pruebas , Sensibilidad y EspecificidadRESUMEN
CONTEXT: Urologists regularly develop clinical risk prediction models to support clinical decisions. In contrast to traditional performance measures, decision curve analysis (DCA) can assess the utility of models for decision making. DCA plots net benefit (NB) at a range of clinically reasonable risk thresholds. OBJECTIVE: To provide recommendations on interpreting and reporting DCA when evaluating prediction models. EVIDENCE ACQUISITION: We informally reviewed the urological literature to determine investigators' understanding of DCA. To illustrate, we use data from 3616 patients to develop risk models for high-grade prostate cancer (n=313, 9%) to decide who should undergo a biopsy. The baseline model includes prostate-specific antigen and digital rectal examination; the extended model adds two predictors based on transrectal ultrasound (TRUS). EVIDENCE SYNTHESIS: We explain risk thresholds, NB, default strategies (treat all, treat no one), and test tradeoff. To use DCA, first determine whether a model is superior to all other strategies across the range of reasonable risk thresholds. If so, that model appears to improve decisions irrespective of threshold. Second, consider if there are important extra costs to using the model. If so, obtain the test tradeoff to check whether the increase in NB versus the best other strategy is worth the additional cost. In our case study, addition of TRUS improved NB by 0.0114, equivalent to 1.1 more detected high-grade prostate cancers per 100 patients. Hence, adding TRUS would be worthwhile if we accept subjecting 88 patients to TRUS to find one additional high-grade prostate cancer or, alternatively, subjecting 10 patients to TRUS to avoid one unnecessary biopsy. CONCLUSIONS: The proposed guidelines can help researchers understand DCA and improve application and reporting. PATIENT SUMMARY: Decision curve analysis can identify risk models that can help us make better clinical decisions. We illustrate appropriate reporting and interpretation of decision curve analysis.