Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 33
Filter
1.
BMC Med Inform Decis Mak ; 24(1): 111, 2024 Apr 26.
Article in English | MEDLINE | ID: mdl-38664664

ABSTRACT

In cancer research there is much interest in building and validating outcome prediction models to support treatment decisions. However, because most outcome prediction models are developed and validated without regard to the causal aspects of treatment decision making, many published outcome prediction models may cause harm when used for decision making, despite being found accurate in validation studies. Guidelines on prediction model validation and the checklist for risk model endorsement by the American Joint Committee on Cancer do not protect against prediction models that are accurate during development and validation but harmful when used for decision making. We explain why this is the case and how to build and validate models that are useful for decision making.


Subject(s)
Algorithms , Humans , Causality , Clinical Decision-Making , Neoplasms/therapy , Quality Improvement
2.
Am J Hematol ; 95(3): 302-309, 2020 03.
Article in English | MEDLINE | ID: mdl-31849101

ABSTRACT

Iron-deficiency contributes to a ∼50% of anemia prevalence worldwide, but reference intervals for iron status tests are not optimized for anemia diagnosis. To address this limitation, we identified the serum ferritin (SF) thresholds associated with hematologic decline in iron-deficient patients, and the SF thresholds from which an SF increase was associated with hematologic improvement. Paired red blood cell and SF measurements were analysed from two adult cohorts at Massachusetts General Hospital (MGH), from 2008-2011 (N = 48 409), and 2016-2018 (N = 10 042). Inter-patient measurements in the first cohort were used to define optimal SF thresholds based on the physiologic relationship between SF and red cell measurements. Intra-patient measurements (1-26 weeks apart) in the second cohort were used to identify SF thresholds from which an SF increase was associated, with an increase in red cell measurements. The identified optimal SF thresholds varied with age, sex and red cell measure. Thresholds associated with a ∼5% decline in red cell index were typically in the range 10-25 ng/mL. Thresholds for younger women (18-45 year) were ∼5 ng/mL lower than for older women (60-95 years), and ∼10 ng/mL lower than for men. Thresholds from which a subsequent increase in SF was associated with a concomitant increase in red cell measure showed similar patterns: younger women had lower thresholds (∼15 ng/mL) than older women (∼25 ng/mL), or men (∼35 ng/mL). These results suggest that diagnostic accuracy may be improved by setting different SF thresholds for younger women, older women, and men. This study illustrates how clinical databases may provide physiologic evidence for improved diagnostic thresholds.


Subject(s)
Anemia, Iron-Deficiency/blood , Erythrocytes/metabolism , Ferritins/blood , Adolescent , Adult , Age Factors , Aged , Anemia, Iron-Deficiency/physiopathology , Erythrocyte Count , Erythrocytes/pathology , Female , Hemoglobins/metabolism , Humans , Male , Middle Aged , Prevalence , Retrospective Studies
3.
J Biomed Inform ; 109: 103515, 2020 09.
Article in English | MEDLINE | ID: mdl-32771540

ABSTRACT

Causal inference often relies on the counterfactual framework, which requires that treatment assignment is independent of the outcome, known as strong ignorability. Approaches to enforcing strong ignorability in causal analyses of observational data include weighting and matching methods. Effect estimates, such as the average treatment effect (ATE), are then estimated as expectations under the re-weighted or matched distribution, P. The choice of P is important and can impact the interpretation of the effect estimate and the variance of effect estimates. In this work, instead of specifying P, we learn a distribution that simultaneously maximizes coverage and minimizes variance of ATE estimates. In order to learn this distribution, this research proposes a generative adversarial network (GAN)-based model called the Counterfactual χ-GAN (cGAN), which also learns feature-balancing weights and supports unbiased causal estimation in the absence of unobserved confounding. Our model minimizes the Pearson χ2-divergence, which we show simultaneously maximizes coverage and minimizes the variance of importance sampling estimates. To our knowledge, this is the first such application of the Pearson χ2-divergence. We demonstrate the effectiveness of cGAN in achieving feature balance relative to established weighting methods in simulation and with real-world medical data.


Subject(s)
Causality , Computer Simulation , Humans
4.
Neuroimage ; 180(Pt A): 243-252, 2018 10 15.
Article in English | MEDLINE | ID: mdl-29448074

ABSTRACT

Recent research shows that the covariance structure of functional magnetic resonance imaging (fMRI) data - commonly described as functional connectivity - can change as a function of the participant's cognitive state (for review see Turk-Browne, 2013). Here we present a Bayesian hierarchical matrix factorization model, termed hierarchical topographic factor analysis (HTFA), for efficiently discovering full-brain networks in large multi-subject neuroimaging datasets. HTFA approximates each subject's network by first re-representing each brain image in terms of the activities of a set of localized nodes, and then computing the covariance of the activity time series of these nodes. The number of nodes, along with their locations, sizes, and activities (over time) are learned from the data. Because the number of nodes is typically substantially smaller than the number of fMRI voxels, HTFA can be orders of magnitude more efficient than traditional voxel-based functional connectivity approaches. In one case study, we show that HTFA recovers the known connectivity patterns underlying a collection of synthetic datasets. In a second case study, we illustrate how HTFA may be used to discover dynamic full-brain activity and connectivity patterns in real fMRI data, collected as participants listened to a story. In a third case study, we carried out a similar series of analyses on fMRI data collected as participants viewed an episode of a television show. In these latter case studies, we found that the HTFA-derived activity and connectivity patterns can be used to reliably decode which moments in the story or show the participants were experiencing. Further, we found that these two classes of patterns contained partially non-overlapping information, such that decoders trained on combinations of activity-based and dynamic connectivity-based features performed better than decoders trained on activity or connectivity patterns alone. We replicated this latter result with two additional (previously developed) methods for efficiently characterizing full-brain activity and connectivity patterns.


Subject(s)
Brain Mapping/methods , Brain/physiology , Nerve Net/physiology , Factor Analysis, Statistical , Humans , Image Processing, Computer-Assisted , Magnetic Resonance Imaging/methods
5.
NPJ Digit Med ; 7(1): 180, 2024 Jul 06.
Article in English | MEDLINE | ID: mdl-38969786

ABSTRACT

Automatic assessment of impairment and disease severity is a key challenge in data-driven medicine. We propose a framework to address this challenge, which leverages AI models trained exclusively on healthy individuals. The COnfidence-Based chaRacterization of Anomalies (COBRA) score exploits the decrease in confidence of these models when presented with impaired or diseased patients to quantify their deviation from the healthy population. We applied the COBRA score to address a key limitation of current clinical evaluation of upper-body impairment in stroke patients. The gold-standard Fugl-Meyer Assessment (FMA) requires in-person administration by a trained assessor for 30-45 minutes, which restricts monitoring frequency and precludes physicians from adapting rehabilitation protocols to the progress of each patient. The COBRA score, computed automatically in under one minute, is shown to be strongly correlated with the FMA on an independent test cohort for two different data modalities: wearable sensors (ρ = 0.814, 95% CI [0.700,0.888]) and video (ρ = 0.736, 95% C.I [0.584, 0.838]). To demonstrate the generalizability of the approach to other conditions, the COBRA score was also applied to quantify severity of knee osteoarthritis from magnetic-resonance imaging scans, again achieving significant correlation with an independent clinical assessment (ρ = 0.644, 95% C.I [0.585,0.696]).

6.
JACC Clin Electrophysiol ; 10(5): 956-966, 2024 May.
Article in English | MEDLINE | ID: mdl-38703162

ABSTRACT

BACKGROUND: Prediction of drug-induced long QT syndrome (diLQTS) is of critical importance given its association with torsades de pointes. There is no reliable method for the outpatient prediction of diLQTS. OBJECTIVES: This study sought to evaluate the use of a convolutional neural network (CNN) applied to electrocardiograms (ECGs) to predict diLQTS in an outpatient population. METHODS: We identified all adult outpatients newly prescribed a QT-prolonging medication between January 1, 2003, and March 31, 2022, who had a 12-lead sinus ECG in the preceding 6 months. Using risk factor data and the ECG signal as inputs, the CNN QTNet was implemented in TensorFlow to predict diLQTS. RESULTS: Models were evaluated in a held-out test dataset of 44,386 patients (57% female) with a median age of 62 years. Compared with 3 other models relying on risk factors or ECG signal or baseline QTc alone, QTNet achieved the best (P < 0.001) performance with a mean area under the curve of 0.802 (95% CI: 0.786-0.818). In a survival analysis, QTNet also had the highest inverse probability of censorship-weighted area under the receiver-operating characteristic curve at day 2 (0.875; 95% CI: 0.848-0.904) and up to 6 months. In a subgroup analysis, QTNet performed best among males and patients ≤50 years or with baseline QTc <450 ms. In an external validation cohort of solely suburban outpatient practices, QTNet similarly maintained the highest predictive performance. CONCLUSIONS: An ECG-based CNN can accurately predict diLQTS in the outpatient setting while maintaining its predictive performance over time. In the outpatient setting, our model could identify higher-risk individuals who would benefit from closer monitoring.


Subject(s)
Artificial Intelligence , Electrocardiography , Long QT Syndrome , Neural Networks, Computer , Humans , Female , Male , Long QT Syndrome/chemically induced , Long QT Syndrome/diagnosis , Middle Aged , Aged , Adult , Risk Factors
7.
Eur Heart J Acute Cardiovasc Care ; 13(6): 472-480, 2024 Jun 30.
Article in English | MEDLINE | ID: mdl-38518758

ABSTRACT

AIMS: Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the USA with morbidity and mortality being highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock allows prompt implementation of treatment measures. Our objective is to develop a new dynamic risk score, called CShock, to improve early detection of cardiogenic shock in the cardiac intensive care unit (ICU). METHODS AND RESULTS: We developed and externally validated a deep learning-based risk stratification tool, called CShock, for patients admitted into the cardiac ICU with acute decompensated heart failure and/or myocardial infarction to predict the onset of cardiogenic shock. We prepared a cardiac ICU dataset using the Medical Information Mart for Intensive Care-III database by annotating with physician-adjudicated outcomes. This dataset which consisted of 1500 patients with 204 having cardiogenic/mixed shock was then used to train CShock. The features used to train the model for CShock included patient demographics, cardiac ICU admission diagnoses, routinely measured laboratory values and vital signs, and relevant features manually extracted from echocardiogram and left heart catheterization reports. We externally validated the risk model on the New York University (NYU) Langone Health cardiac ICU database which was also annotated with physician-adjudicated outcomes. The external validation cohort consisted of 131 patients with 25 patients experiencing cardiogenic/mixed shock. CShock achieved an area under the receiver operator characteristic curve (AUROC) of 0.821 (95% CI 0.792-0.850). CShock was externally validated in the more contemporary NYU cohort and achieved an AUROC of 0.800 (95% CI 0.717-0.884), demonstrating its generalizability in other cardiac ICUs. Having an elevated heart rate is most predictive of cardiogenic shock development based on Shapley values. The other top 10 predictors are having an admission diagnosis of myocardial infarction with ST-segment elevation, having an admission diagnosis of acute decompensated heart failure, Braden Scale, Glasgow Coma Scale, blood urea nitrogen, systolic blood pressure, serum chloride, serum sodium, and arterial blood pH. CONCLUSION: The novel CShock score has the potential to provide automated detection and early warning for cardiogenic shock and improve the outcomes for millions of patients who suffer from myocardial infarction and heart failure.


Subject(s)
Machine Learning , Shock, Cardiogenic , Humans , Shock, Cardiogenic/diagnosis , Male , Female , Risk Assessment/methods , Aged , Middle Aged , Coronary Care Units , Early Diagnosis , Retrospective Studies , Risk Factors , ROC Curve , Hospital Mortality/trends , Myocardial Infarction/diagnosis , Myocardial Infarction/complications , Intensive Care Units
8.
Proc AAAI Conf Artif Intell ; 37(12): 15305-15312, 2023 Jun 27.
Article in English | MEDLINE | ID: mdl-38464961

ABSTRACT

Methods which utilize the outputs or feature representations of predictive models have emerged as promising approaches for out-of-distribution (ood) detection of image inputs. However, these methods struggle to detect ood inputs that share nuisance values (e.g. background) with in-distribution inputs. The detection of shared-nuisance out-of-distribution (sn-ood) inputs is particularly relevant in real-world applications, as anomalies and in-distribution inputs tend to be captured in the same settings during deployment. In this work, we provide a possible explanation for sn-ood detection failures and propose nuisance-aware ood detection to address them. Nuisance-aware ood detection substitutes a classifier trained via Empirical Risk Minimization (erm) and cross-entropy loss with one that 1. is trained under a distribution where the nuisance-label relationship is broken and 2. yields representations that are independent of the nuisance under this distribution, both marginally and conditioned on the label. We can train a classifier to achieve these objectives using Nuisance-Randomized Distillation (NURD), an algorithm developed for ood generalization under spurious correlations. Output- and feature-based nuisance-aware ood detection perform substantially better than their original counterparts, succeeding even when detection based on domain generalization algorithms fails to improve performance.

9.
Proc Mach Learn Res ; 206: 10343-10367, 2023 Apr.
Article in English | MEDLINE | ID: mdl-37681192

ABSTRACT

Conditional randomization tests (CRTs) assess whether a variable x is predictive of another variable y, having observed covariates z. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: Fx∣z(x∣z) and Fy∣z(y∣z) where F⋅∣z(⋅∣z) is a conditional cumulative distribution function (CDF) for the distribution p(⋅∣z). These variables are termed "information residuals." We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.

10.
Article in English | MEDLINE | ID: mdl-38645403

ABSTRACT

Deep Neural Networks (DNNs) are prone to learning spurious features that correlate with the label during training but are irrelevant to the learning problem. This hurts model generalization and poses problems when deploying them in safety-critical applications. This paper aims to better understand the effects of spurious features through the lens of the learning dynamics of the internal neurons during the training process. We make the following observations: (1) While previous works highlight the harmful effects of spurious features on the generalization ability of DNNs, we emphasize that not all spurious features are harmful. Spurious features can be "benign" or "harmful" depending on whether they are "harder" or "easier" to learn than the core features for a given model. This definition is model and dataset dependent. (2) We build upon this premise and use instance difficulty methods (like Prediction Depth (Baldock et al., 2021)) to quantify "easiness" for a given model and to identify this behavior during the training phase. (3) We empirically show that the harmful spurious features can be detected by observing the learning dynamics of the DNN's early layers. In other words, easy features learned by the initial layers of a DNN early during the training can (potentially) hurt model generalization. We verify our claims on medical and vision datasets, both simulated and real, and justify the empirical success of our hypothesis by showing the theoretical connections between Prediction Depth and information-theoretic concepts like 𝒱-usable information (Ethayarajh et al., 2021). Lastly, our experiments show that monitoring only accuracy during training (as is common in machine learning pipelines) is insufficient to detect spurious features. We, therefore, highlight the need for monitoring early training dynamics using suitable instance difficulty metrics.

11.
ArXiv ; 2023 Nov 21.
Article in English | MEDLINE | ID: mdl-38045479

ABSTRACT

Automatic assessment of impairment and disease severity is a key challenge in data-driven medicine. We propose a novel framework to address this challenge, which leverages AI models trained exclusively on healthy individuals. The COnfidence-Based chaRacterization of Anomalies (COBRA) score exploits the decrease in confidence of these models when presented with impaired or diseased patients to quantify their deviation from the healthy population. We applied the COBRA score to address a key limitation of current clinical evaluation of upper-body impairment in stroke patients. The gold-standard Fugl-Meyer Assessment (FMA) requires in-person administration by a trained assessor for 30-45 minutes, which restricts monitoring frequency and precludes physicians from adapting rehabilitation protocols to the progress of each patient. The COBRA score, computed automatically in under one minute, is shown to be strongly correlated with the FMA on an independent test cohort for two different data modalities: wearable sensors ($\rho = 0.845$, 95% CI [0.743,0.908]) and video ($\rho = 0.746$, 95% C.I [0.594, 0.847]). To demonstrate the generalizability of the approach to other conditions, the COBRA score was also applied to quantify severity of knee osteoarthritis from magnetic-resonance imaging scans, again achieving significant correlation with an independent clinical assessment ($\rho = 0.644$, 95% C.I [0.585,0.696]).

12.
Proc Mach Learn Res ; 182: 224-248, 2022 Aug.
Article in English | MEDLINE | ID: mdl-37706207

ABSTRACT

Survival analysis, the art of time-to-event modeling, plays an important role in clinical treatment decisions. Recently, continuous time models built from neural ODEs have been proposed for survival analysis. However, the training of neural ODEs is slow due to the high computational complexity of neural ODE solvers. Here, we propose an efficient alternative for flexible continuous time models, called Survival Mixture Density Networks (Survival MDNs). Survival MDN applies an invertible positive function to the output of Mixture Density Networks (MDNs). While MDNs produce flexible real-valued distributions, the invertible positive function maps the model into the time-domain while preserving a tractable density. Using four datasets, we show that Survival MDN performs better than, or similarly to continuous and discrete time baselines on concordance, integrated Brier score and integrated binomial log-likelihood. Meanwhile, Survival MDNs are also faster than ODE-based models and circumvent binning issues in discrete models.

13.
Proc Mach Learn Res ; 162: 26559-26574, 2022 Jul.
Article in English | MEDLINE | ID: mdl-37645424

ABSTRACT

Permutation invariant neural networks are a promising tool for making predictions from sets. However, we show that existing permutation invariant architectures, Deep Sets and Set Transformer, can suffer from vanishing or exploding gradients when they are deep. Additionally, layer norm, the normalization of choice in Set Transformer, can hurt performance by removing information useful for prediction. To address these issues, we introduce the "clean path principle" for equivariant residual connections and develop set norm (sn), a normalization tailored for sets. With these, we build Deep Sets++ and Set Transformer++, models that reach high depths with better or comparable performance than their original counterparts on a diverse suite of tasks. We additionally introduce Flow-RBC, a new single-cell dataset and real-world application of permutation invariant prediction. We open-source our data and code here: https://github.com/rajesh-lab/deep_permutation_invariant.

14.
Proc Mach Learn Res ; 177: 290-301, 2022 Apr.
Article in English | MEDLINE | ID: mdl-37646010

ABSTRACT

Spurious correlations allow flexible models to predict well during training but poorly on related test populations. Recent work has shown that models that satisfy particular independencies involving correlation-inducing nuisance variables have guarantees on their test performance. Enforcing such independencies requires nuisances to be observed during training. However, nuisances, such as demographics or image background labels, are often missing. Enforcing independence on just the observed data does not imply independence on the entire population. Here we derive MMD estimators used for invariance objectives under missing nuisances. On simulations and clinical data, optimizing through these estimates achieves test performance similar to using estimators that make use of the full data.

15.
Sci Rep ; 12(1): 5848, 2022 04 07.
Article in English | MEDLINE | ID: mdl-35393451

ABSTRACT

Randomized Controlled Trials (RCT) are the gold standard for estimating treatment effects but some important situations in cancer care require treatment effect estimates from observational data. We developed "Proxy based individual treatment effect modeling in cancer" (PROTECT) to estimate treatment effects from observational data when there are unobserved confounders, but proxy measurements of these confounders exist. We identified an unobserved confounder in observational cancer research: overall fitness. Proxy measurements of overall fitness exist like performance score, but the fitness as observed by the treating physician is unavailable for research. PROTECT reconstructs the distribution of the unobserved confounder based on these proxy measurements to estimate the treatment effect. PROTECT was applied to an observational cohort of 504 stage III non-small cell lung cancer (NSCLC) patients, treated with concurrent chemoradiation or sequential chemoradiation. Whereas conventional confounding adjustment methods seemed to overestimate the treatment effect, PROTECT provided credible treatment effect estimates.


Subject(s)
Carcinoma, Non-Small-Cell Lung , Lung Neoplasms , Carcinoma, Non-Small-Cell Lung/drug therapy , Chemoradiotherapy , Cohort Studies , Humans , Lung Neoplasms/therapy
16.
Proc Mach Learn Res ; 139: 12427-12436, 2021 Jul.
Article in English | MEDLINE | ID: mdl-35860036

ABSTRACT

Deep generative models (dgms) seem a natural fit for detecting out-of-distribution (ood) inputs, but such models have been shown to assign higher probabilities or densities to ood images than images from the training distribution. In this work, we explain why this behavior should be attributed to model misestimation. We first prove that no method can guarantee performance beyond random chance without assumptions on which out-distributions are relevant. We then interrogate the typical set hypothesis, the claim that relevant out-distributions can lie in high likelihood regions of the data distribution, and that ood detection should be defined based on the data distribution's typical set. We highlight the consequences implied by assuming support overlap between in- and out-distributions, as well as the arbitrariness of the typical set for ood detection. Our results suggest that estimation error is a more plausible explanation than the misalignment between likelihood-based ood detection and out-distributions of interest, and we illustrate how even minimal estimation error can lead to ood detection failures, yielding implications for future work in deep generative modeling and ood detection.

17.
Proc Mach Learn Res ; 130: 1459-1467, 2021 Apr.
Article in English | MEDLINE | ID: mdl-33954293

ABSTRACT

While the need for interpretable machine learning has been established, many common approaches are slow, lack fidelity, or hard to evaluate. Amortized explanation methods reduce the cost of providing interpretations by learning a global selector model that returns feature importances for a single instance of data. The selector model is trained to optimize the fidelity of the interpretations, as evaluated by a predictor model for the target. Popular methods learn the selector and predictor model in concert, which we show allows predictions to be encoded within interpretations. We introduce EVAL-X as a method to quantitatively evaluate interpretations and REAL-X as an amortized explanation method, which learn a predictor model that approximates the true data generating distribution given any subset of the input. We show EVAL-X can detect when predictions are encoded in interpretations and show the advantages of REAL-X through quantitative and radiologist evaluation.

18.
Annu Rev Biomed Data Sci ; 4: 393-415, 2021 07 20.
Article in English | MEDLINE | ID: mdl-34465179

ABSTRACT

Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be beneficial, including calibration and missing data. Beyond predictive models, we also investigate the utility of probabilistic machine learning models in phenotyping, in generative models for clinical use cases, and in reinforcement learning.


Subject(s)
Delivery of Health Care , Machine Learning , Health Facilities , Models, Statistical
19.
Proc Mach Learn Res ; 130: 1900-1908, 2021 Apr.
Article in English | MEDLINE | ID: mdl-34522887

ABSTRACT

The holdout randomization test (HRT) discovers a set of covariates most predictive of a response. Given the covariate distribution, HRTs can explicitly control the false discovery rate (FDR). However, if this distribution is unknown and must be estimated from data, HRTs can inflate the FDR. To alleviate the inflation of FDR, we propose the contrarian randomization test (CONTRA), which is designed explicitly for scenarios where the covariate distribution must be estimated from data and may even be misspecified. Our key insight is to use an equal mixture of two "contrarian" probabilistic models in determining the importance of a covariate. One model is fit with the real data, while the other is fit using the same data, but with the covariate being tested replaced with samples from an estimate of the covariate distribution. CONTRA is flexible enough to achieve a power of 1 asymptotically, can reduce the FDR compared to state-of-the-art CVS methods when the covariate distribution is misspecified, and is computationally efficient in high dimensions and large sample sizes. We further demonstrate the effectiveness of CONTRA on numerous synthetic benchmarks, and highlight its capabilities on a genetic dataset.

20.
Adv Neural Inf Process Syst ; 34: 2160-2172, 2021 Dec.
Article in English | MEDLINE | ID: mdl-35859987

ABSTRACT

Deep models trained through maximum likelihood have achieved state-of-the-art results for survival analysis. Despite this training scheme, practitioners evaluate models under other criteria, such as binary classification losses at a chosen set of time horizons, e.g. Brier score (BS) and Bernoulli log likelihood (BLL). Models trained with maximum likelihood may have poor BS or BLL since maximum likelihood does not directly optimize these criteria. Directly optimizing criteria like BS requires inverse-weighting by the censoring distribution. However, estimating the censoring model under these metrics requires inverse-weighting by the failure distribution. The objective for each model requires the other, but neither are known. To resolve this dilemma, we introduce Inverse-Weighted Survival Games. In these games, objectives for each model are built from re-weighted estimates featuring the other model, where the latter is held fixed during training. When the loss is proper, we show that the games always have the true failure and censoring distributions as a stationary point. This means models in the game do not leave the correct distributions once reached. We construct one case where this stationary point is unique. We show that these games optimize BS on simulations and then apply these principles on real world cancer and critically-ill patient data.

SELECTION OF CITATIONS
SEARCH DETAIL