RESUMO
This paper introduces a novel Perturbation-Assisted Inference (PAI) framework utilizing synthetic data generated by the Perturbation-Assisted Sample Synthesis (PASS) method. The framework focuses on uncertainty quantification in complex data scenarios, particularly involving unstructured data while utilizing deep learning models. On one hand, PASS employs a generative model to create synthetic data that closely mirrors raw data while preserving its rank properties through data perturbation, thereby enhancing data diversity and bolstering privacy. By incorporating knowledge transfer from large pre-trained generative models, PASS enhances estimation accuracy, yielding refined distributional estimates of various statistics via Monte Carlo experiments. On the other hand, PAI boasts its statistically guaranteed validity. In pivotal inference, it enables precise conclusions even without prior knowledge of the pivotal's distribution. In non-pivotal situations, we enhance the reliability of synthetic data generation by training it with an independent holdout sample. We demonstrate the effectiveness of PAI in advancing uncertainty quantification in complex, data-driven tasks by applying it to diverse areas such as image synthesis, sentiment word analysis, multimodal inference, and the construction of prediction intervals.
RESUMO
Post-market active safety monitoring is important for the timely capture of safety signals associated with exposure to a new vaccine or drug. The group sequential analysis is a common method employed in safety surveillance. Specifically, it compares the post-vaccination incidence of adverse event (AE) in a vaccinated population with a pre-specified reference level by sequentially conducting hypothesis testing during the surveillance. When the number of AEs is "too high", a safety signal is identified. If the null hypothesis is never rejected, the vaccine is considered safe. Such an approach does not account for either the variation in determining the reference risk from a control population or the seasonality effect. Furthermore, not rejecting the null could be due to a lack of power and cannot always be interpreted as proof of safety. In this paper, we proposed a new group sequential test procedure fully accounting for both seasonality and variation from the historical controls. More importantly, we proposed to construct a confidence interval for the relative AE risk between the exposed and control groups at the end of the study, which can be used to quantify the safety of the vaccine. The proposed method is illustrated via real-data examples on anaphylaxis and examined by extensive simulation studies.
Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Vacinas , Vacinas/efeitos adversos , Vacinação/efeitos adversos , Simulação por Computador , Risco , Vigilância de Produtos Comercializados/métodosRESUMO
Data perturbation is a technique for generating synthetic data by adding "noise" to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information loss at the expense of privacy protection. The information loss, in turn, renders the accuracy loss for any statistical or machine learning method based on the synthetic data, weakening downstream analysis and deteriorating in machine learning. In this article, we introduce and advocate a fundamental principle of data perturbation, which requires the preservation of the distribution of raw data. To achieve this, we propose a new scheme, named data flush, which ascertains the validity of the downstream analysis and maintains the predictive accuracy of a learning task. It perturbs data nonlinearly while accommodating the requirement of strict privacy protection, for instance, differential privacy. We highlight multiple facets of data flush through examples.
RESUMO
BACKGROUND: This article concerns the identification of gene pairs or combinations of gene pairs associated with biological phenotype or clinical outcome, allowing for building predictive models that are not only robust to normalization but also easily validated and measured by qPCR techniques. However, given a small number of biological samples yet a large number of genes, this problem suffers from the difficulty of high computational complexity and imposes challenges to the accuracy of identification statistically. RESULTS: In this paper, we propose a parsimonious model representation and develop efficient algorithms for identification. Particularly, we derive an equivalent model subject to a sum-to-zero constraint in penalized linear regression, where the correspondence between nonzero coefficients in these models is established. Most importantly, it reduces the model complexity of the traditional approach from the quadratic order to the linear order in the number of candidate genes, while overcoming the difficulty of model nonidentifiablity. Computationally, we develop an algorithm using the alternating direction method of multipliers (ADMM) to deal with the constraint. Numerically, we demonstrate that the proposed method outperforms the traditional method in terms of the statistical accuracy. Moreover, we demonstrate that our ADMM algorithm is more computationally efficient than a coordinate descent algorithm with a local search. Finally, we illustrate the proposed method on a prostate cancer dataset to identify gene pairs that are associated with pre-operative prostate-specific antigen. CONCLUSION: Our findings demonstrate the feasibility and utility of using gene pairs as biomarkers.