RESUMO
The rich longitudinal individual level data available from electronic health records (EHRs) can be used to examine treatment effect heterogeneity. However, estimating treatment effects using EHR data poses several challenges, including time-varying confounding, repeated and temporally non-aligned measurements of covariates, treatment assignments and outcomes, and loss-to-follow-up due to dropout. Here, we develop the subgroup discovery for longitudinal data algorithm, a tree-based algorithm for discovering subgroups with heterogeneous treatment effects using longitudinal data by combining the generalized interaction tree algorithm, a general data-driven method for subgroup discovery, with longitudinal targeted maximum likelihood estimation. We apply the algorithm to EHR data to discover subgroups of people living with human immunodeficiency virus who are at higher risk of weight gain when receiving dolutegravir (DTG)-containing antiretroviral therapies (ARTs) versus when receiving non-DTG-containing ARTs.
Assuntos
Registros Eletrônicos de Saúde , Infecções por HIV , Compostos Heterocíclicos com 3 Anéis , Piperazinas , Piridonas , Humanos , Heterogeneidade da Eficácia do Tratamento , Oxazinas , Infecções por HIV/tratamento farmacológicoRESUMO
Targeted maximum likelihood estimation (TMLE) is increasingly used for doubly robust causal inference, but how missing data should be handled when using TMLE with data-adaptive approaches is unclear. Based on data (1992-1998) from the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate 8 missing-data methods in this context: complete-case analysis, extended TMLE incorporating an outcome-missingness model, the missing covariate missing indicator method, and 5 multiple imputation (MI) approaches using parametric or machine-learning models. We considered 6 scenarios that varied in terms of exposure/outcome generation models (presence of confounder-confounder interactions) and missingness mechanisms (whether outcome influenced missingness in other variables and presence of interaction/nonlinear terms in missingness models). Complete-case analysis and extended TMLE had small biases when outcome did not influence missingness in other variables. Parametric MI without interactions had large bias when exposure/outcome generation models included interactions. Parametric MI including interactions performed best in bias and variance reduction across all settings, except when missingness models included a nonlinear term. When choosing a method for handling missing data in the context of TMLE, researchers must consider the missingness mechanism and, for MI, compatibility with the analysis method. In many settings, a parametric MI approach that incorporates interactions and nonlinearities is expected to perform well.
Assuntos
Causalidade , Humanos , Funções Verossimilhança , Adolescente , Interpretação Estatística de Dados , Viés , Modelos Estatísticos , Simulação por ComputadorRESUMO
Metagenomic next-generation sequencing (mNGS) enables comprehensive pathogen detection and has become increasingly popular in clinical diagnosis. The distinct pathogenic traits between strains require mNGS to achieve a strain-level resolution, but an equivocal concept of 'strain' as well as the low pathogen loads in most clinical specimens hinders such strain awareness. Here we introduce a metagenomic intra-species typing (MIST) tool (https://github.com/pandafengye/MIST), which hierarchically organizes reference genomes based on average nucleotide identity (ANI) and performs maximum likelihood estimation to infer the strain-level compositional abundance. In silico analysis using synthetic datasets showed that MIST accurately predicted the strain composition at a 99.9% average nucleotide identity (ANI) resolution with a merely 0.001× sequencing depth. When applying MIST on 359 culture-positive and 359 culture-negative real-world specimens of infected body fluids, we found the presence of multiple-strain reached considerable frequencies (30.39%-93.22%), which were otherwise underestimated by current diagnostic techniques due to their limited resolution. Several high-risk clones were identified to be prevalent across samples, including Acinetobacter baumannii sequence type (ST)208/ST195, Staphylococcus aureus ST22/ST398 and Klebsiella pneumoniae ST11/ST15, indicating potential outbreak events occurring in the clinical settings. Interestingly, contaminations caused by the engineered Escherichia coli strain K-12 and BL21 throughout the mNGS datasets were also identified by MIST instead of the statistical decontamination approach. Our study systemically characterized the infected body fluids at the strain level for the first time. Extension of mNGS testing to the strain level can greatly benefit clinical diagnosis of bacterial infections, including the identification of multi-strain infection, decontamination and infection control surveillance.
Assuntos
Infecções Bacterianas , Líquidos Corporais , Infecções Bacterianas/diagnóstico , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Metagenômica/métodos , NucleotídeosRESUMO
Consider the problem of estimating the branch lengths in a symmetric 2-state substitution model with a known topology and a general, clock-like or star-shaped tree with three leaves. We show that the maximum likelihood estimates are analytically tractable and can be obtained from pairwise sequence comparisons. Furthermore, we demonstrate that this property does not generalize to larger state spaces, more complex models or larger trees. Our arguments are based on an enumeration of the free parameters of the model and the dimension of the minimal sufficient data vector. Our interest in this problem arose from discussions with our former colleague Freddy Bugge Christiansen.
Assuntos
Evolução Molecular , Modelos Genéticos , Funções Verossimilhança , FilogeniaRESUMO
Estimating parameters of amino acid substitution models is a crucial task in bioinformatics. The maximum likelihood (ML) approach has been proposed to estimate amino acid substitution models from large datasets. The quality of newly estimated models is normally assessed by comparing with the existing models in building ML trees. Two important questions remained are the correlation of the estimated models with the true models and the required size of the training datasets to estimate reliable models. In this article, we performed a simulation study to answer these two questions based on simulated data. We simulated genome datasets with different numbers of genes/alignments based on predefined models (called true models) and predefined trees (called true trees). The simulated datasets were used to estimate amino acid substitution model using the ML estimation methods. Our experiments showed that models estimated by the ML methods from simulated datasets with more than 100 genes have high correlations with the true models. The estimated models performed well in building ML trees in comparison with the true models. The results suggest that amino acid substitution models estimated by the ML methods from large genome datasets are a reliable tool for analyzing amino acid sequences.
Assuntos
Algoritmos , Genoma , Substituição de Aminoácidos , Filogenia , Simulação por Computador , Modelos GenéticosRESUMO
Interval-censored failure time data frequently arise in various scientific studies where each subject experiences periodical examinations for the occurrence of the failure event of interest, and the failure time is only known to lie in a specific time interval. In addition, collected data may include multiple observed variables with a certain degree of correlation, leading to severe multicollinearity issues. This work proposes a factor-augmented transformation model to analyze interval-censored failure time data while reducing model dimensionality and avoiding multicollinearity elicited by multiple correlated covariates. We provide a joint modeling framework by comprising a factor analysis model to group multiple observed variables into a few latent factors and a class of semiparametric transformation models with the augmented factors to examine their and other covariate effects on the failure event. Furthermore, we propose a nonparametric maximum likelihood estimation approach and develop a computationally stable and reliable expectation-maximization algorithm for its implementation. We establish the asymptotic properties of the proposed estimators and conduct simulation studies to assess the empirical performance of the proposed method. An application to the Alzheimer's Disease Neuroimaging Initiative (ADNI) study is provided. An R package ICTransCFA is also available for practitioners. Data used in preparation of this article were obtained from the ADNI database.
Assuntos
Doença de Alzheimer , Simulação por Computador , Modelos Estatísticos , Humanos , Funções Verossimilhança , Algoritmos , Neuroimagem , Análise Fatorial , Interpretação Estatística de Dados , Fatores de TempoRESUMO
People living with HIV on antiretroviral therapy often have undetectable virus levels by standard assays, but "latent" HIV still persists in viral reservoirs. Eliminating these reservoirs is the goal of HIV cure research. The quantitative viral outgrowth assay (QVOA) is commonly used to estimate the reservoir size, that is, the infectious units per million (IUPM) of HIV-persistent resting CD4+ T cells. A new variation of the QVOA, the ultra deep sequencing assay of the outgrowth virus (UDSA), was recently developed that further quantifies the number of viral lineages within a subset of infected wells. Performing the UDSA on a subset of wells provides additional information that can improve IUPM estimation. This paper considers statistical inference about the IUPM from combined dilution assay (QVOA) and deep viral sequencing (UDSA) data, even when some deep sequencing data are missing. Methods are proposed to accommodate assays with wells sequenced at multiple dilution levels and with imperfect sensitivity and specificity, and a novel bias-corrected estimator is included for small samples. The proposed methods are evaluated in a simulation study, applied to data from the University of North Carolina HIV Cure Center, and implemented in the open-source R package SLDeepAssay.
Assuntos
Infecções por HIV , HIV-1 , Humanos , Latência Viral , HIV-1/genética , Linfócitos T CD4-Positivos , Simulação por Computador , Carga ViralRESUMO
In this study, we develop a new method for the meta-analysis of mixed aggregate data (AD) and individual participant data (IPD). The method is an adaptation of inverse probability weighted targeted maximum likelihood estimation (IPW-TMLE), which was initially proposed for two-stage sampled data. Our methods are motivated by a systematic review investigating treatment effectiveness for multidrug resistant tuberculosis (MDR-TB) where the available data include IPD from some studies but only AD from others. One complication in this application is that participants with MDR-TB are typically treated with multiple antimicrobial agents where many such medications were not observed in all studies considered in the meta-analysis. We focus here on the estimation of the expected potential outcome while intervening on a specific medication but not intervening on any others. Our method involves the implementation of a TMLE that transports the estimation from studies where the treatment is observed to the full target population. A second weighting component adjusts for the studies with missing (inaccessible) IPD. We demonstrate the properties of the proposed method and contrast it with alternative approaches in a simulation study. We finally apply this method to estimate treatment effectiveness in the MDR-TB case study.
Assuntos
Tuberculose Resistente a Múltiplos Medicamentos , Humanos , Funções Verossimilhança , Tuberculose Resistente a Múltiplos Medicamentos/tratamento farmacológico , Tuberculose Resistente a Múltiplos Medicamentos/epidemiologia , Resultado do Tratamento , Simulação por ComputadorRESUMO
Tolerance intervals from quality attribute measurements are used to establish specification limits for drug products. Some attribute measurements may be below the reporting limits, that is, left-censored data. When data has a long, right-skew tail, a gamma distribution may be applicable. This paper compares maximum likelihood estimation (MLE) and Bayesian methods to estimate shape and scale parameters of censored gamma distributions and to calculate tolerance intervals under varying sample sizes and extents of censoring. The noninformative reference prior and the maximal data information prior (MDIP) are used to compare the impact of prior choice. Metrics used are bias and root mean square error for the parameter estimation and average length and confidence coefficient for the tolerance interval evaluation. It will be shown that Bayesian method using a reference prior overall performs better than MLE for the scenarios evaluated. When sample size is small, the Bayesian method using MDIP yields conservatively too wide tolerance intervals that are unsuitable basis for specification setting. The metrics for all methods worsened with increasing extent of censoring but improved with increasing sample size, as expected. This study demonstrates that although MLE is relatively simple and available in user-friendly statistical software, it falls short in accurately and precisely producing tolerance limits that maintain the stated confidence depending on the scenario. The Bayesian method using noninformative prior, even though computationally intensive and requires considerable statistical programming, produces tolerance limits which are practically useful for specification setting. Real-world examples are provided to illustrate the findings from the simulation study.
Assuntos
Modelos Estatísticos , Software , Humanos , Teorema de Bayes , Limite de Detecção , Simulação por ComputadorRESUMO
One basic limitation of using the periodogram as a frequency estimator is that any of its significant peaks may result from a diffuse (or spread) frequency component rather than a pure one. Diffuse components are common in applications such as channel estimation, in which a given periodogram peak reveals the presence of a complex multipath distribution (unresolvable propagation paths or diffuse scattering, for example). We present a method to detect the presence of a diffuse component in a given peak based on analyzing the projection of the data vector onto the span of the signature's derivatives up to a given order. Fundamentally, a diffuse component is detected if the energy in the derivatives' subspace is too high at the peak's frequency, and its spread is estimated as the ratio between this last energy and the peak's energy. The method is based on exploiting the signature's Vandermonde structure through the properties of discrete Chebyshev polynomials. We also present an efficient numerical procedure for computing the data component in the derivatives' span based on barycentric interpolation. The paper contains a numerical assessment of the proposed estimator and detector.
RESUMO
In this paper, we propose a method for the three-dimensional (3D) image visualization of objects under photon-starved conditions using multiple observations and statistical estimation. To visualize 3D objects under these conditions, photon counting integral imaging was used, which can extract photons from 3D objects using the Poisson random process. However, this process may not reconstruct 3D images under severely photon-starved conditions due to a lack of photons. Therefore, to solve this problem, in this paper, we propose N-observation photon-counting integral imaging with statistical estimation. Since photons are extracted randomly using the Poisson distribution, increasing the samples of photons can improve the accuracy of photon extraction. In addition, by using a statistical estimation method, such as maximum likelihood estimation, 3D images can be reconstructed. To prove our proposed method, we implemented the optical experiment and calculated its performance metrics, which included the peak signal-to-noise ratio (PSNR), structural similarity (SSIM), peak-to-correlation energy (PCE), and the peak sidelobe ratio (PSR).
RESUMO
Atmospheric phase error is the main factor affecting the accuracy of ground-based synthetic aperture radar (GB-SAR). The atmospheric phase screen (APS) may be very complicated, so the atmospheric phase correction (APC) model is very important; in particular, the parameters to be estimated in the model are the key to improving the accuracy of APC. However, the conventional APC method first performs phase unwrapping and then removes the APS based on the least-squares method (LSM), and the general phase unwrapping method is prone to introducing unwrapping error. In particular, the LSM is difficult to apply directly due to the phase wrapping of permanent scatterers (PSs). Therefore, a novel methodology for estimating parameters of the APC model based on the maximum likelihood estimation (MLE) and the Gauss-Newton algorithm is proposed in this paper, which first introduces the MLE method to provide a suitable objective function for the parameter estimation of nonlinear far-end and near-end correction models. Then, based on the Gauss-Newton algorithm, the parameters of the objective function are iteratively estimated with suitable initial values, and the Matthews and Davies algorithm is used to optimize the Gauss-Newton algorithm to improve the accuracy of parameter estimation. Finally, the parameter estimation performance is evaluated based on Monte Carlo simulation experiments. The method proposed in this paper experimentally verifies the feasibility and superiority, which avoids phase unwrapping processing unlike the conventional method.
RESUMO
The added value of candidate predictors for risk modeling is routinely evaluated by comparing the performance of models with or without including candidate predictors. Such comparison is most meaningful when the estimated risk by the two models are both unbiased in the target population. Very often data for candidate predictors are sourced from nonrepresentative convenience samples. Updating the base model using the study data without acknowledging the discrepancy between the underlying distribution of the study data and that in the target population can lead to biased risk estimates and therefore an unfair evaluation of candidate predictors. To address this issue assuming access to a well-calibrated base model, we propose a semiparametric method for model fitting that enforces good calibration. The central idea is to calibrate the fitted model against the base model by enforcing suitable constraints in maximizing the likelihood function. This approach enables unbiased assessment of model improvement offered by candidate predictors without requiring a representative sample from the target population, thus overcoming a significant practical challenge. We study theoretical properties for model parameter estimates, and demonstrate improvement in model calibration via extensive simulation studies. Finally, we apply the proposed method to data extracted from Penn Medicine Biobank to inform the added value of breast density for breast cancer risk assessment in the Caucasian woman population.
Assuntos
Neoplasias da Mama , Modelos Estatísticos , Humanos , Funções Verossimilhança , Feminino , Simulação por Computador , Medição de Risco/métodos , CalibragemRESUMO
Semiparametric transformation models for failure time data consist of a parametric regression component and an unspecified cumulative baseline hazard. The nonparametric maximum likelihood estimator (NPMLE) of the cumulative baseline hazard can be summarized in terms of weights introduced into a Breslow-type estimator (Weighted Breslow). At any given time point, the weights invoke an integral over the future of the cumulative baseline hazard, which presents theoretical and computational challenges. A simpler non-MLE Breslow-type estimator (Breslow) was derived earlier from a martingale estimating equation (MEE) setting observed and expected counts of failures equal, conditional on the past history. Despite much successful theoretical and computational development, the simpler Breslow estimator continues to be commonly used as a compromise between simplicity and perceived loss of full efficiency. In this paper we derive the relative efficiency of the Breslow estimator and consider the properties of the two estimators using simulations and real data on prostate cancer survival.
Assuntos
Neoplasias da Próstata , Masculino , Humanos , Funções VerossimilhançaRESUMO
In studies with time-to-event outcomes, multiple, inter-correlated, and time-varying covariates are commonly observed. It is of great interest to model their joint effects by allowing a flexible functional form and to delineate their relative contributions to survival risk. A class of semiparametric transformation (ST) models offers flexible specifications of the intensity function and can be a general framework to accommodate nonlinear covariate effects. In this paper, we propose a partial-linear single-index (PLSI) transformation model that reduces the dimensionality of multiple covariates into a single index and provides interpretable estimates of the covariate effects. We develop an iterative algorithm using the regression spline technique to model the nonparametric single-index function for possibly nonlinear joint effects, followed by nonparametric maximum likelihood estimation. We also propose a nonparametric testing procedure to formally examine the linearity of covariate effects. We conduct Monte Carlo simulation studies to compare the PLSI transformation model with the standard ST model and apply it to NYU Langone Health de-identified electronic health record data on COVID-19 hospitalized patients' mortality and a Veteran's Administration lung cancer trial.
Assuntos
COVID-19 , Neoplasias Pulmonares , Método de Monte Carlo , Humanos , COVID-19/mortalidade , Neoplasias Pulmonares/mortalidade , Funções Verossimilhança , Algoritmos , Simulação por Computador , Modelos Estatísticos , SARS-CoV-2 , Modelos Lineares , Análise de SobrevidaRESUMO
The aging intensity (AI), defined as the ratio of the instantaneous hazard rate and a baseline hazard rate, is a useful tool for the describing reliability properties of a random variable corresponding to a lifetime. In this work, the concept of AI is introduced in step-stress accelerated life testing (SSALT) experiments, providing new insights to the model and enabling the further clarification of the differences between the two commonly employed cumulative exposure (CE) and tampered failure rate (TFR) models. New AI-based estimators for the parameters of a SSALT model are proposed and compared to the MLEs in terms of examples and a simulation study.
RESUMO
BACKGROUND: Gay and bisexual men (GBM) are at increased risk of human papillomavirus (HPV)-associated anal high-grade squamous intraepithelial lesions (HSILs). Understanding the fractions of HSILs attributable to HPV genotypes is important to inform potential impacts of screening and vaccination strategies. However, multiple infections are common, making attribution of causative types difficult. Algorithms developed for predicting HSIL-causative genotype fractions have never been compared with a reference standard in GBM. METHOD: Samples were from the Study of the Prevention of Anal Cancer. Baseline HPV genotypes detected in anal swab samples (160 participants) were compared with HPV genotypes in anal HSILs (222 lesions) determined by laser capture microdissection (LCM). Five algorithms were compared: proportional, hierarchical, maximum, minimum, and maximum likelihood estimation. RESULTS: All algorithms predicted HPV-16 as the most common HSIL-causative genotype, and proportions differed from LCM detection (37.8%) by algorithm (with differences of -6.1%, +20.9%, -20.4%, +2.9%, and +2.2% respectively). Fractions predicted using the proportional method showed a strong positive correlation with LCM, overall (R = 0.73 and P = .002), and by human immunodeficiency virus (HIV) status (HIV positive, R = 0.74 and P = .001; HIV-negative, R = 0.68 and P = .005). CONCLUSIONS: Algorithms produced a range of inaccurate estimates of HSIL attribution, with the proportional algorithm performing best. The high occurrence of multiple HPV infections means that these algorithms may be of limited use in GBM.
Assuntos
Neoplasias do Ânus , Infecções por HIV , Soropositividade para HIV , Infecções por Papillomavirus , Lesões Intraepiteliais Escamosas , Masculino , Humanos , Papillomavirus Humano , Homossexualidade Masculina , Infecções por Papillomavirus/epidemiologia , Genótipo , Neoplasias do Ânus/diagnóstico , Papillomaviridae/genética , Infecções por HIV/complicaçõesRESUMO
Mixed evidence exists of associations between mobility data and coronavirus disease 2019 (COVID-19) case rates. We aimed to evaluate the county-level impact of reducing mobility on new COVID-19 cases in summer/fall of 2020 in the United States and to demonstrate modified treatment policies to define causal effects with continuous exposures. Specifically, we investigated the impact of shifting the distribution of 10 mobility indexes on the number of newly reported cases per 100,000 residents 2 weeks ahead. Primary analyses used targeted minimum loss-based estimation with Super Learner to avoid parametric modeling assumptions during statistical estimation and flexibly adjust for a wide range of confounders, including recent case rates. We also implemented unadjusted analyses. For most weeks, unadjusted analyses suggested strong associations between mobility indexes and subsequent new case rates. However, after confounder adjustment, none of the indexes showed consistent associations under mobility reduction. Our analysis demonstrates the utility of this novel distribution-shift approach to defining and estimating causal effects with continuous exposures in epidemiology and public health.
Assuntos
COVID-19 , Política de Saúde , Governo Local , Humanos , Causalidade , COVID-19/epidemiologia , Saúde Pública , Estados Unidos/epidemiologia , Aprendizado de Máquina , Política PúblicaRESUMO
BACKGROUND: Low-value healthcare is costly and inefficient and may adversely affect patient outcomes. Despite increases in low-value service use, little is known about how the receipt of low-value care differs across payers. OBJECTIVE: To evaluate differences in the use of low-value care between patients with commercial versus Medicaid coverage. DESIGN: Retrospective observational analysis of the 2017 Rhode Island All-payer Claims Database, estimating the probability of receiving each of 14 low-value services between commercial and Medicaid enrollees, adjusting for patient sociodemographic and clinical characteristics. Ensemble machine learning minimized the possibility of model misspecification. PARTICIPANTS: Medicaid and commercial enrollees aged 18-64 with continuous coverage and an encounter at which they were at risk of receiving a low-value service. INTERVENTION: Enrollment in Medicaid or Commercial insurance. MAIN MEASURES: Use of one of 14 validated measures of low-value care. KEY RESULTS: Among 110,609 patients, Medicaid enrollees were younger, had more comorbidities, and were more likely to be female than commercial enrollees. Medicaid enrollees had higher rates of use for 7 low-value care measures, and those with commercial coverage had higher rates for 5 measures. Across all measures of low-value care, commercial enrollees received more (risk difference [RD] 6.8 percentage points; CI: 6.6 to 7.0) low-value services than their counterparts with Medicaid. Commercial enrollees were also more likely to receive low-value services typically performed in the emergency room (RD 11.4 percentage points; CI: 10.7 to 12.2) and services that were less expensive (RD 15.3 percentage points; CI 14.6 to 16.0). CONCLUSION: Differences in the provision of low-value care varied across measures, though average use was slightly higher among commercial than Medicaid enrollees. This difference was more pronounced for less expensive services indicating that financial incentives may not be the sole driver of low-value care.
Assuntos
Cuidados de Baixo Valor , Medicaid , Estados Unidos/epidemiologia , Humanos , Feminino , Masculino , Estudos Retrospectivos , Atenção à Saúde , Rhode IslandRESUMO
Recent evidence suggests that nongenetic (epigenetic) mechanisms play an important role at all stages of cancer evolution. In many cancers, these mechanisms have been observed to induce dynamic switching between two or more cell states, which commonly show differential responses to drug treatments. To understand how these cancers evolve over time, and how they respond to treatment, we need to understand the state-dependent rates of cell proliferation and phenotypic switching. In this work, we propose a rigorous statistical framework for estimating these parameters, using data from commonly performed cell line experiments, where phenotypes are sorted and expanded in culture. The framework explicitly models the stochastic dynamics of cell division, cell death and phenotypic switching, and it provides likelihood-based confidence intervals for the model parameters. The input data can be either the fraction of cells or the number of cells in each state at one or more time points. Through a combination of theoretical analysis and numerical simulations, we show that when cell fraction data is used, the rates of switching may be the only parameters that can be estimated accurately. On the other hand, using cell number data enables accurate estimation of the net division rate for each phenotype, and it can even enable estimation of the state-dependent rates of cell division and cell death. We conclude by applying our framework to a publicly available dataset.