Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 61
Filtrar
1.
Stat Med ; 43(2): 419-434, 2024 01 30.
Artigo em Inglês | MEDLINE | ID: mdl-37994214

RESUMO

Accurate assessment of the mean-variance relation can benefit subsequent analysis in biomedical research. However, in most biomedical data, both the true mean and the true variance are unavailable. Instead, raw data are typically used to allow forming sample mean and sample variance in practice. In addition, different experimental conditions sometimes cause a slightly different mean-variance relation from the majority of the data in the same data set. To address these issues, we propose a semiparametric estimator, where we treat the uncertainty in the sample mean as a measurement error problem, the uncertainty in the sample variance as model error, and use a mixture model to account for different mean-variance relations. Asymptotic normality of the proposed method is established and its finite sample properties are demonstrated by simulation studies. The data application shows that the proposed method produces sensible results compared with methods either ignoring the uncertainty in the sample means or ignoring the potential different mean-variance relations.


Assuntos
Modelos Estatísticos , Humanos , Simulação por Computador , Incerteza
2.
Ann Stat ; 51(1): 233-259, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37602147

RESUMO

We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.

3.
Stat Med ; 42(18): 3145-3163, 2023 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-37458069

RESUMO

Expression quantitative trait loci (eQTL) studies utilize regression models to explain the variance of gene expressions with genetic loci or single nucleotide polymorphisms (SNPs). However, regression models for eQTL are challenged by the presence of high dimensional non-sparse and correlated SNPs with small effects, and nonlinear relationships between responses and SNPs. Principal component analyses are commonly conducted for dimension reduction without considering responses. Because of that, this non-supervised learning method often does not work well when the focus is on discovery of the response-covariate relationship. We propose a new supervised structural dimensional reduction method for semiparametric regression models with high dimensional and correlated covariates; we extract low-dimensional latent features from a vast number of correlated SNPs while accounting for their relationships, possibly nonlinear, with gene expressions. Our model identifies important SNPs associated with gene expressions and estimates the association parameters via a likelihood-based algorithm. A GTEx data application on a cancer related gene is presented with 18 novel eQTLs detected by our method. In addition, extensive simulations show that our method outperforms the other competing methods in bias, efficiency, and computational cost.


Assuntos
Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Humanos , Locos de Características Quantitativas/genética , Funções Verossimilhança , Estudo de Associação Genômica Ampla/métodos
4.
Biometrics ; 79(4): 2974-2986, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-36632649

RESUMO

Identifying a patient's disease/health status from electronic medical records is a frequently encountered task in electronic health records (EHR) related research, and estimation of a classification model often requires a benchmark training data with patients' known phenotype statuses. However, assessing a patient's phenotype is costly and labor intensive, hence a proper selection of EHR records as a training set is desired. We propose a procedure to tailor the best training subsample with limited sample size for a classification model, minimizing its mean-squared phenotyping/classification error (MSE). Our approach incorporates "positive only" information, an approximation of the true disease status without false alarm, when it is available. In addition, our sampling procedure is applicable for training a chosen classification model which can be misspecified. We provide theoretical justification on its optimality in terms of MSE. The performance gain from our method is illustrated through simulation and a real-data example, and is found often satisfactory under criteria beyond MSE.


Assuntos
Registros Eletrônicos de Saúde , Humanos , Fenótipo
5.
Biometrics ; 79(3): 2023-2035, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-35841231

RESUMO

We consider analyses of case-control studies assembled from electronic health records (EHRs) where the pool of cases is contaminated by patients who are ineligible for the study. These ineligible patients, referred to as "false cases," should be excluded from the analyses if known. However, the true outcome status of a patient in the case pool is unknown except in a subset whose size may be arbitrarily small compared to the entire pool. To effectively remove the influence of the false cases on estimating odds ratio parameters defined by a working association model of the logistic form, we propose a general strategy to adaptively impute the unknown case status without requiring a correct phenotyping model to help discern the true and false case statuses. Our method estimates the target parameters as the solution to a set of unbiased estimating equations constructed using all available data. It outperforms existing methods by achieving robustness to mismodeling the relationship between the outcome status and covariates of interest, as well as improved estimation efficiency. We further show that our estimator is root-n-consistent and asymptotically normal. Through extensive simulation studies and analysis of real EHR data, we demonstrate that our method has desirable robustness to possible misspecification of both the association and phenotyping models, along with statistical efficiency superior to the competitors.


Assuntos
Registros Eletrônicos de Saúde , Modelos Estatísticos , Humanos , Simulação por Computador , Estudos de Casos e Controles
6.
Lifetime Data Anal ; 28(3): 428-491, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35753014

RESUMO

Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on important clinical outcomes such as time to cancer progression are not readily available in these databases. The true clinical event times typically cannot be approximated well based on simple extracts of billing or procedure codes. Whereas, annotating event times manually is time and resource prohibitive. In this paper, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) method leveraging multi-dimensional longitudinal EHR encounter records. In step I, we employ a functional principal component analysis approach to estimate the underlying intensity functions based on observed point processes from the unlabeled patients. In step II, we fit a penalized proportional odds model to the event time outcomes with features derived in step I in the labeled data where the non-parametric baseline function is approximated using B-splines. Under regularity conditions, the resulting estimator of the feature effect vector is shown as root-n consistent. We demonstrate the superiority of our approach relative to existing approaches through simulations and a real data example on annotating lung cancer recurrence in an EHR cohort of lung cancer patients from Veteran Health Administration.


Assuntos
Registros Eletrônicos de Saúde , Neoplasias Pulmonares , Algoritmos , Bases de Dados Factuais , Humanos , Neoplasias Pulmonares/terapia , Recidiva Local de Neoplasia
7.
Biom J ; 64(6): 1007-1022, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35524713

RESUMO

We propose a two-way additive model with group-specific interactions, where the group information is unknown. We treat the group membership as latent information and propose an EM algorithm for estimation. With a single observation matrix and under the situation of diverging row and column numbers, we rigorously establish the estimation consistency and asymptotic normality of our estimator. Extensive simulation studies are conducted to demonstrate the finite sample performance. We apply the model to the triple negative breast cancer (TNBC) gene expression data and provide a new way to classify patients into different subtypes. Our analysis detects the potential genes that may be associated with TNBC.


Assuntos
Neoplasias de Mama Triplo Negativas , Algoritmos , Simulação por Computador , Expressão Gênica , Humanos , Neoplasias de Mama Triplo Negativas/genética
8.
Biostatistics ; 23(3): 844-859, 2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-33616157

RESUMO

Validation of phenotyping models using Electronic Health Records (EHRs) data conventionally requires gold-standard case and control labels. The labeling process requires clinical experts to retrospectively review patients' medical charts, therefore is labor intensive and time consuming. For some disease conditions, it is prohibitive to identify the gold-standard controls because routine clinical assessments are performed for selective patients who are deemed to possibly have the condition. To build a model for phenotyping patients in EHRs, the most readily accessible data are often for a cohort consisting of a set of gold-standard cases and a large number of unlabeled patients. Hereby, we propose methods for assessing model calibration and discrimination using such "positive-only" EHR data that does not require gold-standard controls, provided that the labeled cases are representative of all cases. For model calibration, we propose a novel statistic that aggregates differences between model-free and model-based estimated numbers of cases across risk subgroups, which asymptotically follows a Chi-squared distribution. We additionally demonstrate that the calibration slope can also be estimated using such "positive-only" data. We propose consistent estimators for discrimination measures and derive their large sample properties. We demonstrate performances of the proposed methods through extensive simulation studies and apply them to Penn Medicine EHRs to validate two preliminary models for predicting the risk of primary aldosteronism.


Assuntos
Algoritmos , Registros Eletrônicos de Saúde , Calibragem , Humanos , Fenótipo , Estudos Retrospectivos
9.
Stat Sin ; 31(2): 821-842, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34526756

RESUMO

When estimating the treatment effect in an observational study, we use a semiparametric locally efficient dimension reduction approach to assess both the treatment assignment mechanism and the average responses in both treated and non-treated groups. We then integrate all results through imputation, inverse probability weighting and double robust augmentation estimators. Double robust estimators are locally efficient while imputation estimators are super-efficient when the response models are correct. To take advantage of both procedures, we introduce a shrinkage estimator to automatically combine the two, which retains the double robustness property while improving on the variance when the response model is correct. We demonstrate the performance of these estimators through simulated experiments and a real dataset concerning the effect of maternal smoking on baby birth weight.

10.
Econom J ; 24(1): 177-197, 2021 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-33746562

RESUMO

In this paper, we develop a model averaging method to estimate a high-dimensional covariance matrix, where the candidate models are constructed by different orders of polynomial functions. We propose a Mallows-type model averaging criterion and select the weights by minimizing this criterion, which is an unbiased estimator of the expected in-sample squared error plus a constant. Then, we prove the asymptotic optimality of the resulting model average covariance estimators. Finally, we conduct numerical simulations and a case study on Chinese airport network structure data to demonstrate the usefulness of the proposed approaches.

11.
Artigo em Inglês | MEDLINE | ID: mdl-32153310

RESUMO

Field studies in ecology often make use of data collected in a hierarchical fashion, and may combine studies that vary in sampling design. For example, studies of tree recruitment after disturbance may use counts of individual seedlings from plots that vary in spatial arrangement and sampling density. To account for the multi-level design and the fact that more than a few plots usually yield no individuals, a mixed effects zero inflated Poisson model is often adopted. Although it is a convenient modeling strategy, various aspects of the model could be misspecified. A comprehensive test procedure, based on the cumulative sum of the residuals, is proposed. The test is proven to be consistent, and its convergence properties are established as well. The application of the proposed test is illustrated by a real data example and simulation studies.

12.
J Am Med Inform Assoc ; 27(1): 119-126, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31722396

RESUMO

OBJECTIVE: Phenotyping patients using electronic health record (EHR) data conventionally requires labeled cases and controls. Assigning labels requires manual medical chart review and therefore is labor intensive. For some phenotypes, identifying gold-standard controls is prohibitive. We developed an accurate EHR phenotyping approach that does not require labeled controls. MATERIALS AND METHODS: Our framework relies on a random subset of cases, which can be specified using an anchor variable that has excellent positive predictive value and sensitivity independent of predictors. We proposed a maximum likelihood approach that efficiently leverages data from the specified cases and unlabeled patients to develop logistic regression phenotyping models, and compare model performance with existing algorithms. RESULTS: Our method outperformed the existing algorithms on predictive accuracy in Monte Carlo simulation studies, application to identify hypertension patients with hypokalemia requiring oral supplementation using a simulated anchor, and application to identify primary aldosteronism patients using real-world cases and anchor variables. Our method additionally generated consistent estimates of 2 important parameters, phenotype prevalence and the proportion of true cases that are labeled. DISCUSSION: Upon identification of an anchor variable that is scalable and transferable to different practices, our approach should facilitate development of scalable, transferable, and practice-specific phenotyping models. CONCLUSIONS: Our proposed approach enables accurate semiautomated EHR phenotyping with minimal manual labeling and therefore should greatly facilitate EHR clinical decision support and research.


Assuntos
Algoritmos , Registros Eletrônicos de Saúde/classificação , Funções Verossimilhança , Humanos , Método de Monte Carlo
13.
Biometrics ; 76(4): 1340-1350, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-31860141

RESUMO

High-dimensional gene expression data often exhibit intricate correlation patterns as the result of coordinated genetic regulation. In practice, however, it is difficult to directly measure these coordinated underlying activities. Analysis of breast cancer survival data with gene expressions motivates us to use a two-stage latent factor approach to estimate these unobserved coordinated biological processes. Compared to existing approaches, our proposed procedure has several unique characteristics. In the first stage, an important distinction is that our procedure incorporates prior biological knowledge about gene-pathway membership into the analysis and explicitly model the effects of genetic pathways on the latent factors. Second, to characterize the molecular heterogeneity of breast cancer, our approach provides estimates specific to each cancer subtype. Finally, our proposed framework incorporates sparsity condition due to the fact that genetic networks are often sparse. In the second stage, we investigate the relationship between latent factor activity levels and survival time with censoring using a general dimension reduction model in the survival analysis context. Combining the factor model and sufficient direction model provides an efficient way of analyzing high-dimensional data and reveals some interesting relations in the breast cancer gene expression data.


Assuntos
Neoplasias da Mama , Neoplasias da Mama/genética , Feminino , Redes Reguladoras de Genes , Humanos , Análise de Sobrevida
14.
J Multivar Anal ; 173: 38-50, 2019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-31680705

RESUMO

Case-controls studies are popular epidemiological designs for detecting gene-environment interactions in the etiology of complex diseases, where the genetic susceptibility and environmental exposures may often be reasonably assumed independent in the source population. Various papers have presented analytical methods exploiting gene-environment independence to achieve better efficiency, all of which require either a rare disease assumption or a distributional assumption on the genetic variables. We relax both assumptions. We construct a semiparametric estimator in case-control studies exploiting gene-environment independence, while the distributions of genetic susceptibility and environmental exposures are both unspecified and the disease rate is assumed unknown and is not required to be close to zero. The resulting estimator is semiparametric efficient and its superiority over prospective logistic regression, the usual analysis in case-control studies, is demonstrated in various numerical illustrations.

15.
Can J Stat ; 47(2): 140-156, 2019 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-31274953

RESUMO

We propose a consistent and locally efficient estimator to estimate the model parameters for a logistic mixed effect model with random slopes. Our approach relaxes two typical assumptions: the random effects being normally distributed, and the covariates and random effects being independent of each other. Adhering to these assumptions is particularly difficult in health studies where in many cases we have limited resources to design experiments and gather data in long-term studies, while new findings from other fields might emerge, suggesting the violation of such assumptions. So it is crucial if we could have an estimator robust to such violations and then we could make better use of current data harvested using various valuable resources. Our method generalizes the framework presented in Garcia & Ma (2016) which also deals with a logistic mixed effect model but only considers a random intercept. A simulation study reveals that our proposed estimator remains consistent even when the independence and normality assumptions are violated. This contrasts from the traditional maximum likelihood estimator which is likely to be inconsistent when there is dependence between the covariates and random effects. Application of this work to a Huntington disease study reveals that disease diagnosis can be further improved using assessments of cognitive performance.

16.
J Multivar Anal ; 171: 320-338, 2019 May.
Artigo em Inglês | MEDLINE | ID: mdl-30799885

RESUMO

Covariate measurement error is a common problem. Improper treatment of measurement errors may affect the quality of estimation and the accuracy of inference. Extensive literature exists on homoscedastic measurement error models, but little research exists on heteroscedastic measurement. In this paper, we consider a general parametric regression model allowing for a covariate measured with heteroscedastic error. We allow both the variance function of the measurement errors and the conditional density function of the error-prone covariate given the error-free covariates to be completely unspecified. We treat the variance function using B-spline approximation and propose a semiparametric estimator based on efficient score functions to deal with the heteroscedasticity of the measurement error. The resulting estimator is consistent and enjoys good inference properties. Its finite-sample performance is demonstrated through simulation studies and a real data example.

17.
J R Stat Soc Series B Stat Methodol ; 81(4): 763-779, 2019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-32863735

RESUMO

We develop model averaging estimation in the linear regression model where some covariates are subject to measurement error. The absence of the true covariates in this framework makes the calculation of the standard residual-based loss function impossible. We take advantage of the explicit form of the parameter estimators and construct a weight choice criterion. It is asymptotically equivalent to the unknown model average estimator minimizing the loss function. When the true model is not included in the set of candidate models, the method achieves optimality in terms of minimizing the relative loss, whereas, when the true model is included, the method estimates the model parameter with root n rate. Simulation results in comparison with existing Bayesian information criterion and Akaike information criterion model selection and model averaging methods strongly favour our model averaging method. The method is applied to a study on health.

18.
J R Stat Soc Series B Stat Methodol ; 80(4): 625-648, 2018 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-30337833

RESUMO

Analysing secondary outcomes is a common practice for case-control studies. Traditional secondary analysis employs either completely parametric models or conditional mean regression models to link the secondary outcome to covariates. In many situations, quantile regression models complement mean-based analyses and provide alternative new insights on the associations of interest. For example, biomedical outcomes are often highly asymmetric, and median regression is more useful in describing the 'central' behaviour than mean regressions. There are also cases where the research interest is to study the high or low quantiles of a population, as they are more likely to be at risk. We approach the secondary quantile regression problem from a semiparametric perspective, allowing the covariate distribution to be completely unspecified. We derive a class of consistent semiparametric estimators and identify the efficient member. The asymptotic properties of the resulting estimators are established. Simulation results and a real data analysis are provided to demonstrate the superior performance of our approach with a comparison with the only existing approach so far in the literature.

19.
Electron J Stat ; 12(1): 1782-1821, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30100949

RESUMO

Studying the relationship between covariates based on retrospective data is the main purpose of secondary analysis, an area of increasing interest. We examine the secondary analysis problem when multiple covariates are available, while only a regression mean model is specified. Despite the completely parametric modeling of the regression mean function, the case-control nature of the data requires special treatment and semi-parametric efficient estimation generates various nonparametric estimation problems with multivariate covariates. We devise a dimension reduction approach that fits with the specified primary and secondary models in the original problem setting, and use reweighting to adjust for the case-control nature of the data, even when the disease rate in the source population is unknown. The resulting estimator is both locally efficient and robust against the misspecification of the regression error distribution, which can be heteroscedastic as well as non-Gaussian. We demonstrate the advantage of our method over several existing methods, both analytically and numerically.

20.
Biometrics ; 74(3): 910-923, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-29441521

RESUMO

The problem of estimating the average treatment effects is important when evaluating the effectiveness of medical treatments or social intervention policies. Most of the existing methods for estimating the average treatment effect rely on some parametric assumptions about the propensity score model or the outcome regression model one way or the other. In reality, both models are prone to misspecification, which can have undue influence on the estimated average treatment effect. We propose an alternative robust approach to estimating the average treatment effect based on observational data in the challenging situation when neither a plausible parametric outcome model nor a reliable parametric propensity score model is available. Our estimator can be considered as a robust extension of the popular class of propensity score weighted estimators. This approach has the advantage of being robust, flexible, data adaptive, and it can handle many covariates simultaneously. Adopting a dimension reduction approach, we estimate the propensity score weights semiparametrically by using a non-parametric link function to relate the treatment assignment indicator to a low-dimensional structure of the covariates which are formed typically by several linear combinations of the covariates. We develop a class of consistent estimators for the average treatment effect and study their theoretical properties. We demonstrate the robust performance of the estimators on simulated data and a real data example of investigating the effect of maternal smoking on babies' birth weight.


Assuntos
Modelos Estatísticos , Pontuação de Propensão , Estatística como Assunto/métodos , Peso ao Nascer , Simulação por Computador , Feminino , Humanos , Troca Materno-Fetal , Gravidez , Fumar , Resultado do Tratamento
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA