Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 61
Filtrar
1.
Stat Med ; 43(2): 419-434, 2024 01 30.
Artículo en Inglés | MEDLINE | ID: mdl-37994214

RESUMEN

Accurate assessment of the mean-variance relation can benefit subsequent analysis in biomedical research. However, in most biomedical data, both the true mean and the true variance are unavailable. Instead, raw data are typically used to allow forming sample mean and sample variance in practice. In addition, different experimental conditions sometimes cause a slightly different mean-variance relation from the majority of the data in the same data set. To address these issues, we propose a semiparametric estimator, where we treat the uncertainty in the sample mean as a measurement error problem, the uncertainty in the sample variance as model error, and use a mixture model to account for different mean-variance relations. Asymptotic normality of the proposed method is established and its finite sample properties are demonstrated by simulation studies. The data application shows that the proposed method produces sensible results compared with methods either ignoring the uncertainty in the sample means or ignoring the potential different mean-variance relations.


Asunto(s)
Modelos Estadísticos , Humanos , Simulación por Computador , Incertidumbre
2.
Biostatistics ; 23(3): 844-859, 2022 07 18.
Artículo en Inglés | MEDLINE | ID: mdl-33616157

RESUMEN

Validation of phenotyping models using Electronic Health Records (EHRs) data conventionally requires gold-standard case and control labels. The labeling process requires clinical experts to retrospectively review patients' medical charts, therefore is labor intensive and time consuming. For some disease conditions, it is prohibitive to identify the gold-standard controls because routine clinical assessments are performed for selective patients who are deemed to possibly have the condition. To build a model for phenotyping patients in EHRs, the most readily accessible data are often for a cohort consisting of a set of gold-standard cases and a large number of unlabeled patients. Hereby, we propose methods for assessing model calibration and discrimination using such "positive-only" EHR data that does not require gold-standard controls, provided that the labeled cases are representative of all cases. For model calibration, we propose a novel statistic that aggregates differences between model-free and model-based estimated numbers of cases across risk subgroups, which asymptotically follows a Chi-squared distribution. We additionally demonstrate that the calibration slope can also be estimated using such "positive-only" data. We propose consistent estimators for discrimination measures and derive their large sample properties. We demonstrate performances of the proposed methods through extensive simulation studies and apply them to Penn Medicine EHRs to validate two preliminary models for predicting the risk of primary aldosteronism.


Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Calibración , Humanos , Fenotipo , Estudios Retrospectivos
3.
Biometrics ; 79(3): 2023-2035, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-35841231

RESUMEN

We consider analyses of case-control studies assembled from electronic health records (EHRs) where the pool of cases is contaminated by patients who are ineligible for the study. These ineligible patients, referred to as "false cases," should be excluded from the analyses if known. However, the true outcome status of a patient in the case pool is unknown except in a subset whose size may be arbitrarily small compared to the entire pool. To effectively remove the influence of the false cases on estimating odds ratio parameters defined by a working association model of the logistic form, we propose a general strategy to adaptively impute the unknown case status without requiring a correct phenotyping model to help discern the true and false case statuses. Our method estimates the target parameters as the solution to a set of unbiased estimating equations constructed using all available data. It outperforms existing methods by achieving robustness to mismodeling the relationship between the outcome status and covariates of interest, as well as improved estimation efficiency. We further show that our estimator is root-n-consistent and asymptotically normal. Through extensive simulation studies and analysis of real EHR data, we demonstrate that our method has desirable robustness to possible misspecification of both the association and phenotyping models, along with statistical efficiency superior to the competitors.


Asunto(s)
Registros Electrónicos de Salud , Modelos Estadísticos , Humanos , Simulación por Computador , Estudios de Casos y Controles
4.
Biometrics ; 79(4): 2974-2986, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-36632649

RESUMEN

Identifying a patient's disease/health status from electronic medical records is a frequently encountered task in electronic health records (EHR) related research, and estimation of a classification model often requires a benchmark training data with patients' known phenotype statuses. However, assessing a patient's phenotype is costly and labor intensive, hence a proper selection of EHR records as a training set is desired. We propose a procedure to tailor the best training subsample with limited sample size for a classification model, minimizing its mean-squared phenotyping/classification error (MSE). Our approach incorporates "positive only" information, an approximation of the true disease status without false alarm, when it is available. In addition, our sampling procedure is applicable for training a chosen classification model which can be misspecified. We provide theoretical justification on its optimality in terms of MSE. The performance gain from our method is illustrated through simulation and a real-data example, and is found often satisfactory under criteria beyond MSE.


Asunto(s)
Registros Electrónicos de Salud , Humanos , Fenotipo
5.
Stat Med ; 42(18): 3145-3163, 2023 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-37458069

RESUMEN

Expression quantitative trait loci (eQTL) studies utilize regression models to explain the variance of gene expressions with genetic loci or single nucleotide polymorphisms (SNPs). However, regression models for eQTL are challenged by the presence of high dimensional non-sparse and correlated SNPs with small effects, and nonlinear relationships between responses and SNPs. Principal component analyses are commonly conducted for dimension reduction without considering responses. Because of that, this non-supervised learning method often does not work well when the focus is on discovery of the response-covariate relationship. We propose a new supervised structural dimensional reduction method for semiparametric regression models with high dimensional and correlated covariates; we extract low-dimensional latent features from a vast number of correlated SNPs while accounting for their relationships, possibly nonlinear, with gene expressions. Our model identifies important SNPs associated with gene expressions and estimates the association parameters via a likelihood-based algorithm. A GTEx data application on a cancer related gene is presented with 18 novel eQTLs detected by our method. In addition, extensive simulations show that our method outperforms the other competing methods in bias, efficiency, and computational cost.


Asunto(s)
Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , Humanos , Sitios de Carácter Cuantitativo/genética , Funciones de Verosimilitud , Estudio de Asociación del Genoma Completo/métodos
6.
Ann Stat ; 51(1): 233-259, 2023 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-37602147

RESUMEN

We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.

7.
Biom J ; 64(6): 1007-1022, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35524713

RESUMEN

We propose a two-way additive model with group-specific interactions, where the group information is unknown. We treat the group membership as latent information and propose an EM algorithm for estimation. With a single observation matrix and under the situation of diverging row and column numbers, we rigorously establish the estimation consistency and asymptotic normality of our estimator. Extensive simulation studies are conducted to demonstrate the finite sample performance. We apply the model to the triple negative breast cancer (TNBC) gene expression data and provide a new way to classify patients into different subtypes. Our analysis detects the potential genes that may be associated with TNBC.


Asunto(s)
Neoplasias de la Mama Triple Negativas , Algoritmos , Simulación por Computador , Expresión Génica , Humanos , Neoplasias de la Mama Triple Negativas/genética
8.
Lifetime Data Anal ; 28(3): 428-491, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-35753014

RESUMEN

Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on important clinical outcomes such as time to cancer progression are not readily available in these databases. The true clinical event times typically cannot be approximated well based on simple extracts of billing or procedure codes. Whereas, annotating event times manually is time and resource prohibitive. In this paper, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) method leveraging multi-dimensional longitudinal EHR encounter records. In step I, we employ a functional principal component analysis approach to estimate the underlying intensity functions based on observed point processes from the unlabeled patients. In step II, we fit a penalized proportional odds model to the event time outcomes with features derived in step I in the labeled data where the non-parametric baseline function is approximated using B-splines. Under regularity conditions, the resulting estimator of the feature effect vector is shown as root-n consistent. We demonstrate the superiority of our approach relative to existing approaches through simulations and a real data example on annotating lung cancer recurrence in an EHR cohort of lung cancer patients from Veteran Health Administration.


Asunto(s)
Registros Electrónicos de Salud , Neoplasias Pulmonares , Algoritmos , Bases de Datos Factuales , Humanos , Neoplasias Pulmonares/terapia , Recurrencia Local de Neoplasia
9.
Stat Sin ; 31(2): 821-842, 2021 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-34526756

RESUMEN

When estimating the treatment effect in an observational study, we use a semiparametric locally efficient dimension reduction approach to assess both the treatment assignment mechanism and the average responses in both treated and non-treated groups. We then integrate all results through imputation, inverse probability weighting and double robust augmentation estimators. Double robust estimators are locally efficient while imputation estimators are super-efficient when the response models are correct. To take advantage of both procedures, we introduce a shrinkage estimator to automatically combine the two, which retains the double robustness property while improving on the variance when the response model is correct. We demonstrate the performance of these estimators through simulated experiments and a real dataset concerning the effect of maternal smoking on baby birth weight.

10.
Biometrics ; 76(4): 1340-1350, 2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-31860141

RESUMEN

High-dimensional gene expression data often exhibit intricate correlation patterns as the result of coordinated genetic regulation. In practice, however, it is difficult to directly measure these coordinated underlying activities. Analysis of breast cancer survival data with gene expressions motivates us to use a two-stage latent factor approach to estimate these unobserved coordinated biological processes. Compared to existing approaches, our proposed procedure has several unique characteristics. In the first stage, an important distinction is that our procedure incorporates prior biological knowledge about gene-pathway membership into the analysis and explicitly model the effects of genetic pathways on the latent factors. Second, to characterize the molecular heterogeneity of breast cancer, our approach provides estimates specific to each cancer subtype. Finally, our proposed framework incorporates sparsity condition due to the fact that genetic networks are often sparse. In the second stage, we investigate the relationship between latent factor activity levels and survival time with censoring using a general dimension reduction model in the survival analysis context. Combining the factor model and sufficient direction model provides an efficient way of analyzing high-dimensional data and reveals some interesting relations in the breast cancer gene expression data.


Asunto(s)
Neoplasias de la Mama , Neoplasias de la Mama/genética , Femenino , Redes Reguladoras de Genes , Humanos , Análisis de Supervivencia
11.
Artículo en Inglés | MEDLINE | ID: mdl-32153310

RESUMEN

Field studies in ecology often make use of data collected in a hierarchical fashion, and may combine studies that vary in sampling design. For example, studies of tree recruitment after disturbance may use counts of individual seedlings from plots that vary in spatial arrangement and sampling density. To account for the multi-level design and the fact that more than a few plots usually yield no individuals, a mixed effects zero inflated Poisson model is often adopted. Although it is a convenient modeling strategy, various aspects of the model could be misspecified. A comprehensive test procedure, based on the cumulative sum of the residuals, is proposed. The test is proven to be consistent, and its convergence properties are established as well. The application of the proposed test is illustrated by a real data example and simulation studies.

12.
Can J Stat ; 47(2): 140-156, 2019 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-31274953

RESUMEN

We propose a consistent and locally efficient estimator to estimate the model parameters for a logistic mixed effect model with random slopes. Our approach relaxes two typical assumptions: the random effects being normally distributed, and the covariates and random effects being independent of each other. Adhering to these assumptions is particularly difficult in health studies where in many cases we have limited resources to design experiments and gather data in long-term studies, while new findings from other fields might emerge, suggesting the violation of such assumptions. So it is crucial if we could have an estimator robust to such violations and then we could make better use of current data harvested using various valuable resources. Our method generalizes the framework presented in Garcia & Ma (2016) which also deals with a logistic mixed effect model but only considers a random intercept. A simulation study reveals that our proposed estimator remains consistent even when the independence and normality assumptions are violated. This contrasts from the traditional maximum likelihood estimator which is likely to be inconsistent when there is dependence between the covariates and random effects. Application of this work to a Huntington disease study reveals that disease diagnosis can be further improved using assessments of cognitive performance.

13.
Biometrics ; 74(3): 910-923, 2018 09.
Artículo en Inglés | MEDLINE | ID: mdl-29441521

RESUMEN

The problem of estimating the average treatment effects is important when evaluating the effectiveness of medical treatments or social intervention policies. Most of the existing methods for estimating the average treatment effect rely on some parametric assumptions about the propensity score model or the outcome regression model one way or the other. In reality, both models are prone to misspecification, which can have undue influence on the estimated average treatment effect. We propose an alternative robust approach to estimating the average treatment effect based on observational data in the challenging situation when neither a plausible parametric outcome model nor a reliable parametric propensity score model is available. Our estimator can be considered as a robust extension of the popular class of propensity score weighted estimators. This approach has the advantage of being robust, flexible, data adaptive, and it can handle many covariates simultaneously. Adopting a dimension reduction approach, we estimate the propensity score weights semiparametrically by using a non-parametric link function to relate the treatment assignment indicator to a low-dimensional structure of the covariates which are formed typically by several linear combinations of the covariates. We develop a class of consistent estimators for the average treatment effect and study their theoretical properties. We demonstrate the robust performance of the estimators on simulated data and a real data example of investigating the effect of maternal smoking on babies' birth weight.


Asunto(s)
Modelos Estadísticos , Puntaje de Propensión , Estadística como Asunto/métodos , Peso al Nacer , Simulación por Computador , Femenino , Humanos , Intercambio Materno-Fetal , Embarazo , Fumar , Resultado del Tratamiento
14.
J Econom ; 200(2): 194-206, 2017 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-29200600

RESUMEN

We develop consistent and efficient estimation of parameters in general regression models with mismeasured covariates. We assume the model error and covariate distributions are unspecified, and the measurement error distribution is a general parametric distribution with unknown variance-covariance. We construct root-n consistent, asymptotically normal and locally efficient estimators using the semiparametric efficient score. We do not estimate any unknown distribution or model error heteroskedasticity. Instead, we form the estimator under possibly incorrect working distribution models for the model error, error-prone covariate, or both. Empirical results demonstrate robustness to different incorrect working models in homoscedastic and heteroskedastic models with error-prone covariates.

15.
Stat Sin ; 27(4): 1857-1878, 2017 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-29097879

RESUMEN

Some biomedical studies lead to mixture data. When a discrete covariate defining subgroup membership is missing for some of the subjects in a study, the distribution of the outcome follows a mixture distribution of the subgroup-specific distributions. Taking into account the uncertain distribution of the group membership and the covariates, we model the relation between the disease onset time and the covariates through transformation models in each sub-population, and develop a nonparametric maximum likelihood based estimation implemented through EM algorithm along with its inference procedure. We further propose methods to identify the covariates that have different effects or common effects in distinct populations, which enables parsimonious modeling and better understanding of the difference across populations. The methods are illustrated through extensive simulation studies and a real data example.

16.
Stat Med ; 34(16): 2427-43, 2015 Jul 20.
Artículo en Inglés | MEDLINE | ID: mdl-25847392

RESUMEN

We propose a simple approach predicting the cumulative risk of disease accommodating predictors with time-varying effects and outcomes subject to censoring. We use a nonparametric function for the coefficient of the time-varying effect and handle censoring through self-consistency equations that redistribute the probability mass of censored outcomes to the right. The computational procedure is extremely convenient and can be implemented by standard software. We prove large sample properties of the proposed estimator and evaluate its finite sample performance through simulation studies. We apply the method to estimate the cumulative risk of developing Huntington's disease (HD) from subjects with huntingtin gene mutation using a large collaborative HD study data and illustrate an inverse relationship between the cumulative risk of HD and the length of cytosine-adenine-guanine repeats in the huntingtin gene.


Asunto(s)
Edad de Inicio , Riesgo , Adulto , Bioestadística , Simulación por Computador , Humanos , Proteína Huntingtina , Enfermedad de Huntington/genética , Persona de Mediana Edad , Modelos Estadísticos , Método de Montecarlo , Proteínas del Tejido Nervioso/genética , Estudios Observacionales como Asunto/estadística & datos numéricos , Estadísticas no Paramétricas , Expansión de Repetición de Trinucleótido
17.
Ann Stat ; 43(5): 1929-1958, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26283801

RESUMEN

We propose a generalized partially linear functional single index risk score model for repeatedly measured outcomes where the index itself is a function of time. We fuse the nonparametric kernel method and regression spline method, and modify the generalized estimating equation to facilitate estimation and inference. We use local smoothing kernel to estimate the unspecified coefficient functions of time, and use B-splines to estimate the unspecified function of the single index component. The covariance structure is taken into account via a working model, which provides valid estimation and inference procedure whether or not it captures the true covariance. The estimation method is applicable to both continuous and discrete outcomes. We derive large sample properties of the estimation procedure and show different convergence rate of each component of the model. The asymptotic properties when the kernel and regression spline methods are combined in a nested fashion has not been studied prior to this work even in the independent data case.

18.
Biometrics ; 70(1): 21-32, 2014 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-24350758

RESUMEN

We take a semiparametric approach in fitting a linear transformation model to a right censored data when predictive variables are subject to measurement errors. We construct consistent estimating equations when repeated measurements of a surrogate of the unobserved true predictor are available. The proposed approach applies under minimal assumptions on the distributions of the true covariate or the measurement errors. We derive the asymptotic properties of the estimator and illustrate the characteristics of the estimator in finite sample performance via simulation studies. We apply the method to analyze an AIDS clinical trial data set that motivated the work.


Asunto(s)
Biomarcadores/análisis , Interpretación Estadística de Datos , Modelos Lineales , Estudios Longitudinales/métodos , Fármacos Anti-VIH/farmacología , Recuento de Linfocito CD4 , Simulación por Computador , Infecciones por VIH/tratamiento farmacológico , VIH-1/inmunología , Humanos
19.
Stat Med ; 33(8): 1369-82, 2014 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-24027120

RESUMEN

Huntington's disease (HD) is a neurodegenerative disorder with a dominant genetic mode of inheritance caused by an expansion of CAG repeats on chromosome 4. Typically, a longer sequence of CAG repeat length is associated with increased risk of experiencing earlier onset of HD. Previous studies of the association between HD onset age and CAG length have favored a logistic model, where the CAG repeat length enters the mean and variance components of the logistic model in a complex exponential-linear form. To relax the parametric assumption of the exponential-linear association to the true HD onset distribution, we propose to leave both mean and variance functions of the CAG repeat length unspecified and perform semiparametric estimation in this context through a local kernel and backfitting procedure. Motivated by including family history of HD information available in the family members of participants in the Cooperative Huntington's Observational Research Trial (COHORT), we develop the methodology in the context of mixture data, where some subjects have a positive probability of being risk free. We also allow censoring on the age at onset of disease and accommodate covariates other than the CAG length. We study the theoretical properties of the proposed estimator and derive its asymptotic distribution. Finally, we apply the proposed methods to the COHORT data to estimate the HD onset distribution using a group of study participants and the disease family history information available on their family members.


Asunto(s)
Edad de Inicio , Enfermedad de Huntington/genética , Modelos Genéticos , Repeticiones de Trinucleótidos/genética , Simulación por Computador , Familia , Femenino , Humanos , Masculino
20.
Ann Stat ; 41(1): 250-268, 2013 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-24058219

RESUMEN

We develop an efficient estimation procedure for identifying and estimating the central subspace. Using a new way of parameterization, we convert the problem of identifying the central subspace to the problem of estimating a finite dimensional parameter in a semiparametric model. This conversion allows us to derive an efficient estimator which reaches the optimal semiparametric efficiency bound. The resulting efficient estimator can exhaustively estimate the central subspace without imposing any distributional assumptions. Our proposed efficient estimation also provides a possibility for making inference of parameters that uniquely identify the central subspace. We conduct simulation studies and a real data analysis to demonstrate the finite sample performance in comparison with several existing methods.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA