Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters










Publication year range
1.
Proc Natl Acad Sci U S A ; 115(7): 1481-1486, 2018 02 13.
Article in English | MEDLINE | ID: mdl-29386387

ABSTRACT

When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer's disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies.


Subject(s)
Alzheimer Disease/cerebrospinal fluid , Databases, Factual/statistics & numerical data , Aged , Aged, 80 and over , Algorithms , Biomarkers/cerebrospinal fluid , Data Interpretation, Statistical , Humans , Middle Aged
2.
Proc Mach Learn Res ; 70: 4170-4179, 2017 Aug.
Article in English | MEDLINE | ID: mdl-31742253

ABSTRACT

Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) - independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer's disease studies, we present empirical results showing that in regimes suggested by our analysis, pooling a local dataset with data from an international study improves power.

3.
Adv Neural Inf Process Syst ; 29: 2496-2504, 2016.
Article in English | MEDLINE | ID: mdl-29308004

ABSTRACT

Consider samples from two different data sources [Formula: see text] and [Formula: see text]. We only observe their transformed versions [Formula: see text] and [Formula: see text], for some known function class h(·) and g(·). Our goal is to perform a statistical test checking if Psource = Ptarget while removing the distortions induced by the transformations. This problem is closely related to domain adaptation, and in our case, is motivated by the need to combine clinical and imaging based biomarkers from multiple sites and/or batches - a fairly common impediment in conducting analyses with much larger sample sizes. We address this problem using ideas from hypothesis testing on the transformed measurements, wherein the distortions need to be estimated in tandem with the testing. We derive a simple algorithm and study its convergence and consistency properties in detail, and provide lower-bound strategies based on recent work in continuous optimization. On a dataset of individuals at risk for Alzheimer's disease, our framework is competitive with alternative procedures that are twice as expensive and in some cases operationally infeasible to implement.

4.
Proc Natl Acad Sci U S A ; 112(39): 12069-74, 2015 Sep 29.
Article in English | MEDLINE | ID: mdl-26371300

ABSTRACT

The conditional lifetime expectancy function (LEF) is the expected lifetime of a subject given survival past a certain time point and the values of a set of explanatory variables. This function is attractive to researchers because it summarizes the entire residual life distribution and has an easy interpretation compared with the popularly used hazard function. In this paper, we propose a general framework of backward multiple imputation for estimating the conditional LEF and the variance of the estimator in the right-censoring setting. Simulation studies are conducted to investigate the empirical properties of the proposed estimator and the corresponding variance estimator. We demonstrate the method on the Beaver Dam Eye Study data, where the expected human lifetime is modeled with smoothing-spline ANOVA given the covariates information including sex, lifestyle factors, and disease variables.


Subject(s)
Life Expectancy , Longevity/physiology , Models, Biological , Analysis of Variance , Body Mass Index , Humans , Sex Factors , Smoking , Social Class
5.
Stat Med ; 34(10): 1708-20, 2015 May 10.
Article in English | MEDLINE | ID: mdl-25640961

ABSTRACT

Variable selection is of increasing importance to address the difficulties of high dimensionality in many scientific areas. In this paper, we demonstrate a property for distance covariance, which is incorporated in a novel feature screening procedure together with the use of distance correlation. The approach makes no distributional assumptions for the variables and does not require the specification of a regression model and hence is especially attractive in variable selection given an enormous number of candidate attributes without much information about the true model with the response. The method is applied to two genetic risk problems, where issues including uncertainty of variable selection via cross validation, subgroup of hard-to-classify cases, and the application of a reject option are discussed.


Subject(s)
Genes, Neoplasm/drug effects , Genetic Predisposition to Disease , Models, Genetic , Ovarian Neoplasms/genetics , Analysis of Variance , Antineoplastic Agents/therapeutic use , Binomial Distribution , Diagnosis, Differential , Female , Genetic Testing/methods , Genetic Testing/statistics & numerical data , Humans , Ovarian Neoplasms/pathology , Ovarian Neoplasms/therapy , Pharmacogenetics , Risk Assessment/methods , Risk Assessment/statistics & numerical data
6.
Biometrics ; 71(1): 53-62, 2015 Mar.
Article in English | MEDLINE | ID: mdl-25257196

ABSTRACT

In many scientific and engineering applications, covariates are naturally grouped. When the group structures are available among covariates, people are usually interested in identifying both important groups and important variables within the selected groups. Among existing successful group variable selection methods, some methods fail to conduct the within group selection. Some methods are able to conduct both group and within group selection, but the corresponding objective functions are non-convex. Such a non-convexity may require extra numerical effort. In this article, we propose a novel Log-Exp-Sum(LES) penalty for group variable selection. The LES penalty is strictly convex. It can identify important groups as well as select important variables within the group. We develop an efficient group-level coordinate descent algorithm to fit the model. We also derive non-asymptotic error bounds and asymptotic group selection consistency for our method in the high-dimensional setting where the number of covariates can be much larger than the sample size. Numerical results demonstrate the good performance of our method in both variable selection and prediction. We applied the proposed method to an American Cancer Society breast cancer survivor dataset. The findings are clinically meaningful and may help design intervention programs to improve the qualify of life for breast cancer survivors.


Subject(s)
Breast Neoplasms/epidemiology , Breast Neoplasms/therapy , Outcome Assessment, Health Care/methods , Quality of Life/psychology , Survivors/psychology , Survivors/statistics & numerical data , Adolescent , Adult , Breast Neoplasms/psychology , Computer Simulation , Female , Humans , Middle Aged , Models, Statistical , Prevalence , Survival Rate , Treatment Outcome , United States/epidemiology , Young Adult
7.
Proc Natl Acad Sci U S A ; 109(50): 20352-7, 2012 Dec 11.
Article in English | MEDLINE | ID: mdl-23175793

ABSTRACT

We present a method for examining mortality as it is seen to run in families, and lifestyle factors that are also seen to run in families, in a subpopulation of the Beaver Dam Eye Study. We observe that pairwise distance between death age in related persons is on average less than pairwise distance in death age between random pairs of unrelated persons. Our goal is to examine the hypothesis that pairwise differences in lifestyle factors correlate with the observed pairwise differences in death age that run in families. Szekely and Rizzo [Szekely GJ, Rizzo ML (2009) Ann Appl Stat 3(4): 1236-1265] have recently developed a method called distance correlation, which is suitable for this task with some enhancements. We build a Smoothing Spline ANOVA (SS-ANOVA) model for predicting death age based on four major lifestyle factors generally known to be related to mortality and four major diseases contributing to mortality, to develop a lifestyle mortality risk vector and a disease mortality risk vector. We then examine to what extent pairwise differences in these scores correlate with pairwise differences in mortality as they occur between family members and between unrelated persons. We find significant distance correlations between death ages, lifestyle factors, and family relationships. Considering only sib pairs compared with unrelated persons, distance correlation between siblings and mortality is, not surprisingly, stronger than that between more distantly related family members and mortality. The methodological approach here adapts to exploring relationships between multiple clusters of variables with observable (real-valued) attributes, and other factors for which only possibly nonmetric pairwise dissimilarities are observed.


Subject(s)
Family Relations , Life Style , Mortality , Adult , Aged , Aged, 80 and over , Analysis of Variance , Disease/etiology , Epidemiologic Factors , Eye Diseases/epidemiology , Female , Humans , Male , Middle Aged , Models, Statistical , Pedigree , Risk Factors , Wisconsin/epidemiology
8.
BMC Bioinformatics ; 13: 98, 2012 May 15.
Article in English | MEDLINE | ID: mdl-22587526

ABSTRACT

BACKGROUND: In systems biology, the task of reverse engineering gene pathways from data has been limited not just by the curse of dimensionality (the interaction space is huge) but also by systematic error in the data. The gene expression barcode reduces spurious association driven by batch effects and probe effects. The binary nature of the resulting expression calls lends itself perfectly to modern regularization approaches that thrive in high-dimensional settings. RESULTS: The Partitioned LASSO-Patternsearch algorithm is proposed to identify patterns of multiple dichotomous risk factors for outcomes of interest in genomic studies. A partitioning scheme is used to identify promising patterns by solving many LASSO-Patternsearch subproblems in parallel. All variables that survive this stage proceed to an aggregation stage where the most significant patterns are identified by solving a reduced LASSO-Patternsearch problem in just these variables. This approach was applied to genetic data sets with expression levels dichotomized by gene expression bar code. Most of the genes and second-order interactions thus selected and are known to be related to the outcomes. CONCLUSIONS: We demonstrate with simulations and data analyses that the proposed method not only selects variables and patterns more accurately, but also provides smaller models with better prediction accuracy, in comparison to several alternative methodologies.


Subject(s)
Algorithms , Computer Simulation , Gene Expression Profiling/statistics & numerical data , Gene Expression , Models, Genetic , Breast Neoplasms/genetics , Breast Neoplasms/mortality , Female , Genomics , Humans
9.
J Stat Plan Inference ; 140(12): 3580-3596, 2010 Dec 01.
Article in English | MEDLINE | ID: mdl-20814436

ABSTRACT

We summarize, review and comment upon three papers which discuss the use of discrete, noisy, incomplete, scattered pairwise dissimilarity data in statistical model building. Convex cone optimization codes are used to embed the objects into a Euclidean space which respects the dissimilarity information while controlling the dimension of the space. A "newbie" algorithm is provided for embedding new objects into this space. This allows the dissimilarity information to be incorporated into a Smoothing Spline ANOVA penalized likelihood model, a Support Vector Machine, or any model that will admit Reproducing Kernel Hilbert Space components, for nonparametric regression, supervised learning, or semi-supervised learning. Future work and open questions are discussed. The papers are: F. Lu, S. Keles, S. Wright and G. Wahba 2005. A framework for kernel regularization with application to protein clustering. Proceedings of the National Academy of Sciences 102, 12332-1233.G. Corrada Bravo, G. Wahba, K. Lee, B. Klein, R. Klein and S. Iyengar 2009. Examining the relative influence of familial, genetic and environmental covariate information in flexible risk models. Proceedings of the National Academy of Sciences 106, 8128-8133F. Lu, Y. Lin and G. Wahba. Robust manifold unfolding with kernel regularization. TR 1008, Department of Statistics, University of Wisconsin-Madison.

10.
Proc Natl Acad Sci U S A ; 106(20): 8128-33, 2009 May 19.
Article in English | MEDLINE | ID: mdl-19420224

ABSTRACT

We present a method for examining the relative influence of familial, genetic, and environmental covariate information in flexible nonparametric risk models. Our goal is investigating the relative importance of these three sources of information as they are associated with a particular outcome. To that end, we developed a method for incorporating arbitrary pedigree information in a smoothing spline ANOVA (SS-ANOVA) model. By expressing pedigree data as a positive semidefinite kernel matrix, the SS-ANOVA model is able to estimate a log-odds ratio as a multicomponent function of several variables: one or more functional components representing information from environmental covariates and/or genetic marker data and another representing pedigree relationships. We report a case study on models for retinal pigmentary abnormalities in the Beaver Dam Eye Study. Our model verifies known facts about the epidemiology of this eye lesion--found in eyes with early age-related macular degeneration--and shows significantly increased predictive ability in models that include all three of the genetic, environmental, and familial data sources. The case study also shows that models that contain only two of these data sources, that is, pedigree-environmental covariates, or pedigree-genetic markers, or environmental covariates-genetic markers, have comparable predictive ability, but less than the model with all three. This result is consistent with the notions that genetic marker data encode--at least in part--pedigree data, and that familial correlations encode shared environment data as well.


Subject(s)
Disease Susceptibility/etiology , Models, Theoretical , Risk , Adult , Aged , Aged, 80 and over , Analysis of Variance , Computer Simulation , Environment , Family Health , Genetic Markers , Humans , Macular Degeneration/etiology , Middle Aged , Pedigree , Polymorphism, Single Nucleotide , ROC Curve
11.
J Mach Learn Res ; 5: 41-48, 2009.
Article in English | MEDLINE | ID: mdl-22081761

ABSTRACT

We present a novel method for estimating tree-structured covariance matrices directly from observed continuous data. Specifically, we estimate a covariance matrix from observations of p continuous random variables encoding a stochastic process over a tree with p leaves. A representation of these classes of matrices as linear combinations of rank-one matrices indicating object partitions is used to formulate estimation as instances of well-studied numerical optimization problems.In particular, our estimates are based on projection, where the covariance estimate is the nearest tree-structured covariance matrix to an observed sample covariance matrix. The problem is posed as a linear or quadratic mixed-integer program (MIP) where a setting of the integer variables in the MIP specifies a set of tree topologies of the structured covariance matrix. We solve these problems to optimality using efficient and robust existing MIP solvers.We present a case study in phylogenetic analysis of gene expression and a simulation study comparing our method to distance-based tree estimating procedures.

12.
Stat Interface ; 1(1): 137-153, 2008.
Article in English | MEDLINE | ID: mdl-18852828

ABSTRACT

The LASSO-Patternsearch algorithm is proposed to efficiently identify patterns of multiple dichotomous risk factors for outcomes of interest in demographic and genomic studies. The patterns considered are those that arise naturally from the log linear expansion of the multivariate Bernoulli density. The method is designed for the case where there is a possibly very large number of candidate patterns but it is believed that only a relatively small number are important. A LASSO is used to greatly reduce the number of candidate patterns, using a novel computational algorithm that can handle an extremely large number of unknowns simultaneously. The patterns surviving the LASSO are further pruned in the framework of (parametric) generalized linear models. A novel tuning procedure based on the GACV for Bernoulli outcomes, modified to act as a model selector, is used at both steps. We applied the method to myopia data from the population-based Beaver Dam Eye Study, exposing physiologically interesting interacting risk factors. We then applied the the method to data from a generative model of Rheumatoid Arthritis based on Problem 3 from the Genetic Analysis Workshop 15, successfully demonstrating its potential to efficiently recover higher order patterns from attribute vectors of length typical of genomic studies.

13.
Genet Epidemiol ; 31 Suppl 1: S51-60, 2007.
Article in English | MEDLINE | ID: mdl-18046765

ABSTRACT

Genome-wide association studies using thousands to hundreds of thousands of single nucleotide polymorphism (SNP) markers and region-wide association studies using a dense panel of SNPs are already in use to identify disease susceptibility genes and to predict disease risk in individuals. Because these tasks become increasingly important, three different data sets were provided for the Genetic Analysis Workshop 15, thus allowing examination of various novel and existing data mining methods for both classification and identification of disease susceptibility genes, gene by gene or gene by environment interaction. The approach most often applied in this presentation group was random forests because of its simplicity, elegance, and robustness. It was used for prediction and for screening for interesting SNPs in a first step. The logistic tree with unbiased selection approach appeared to be an interesting alternative to efficiently select interesting SNPs. Machine learning, specifically ensemble methods, might be useful as pre-screening tools for large-scale association studies because they can be less prone to overfitting, can be less computer processor time intensive, can easily include pair-wise and higher-order interactions compared with standard statistical approaches and can also have a high capability for classification. However, improved implementations that are able to deal with hundreds of thousands of SNPs at a time are required.


Subject(s)
Neural Networks, Computer , Polymorphism, Single Nucleotide , Genetic Predisposition to Disease , Genome, Human , Humans , Regression Analysis
14.
BMC Proc ; 1 Suppl 1: S60, 2007.
Article in English | MEDLINE | ID: mdl-18466561

ABSTRACT

The Genetic Analysis Workshop 15 Problem 3 simulated rheumatoid arthritis data set provided 100 replicates of simulated single-nucleotide polymorphism (SNP) and covariate data sets for 1500 families with an affected sib pair and 2000 controls, modeled after real rheumatoid arthritis data. The data generation model included nine unobserved trait loci, most of which have one or more of the generated SNPs associated with them. These data sets provide an ideal experimental test bed for evaluating new and old algorithms for selecting SNPs and covariates that can separate cases from controls, because the cases and controls are known as well as the identities of the trait loci. LASSO-Patternsearch is a new multi-step algorithm with a LASSO-type penalized likelihood method at its core specifically designed to detect and model interactions between important predictor variables. In this article the original LASSO-Patternsearch algorithm is modified to handle the large number of SNPs plus covariates. We start with a screen step within the framework of parametric logistic regression. The patterns that survived the screen step were further selected by a penalized logistic regression with the LASSO penalty. And finally, a parametric logistic regression model were built on the patterns that survived the LASSO step. In our analysis of Genetic Analysis Workshop 15 Problem 3 data we have identified most of the associated SNPs and relevant covariates. Upon using the model as a classifier, very competitive error rates were obtained.

15.
Proc Natl Acad Sci U S A ; 102(35): 12332-7, 2005 Aug 30.
Article in English | MEDLINE | ID: mdl-16109767

ABSTRACT

We develop and apply a previously undescribed framework that is designed to extract information in the form of a positive definite kernel matrix from possibly crude, noisy, incomplete, inconsistent dissimilarity information between pairs of objects, obtainable in a variety of contexts. Any positive definite kernel defines a consistent set of distances, and the fitted kernel provides a set of coordinates in Euclidean space that attempts to respect the information available while controlling for complexity of the kernel. The resulting set of coordinates is highly appropriate for visualization and as input to classification and clustering algorithms. The framework is formulated in terms of a class of optimization problems that can be solved efficiently by using modern convex cone programming software. The power of the method is illustrated in the context of protein clustering based on primary sequence data. An application to the globin family of proteins resulted in a readily visualizable 3D sequence space of globins, where several subfamilies and subgroupings consistent with the literature were easily identifiable.


Subject(s)
Biometry , Sequence Alignment/statistics & numerical data , Sequence Analysis, Protein/statistics & numerical data , Algorithms , Animals , Cluster Analysis , Globins/chemistry , Globins/genetics , Humans , Software
16.
Neuroimage ; 18(4): 950-61, 2003 Apr.
Article in English | MEDLINE | ID: mdl-12725770

ABSTRACT

Linear parametric regression models of fMRI time series have correlated residuals. One approach to address this problem is to condition the autocorrelation structure by temporal smoothing. Smoothing splines with the degree of smoothing selected by generalized cross-validation (GCV-spline) provide a method to find an optimal smoother for an fMRI time series. The purpose of this study was to determine if GCV-spline of fMRI time series yields unbiased variance estimates of linear regression model parameters. GCV-spline was evaluated with a real fMRI data set and bias of the variance estimator was computed for simulated time series with autocorrelation structures derived from fMRI data. This study only considered fMRI experimental designs of boxcar type. The results from the real data suggest that GCV-spline determines appropriate amounts of smoothing. The simulations show that the variance estimates are, on average, unbiased. The unbiased variance estimates come at some cost to signal detection efficiency. This study demonstrates that GCV-spline is an appropriate method for smoothing fMRI time series.


Subject(s)
Brain Mapping/methods , Brain/anatomy & histology , Image Processing, Computer-Assisted/methods , Magnetic Resonance Imaging/methods , Bias , Brain/physiology , Computer Simulation , Darkness , Data Interpretation, Statistical , Humans , Linear Models , Photic Stimulation , Reference Values , Regression Analysis , Reproducibility of Results , Sensitivity and Specificity , Time Factors , Visual Cortex/anatomy & histology , Visual Cortex/physiology
17.
Proc Natl Acad Sci U S A ; 99(26): 16524-30, 2002 Dec 24.
Article in English | MEDLINE | ID: mdl-12477931

ABSTRACT

Reproducing kernel Hilbert space (RKHS) methods provide a unified context for solving a wide variety of statistical modelling and function estimation problems. We consider two such problems: We are given a training set [yi, ti, i = 1, em leader, n], where yi is the response for the ith subject, and ti is a vector of attributes for this subject. The value of y(i) is a label that indicates which category it came from. For the first problem, we wish to build a model from the training set that assigns to each t in an attribute domain of interest an estimate of the probability pj(t) that a (future) subject with attribute vector t is in category j. The second problem is in some sense less ambitious; it is to build a model that assigns to each t a label, which classifies a future subject with that t into one of the categories or possibly "none of the above." The approach to the first of these two problems discussed here is a special case of what is known as penalized likelihood estimation. The approach to the second problem is known as the support vector machine. We also note some alternate but closely related approaches to the second problem. These approaches are all obtained as solutions to optimization problems in RKHS. Many other problems, in particular the solution of ill-posed inverse problems, can be obtained as solutions to optimization problems in RKHS and are mentioned in passing. We caution the reader that although a large literature exists in all of these topics, in this inaugural article we are selectively highlighting work of the author, former students, and other collaborators.


Subject(s)
Classification/methods , Models, Statistical , Probability , Likelihood Functions , Mathematics
SELECTION OF CITATIONS
SEARCH DETAIL
...