RESUMO
Wide heterogeneity exists in cancer patients' survival, ranging from a few months to several decades. To accurately predict clinical outcomes, it is vital to build an accurate predictive model that relates the patients' molecular profiles with the patients' survival. With complex relationships between survival and high-dimensional molecular predictors, it is challenging to conduct nonparametric modeling and irrelevant predictors removing simultaneously. In this article, we build a kernel Cox proportional hazards semi-parametric model and propose a novel regularized garrotized kernel machine (RegGKM) method to fit the model. We use the kernel machine method to describe the complex relationship between survival and predictors, while automatically removing irrelevant parametric and nonparametric predictors through a LASSO penalty. An efficient high-dimensional algorithm is developed for the proposed method. Comparison with other competing methods in simulation shows that the proposed method always has better predictive accuracy. We apply this method to analyze a multiple myeloma dataset and predict the patients' death burden based on their gene expressions. Our results can help classify patients into groups with different death risks, facilitating treatment for better clinical outcomes.
Assuntos
Algoritmos , Neoplasias , Humanos , Modelos Lineares , Modelos de Riscos Proporcionais , Simulação por Computador , Neoplasias/genéticaRESUMO
Nonparametric feature selection for high-dimensional data is an important and challenging problem in the fields of statistics and machine learning. Most of the existing methods for feature selection focus on parametric or additive models which may suffer from model misspecification. In this paper, we propose a new framework to perform nonparametric feature selection for both regression and classification problems. Under this framework, we learn prediction functions through empirical risk minimization over a reproducing kernel Hilbert space. The space is generated by a novel tensor product kernel, which depends on a set of parameters that determines the importance of the features. Computationally, we minimize the empirical risk with a penalty to estimate the prediction and kernel parameters simultaneously. The solution can be obtained by iteratively solving convex optimization problems. We study the theoretical property of the kernel feature space and prove the oracle selection property and Fisher consistency of our proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via extensive simulation studies and applications to two real studies.
Assuntos
Algoritmos , Aprendizado de Máquina , Simulação por ComputadorRESUMO
Motivated by the analysis of longitudinal neuroimaging studies, we study the longitudinal functional linear regression model under asynchronous data setting for modeling the association between clinical outcomes and functional (or imaging) covariates. In the asynchronous data setting, both covariates and responses may be measured at irregular and mismatched time points, posing methodological challenges to existing statistical methods. We develop a kernel weighted loss function with roughness penalty to obtain the functional estimator and derive its representer theorem. The rate of convergence, a Bahadur representation, and the asymptotic pointwise distribution of the functional estimator are obtained under the reproducing kernel Hilbert space framework. We propose a penalized likelihood ratio test to test the nullity of the functional coefficient, derive its asymptotic distribution under the null hypothesis, and investigate the separation rate under the alternative hypotheses. Simulation studies are conducted to examine the finite-sample performance of the proposed procedure. We apply the proposed methods to the analysis of multitype data obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, which reveals significant association between 21 regional brain volume density curves and the cognitive function. Data used in preparation of this paper were obtained from the ADNI database (adni.loni.usc.edu).
Assuntos
Doença de Alzheimer , Humanos , Modelos Lineares , Doença de Alzheimer/diagnóstico por imagem , Simulação por Computador , Algoritmos , Funções VerossimilhançaRESUMO
Generalized linear models are flexible tools for the analysis of diverse datasets, but the classical formulation requires that the parametric component is correctly specified and the data contain no atypical observations. To address these shortcomings, we introduce and study a family of nonparametric full-rank and lower-rank spline estimators that result from the minimization of a penalized density power divergence. The proposed class of estimators is easily implementable, offers high protection against outlying observations and can be tuned for arbitrarily high efficiency in the case of clean data. We show that under weak assumptions, these estimators converge at a fast rate and illustrate their highly competitive performance on a simulation study and two real-data examples. Supplementary Information: The online version contains supplementary material available at 10.1007/s11749-023-00866-x.
RESUMO
The goal of nonparametric regression is to recover an underlying regression function from noisy observations, under the assumption that the regression function belongs to a prespecified infinite-dimensional function space. In the online setting, in which the observations come in a stream, it is generally computationally infeasible to refit the whole model repeatedly. As yet, there are no methods that are both computationally efficient and statistically rate optimal. In this paper, we propose an estimator for online nonparametric regression. Notably, our estimator is an empirical risk minimizer in a deterministic linear space, which is quite different from existing methods that use random features and a functional stochastic gradient. Our theoretical analysis shows that this estimator obtains a rate-optimal generalization error when the regression function is known to live in a reproducing kernel Hilbert space. We also show, theoretically and empirically, that the computational cost of our estimator is much lower than that of other rate-optimal estimators proposed for this online setting.
RESUMO
Many methods have been developed to study nonparametric function-on-function regression models. Nevertheless, there is a lack of model selection approach to the regression function as a functional function with functional covariate inputs. To study interaction effects among these functional covariates, in this article, we first construct a tensor product space of reproducing kernel Hilbert spaces and build an analysis of variance (ANOVA) decomposition of the tensor product space. We then use a model selection method with the L1 criterion to estimate the functional function with functional covariate inputs and detect interaction effects among the functional covariates. The proposed method is evaluated using simulations and stroke rehabilitation data.
RESUMO
Sufficient dimension reduction (SDR) embodies a family of methods that aim for reduction of dimensionality without loss of information in a regression setting. In this article, we propose a new method for nonparametric function-on-function SDR, where both the response and the predictor are a function. We first develop the notions of functional central mean subspace and functional central subspace, which form the population targets of our functional SDR. We then introduce an average Fréchet derivative estimator, which extends the gradient of the regression function to the operator level and enables us to develop estimators for our functional dimension reduction spaces. We show the resulting functional SDR estimators are unbiased and exhaustive, and more importantly, without imposing any distributional assumptions such as the linearity or the constant variance conditions that are commonly imposed by all existing functional SDR methods. We establish the uniform convergence of the estimators for the functional dimension reduction spaces, while allowing both the number of Karhunen-Loève expansions and the intrinsic dimension to diverge with the sample size. We demonstrate the efficacy of the proposed methods through both simulations and two real data examples.
RESUMO
Brain-computer interface (BCI) technology allows people with disabilities to communicate with the physical environment. One of the most promising signals is the non-invasive electroencephalogram (EEG) signal. However, due to the non-stationary nature of EEGs, a subject's signal may change over time, which poses a challenge for models that work across time. Recently, domain adaptive learning (DAL) has shown its superior performance in various classification tasks. In this paper, we propose a regularized reproducing kernel Hilbert space (RKHS) subspace learning algorithm with K-nearest neighbors (KNNs) as a classifier for the task of motion imagery signal classification. First, we reformulate the framework of RKHS subspace learning with a rigorous mathematical inference. Secondly, since the commonly used maximum mean difference (MMD) criterion measures the distribution variance based on the mean value only and ignores the local information of the distribution, a regularization term of source domain linear discriminant analysis (SLDA) is proposed for the first time, which reduces the variance of similar data and increases the variance of dissimilar data to optimize the distribution of source domain data. Finally, the RKHS subspace framework was constructed sparsely considering the sensitivity of the BCI data. We test the proposed algorithm in this paper, first on four standard datasets, and the experimental results show that the other baseline algorithms improve the average accuracy by 2-9% after adding SLDA. In the motion imagery classification experiments, the average accuracy of our algorithm is 3% higher than the other algorithms, demonstrating the adaptability and effectiveness of the proposed algorithm.
RESUMO
KEY MESSAGE: Complementing or replacing genetic markers with transcriptomic data and use of reproducing kernel Hilbert space regression based on Gaussian kernels increases hybrid prediction accuracies for complex agronomic traits in canola. In plant breeding, hybrids gained particular importance due to heterosis, the superior performance of offspring compared to their inbred parents. Since the development of new top performing hybrids requires labour-intensive and costly breeding programmes, including testing of large numbers of experimental hybrids, the prediction of hybrid performance is of utmost interest to plant breeders. In this study, we tested the effectiveness of hybrid prediction models in spring-type oilseed rape (Brassica napus L./canola) employing different omics profiles, individually and in combination. To this end, a population of 950 F1 hybrids was evaluated for seed yield and six other agronomically relevant traits in commercial field trials at several locations throughout Europe. A subset of these hybrids was also evaluated in a climatized glasshouse regarding early biomass production. For each of the 477 parental rapeseed lines, 13,201 single nucleotide polymorphisms (SNPs), 154 primary metabolites, and 19,479 transcripts were determined and used as predictive variables. Both, SNP markers and transcripts, effectively predict hybrid performance using (genomic) best linear unbiased prediction models (gBLUP). Compared to models using pure genetic markers, models incorporating transcriptome data resulted in significantly higher prediction accuracies for five out of seven agronomic traits, indicating that transcripts carry important information beyond genomic data. Notably, reproducing kernel Hilbert space regression based on Gaussian kernels significantly exceeded the predictive abilities of gBLUP models for six of the seven agronomic traits, demonstrating its potential for implementation in future canola breeding programmes.
Assuntos
Brassica napus/genética , Cruzamentos Genéticos , Genoma de Planta , Vigor Híbrido , Metaboloma , Polimorfismo de Nucleotídeo Único , Transcriptoma , Brassica napus/crescimento & desenvolvimento , Brassica napus/metabolismo , Hibridização Genética , Modelos Genéticos , Fenótipo , Melhoramento Vegetal , Locos de Características Quantitativas , Sementes/genética , Sementes/crescimento & desenvolvimento , Sementes/metabolismoRESUMO
Many dimensionality and model reduction techniques rely on estimating dominant eigenfunctions of associated dynamical operators from data. Important examples include the Koopman operator and its generator, but also the Schrödinger operator. We propose a kernel-based method for the approximation of differential operators in reproducing kernel Hilbert spaces and show how eigenfunctions can be estimated by solving auxiliary matrix eigenvalue problems. The resulting algorithms are applied to molecular dynamics and quantum chemistry examples. Furthermore, we exploit that, under certain conditions, the Schrödinger operator can be transformed into a Kolmogorov backward operator corresponding to a drift-diffusion process and vice versa. This allows us to apply methods developed for the analysis of high-dimensional stochastic differential equations to quantum mechanical systems.
RESUMO
Model-free variable selection has attracted increasing interest recently due to its flexibility in algorithmic design and outstanding performance in real-world applications. However, most of the existing statistical methods are formulated under the mean square error (MSE) criterion, and susceptible to non-Gaussian noise and outliers. As the MSE criterion requires the data to satisfy Gaussian noise condition, it potentially hampers the effectiveness of model-free methods in complex circumstances. To circumvent this issue, we present a new model-free variable selection algorithm by integrating kernel modal regression and gradient-based variable identification together. The derived modal regression estimator is related closely to information theoretic learning under the maximum correntropy criterion, and assures algorithmic robustness to complex noise by replacing learning of the conditional mean with the conditional mode. The gradient information of estimator offers a model-free metric to screen the key variables. In theory, we investigate the theoretical foundations of our new model on generalization-bound and variable selection consistency. In applications, the effectiveness of the proposed method is verified by data experiments.
RESUMO
In this paper, we consider a surrogate modeling approach using a data-driven nonparametric likelihood function constructed on a manifold on which the data lie (or to which they are close). The proposed method represents the likelihood function using a spectral expansion formulation known as the kernel embedding of the conditional distribution. To respect the geometry of the data, we employ this spectral expansion using a set of data-driven basis functions obtained from the diffusion maps algorithm. The theoretical error estimate suggests that the error bound of the approximate data-driven likelihood function is independent of the variance of the basis functions, which allows us to determine the amount of training data for accurate likelihood function estimations. Supporting numerical results to demonstrate the robustness of the data-driven likelihood functions for parameter estimation are given on instructive examples involving stochastic and deterministic differential equations. When the dimension of the data manifold is strictly less than the dimension of the ambient space, we found that the proposed approach (which does not require the knowledge of the data manifold) is superior compared to likelihood functions constructed using standard parametric basis functions defined on the ambient coordinates. In an example where the data manifold is not smooth and unknown, the proposed method is more robust compared to an existing polynomial chaos surrogate model which assumes a parametric likelihood, the non-intrusive spectral projection. In fact, the estimation accuracy is comparable to direct MCMC estimates with only eight likelihood function evaluations that can be done offline as opposed to 4000 sequential function evaluations, whenever direct MCMC can be performed. A robust accurate estimation is also found using a likelihood function trained on statistical averages of the chaotic 40-dimensional Lorenz-96 model on a wide parameter domain.
RESUMO
We introduce a new, semi-supervised classification method that extensively exploits knowledge. The method has three steps. First, the manifold regularization mechanism, adapted from the Laplacian support vector machine (LapSVM), is adopted to mine the manifold structure embedded in all training data, especially in numerous label-unknown data. Meanwhile, by converting the labels into pairwise constraints, the pairwise constraint regularization formula (PCRF) is designed to compensate for the few but valuable labelled data. Second, by further combining the PCRF with the manifold regularization, the precise manifold and pairwise constraint jointly regularized formula (MPCJRF) is achieved. Third, by incorporating the MPCJRF into the framework of the conventional SVM, our approach, referred to as semi-supervised classification with extensive knowledge exploitation (SSC-EKE), is developed. The significance of our research is fourfold: 1) The MPCJRF is an underlying adjustment, with respect to the pairwise constraints, to the graph Laplacian enlisted for approximating the potential data manifold. This type of adjustment plays the correction role, as an unbiased estimation of the data manifold is difficult to obtain, whereas the pairwise constraints, converted from the given labels, have an overall high confidence level. 2) By transforming the values of the two terms in the MPCJRF such that they have the same range, with a trade-off factor varying within the invariant interval [0, 1), the appropriate impact of the pairwise constraints to the graph Laplacian can be self-adaptively determined. 3) The implication regarding extensive knowledge exploitation is embodied in SSC-EKE. That is, the labelled examples are used not only to control the empirical risk but also to constitute the MPCJRF. Moreover, all data, both labelled and unlabelled, are recruited for the model smoothness and manifold regularization. 4) The complete framework of SSC-EKE organically incorporates multiple theories, such as joint manifold and pairwise constraint-based regularization, smoothness in the reproducing kernel Hilbert space, empirical risk minimization, and spectral methods, which facilitates the preferable classification accuracy as well as the generalizability of SSC-EKE.
RESUMO
In this paper, we examine two widely-used approaches, the polynomial chaos expansion (PCE) and Gaussian process (GP) regression, for the development of surrogate models. The theoretical differences between the PCE and GP approximations are discussed. A state-of-the-art PCE approach is constructed based on high precision quadrature points; however, the need for truncation may result in potential precision loss; the GP approach performs well on small datasets and allows a fine and precise trade-off between fitting the data and smoothing, but its overall performance depends largely on the training dataset. The reproducing kernel Hilbert space (RKHS) and Mercer's theorem are introduced to form a linkage between the two methods. The theorem has proven that the two surrogates can be embedded in two isomorphic RKHS, by which we propose a novel method named Gaussian process on polynomial chaos basis (GPCB) that incorporates the PCE and GP. A theoretical comparison is made between the PCE and GPCB with the help of the Kullback-Leibler divergence. We present that the GPCB is as stable and accurate as the PCE method. Furthermore, the GPCB is a one-step Bayesian method that chooses the best subset of RKHS in which the true function should lie, while the PCE method requires an adaptive procedure. Simulations of 1D and 2D benchmark functions show that GPCB outperforms both the PCE and classical GP methods. In order to solve high dimensional problems, a random sample scheme with a constructive design (i.e., tensor product of quadrature points) is proposed to generate a valid training dataset for the GPCB method. This approach utilizes the nature of the high numerical accuracy underlying the quadrature points while ensuring the computational feasibility. Finally, the experimental results show that our sample strategy has a higher accuracy than classical experimental designs; meanwhile, it is suitable for solving high dimensional problems.
RESUMO
Inference in mechanistic models of non-linear differential equations is a challenging problem in current computational statistics. Due to the high computational costs of numerically solving the differential equations in every step of an iterative parameter adaptation scheme, approximate methods based on gradient matching have become popular. However, these methods critically depend on the smoothing scheme for function interpolation. The present article adapts an idea from manifold learning and demonstrates that a time warping approach aiming to homogenize intrinsic length scales can lead to a significant improvement in parameter estimation accuracy. We demonstrate the effectiveness of this scheme on noisy data from two dynamical systems with periodic limit cycle, a biopathway, and an application from soft-tissue mechanics. Our study also provides a comparative evaluation on a wide range of signal-to-noise ratios.
RESUMO
Genotype by environment interaction (G × E) in dairy cattle productive traits has been shown to exist, but current genetic evaluation methods do not take this component into account. As several environmental descriptors (e.g., climate, farming system) are known to vary within the United States, not accounting for the G × E could lead to reranking of bulls and loss in genetic gain. Using test-day records on milk yield, somatic cell score, fat, and protein percentage from all over the United States, we computed within herd-year-season daughter yield deviations for 1,087 Holstein bulls and regressed them on genetic and environmental information to estimate variance components and to assess prediction accuracy. Genomic information was obtained from a 50k SNP marker panel. Environmental effect inputs included herd (160 levels), geographical region (7 levels), geographical location (2 variables), climate information (7 variables), and management conditions of the herds (16 total variables divided in 4 subgroups). For each set of environmental descriptors, environmental, genomic, and G × E components were sequentially fitted. Variance components estimates confirmed the presence of G × E on milk yield, with its effect being larger than main genetic effect and the environmental effect for some models. Conversely, G × E was moderate for somatic cell score and small for milk composition. Genotype by environment interaction, when included, partially eroded the genomic effect (as compared with the models where G × E was not included), suggesting that the genomic variance could at least in part be attributed to G × E not appropriately accounted for. Model predictive ability was assessed using 3 cross-validation schemes (new bulls, incomplete progeny test, and new environmental conditions), and performance was compared with a reference model including only the main genomic effect. In each scenario, at least 1 of the models including G × E was able to perform better than the reference model, although it was not possible to find the overall best-performing model that included the same set of environmental descriptors. In general, the methodology used is promising in accounting for G × E in genomic predictions, but challenges exist in identifying a unique set of covariates capable of describing the entire variety of environments.
Assuntos
Bovinos/genética , Interação Gene-Ambiente , Animais , Cruzamento , Clima , Meio Ambiente , Feminino , Genoma , Genômica , Genótipo , Lactação/genética , Masculino , Leite/metabolismo , FenótipoRESUMO
We consider a partially linear framework for modelling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracular result holds when the number of sub-populations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of sub-populations. All the above results require to regularize each sub-estimation as though it had the entire sample size. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is the statistical inferences for the general kernel ridge regression. Thorough numerical results are also provided to back up our theory.
RESUMO
In Magnetic Resonance Imaging (MRI) data samples are collected in the spatial frequency domain (k-space), typically by time-consuming line-by-line scanning on a Cartesian grid. Scans can be accelerated by simultaneous acquisition of data using multiple receivers (parallel imaging), and by using more efficient non-Cartesian sampling schemes. To understand and design k-space sampling patterns, a theoretical framework is needed to analyze how well arbitrary sampling patterns reconstruct unsampled k-space using receive coil information. As shown here, reconstruction from samples at arbitrary locations can be understood as approximation of vector-valued functions from the acquired samples and formulated using a Reproducing Kernel Hilbert Space (RKHS) with a matrix-valued kernel defined by the spatial sensitivities of the receive coils. This establishes a formal connection between approximation theory and parallel imaging. Theoretical tools from approximation theory can then be used to understand reconstruction in k-space and to extend the analysis of the effects of samples selection beyond the traditional image-domain g-factor noise analysis to both noise amplification and approximation errors in k-space. This is demonstrated with numerical examples.
RESUMO
In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects. The accuracy of such diagnostic tests is commonly assessed using the receiver operating characteristic (ROC) curve. To summarize an ROC curve and determine its optimal cut-point, the Youden index is popularly used. In literature, the estimation of the Youden index has been widely studied via various statistical modeling strategies on the conditional density. This paper proposes a new model-free estimation method, which directly estimates the covariate-adjusted cut-point without estimating the conditional density. Consequently, covariate-adjusted Youden index can be estimated based on the estimated cut-point. The proposed method formulates the estimation problem in a large margin classification framework, which allows flexible modeling of the covariate-adjusted Youden index through kernel machines. The advantage of the proposed method is demonstrated in a variety of simulated experiments as well as a real application to Pima Indians diabetes study.
Assuntos
Biomarcadores/análise , Interpretação Estatística de Dados , Testes Diagnósticos de Rotina/métodos , Glicemia/análise , Simulação por Computador , Diabetes Mellitus/sangue , Feminino , Teste de Tolerância a Glucose , Humanos , Indígenas Norte-Americanos , Masculino , Pessoa de Meia-IdadeRESUMO
In this article, we establish a connection between a stochastic dynamic model (SDM) driven by a linear stochastic differential equation (SDE) and a Chebyshev spline, which enables researchers to borrow strength across fields both theoretically and numerically. We construct a differential operator for the penalty function and develop a reproducing kernel Hilbert space (RKHS) induced by the SDM and the Chebyshev spline. The general form of the linear SDE allows us to extend the well-known connection between an integrated Brownian motion and a polynomial spline to a connection between more complex diffusion processes and Chebyshev splines. One interesting special case is connection between an integrated Ornstein-Uhlenbeck process and an exponential spline. We use two real data sets to illustrate the integrated Ornstein-Uhlenbeck process model and exponential spline model and show their estimates are almost identical.