Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Am J Epidemiol ; 193(7): 1010-1018, 2024 07 08.
Artigo em Inglês | MEDLINE | ID: mdl-38375692

RESUMO

The statistical analysis of omics data poses a great computational challenge given their ultra-high-dimensional nature and frequent between-features correlation. In this work, we extended the iterative sure independence screening (ISIS) algorithm by pairing ISIS with elastic-net (Enet) and 2 versions of adaptive elastic-net (adaptive elastic-net (AEnet) and multistep adaptive elastic-net (MSAEnet)) to efficiently improve feature selection and effect estimation in omics research. We subsequently used genome-wide human blood DNA methylation data from American Indian participants in the Strong Heart Study (n = 2235 participants; measured in 1989-1991) to compare the performance (predictive accuracy, coefficient estimation, and computational efficiency) of ISIS-paired regularization methods with that of a bayesian shrinkage and traditional linear regression to identify an epigenomic multimarker of body mass index (BMI). ISIS-AEnet outperformed the other methods in prediction. In biological pathway enrichment analysis of genes annotated to BMI-related differentially methylated positions, ISIS-AEnet captured most of the enriched pathways in common for at least 2 of all the evaluated methods. ISIS-AEnet can favor biological discovery because it identifies the most robust biological pathways while achieving an optimal balance between bias and efficient feature selection. In the extended SIS R package, we also implemented ISIS paired with Cox and logistic regression for time-to-event and binary endpoints, respectively, and a bootstrap approach for the estimation of regression coefficients.


Assuntos
Algoritmos , Índice de Massa Corporal , Metilação de DNA , Epigenômica , Humanos , Epigenômica/métodos , Feminino , Masculino , Teorema de Bayes , Pessoa de Meia-Idade , Epigênese Genética , Idoso , Biomarcadores/sangue
2.
J Bus Econ Stat ; 41(4): 1157-1172, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38046827

RESUMO

Modeling and inference for heterogeneous data have gained great interest recently due to rapid developments in personalized marketing. Most existing regression approaches are based on the conditional mean and may require additional cluster information to accommodate data heterogeneity. In this paper, we propose a novel nonparametric resolution-wise regression procedure to provide an estimated distribution of the response instead of one single value. We achieve this by decomposing the information of the response and the predictors into resolutions and patterns respectively based on marginal binary expansions. The relationships between resolutions and patterns are modeled by penalized logistic regressions. Combining the resolution-wise prediction, we deliver a histogram of the conditional response to approximate the distribution. Moreover, we show a sure independence screening property and the consistency of the proposed method for growing dimensions. Simulations and a real estate valuation dataset further illustrate the effectiveness of the proposed method.

3.
Stat Med ; 42(13): 2082-2100, 2023 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-36951373

RESUMO

The increased availability of ultrahigh-dimensional biomarker data and the high demand of identifying biomarkers importantly related to survival outcomes made feature screening methods commonplace in the analysis of cancer genome data. When survival outcomes include endpoints of overall survival (OS) and time-to-progression (TTP), a high concordance is typically found in both endpoints in cancer studies, namely, patients' OS would most likely be extended when tumour progression is delayed. Existing screening procedures are often performed on a single survival endpoint only and may result in biased selection of features for OS in ignorance of disease progression. We propose a novel feature screening method by incorporating information of TTP into the selection of important biomarker predictors for more accurate inference of OS subsequent to disease progression. The proposal is based on the rank of correlation between individual features and the conditional distribution of OS given observations of TTP. It is advantageous for its flexible model nature, which requires no marginal model assumption for each endpoint, and its minimal computational cost for implementation. Theoretical results show its ranking consistency, sure screening and false rate control properties. Simulation results demonstrate that the proposed screener leads to more accurate feature selection than the method without considering the prior observations of disease progression. An application to breast cancer genome data illustrates its practical utility and facilitates disease classification using selected biomarker predictors.


Assuntos
Neoplasias da Mama , Humanos , Feminino , Biomarcadores , Progressão da Doença , Simulação por Computador , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética
4.
bioRxiv ; 2023 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-36798366

RESUMO

Mediation analysis is a useful tool in biomedical research to investigate how molecular phenotypes, such as gene expression, mediate the effect of an exposure on health outcomes. However, commonly used mean-based total mediation effect measures may suffer from cancellation of component-wise mediation effects of opposite directions in the presence of high-dimensional omics mediators. To overcome this limitation, a variance-based R-squared total mediation effect measure has been recently proposed, which, nevertheless, relies on the computationally intensive nonparametric bootstrap for confidence interval estimation. In this work, we formulate a more efficient two-stage cross-fitted estimation procedure for the R-squared measure. To avoid potential bias, we perform iterative Sure Independence Screening (iSIS) in two subsamples to exclude the non-mediators, followed by ordinary least squares (OLS) regressions for the variance estimation. We then construct confidence intervals based on the newly-derived closed-form asymptotic distribution of the R-squared measure. Extensive simulation studies demonstrate that the proposed procedure is hundreds of times more computationally efficient than the resampling-based method with comparable coverage probability. Furthermore, when applied to the Framingham Heart Study, the proposed method replicated the established finding of gene expression mediating age-related variation in systolic blood pressure and discovered the role of gene expression profiles in the relationship between sex and high-density lipoprotein cholesterol. The proposed cross-fitted interval estimation procedure is implemented in R package RsqMed.

5.
Biometrics ; 79(2): 926-939, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-35191015

RESUMO

Microarray studies, in order to identify genes associated with an outcome of interest, usually produce noisy measurements for a large number of gene expression features from a small number of subjects. One common approach to analyzing such high-dimensional data is to use linear errors-in-variables (EIV) models; however, current methods for fitting such models are computationally expensive. In this paper, we present two efficient screening procedures, namely, corrected penalized marginal screening (PMSc) and corrected sure independence screening (SISc), to reduce the number of variables for final model building. Both screening procedures are based on fitting corrected marginal regression models relating the outcome to each contaminated covariate separately, which can be computed efficiently even with a large number of features. Under mild conditions, we show that these procedures achieve screening consistency and reduce the number of features substantially, even when the number of covariates grows exponentially with sample size. In addition, if the true covariates are weakly correlated, we show that PMSc can achieve full variable selection consistency. Through a simulation study and an analysis of gene expression data for bone mineral density of Norwegian women, we demonstrate that the two new screening procedures make estimation of linear EIV models computationally scalable in high-dimensional settings, and improve finite sample estimation and selection performance compared with estimators that do not employ a screening stage.


Assuntos
Simulação por Computador , Feminino , Humanos , Análise em Microsséries , Tamanho da Amostra
6.
J Am Stat Assoc ; 117(539): 1516-1529, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36172297

RESUMO

Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (n), each with a very large number of covariates (p >> n), only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of Sure Independence Screening (Fan and Lv, 2008) and other model- and correlation-based feature screening methods, we propose a model-free procedure called Covariate Information Number - Sure Independence Screening (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher Information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods.

7.
J Appl Stat ; 49(7): 1848-1864, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35707564

RESUMO

In recent years, numerous feature screening schemes have been developed for ultra-high dimensional standard survival data with only one failure event. Nevertheless, existing literature pays little attention to related investigations for competing risks data, in which subjects suffer from multiple mutually exclusive failures. In this article, we develop a new marginal feature screening for ultra-high dimensional time-to-event data to allow for competing risks. The proposed procedure is model-free, and robust against heavy-tailed distributions and potential outliers for time to the type of failure of interest. Apart from this, it is invariant to any monotone transformation of event time of interest. Under rather mild assumptions, it is shown that the newly suggested approach possesses the ranking consistency and sure independence screening properties. Some numerical studies are conducted to evaluate the finite-sample performance of our method and make a comparison with its competitor, while an application to a real data set is provided to serve as an illustration.

8.
Onco (Basel) ; 2(4): 305-318, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37066112

RESUMO

Background: Advances in sequencing technologies have allowed collection of massive genome-wide information that substantially advances lung cancer diagnosis and prognosis. Identifying influential markers for clinical endpoints of interest has been an indispensable and critical component of the statistical analysis pipeline. However, classical variable selection methods are not feasible or reliable for high-throughput genetic data. Our objective is to propose a model-free gene screening procedure for high-throughput right-censored data, and to develop a predictive gene signature for lung squamous cell carcinoma (LUSC) with the proposed procedure. Methods: A gene screening procedure was developed based on a recently proposed independence measure. The Cancer Genome Atlas (TCGA) data on LUSC was then studied. The screening procedure was conducted to narrow down the set of influential genes to 378 candidates. A penalized Cox model was then fitted to the reduced set, which further identified a 6-gene signature for LUSC prognosis. The 6-gene signature was validated on datasets from the Gene Expression Omnibus. Results: Both model-fitting and validation results reveal that our method selected influential genes that lead to biologically sensible findings as well as better predictive performance, compared to existing alternatives. According to our multivariable Cox regression analysis, the 6-gene signature was indeed a significant prognostic factor (p-value < 0.001) while controlling for clinical covariates. Conclusions: Gene screening as a fast dimension reduction technique plays an important role in analyzing high-throughput data. The main contribution of this paper is to introduce a fundamental yet pragmatic model-free gene screening approach that aids statistical analysis of right-censored cancer data, and provide a lateral comparison with other available methods in the context of LUSC.

9.
BMC Bioinformatics ; 22(1): 414, 2021 Aug 23.
Artigo em Inglês | MEDLINE | ID: mdl-34425752

RESUMO

BACKGROUND: Environmental exposures can regulate intermediate molecular phenotypes, such as gene expression, by different mechanisms and thereby lead to various health outcomes. It is of significant scientific interest to unravel the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposure and traits. Mediation analysis is an important tool for investigating such relationships. However, it has mainly focused on low-dimensional settings, and there is a lack of a good measure of the total mediation effect. Here, we extend an R-squared (R[Formula: see text]) effect size measure, originally proposed in the single-mediator setting, to the moderate- and high-dimensional mediator settings in the mixed model framework. RESULTS: Based on extensive simulations, we compare our measure and estimation procedure with several frequently used mediation measures, including product, proportion, and ratio measures. Our R[Formula: see text]-based second-moment measure has small bias and variance under the correctly specified model. To mitigate potential bias induced by non-mediators, we examine two variable selection procedures, i.e., iterative sure independence screening and false discovery rate control, to exclude the non-mediators. We establish the consistency of the proposed estimation procedures and introduce a resampling-based confidence interval. By applying the proposed estimation procedure, we found that 38% of the age-related variations in systolic blood pressure can be explained by gene expression profiles in the Framingham Heart Study of 1711 individuals. An R package "RsqMed" is available on CRAN. CONCLUSION: R-squared (R[Formula: see text]) is an effective and efficient measure for total mediation effect especially under high-dimensional setting.


Assuntos
Estudos Longitudinais , Humanos
10.
Comb Chem High Throughput Screen ; 23(8): 740-756, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32342803

RESUMO

AIM AND OBJECTIVE: Near Infrared (NIR) spectroscopy data are featured by few dozen to many thousands of samples and highly correlated variables. Quantitative analysis of such data usually requires a combination of analytical methods with variable selection or screening methods. Commonly-used variable screening methods fail to recover the true model when (i) some of the variables are highly correlated, and (ii) the sample size is less than the number of relevant variables. In these cases, Partial Least Squares (PLS) regression based approaches can be useful alternatives. MATERIALS AND METHODS: In this research, a fast variable screening strategy, namely the preconditioned screening for ridge partial least squares regression (PSRPLS), is proposed for modelling NIR spectroscopy data with high-dimensional and highly correlated covariates. Under rather mild assumptions, we prove that using Puffer transformation, the proposed approach successfully transforms the problem of variable screening with highly correlated predictor variables to that of weakly correlated covariates with less extra computational effort. RESULTS: We show that our proposed method leads to theoretically consistent model selection results. Four simulation studies and two real examples are then analyzed to illustrate the effectiveness of the proposed approach. CONCLUSION: By introducing Puffer transformation, high correlation problem can be mitigated using the PSRPLS procedure we construct. By employing RPLS regression to our approach, it can be made more simple and computational efficient to cope with the situation where model size is larger than the sample size while maintaining a high precision prediction.


Assuntos
Solo/química , Espectroscopia de Luz Próxima ao Infravermelho/métodos , Simulação por Computador , Bases de Dados de Compostos Químicos , Análise dos Mínimos Quadrados , Modelos Teóricos , Método de Monte Carlo
11.
Proc IEEE Int Symp Biomed Imaging ; 2019: 404-408, 2019 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-32256966

RESUMO

Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder, and behavioral treatment interventions have shown promise for young children with ASD. However, there is limited progress in understanding the effect of each type of treatment. In this project, we aim to detect structural changes in the brain after treatment and select structural features associated with treatment outcomes. The difficulty in building large databases of patients who have received specific treatments and the high dimensionality of medical image analysis problems are the challenges in this work. To select predictive features and build accurate models, we use the sure independence screening (SIS) method. SIS is a theoretically and empirically validated method for ultra-high dimensional general linear models, and it achieves both predictive accuracy and correct feature selection by iterative feature selection. Compared with step-wise feature selection methods, SIS removes multiple features in each iteration and is computationally efficient. Compared with other linear models such as elastic-net regression, support vector regression (SVR) and partial least squares regression (PSLR), SIS achieves higher accuracy. We validated the superior performance of SIS in various experiments: First, we extract brain structural features from FreeSurfer, including cortical thickness, surface area, mean curvature and cortical volume. Next, we predict different measures of treatment outcomes based on structural features. We show that SIS achieves the highest correlation between prediction and measurements in all tasks. Furthermore, we report regions selected by SIS as biomarkers for ASD.

12.
J Stat Comput Simul ; 87(14): 2708-2723, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29075047

RESUMO

Screening procedures play an important role in data analysis, especially in high-throughput biological studies where the datasets consist of more covariates than independent subjects. In this article, a Bayesian screening procedure is introduced for the binary response models with logit and probit links. In contrast to many screening rules based on marginal information involving one or a few covariates, the proposed Bayesian procedure simultaneously models all covariates and uses closed-form screening statistics. Specifically, we use the posterior means of the regression coefficients as screening statistics; by imposing a generalized g-prior on the regression coefficients, we derive the analytical form of their posterior means and compute the screening statistics without Markov chain Monte Carlo implementation. We evaluate the utility of the proposed Bayesian screening method using simulations and real data analysis. When the sample size is small, the simulation results suggest improved performance with comparable computational cost.

13.
BMC Genomics ; 18(Suppl 1): 950, 2017 01 25.
Artigo em Inglês | MEDLINE | ID: mdl-28198665

RESUMO

BACKGROUND: This study is to explore the key genes and signaling transduction pathways related to the survival time of glioblastoma multiforme (GBM) patients. RESULTS: Our results not only showed that mutually explored GBM survival time related genes and signaling transduction pathways are closely related to the GBM, but also demonstrated that our innovated constrained optimization algorithm (CoxSisLasso strategy) are better than the classical methods (CoxLasso and CoxSis strategy). CONCLUSION: We analyzed why the CoxSisLasso strategy can outperform the existing classical methods and discuss how to extend this research in the distant future.


Assuntos
Neoplasias Encefálicas/genética , Neoplasias Encefálicas/metabolismo , Neoplasias Encefálicas/mortalidade , Regulação Neoplásica da Expressão Gênica , Glioblastoma/genética , Glioblastoma/metabolismo , Glioblastoma/mortalidade , Transdução de Sinais , Análise de Sobrevida , Algoritmos , Perfilação da Expressão Gênica , Estudos de Associação Genética , Humanos , Prognóstico , Modelos de Riscos Proporcionais , Curva ROC , Fluxo de Trabalho
14.
J Comput Graph Stat ; 26(4): 803-813, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-30532512

RESUMO

Feature screening plays an important role in dimension reduction for ultrahigh-dimensional data. In this paper, we introduce a new feature screening method and establish its sure independence screening property under the ultrahigh-dimensional setting. The proposed method works based on the nonparanormal transformation and Henze-Zirkler's test; that is, it first transforms the response variable and features to Gaussian random variables using the nonparanormal transformation and then tests the dependence between the response variable and features using the Henze-Zirkler's test. The proposed method enjoys at least two merits. First, it is model-free, which avoids the specification of a particular model structure. Second, it is condition-free, which does not require any extra conditions except for some regularity conditions for high-dimensional feature screening. The numerical results indicate that, compared to the existing methods, the proposed method is more robust to the data generated from heavy-tailed distributions and/or complex models with interaction variables. The proposed method is applied to screening of anticancer drug response genes.

15.
Ann Stat ; 44(2): 515-539, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27242388

RESUMO

We consider an independence feature screening technique for identifying explanatory variables that locally contribute to the response variable in high-dimensional regression analysis. Without requiring a specific parametric form of the underlying data model, our approach accommodates a wide spectrum of nonparametric and semiparametric model families. To detect the local contributions of explanatory variables, our approach constructs empirical likelihood locally in conjunction with marginal nonparametric regressions. Since our approach actually requires no estimation, it is advantageous in scenarios such as the single-index models where even specification and identification of a marginal model is an issue. By automatically incorporating the level of variation of the nonparametric regression and directly assessing the strength of data evidence supporting local contribution from each explanatory variable, our approach provides a unique perspective for solving feature screening problems. Theoretical analysis shows that our approach can handle data dimensionality growing exponentially with the sample size. With extensive theoretical illustrations and numerical examples, we show that the local independence screening approach performs promisingly.

16.
Biometrics ; 72(4): 1145-1154, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-26910137

RESUMO

Motivated by ultrahigh-dimensional biomarkers screening studies, we propose a model-free screening approach tailored to censored lifetime outcomes. Our proposal is built upon the introduction of a new measure, survival impact index (SII). By its design, SII sensibly captures the overall influence of a covariate on the outcome distribution, and can be estimated with familiar nonparametric procedures that do not require smoothing and are readily adaptable to handle lifetime outcomes under various censoring and truncation mechanisms. We provide large sample distributional results that facilitate the inference on SII in classical multivariate settings. More importantly, we investigate SII as an effective screener for ultrahigh-dimensional data, not relying on rigid regression model assumptions for real applications. We establish the sure screening property of the proposed SII-based screener. Extensive numerical studies are carried out to assess the performance of our method compared with other existing screening methods. A lung cancer microarray data is analyzed to demonstrate the practical utility of our proposals.


Assuntos
Biometria/métodos , Neoplasias Pulmonares/mortalidade , Modelos Estatísticos , Análise de Sobrevida , Biomarcadores , Simulação por Computador , Humanos , Análise em Microsséries , Estatísticas não Paramétricas
17.
BMC Syst Biol ; 10(Suppl 4): 118, 2016 Dec 23.
Artigo em Inglês | MEDLINE | ID: mdl-28155690

RESUMO

BACKGROUND: High-throughput technology could generate thousands to millions biomarker measurements in one experiment. However, results from high throughput analysis are often barely reproducible due to small sample size. Different statistical methods have been proposed to tackle this "small n and large p" scenario, for example different datasets could be pooled or integrated together to provide an effective way to improve reproducibility. However, the raw data is either unavailable or hard to integrate due to different experimental conditions, thus there is an emerging need to develop a method for "knowledge integration" in high-throughput data analysis. RESULTS: In this study, we proposed an integrative prescreening approach, SKI, for high-throughput data analysis. A new rank is generated based on two initial ranks: (1) knowledge based rank; and (2) marginal correlation based rank. Our simulation shows the SKI outperforms other methods without knowledge-integration in terms of higher true positive rate given the same number of variables selected. We also applied our method in a drug response study and found its performance to be better than regular screening methods. CONCLUSION: The proposed method provides an effective way to integrate knowledge for high-throughput analysis. It could easily implemented with our provided R package named SKI.


Assuntos
Interpretação Estatística de Dados , Genômica/métodos , Neoplasias/tratamento farmacológico , Neoplasias/genética , Integração de Sistemas , Resultado do Tratamento
18.
J Am Stat Assoc ; 111(513): 169-179, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-28127109

RESUMO

This paper is concerned with the problem of feature screening for multi-class linear discriminant analysis under ultrahigh dimensional setting. We allow the number of classes to be relatively large. As a result, the total number of relevant features is larger than usual. This makes the related classification problem much more challenging than the conventional one, where the number of classes is small (very often two). To solve the problem, we propose a novel pairwise sure independence screening method for linear discriminant analysis with an ultrahigh dimensional predictor. The proposed procedure is directly applicable to the situation with many classes. We further prove that the proposed method is screening consistent. Simulation studies are conducted to assess the finite sample performance of the new procedure. We also demonstrate the proposed methodology via an empirical analysis of a real life example on handwritten Chinese character recognition.

19.
Genetics ; 201(3): 865-70, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26405029

RESUMO

There has been a continuing interest in approaches that analyze pairwise locus-by-locus (epistasis) interactions using multilocus association models in genome-wide data sets. In this paper, we suggest an approach that uses sure independence screening to first lower the dimension of the problem by considering the marginal importance of each interaction term within the huge loop. Subsequent multilocus association steps are executed using an extended Bayesian least absolute shrinkage and selection operator (LASSO) model and fast generalized expectation-maximization estimation algorithms. The potential of this approach is illustrated and compared with PLINK software using data examples where phenotypes have been simulated conditionally on marker data from the Quantitative Trait Loci Mapping and Marker Assisted Selection (QTLMAS) Workshop 2008 and real pig data sets.


Assuntos
Mapeamento Cromossômico , Epistasia Genética , Suínos/genética , Animais , Genoma , Modelos Genéticos , Locos de Características Quantitativas , Software
20.
Anal Chim Acta ; 870: 45-55, 2015 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-25819786

RESUMO

Bioactive component identification is a crucial issue in search for new drug leads. We provide a new strategy to search for bioactive components based on Sure Independence Screening (SIS) and interval PLS (iPLS). The method, which is termed as SIS-iPLS, is not only able to find out the chief bioactive components, but also able to judge how many components should be there responsible for the total bioactivity. The method is totally "data-driven" with no need for prior knowledge about the unknown mixture analyzed, therefore especially suitable for effect-directed work like bioassay-guided fractionation. Two data sets, a synthetic mixture system of twelve components and a suite of Radix Puerariae Lobatae extracts samples, are used to test the identification ability of the SIS-iPLS method.


Assuntos
Produtos Biológicos/análise , Produtos Biológicos/farmacologia , Cromatografia/métodos , Métodos Analíticos de Preparação de Amostras , Antioxidantes/análise , Antioxidantes/farmacologia , Bioensaio , Ferro/química , Análise dos Mínimos Quadrados , Oxirredução/efeitos dos fármacos , Pueraria/química , Reprodutibilidade dos Testes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA