Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 182
Filtrar
1.
Proc Natl Acad Sci U S A ; 121(33): e2403210121, 2024 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-39110727

RESUMO

Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.


Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial , Humanos , Herança Multifatorial/genética , Estudo de Associação Genômica Ampla/métodos , Aprendizado de Máquina , Predisposição Genética para Doença , Polimorfismo de Nucleotídeo Único
2.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38836403

RESUMO

In precision medicine, both predicting the disease susceptibility of an individual and forecasting its disease-free survival are areas of key research. Besides the classical epidemiological predictor variables, data from multiple (omic) platforms are increasingly available. To integrate this wealth of information, we propose new methodology to combine both cooperative learning, a recent approach to leverage the predictive power of several datasets, and polygenic hazard score models. Polygenic hazard score models provide a practitioner with a more differentiated view of the predicted disease-free survival than the one given by merely a point estimate, for instance computed with a polygenic risk score. Our aim is to leverage the advantages of cooperative learning for the computation of polygenic hazard score models via Cox's proportional hazard model, thereby improving the prediction of the disease-free survival. In our experimental study, we apply our methodology to forecast the disease-free survival for Alzheimer's disease (AD) using three layers of data. One layer contains epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status and 10 leading principal components. Another layer contains selected genomic loci, and the last layer contains methylation data for selected CpG sites. We demonstrate that the survival curves computed via cooperative learning yield an AUC of around $0.7$, above the state-of-the-art performance of its competitors. Importantly, the proposed methodology returns (1) a linear score that can be easily interpreted (in contrast to machine learning approaches), and (2) a weighting of the predictive power of the involved data layers, allowing for an assessment of the importance of each omic (or other) platform. Similarly to polygenic hazard score models, our methodology also allows one to compute individual survival curves for each patient.


Assuntos
Doença de Alzheimer , Medicina de Precisão , Humanos , Medicina de Precisão/métodos , Doença de Alzheimer/genética , Doença de Alzheimer/mortalidade , Intervalo Livre de Doença , Aprendizado de Máquina , Modelos de Riscos Proporcionais , Herança Multifatorial , Masculino , Feminino , Multiômica
3.
Genet Epidemiol ; 2024 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-38982682

RESUMO

The prediction of the susceptibility of an individual to a certain disease is an important and timely research area. An established technique is to estimate the risk of an individual with the help of an integrated risk model, that is, a polygenic risk score with added epidemiological covariates. However, integrated risk models do not capture any time dependence, and may provide a point estimate of the relative risk with respect to a reference population. The aim of this work is twofold. First, we explore and advocate the idea of predicting the time-dependent hazard and survival (defined as disease-free time) of an individual for the onset of a disease. This provides a practitioner with a much more differentiated view of absolute survival as a function of time. Second, to compute the time-dependent risk of an individual, we use published methodology to fit a Cox's proportional hazard model to data from a genetic SNP study of time to Alzheimer's disease (AD) onset, using the lasso to incorporate further epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status, 10 leading principal components, and selected genomic loci. We apply the lasso for Cox's proportional hazards to a data set of 6792 AD patients (composed of 4102 cases and 2690 controls) and 87 covariates. We demonstrate that fitting a lasso model for Cox's proportional hazards allows one to obtain more accurate survival curves than with state-of-the-art (likelihood-based) methods. Moreover, the methodology allows one to obtain personalized survival curves for a patient, thus giving a much more differentiated view of the expected progression of a disease than the view offered by integrated risk models. The runtime to compute personalized survival curves is under a minute for the entire data set of AD patients, thus enabling it to handle datasets with 60,000-100,000 subjects in less than 1 h.

4.
Proc Natl Acad Sci U S A ; 119(31): e2121279119, 2022 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-35905320

RESUMO

Genetically informed, deep-phenotyped biobanks are an important research resource and it is imperative that the most powerful, versatile, and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. When compared to other approaches, GMRM accuracy was greater than annotation prediction models run in the LDAK or LDPred-funct software by 15% (SE 7%) and 14% (SE 2%), respectively, and was 18% (SE 3%) greater than a baseline BayesR model without single-nucleotide polymorphism (SNP) markers grouped into minor allele frequency-linkage disequilibrium (MAF-LD) annotation categories. For height, the prediction accuracy R2 was 47% in a UK Biobank holdout sample, which was 76% of the estimated [Formula: see text]. We then extend our GMRM prediction model to provide mixed-linear model association (MLMA) SNP marker estimates for genome-wide association (GWAS) discovery, which increased the independent loci detected to 16,162 in unrelated UK Biobank individuals, compared to 10,550 from BoltLMM and 10,095 from Regenie, a 62 and 65% increase, respectively. The average [Formula: see text] value of the leading markers increased by 15.24 (SE 0.41) for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modeling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and discovery in large-scale individual-level studies.


Assuntos
Bases de Dados Genéticas , Estudo de Associação Genômica Ampla , Medicina de Precisão , Característica Quantitativa Herdável , Teorema de Bayes , Inglaterra , Estônia , Genômica , Genótipo , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único
5.
Stat Med ; 43(6): 1119-1134, 2024 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-38189632

RESUMO

Tuning hyperparameters, such as the regularization parameter in Ridge or Lasso regression, is often aimed at improving the predictive performance of risk prediction models. In this study, various hyperparameter tuning procedures for clinical prediction models were systematically compared and evaluated in low-dimensional data. The focus was on out-of-sample predictive performance (discrimination, calibration, and overall prediction error) of risk prediction models developed using Ridge, Lasso, Elastic Net, or Random Forest. The influence of sample size, number of predictors and events fraction on performance of the hyperparameter tuning procedures was studied using extensive simulations. The results indicate important differences between tuning procedures in calibration performance, while generally showing similar discriminative performance. The one-standard-error rule for tuning applied to cross-validation (1SE CV) often resulted in severe miscalibration. Standard non-repeated and repeated cross-validation (both 5-fold and 10-fold) performed similarly well and outperformed the other tuning procedures. Bootstrap showed a slight tendency to more severe miscalibration than standard cross-validation-based tuning procedures. Differences between tuning procedures were larger for smaller sample sizes, lower events fractions and fewer predictors. These results imply that the choice of tuning procedure can have a profound influence on the predictive performance of prediction models. The results support the application of standard 5-fold or 10-fold cross-validation that minimizes out-of-sample prediction error. Despite an increased computational burden, we found no clear benefit of repeated over non-repeated cross-validation for hyperparameter tuning. We warn against the potentially detrimental effects on model calibration of the popular 1SE CV rule for tuning prediction models in low-dimensional settings.


Assuntos
Projetos de Pesquisa , Humanos , Simulação por Computador , Tamanho da Amostra
6.
J Biopharm Stat ; : 1-7, 2024 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-38578223

RESUMO

We describe an approach for combining and analyzing high-dimensional genomic and low-dimensional phenotypic data. The approach leverages a scheme of weights applied to the variables instead of observations and, hence, permits incorporation of the information provided by the low dimensional data source. It can also be incorporated into commonly used downstream techniques, such as random forest or penalized regression. Finally, the simulated lupus studies involving genetic and clinical data are used to illustrate the overall idea and show that the proposed enriched penalized method can select significant genetic variables while keeping several important clinical variables in the final model.

7.
Int J Biometeorol ; 2024 Aug 31.
Artigo em Inglês | MEDLINE | ID: mdl-39215818

RESUMO

Crop yield prediction gains growing importance for all stakeholders in agriculture. Since the growth and development of crops are fully connected with many weather factors, it is inevitable to incorporate meteorological information into yield prediction mechanism. The changes in climate-yield relationship are more pronounced at a local level than across relatively large regions. Hence, district or sub-region-level modeling may be an appropriate approach. To obtain a location- and crop-specific model, different models with different functional forms have to be explored. This systematic review aims to discuss research papers related to statistical and machine-learning models commonly used to predict crop yield using weather factors. It was found that Artificial Neural Network (ANN) and Multiple Linear Regression were the most applied models. Support Vector Regression (SVR) model has a high success ratio as it performed well in most of the cases. The optimization options in ANN and SVR models allow us to tune models to specific patterns of association between weather conditions of a location and crop yield. ANN model can be trained using different activation functions with optimized learning rate and number of hidden layer neurons. Similarly, the SVR model can be trained with different kernel functions and various combinations of hyperparameters. Penalized regression models namely, LASSO and Elastic Net are better alternatives to simple linear regression. The nonlinear machine learning models namely, SVR and ANN were found to perform better in most of the cases which indicates there exists a nonlinear complex association between crop yield and weather factors.

8.
Biom J ; 66(1): e2200092, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37068189

RESUMO

Quantifying drug potency, which requires an accurate estimation of dose-response relationship, is essential for drug development in biomedical research and life sciences. However, the standard estimation procedure of the median-effect equation to describe the dose-response curve is vulnerable to extreme observations in common experimental data. To facilitate appropriate statistical inference, many powerful estimation tools have been developed in R, including various dose-response packages based on the nonlinear least squares method with different optimization strategies. Recently, beta regression-based methods have also been introduced in estimation of the median-effect equation. In theory, they can overcome nonnormality, heteroscedasticity, and asymmetry and accommodate flexible robust frameworks and coefficients penalization. To identify a reliable estimation method(s) to estimate dose-response curves even with extreme observations, we conducted a comparative study to review 14 different tools in R and examine their robustness and efficiency via Monte Carlo simulation under a list of comprehensive scenarios. The simulation results demonstrate that penalized beta regression using the mgcv package outperforms other methods in terms of stable, accurate estimation, and reliable uncertainty quantification.


Assuntos
Simulação por Computador , Análise de Regressão , Incerteza , Método de Monte Carlo
9.
BMC Bioinformatics ; 24(1): 82, 2023 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-36879227

RESUMO

BACKGROUND: One of the main challenges of microbiome analysis is its compositional nature that if ignored can lead to spurious results. Addressing the compositional structure of microbiome data is particularly critical in longitudinal studies where abundances measured at different times can correspond to different sub-compositions. RESULTS: We developed coda4microbiome, a new R package for analyzing microbiome data within the Compositional Data Analysis (CoDA) framework in both, cross-sectional and longitudinal studies. The aim of coda4microbiome is prediction, more specifically, the method is designed to identify a model (microbial signature) containing the minimum number of features with the maximum predictive power. The algorithm relies on the analysis of log-ratios between pairs of components and variable selection is addressed through penalized regression on the "all-pairs log-ratio model", the model containing all possible pairwise log-ratios. For longitudinal data, the algorithm infers dynamic microbial signatures by performing penalized regression over the summary of the log-ratio trajectories (the area under these trajectories). In both, cross-sectional and longitudinal studies, the inferred microbial signature is expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively. The package provides several graphical representations that facilitate the interpretation of the analysis and the identified microbial signatures. We illustrate the new method with data from a Crohn's disease study (cross-sectional data) and on the developing microbiome of infants (longitudinal data). CONCLUSIONS: coda4microbiome is a new algorithm for identification of microbial signatures in both, cross-sectional and longitudinal studies. The algorithm is implemented as an R package that is available at CRAN ( https://cran.r-project.org/web/packages/coda4microbiome/ ) and is accompanied with a vignette with a detailed description of the functions. The website of the project contains several tutorials: https://malucalle.github.io/coda4microbiome/.


Assuntos
Algoritmos , Microbiota , Lactente , Humanos , Estudos Transversais , Análise de Dados , Estudos Longitudinais
10.
Biometrics ; 79(2): 926-939, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-35191015

RESUMO

Microarray studies, in order to identify genes associated with an outcome of interest, usually produce noisy measurements for a large number of gene expression features from a small number of subjects. One common approach to analyzing such high-dimensional data is to use linear errors-in-variables (EIV) models; however, current methods for fitting such models are computationally expensive. In this paper, we present two efficient screening procedures, namely, corrected penalized marginal screening (PMSc) and corrected sure independence screening (SISc), to reduce the number of variables for final model building. Both screening procedures are based on fitting corrected marginal regression models relating the outcome to each contaminated covariate separately, which can be computed efficiently even with a large number of features. Under mild conditions, we show that these procedures achieve screening consistency and reduce the number of features substantially, even when the number of covariates grows exponentially with sample size. In addition, if the true covariates are weakly correlated, we show that PMSc can achieve full variable selection consistency. Through a simulation study and an analysis of gene expression data for bone mineral density of Norwegian women, we demonstrate that the two new screening procedures make estimation of linear EIV models computationally scalable in high-dimensional settings, and improve finite sample estimation and selection performance compared with estimators that do not employ a screening stage.


Assuntos
Simulação por Computador , Feminino , Humanos , Análise em Microsséries , Tamanho da Amostra
11.
Stat Med ; 42(9): 1412-1429, 2023 04 30.
Artigo em Inglês | MEDLINE | ID: mdl-36737800

RESUMO

Penalized regression methods such as the lasso are a popular approach to analyzing high-dimensional data. One attractive property of the lasso is that it naturally performs variable selection. An important area of concern, however, is the reliability of these selections. Motivated by local false discovery rate methodology from the large-scale hypothesis testing literature, we propose a method for calculating a local false discovery rate for each variable under consideration by the lasso model. These rates can be used to assess the reliability of an individual feature, or to estimate the model's overall false discovery rate. The method can be used for any level of regularization. This is particularly useful for models with a few highly significant features but a high overall false discovery rate, a relatively common occurrence when using cross validation to select a model. It is also flexible enough to be applied to many varieties of penalized likelihoods including generalized linear models and Cox regression, and a variety of penalties, including the minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD) penalty. We demonstrate the validity of this approach and contrast it with other inferential methods for penalized regression as well as with local false discovery rates for univariate hypothesis tests. Finally, we show the practical utility of our method by applying it to a case study involving gene expression in breast cancer patients.


Assuntos
Neoplasias da Mama , Humanos , Feminino , Reprodutibilidade dos Testes , Análise de Regressão , Modelos Lineares , Probabilidade , Neoplasias da Mama/genética
12.
Health Econ ; 32(6): 1305-1322, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-36857288

RESUMO

We develop a flexible two-equation copula model to address endogeneity of medical expenditures in a distribution regression for health. The expenditure margin uses the compound gamma distribution, a special case of the Tweedie family of distributions, to account for a spike at zero and a highly skewed continuous part. An efficient estimation algorithm offers flexible choices of copulae and link functions, including logit, probit and cloglog for the health margin. Our empirical application revisits data from the Rand Health Insurance Experiment. In the joint model, using random insurance plan assignment as instrument for spending, a $1000 increase is estimated to reduce the probability of a low post-program mental health index by 1.9 percentage points. The effect is not statistically significant. Ignoring endogeneity leads to a spurious positive effect estimate.


Assuntos
Seguro Saúde , Saúde Mental , Humanos , Gastos em Saúde , Probabilidade , Algoritmos
13.
J Biopharm Stat ; : 1-25, 2023 Jul 17.
Artigo em Inglês | MEDLINE | ID: mdl-37455635

RESUMO

We propose a new approach to select the regularization parameter using a new version of the generalized information criterion (GIC) in the subject of penalized regression. We prove the identifiability of bridge regression model as a prerequisite of statistical modeling. Then, we propose asymptotically efficient generalized information criterion (AGIC) and prove that it has asymptotic loss efficiency. Also, we verified the better performance of AGIC in comparison to the older versions of GIC. Furthermore, we propose MSE search paths to order the selected features by lasso regression based on numerical studies. The MSE search paths provide a way to cover the lack of feature ordering in lasso regression model. The performance of AGIC with other types of GIC is compared using MSE and model utility in simulation study. We exert AGIC and other criteria to analyze breast and prostate cancer and Parkinson disease datasets. The results confirm the superiority of AGIC in almost all situations.

14.
Stat Sin ; 33(1): 27-53, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37854586

RESUMO

In modern scientific research, data heterogeneity is commonly observed owing to the abundance of complex data. We propose a factor regression model for data with heterogeneous subpopulations. The proposed model can be represented as a decomposition of heterogeneous and homogeneous terms. The heterogeneous term is driven by latent factors in different subpopulations. The homogeneous term captures common variation in the covariates and shares common regression coefficients across subpopulations. Our proposed model attains a good balance between a global model and a group-specific model. The global model ignores the data heterogeneity, while the group-specific model fits each subgroup separately. We prove the estimation and prediction consistency for our proposed estimators, and show that it has better convergence rates than those of the group-specific and global models. We show that the extra cost of estimating latent factors is asymptotically negligible and the minimax rate is still attainable. We further demonstrate the robustness of our proposed method by studying its prediction error under a mis-specified group-specific model. Finally, we conduct simulation studies and analyze a data set from the Alzheimer's Disease Neuroimaging Initiative and an aggregated microarray data set to further demonstrate the competitiveness and interpretability of our proposed factor regression model.

15.
Stat Sin ; 33(2): 633-662, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37197479

RESUMO

Recent technological advances have made it possible to measure multiple types of many features in biomedical studies. However, some data types or features may not be measured for all study subjects because of cost or other constraints. We use a latent variable model to characterize the relationships across and within data types and to infer missing values from observed data. We develop a penalized-likelihood approach for variable selection and parameter estimation and devise an efficient expectation-maximization algorithm to implement our approach. We establish the asymptotic properties of the proposed estimators when the number of features increases at a polynomial rate of the sample size. Finally, we demonstrate the usefulness of the proposed methods using extensive simulation studies and provide an application to a motivating multi-platform genomics study.

16.
Genet Epidemiol ; 45(5): 427-444, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-33998038

RESUMO

Many genetic studies that aim to identify genetic variants associated with complex phenotypes are subject to unobserved confounding factors arising from environmental heterogeneity. This poses a challenge to detecting associations of interest and is known to induce spurious associations when left unaccounted for. Penalized linear mixed models (LMMs) are an attractive method to correct for unobserved confounding. These methods correct for varying levels of relatedness and population structure by modeling it as a random effect with a covariance structure estimated from observed genetic data. Despite an extensive literature on penalized regression and LMMs separately, the two are rarely discussed together. The aim of this review is to do so while examining the statistical properties of penalized LMMs in the genetic association setting. Specifically, the ability of penalized LMMs to accurately estimate genetic effects in the presence of environmental confounding has not been well studied. To clarify the important yet subtle distinction between population structure and environmental heterogeneity, we present a detailed review of relevant concepts and methods. In addition, we evaluate the performance of penalized LMMs and competing methods in terms of estimation and selection accuracy in the presence of a number of confounding structures.


Assuntos
Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Humanos , Modelos Lineares , Fenótipo
17.
Biostatistics ; 22(2): 348-364, 2021 04 10.
Artigo em Inglês | MEDLINE | ID: mdl-31596468

RESUMO

Penalization schemes like Lasso or ridge regression are routinely used to regress a response of interest on a high-dimensional set of potential predictors. Despite being decisive, the question of the relative strength of penalization is often glossed over and only implicitly determined by the scale of individual predictors. At the same time, additional information on the predictors is available in many applications but left unused. Here, we propose to make use of such external covariates to adapt the penalization in a data-driven manner. We present a method that differentially penalizes feature groups defined by the covariates and adapts the relative strength of penalization to the information content of each group. Using techniques from the Bayesian tool-set our procedure combines shrinkage with feature selection and provides a scalable optimization scheme. We demonstrate in simulations that the method accurately recovers the true effect sizes and sparsity patterns per feature group. Furthermore, it leads to an improved prediction performance in situations where the groups have strong differences in dynamic range. In applications to data from high-throughput biology, the method enables re-weighting the importance of feature groups from different assays. Overall, using available covariates extends the range of applications of penalized regression, improves model interpretability and can improve prediction performance.


Assuntos
Teorema de Bayes , Humanos
18.
BMC Cancer ; 22(1): 1045, 2022 Oct 05.
Artigo em Inglês | MEDLINE | ID: mdl-36199072

RESUMO

BACKGROUND: Prediction of patient survival from tumor molecular '-omics' data is a key step toward personalized medicine. Cox models performed on RNA profiling datasets are popular for clinical outcome predictions. But these models are applied in the context of "high dimension", as the number p of covariates (gene expressions) greatly exceeds the number n of patients and e of events. Thus, pre-screening together with penalization methods are widely used for dimensional reduction. METHODS: In the present paper, (i) we benchmark the performance of the lasso penalization and three variants (i.e., ridge, elastic net, adaptive elastic net) on 16 cancers from TCGA after pre-screening, (ii) we propose a bi-dimensional pre-screening procedure based on both gene variability and p-values from single variable Cox models to predict survival, and (iii) we compare our results with iterative sure independence screening (ISIS). RESULTS: First, we show that integration of mRNA-seq data with clinical data improves predictions over clinical data alone. Second, our bi-dimensional pre-screening procedure can only improve, in moderation, the C-index and/or the integrated Brier score, while excluding irrelevant genes for prediction. We demonstrate that the different penalization methods reached comparable prediction performances, with slight differences among datasets. Finally, we provide advice in the case of multi-omics data integration. CONCLUSIONS: Tumor profiles convey more prognostic information than clinical variables such as stage for many cancer subtypes. Lasso and Ridge penalizations perform similarly than Elastic Net penalizations for Cox models in high-dimension. Pre-screening of the top 200 genes in term of single variable Cox model p-values is a practical way to reduce dimension, which may be particularly useful when integrating multi-omics.


Assuntos
Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/genética , Prognóstico , Modelos de Riscos Proporcionais , RNA , RNA Mensageiro
19.
Biometrics ; 78(4): 1365-1376, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-34190337

RESUMO

We introduce a statistical procedure that integrates datasets from multiple biomedical studies to predict patients' survival, based on individual clinical and genomic profiles. The proposed procedure accounts for potential differences in the relation between predictors and outcomes across studies, due to distinct patient populations, treatments and technologies to measure outcomes and biomarkers. These differences are modeled explicitly with study-specific parameters. We use hierarchical regularization to shrink the study-specific parameters towards each other and to borrow information across studies. The estimation of the study-specific parameters utilizes a similarity matrix, which summarizes differences and similarities of the relations between covariates and outcomes across studies. We illustrate the method in a simulation study and using a collection of gene expression datasets in ovarian cancer. We show that the proposed model increases the accuracy of survival predictions compared to alternative meta-analytic methods.


Assuntos
Neoplasias Ovarianas , Feminino , Humanos , Simulação por Computador , Biomarcadores , Neoplasias Ovarianas/genética
20.
BMC Med Res Methodol ; 22(1): 206, 2022 07 26.
Artigo em Inglês | MEDLINE | ID: mdl-35883041

RESUMO

BACKGROUND: Variable selection for regression models plays a key role in the analysis of biomedical data. However, inference after selection is not covered by classical statistical frequentist theory, which assumes a fixed set of covariates in the model. This leads to over-optimistic selection and replicability issues. METHODS: We compared proposals for selective inference targeting the submodel parameters of the Lasso and its extension, the adaptive Lasso: sample splitting, selective inference conditional on the Lasso selection (SI), and universally valid post-selection inference (PoSI). We studied the properties of the proposed selective confidence intervals available via R software packages using a neutral simulation study inspired by real data commonly seen in biomedical studies. Furthermore, we present an exemplary application of these methods to a publicly available dataset to discuss their practical usability. RESULTS: Frequentist properties of selective confidence intervals by the SI method were generally acceptable, but the claimed selective coverage levels were not attained in all scenarios, in particular with the adaptive Lasso. The actual coverage of the extremely conservative PoSI method exceeded the nominal levels, and this method also required the greatest computational effort. Sample splitting achieved acceptable actual selective coverage levels, but the method is inefficient and leads to less accurate point estimates. The choice of inference method had a large impact on the resulting interval estimates, thereby necessitating that the user is acutely aware of the goal of inference in order to interpret and communicate the results. CONCLUSIONS: Despite violating nominal coverage levels in some scenarios, selective inference conditional on the Lasso selection is our recommended approach for most cases. If simplicity is strongly favoured over efficiency, then sample splitting is an alternative. If only few predictors undergo variable selection (i.e. up to 5) or the avoidance of false positive claims of significance is a concern, then the conservative approach of PoSI may be useful. For the adaptive Lasso, SI should be avoided and only PoSI and sample splitting are recommended. In summary, we find selective inference useful to assess the uncertainties in the importance of individual selected predictors for future applications.


Assuntos
Pesquisa Biomédica , Simulação por Computador , Humanos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa