Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 1.114
Filter
Add more filters

Publication year range
1.
Am J Hum Genet ; 111(2): 213-226, 2024 02 01.
Article in English | MEDLINE | ID: mdl-38171363

ABSTRACT

The aim of fine mapping is to identify genetic variants causally contributing to complex traits or diseases. Existing fine-mapping methods employ Bayesian discrete mixture priors and depend on a pre-specified maximum number of causal variants, which may lead to sub-optimal solutions. In this work, we propose a Bayesian fine-mapping method called h2-D2, utilizing a continuous global-local shrinkage prior. We also present an approach to define credible sets of causal variants in continuous prior settings. Simulation studies demonstrate that h2-D2 outperforms current state-of-the-art fine-mapping methods such as SuSiE and FINEMAP in accurately identifying causal variants and estimating their effect sizes. We further applied h2-D2 to prostate cancer analysis and discovered some previously unknown causal variants. In addition, we inferred 369 target genes associated with the detected causal variants and several pathways that were significantly over-represented by these genes, shedding light on their potential roles in prostate cancer development and progression.


Subject(s)
Prostatic Neoplasms , Quantitative Trait Loci , Male , Humans , Bayes Theorem , Polymorphism, Single Nucleotide/genetics , Computer Simulation , Prostatic Neoplasms/genetics , Genome-Wide Association Study/methods
2.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38436558

ABSTRACT

Recently, there has been a growing interest in variable selection for causal inference within the context of high-dimensional data. However, when the outcome exhibits a skewed distribution, ensuring the accuracy of variable selection and causal effect estimation might be challenging. Here, we introduce the generalized median adaptive lasso (GMAL) for covariate selection to achieve an accurate estimation of causal effect even when the outcome follows skewed distributions. A distinctive feature of our proposed method is that we utilize a linear median regression model for constructing penalty weights, thereby maintaining the accuracy of variable selection and causal effect estimation even when the outcome presents extremely skewed distributions. Simulation results showed that our proposed method performs comparably to existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the proposed method exhibited obvious superiority over the existing methods when the outcome follows a skewed distribution. Meanwhile, our proposed method consistently outperformed the existing methods in causal estimation, as indicated by smaller root-mean-square error. We also utilized the GMAL method on a deoxyribonucleic acid methylation dataset from the Alzheimer's disease (AD) neuroimaging initiative database to investigate the association between cerebrospinal fluid tau protein levels and the severity of AD.


Subject(s)
Alzheimer Disease , Humans , Alzheimer Disease/genetics , Computer Simulation , Databases, Factual , Linear Models , Protein Processing, Post-Translational
3.
Genet Epidemiol ; 2024 Oct 06.
Article in English | MEDLINE | ID: mdl-39370608

ABSTRACT

The main goal of fine-mapping is the identification of relevant genetic variants that have a causal effect on some trait of interest, such as the presence of a disease. From a statistical point of view, fine mapping can be seen as a variable selection problem. Fine-mapping methods are often challenging to apply because of the presence of linkage disequilibrium (LD), that is, regions of the genome where the variants interrogated have high correlation. Several methods have been proposed to address this issue. Here we explore the 'Sum of Single Effects' (SuSiE) method, applied to real data (summary statistics) from a genome-wide meta-analysis of the autoimmune liver disease primary biliary cholangitis (PBC). Fine-mapping in this data set was previously performed using the FINEMAP program; we compare these previous results with those obtained from SuSiE, which provides an arguably more convenient and principled way of generating 'credible sets', that is set of predictors that are correlated with the response variable. This allows us to appropriately acknowledge the uncertainty when selecting the causal effects for the trait. We focus on the results from SuSiE-RSS, which fits the SuSiE model to summary statistics, such as z-scores, along with a correlation matrix. We also compare the SuSiE results to those obtained using a more recently developed method, h2-D2, which uses the same inputs. Overall, we find the results from SuSiE-RSS and, to a lesser extent, h2-D2, to be quite concordant with those previously obtained using FINEMAP. The resulting genes and biological pathways implicated are therefore also similar to those previously obtained, providing valuable confirmation of these previously reported results. Detailed examination of the credible sets identified suggests that, although for the majority of the loci (33 out of 56) the results from SuSiE-RSS seem most plausible, there are some loci (5 out of 56 loci) where the results from h2-D2 seem more compelling. Computer simulations suggest that, overall, SuSiE-RSS generally has slightly higher power, better precision, and better ability to identify the true number of causal variants in a region than h2-D2, although there are some scenarios where the power of h2-D2 is higher. Thus, in real data analysis, the use of complementary approaches such as both SuSiE and h2-D2 is potentially warranted.

4.
Biostatistics ; 2024 Jun 25.
Article in English | MEDLINE | ID: mdl-38916966

ABSTRACT

Bayesian graphical models are powerful tools to infer complex relationships in high dimension, yet are often fraught with computational and statistical challenges. If exploited in a principled way, the increasing information collected alongside the data of primary interest constitutes an opportunity to mitigate these difficulties by guiding the detection of dependence structures. For instance, gene network inference may be informed by the use of publicly available summary statistics on the regulation of genes by genetic variants. Here we present a novel Gaussian graphical modeling framework to identify and leverage information on the centrality of nodes in conditional independence graphs. Specifically, we consider a fully joint hierarchical model to simultaneously infer (i) sparse precision matrices and (ii) the relevance of node-level information for uncovering the sought-after network structure. We encode such information as candidate auxiliary variables using a spike-and-slab submodel on the propensity of nodes to be hubs, which allows hypothesis-free selection and interpretation of a sparse subset of relevant variables. As efficient exploration of large posterior spaces is needed for real-world applications, we develop a variational expectation conditional maximization algorithm that scales inference to hundreds of samples, nodes and auxiliary variables. We illustrate and exploit the advantages of our approach in simulations and in a gene network study which identifies hub genes involved in biological pathways relevant to immune-mediated diseases.

5.
Cereb Cortex ; 34(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38813966

ABSTRACT

A multitude of factors are associated with the symptoms of post-traumatic stress disorder. However, establishing which predictors are most strongly associated with post-traumatic stress disorder symptoms is complicated because few studies are able to consider multiple factors simultaneously across the biopsychosocial domains that are implicated by existing theoretical models. Further, post-traumatic stress disorder is heterogeneous, and studies using case-control designs may obscure which factors relate uniquely to symptom dimensions. Here we used Bayesian variable selection to identify the most important predictors for overall post-traumatic stress disorder symptoms and individual symptom dimensions in a community sample of 569 adults (18 to 85 yr of age). Candidate predictors were selected from previously established risk factors relevant for post-traumatic stress disorder and included psychological measures, behavioral measures, and resting state functional connectivity among brain regions. In a follow-up analysis, we compared results controlling for current depression symptoms in order to examine specificity. Poor sleep quality and dimensions of temperament and impulsivity were consistently associated with greater post-traumatic stress disorder symptom severity. In addition to self-report measures, brain functional connectivity among regions commonly ascribed to the default mode network, central executive network, and salience network explained the unique variability of post-traumatic stress disorder symptoms. This study demonstrates the unique contributions of psychological measures and neural substrates to post-traumatic stress disorder symptoms.


Subject(s)
Brain , Magnetic Resonance Imaging , Stress Disorders, Post-Traumatic , Humans , Stress Disorders, Post-Traumatic/psychology , Stress Disorders, Post-Traumatic/physiopathology , Stress Disorders, Post-Traumatic/diagnostic imaging , Adult , Male , Female , Middle Aged , Aged , Young Adult , Brain/physiopathology , Brain/diagnostic imaging , Aged, 80 and over , Adolescent , Bayes Theorem , Depression/psychology , Depression/physiopathology , Impulsive Behavior/physiology , Temperament/physiology
6.
Genet Epidemiol ; 47(1): 3-25, 2023 02.
Article in English | MEDLINE | ID: mdl-36273411

ABSTRACT

Mendelian randomization (MR) is the use of genetic variants to assess the existence of a causal relationship between a risk factor and an outcome of interest. Here, we focus on two-sample summary-data MR analyses with many correlated variants from a single gene region, particularly on cis-MR studies which use protein expression as a risk factor. Such studies must rely on a small, curated set of variants from the studied region; using all variants in the region requires inverting an ill-conditioned genetic correlation matrix and results in numerically unstable causal effect estimates. We review methods for variable selection and estimation in cis-MR with summary-level data, ranging from stepwise pruning and conditional analysis to principal components analysis, factor analysis, and Bayesian variable selection. In a simulation study, we show that the various methods have comparable performance in analyses with large sample sizes and strong genetic instruments. However, when weak instrument bias is suspected, factor analysis and Bayesian variable selection produce more reliable inferences than simple pruning approaches, which are often used in practice. We conclude by examining two case studies, assessing the effects of low-density lipoprotein-cholesterol and serum testosterone on coronary heart disease risk using variants in the HMGCR and SHBG gene regions, respectively.


Subject(s)
Mendelian Randomization Analysis , Models, Genetic , Humans , Mendelian Randomization Analysis/methods , Bayes Theorem , Risk Factors , Causality
7.
Am J Epidemiol ; 193(2): 370-376, 2024 Feb 05.
Article in English | MEDLINE | ID: mdl-37771042

ABSTRACT

Variable selection in regression models is a particularly important issue in epidemiology, where one usually encounters observational studies. In contrast to randomized trials or experiments, confounding is often not controlled by the study design, but has to be accounted for by suitable statistical methods. For instance, when risk factors should be identified with unconfounded effect estimates, multivariable regression techniques can help to adjust for confounders. We investigated the current practice of variable selection in 4 major epidemiologic journals in 2019 and found that the majority of articles used subject-matter knowledge to determine a priori the set of included variables. In comparison with previous reviews from 2008 and 2015, fewer articles applied data-driven variable selection. Furthermore, for most articles the main aim of analysis was hypothesis-driven effect estimation in rather low-dimensional data situations (i.e., large sample size compared with the number of variables). Based on our results, we discuss the role of data-driven variable selection in epidemiology.


Subject(s)
Research Design , Humans , Regression Analysis , Sample Size
8.
Biostatistics ; 24(2): 295-308, 2023 04 14.
Article in English | MEDLINE | ID: mdl-34494086

ABSTRACT

Support vector regression (SVR) is particularly beneficial when the outcome and predictors are nonlinearly related. However, when many covariates are available, the method's flexibility can lead to overfitting and an overall loss in predictive accuracy. To overcome this drawback, we develop a feature selection method for SVR based on a genetic algorithm that iteratively searches across potential subsets of covariates to find those that yield the best performance according to a user-defined fitness function. We evaluate the performance of our feature selection method for SVR, comparing it to alternate methods including LASSO and random forest, in a simulation study. We find that our method yields higher predictive accuracy than SVR without feature selection. Our method outperforms LASSO when the relationship between covariates and outcome is nonlinear. Random forest performs equivalently to our method in some scenarios, but more poorly when covariates are correlated. We apply our method to predict donor kidney function 1 year after transplant using data from the United Network for Organ Sharing national registry.


Subject(s)
Algorithms , Regression Analysis , Humans , Support Vector Machine
9.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36184192

ABSTRACT

For many high-dimensional genomic and epigenomic datasets, the outcome of interest is ordinal. While these ordinal outcomes are often thought of as the observed cutpoints of some latent continuous variable, some ordinal outcomes are truly discrete and are comprised of the subjective combination of several factors. The nonlinear stereotype logistic model, which does not assume proportional odds, was developed for these 'assessed' ordinal variables. It has previously been extended to the frequentist high-dimensional feature selection setting, but the Bayesian framework provides some distinct advantages in terms of simultaneous uncertainty quantification and variable selection. Here, we review the stereotype model and Bayesian variable selection methods and demonstrate how to combine them to select genomic features associated with discrete ordinal outcomes. We compared the Bayesian and frequentist methods in terms of variable selection performance. We additionally applied the Bayesian stereotype method to an acute myeloid leukemia RNA-sequencing dataset to further demonstrate its variable selection abilities by identifying features associated with the European LeukemiaNet prognostic risk score.


Subject(s)
Genomics , Logistic Models , Bayes Theorem , Risk Factors
10.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35667004

ABSTRACT

In recent work, researchers have paid considerable attention to the estimation of causal effects in observational studies with a large number of covariates, which makes the unconfoundedness assumption plausible. In this paper, we review propensity score (PS) methods developed in high-dimensional settings and broadly group them into model-based methods that extend models for prediction to causal inference and balance-based methods that combine covariate balancing constraints. We conducted systematic simulation experiments to evaluate these two types of methods, and studied whether the use of balancing constraints further improved estimation performance. Our comparison methods were post-double-selection (PDS), double-index PS (DiPS), outcome-adaptive LASSO (OAL), group LASSO and doubly robust estimation (GLiDeR), high-dimensional covariate balancing PS (hdCBPS), regularized calibrated estimators (RCAL) and approximate residual balancing method (balanceHD). For the four model-based methods, simulation studies showed that GLiDeR was the most stable approach, with high estimation accuracy and precision, followed by PDS, OAL and DiPS. For balance-based methods, hdCBPS performed similarly to GLiDeR in terms of accuracy, and outperformed balanceHD and RCAL. These findings imply that PS methods do not benefit appreciably from covariate balancing constraints in high-dimensional settings. In conclusion, we recommend the preferential use of GLiDeR and hdCBPS approaches for estimating causal effects in high-dimensional settings; however, further studies on the construction of valid confidence intervals are required.


Subject(s)
Models, Statistical , Causality , Computer Simulation , Propensity Score
11.
Biometrics ; 80(1)2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38497825

ABSTRACT

Modern biomedical datasets are increasingly high-dimensional and exhibit complex correlation structures. Generalized linear mixed models (GLMMs) have long been employed to account for such dependencies. However, proper specification of the fixed and random effects in GLMMs is increasingly difficult in high dimensions, and computational complexity grows with increasing dimension of the random effects. We present a novel reformulation of the GLMM using a factor model decomposition of the random effects, enabling scalable computation of GLMMs in high dimensions by reducing the latent space from a large number of random effects to a smaller set of latent factors. We also extend our prior work to estimate model parameters using a modified Monte Carlo Expectation Conditional Minimization algorithm, allowing us to perform variable selection on both the fixed and random effects simultaneously. We show through simulation that through this factor model decomposition, our method can fit high-dimensional penalized GLMMs faster than comparable methods and more easily scale to larger dimensions not previously seen in existing approaches.


Subject(s)
Algorithms , Computer Simulation , Linear Models , Monte Carlo Method
12.
Biometrics ; 80(3)2024 Jul 01.
Article in English | MEDLINE | ID: mdl-39282732

ABSTRACT

We develop a methodology for valid inference after variable selection in logistic regression when the responses are partially observed, that is, when one observes a set of error-prone testing outcomes instead of the true values of the responses. Aiming at selecting important covariates while accounting for missing information in the response data, we apply the expectation-maximization algorithm to compute maximum likelihood estimators subject to LASSO penalization. Subsequent to variable selection, we make inferences on the selected covariate effects by extending post-selection inference methodology based on the polyhedral lemma. Empirical evidence from our extensive simulation study suggests that our post-selection inference results are more reliable than those from naive inference methods that use the same data to perform variable selection and inference without adjusting for variable selection.


Subject(s)
Algorithms , Computer Simulation , Likelihood Functions , Humans , Logistic Models , Data Interpretation, Statistical , Biometry/methods , Models, Statistical
13.
Biometrics ; 80(1)2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38465986

ABSTRACT

This paper proposes a novel likelihood-based boosting method for the selection of the random effects in linear mixed models. The nonconvexity of the objective function to minimize, which is the negative profile log-likelihood, requires the adoption of new solutions. In this respect, our optimization approach also employs the directions of negative curvature besides the usual Newton directions. A simulation study and a real-data application show the good performance of the proposal.


Subject(s)
Likelihood Functions , Linear Models , Computer Simulation
14.
Biometrics ; 80(1)2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38465987

ABSTRACT

High-dimensional data sets are often available in genome-enabled predictions. Such data sets include nonlinear relationships with complex dependence structures. For such situations, vine copula-based (quantile) regression is an important tool. However, the current vine copula-based regression approaches do not scale up to high and ultra-high dimensions. To perform high-dimensional sparse vine copula-based regression, we propose 2 methods. First, we show their superiority regarding computational complexity over the existing methods. Second, we define relevant, irrelevant, and redundant explanatory variables for quantile regression. Then, we show our method's power in selecting relevant variables and prediction accuracy in high-dimensional sparse data sets via simulation studies. Next, we apply the proposed methods to the high-dimensional real data, aiming at the genomic prediction of maize traits. Some data processing and feature extraction steps for the real data are further discussed. Finally, we show the advantage of our methods over linear models and quantile regression forests in simulation studies and real data applications.


Subject(s)
Genome , Genomics , Genomics/methods , Computer Simulation , Linear Models , Phenotype
15.
Biometrics ; 80(1)2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38412301

ABSTRACT

Ordinal class labels are frequently observed in classification studies across various fields. In medical science, patients' responses to a drug can be arranged in the natural order, reflecting their recovery postdrug administration. The severity of the disease is often recorded using an ordinal scale, such as cancer grades or tumor stages. We propose a method based on the linear discriminant analysis (LDA) that generates a sparse, low-dimensional discriminant subspace reflecting the class orders. Unlike existing approaches that focus on predictors marginally associated with ordinal labels, our proposed method selects variables that collectively contribute to the ordinal labels. We employ the optimal scoring approach for LDA as a regularization framework, applying an ordinality penalty to the optimal scores and a sparsity penalty to the coefficients for the predictors. We demonstrate the effectiveness of our approach using a glioma dataset, where we predict cancer grades based on gene expression. A simulation study with various settings validates the competitiveness of our classification performance and demonstrates the advantages of our approach in terms of the interpretability of the estimated classifier with respect to the ordinal class labels.


Subject(s)
Algorithms , Neoplasms , Humans , Discriminant Analysis , Computer Simulation , Neoplasms/genetics , Neoplasms/metabolism
16.
Biometrics ; 80(4)2024 Oct 03.
Article in English | MEDLINE | ID: mdl-39377518

ABSTRACT

In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates, including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors.


Subject(s)
Bayes Theorem , Computer Simulation , Gastrointestinal Microbiome , Obesity , Humans , Regression Analysis , Models, Statistical
17.
Biometrics ; 80(1)2024 Jan 29.
Article in English | MEDLINE | ID: mdl-38465988

ABSTRACT

Mixed panel count data represent a common complex data structure in longitudinal survey studies. A major challenge in analyzing such data is variable selection and estimation while efficiently incorporating both the panel count and panel binary data components. Analyses in the medical literature have often ignored the panel binary component and treated it as missing with the unknown panel counts, while obviously such a simplification does not effectively utilize the original data information. In this research, we put forward a penalized likelihood variable selection and estimation procedure under the proportional mean model. A computationally efficient EM algorithm is developed that ensures sparse estimation for variable selection, and the resulting estimator is shown to have the desirable oracle property. Simulation studies assessed and confirmed the good finite-sample properties of the proposed method, and the method is applied to analyze a motivating dataset from the Health and Retirement Study.


Subject(s)
Algorithms , Likelihood Functions , Computer Simulation , Longitudinal Studies
18.
Stat Med ; 43(1): 61-88, 2024 01 15.
Article in English | MEDLINE | ID: mdl-37927105

ABSTRACT

Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr through P $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free of P $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.


Subject(s)
Algorithms , Research Design , Humans , Computer Simulation
19.
Stat Med ; 43(20): 3792-3814, 2024 Sep 10.
Article in English | MEDLINE | ID: mdl-38923006

ABSTRACT

Integrative analysis has emerged as a prominent tool in biomedical research, offering a solution to the "small n $$ n $$ and large p $$ p $$ " challenge. Leveraging the powerful capabilities of deep learning in extracting complex relationship between genes and diseases, our objective in this study is to incorporate deep learning into the framework of integrative analysis. Recognizing the redundancy within candidate features, we introduce a dedicated feature selection layer in the proposed integrative deep learning method. To further improve the performance of feature selection, the rich previous researches are utilized by an ensemble learning method to identify "prior information". This leads to the proposed prior assisted integrative deep learning (PANDA) method. We demonstrate the superiority of the PANDA method through a series of simulation studies, showing its clear advantages over competing approaches in both feature selection and outcome prediction. Finally, a skin cutaneous melanoma (SKCM) dataset is extensively analyzed by the PANDA method to show its practical application.


Subject(s)
Deep Learning , Melanoma , Skin Neoplasms , Humans , Melanoma/genetics , Computer Simulation , Algorithms
20.
Stat Med ; 43(14): 2713-2733, 2024 Jun 30.
Article in English | MEDLINE | ID: mdl-38690642

ABSTRACT

This article presents a novel method for learning time-varying dynamic Bayesian networks. The proposed method breaks down the dynamic Bayesian network learning problem into a sequence of regression inference problems and tackles each problem using the Markov neighborhood regression technique. Notably, the method demonstrates scalability concerning data dimensionality, accommodates time-varying network structure, and naturally handles multi-subject data. The proposed method exhibits consistency and offers superior performance compared to existing methods in terms of estimation accuracy and computational efficiency, as supported by extensive numerical experiments. To showcase its effectiveness, we apply the proposed method to an fMRI study investigating the effective connectivity among various regions of interest (ROIs) during an emotion-processing task. Our findings reveal the pivotal role of the subcortical-cerebellum in emotion processing.


Subject(s)
Bayes Theorem , Emotions , Magnetic Resonance Imaging , Humans , Magnetic Resonance Imaging/methods , Emotions/physiology , Markov Chains , Brain/diagnostic imaging , Brain/physiology , Computer Simulation
SELECTION OF CITATIONS
SEARCH DETAIL