Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.083
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Am J Hum Genet ; 111(2): 213-226, 2024 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-38171363

RESUMO

The aim of fine mapping is to identify genetic variants causally contributing to complex traits or diseases. Existing fine-mapping methods employ Bayesian discrete mixture priors and depend on a pre-specified maximum number of causal variants, which may lead to sub-optimal solutions. In this work, we propose a Bayesian fine-mapping method called h2-D2, utilizing a continuous global-local shrinkage prior. We also present an approach to define credible sets of causal variants in continuous prior settings. Simulation studies demonstrate that h2-D2 outperforms current state-of-the-art fine-mapping methods such as SuSiE and FINEMAP in accurately identifying causal variants and estimating their effect sizes. We further applied h2-D2 to prostate cancer analysis and discovered some previously unknown causal variants. In addition, we inferred 369 target genes associated with the detected causal variants and several pathways that were significantly over-represented by these genes, shedding light on their potential roles in prostate cancer development and progression.


Assuntos
Neoplasias da Próstata , Locos de Características Quantitativas , Masculino , Humanos , Teorema de Bayes , Polimorfismo de Nucleotídeo Único/genética , Simulação por Computador , Neoplasias da Próstata/genética , Estudo de Associação Genômica Ampla/métodos
2.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38436558

RESUMO

Recently, there has been a growing interest in variable selection for causal inference within the context of high-dimensional data. However, when the outcome exhibits a skewed distribution, ensuring the accuracy of variable selection and causal effect estimation might be challenging. Here, we introduce the generalized median adaptive lasso (GMAL) for covariate selection to achieve an accurate estimation of causal effect even when the outcome follows skewed distributions. A distinctive feature of our proposed method is that we utilize a linear median regression model for constructing penalty weights, thereby maintaining the accuracy of variable selection and causal effect estimation even when the outcome presents extremely skewed distributions. Simulation results showed that our proposed method performs comparably to existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the proposed method exhibited obvious superiority over the existing methods when the outcome follows a skewed distribution. Meanwhile, our proposed method consistently outperformed the existing methods in causal estimation, as indicated by smaller root-mean-square error. We also utilized the GMAL method on a deoxyribonucleic acid methylation dataset from the Alzheimer's disease (AD) neuroimaging initiative database to investigate the association between cerebrospinal fluid tau protein levels and the severity of AD.


Assuntos
Doença de Alzheimer , Humanos , Doença de Alzheimer/genética , Simulação por Computador , Bases de Dados Factuais , Modelos Lineares , Processamento de Proteína Pós-Traducional
3.
Biostatistics ; 2024 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-38916966

RESUMO

Bayesian graphical models are powerful tools to infer complex relationships in high dimension, yet are often fraught with computational and statistical challenges. If exploited in a principled way, the increasing information collected alongside the data of primary interest constitutes an opportunity to mitigate these difficulties by guiding the detection of dependence structures. For instance, gene network inference may be informed by the use of publicly available summary statistics on the regulation of genes by genetic variants. Here we present a novel Gaussian graphical modeling framework to identify and leverage information on the centrality of nodes in conditional independence graphs. Specifically, we consider a fully joint hierarchical model to simultaneously infer (i) sparse precision matrices and (ii) the relevance of node-level information for uncovering the sought-after network structure. We encode such information as candidate auxiliary variables using a spike-and-slab submodel on the propensity of nodes to be hubs, which allows hypothesis-free selection and interpretation of a sparse subset of relevant variables. As efficient exploration of large posterior spaces is needed for real-world applications, we develop a variational expectation conditional maximization algorithm that scales inference to hundreds of samples, nodes and auxiliary variables. We illustrate and exploit the advantages of our approach in simulations and in a gene network study which identifies hub genes involved in biological pathways relevant to immune-mediated diseases.

4.
Cereb Cortex ; 34(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38813966

RESUMO

A multitude of factors are associated with the symptoms of post-traumatic stress disorder. However, establishing which predictors are most strongly associated with post-traumatic stress disorder symptoms is complicated because few studies are able to consider multiple factors simultaneously across the biopsychosocial domains that are implicated by existing theoretical models. Further, post-traumatic stress disorder is heterogeneous, and studies using case-control designs may obscure which factors relate uniquely to symptom dimensions. Here we used Bayesian variable selection to identify the most important predictors for overall post-traumatic stress disorder symptoms and individual symptom dimensions in a community sample of 569 adults (18 to 85 yr of age). Candidate predictors were selected from previously established risk factors relevant for post-traumatic stress disorder and included psychological measures, behavioral measures, and resting state functional connectivity among brain regions. In a follow-up analysis, we compared results controlling for current depression symptoms in order to examine specificity. Poor sleep quality and dimensions of temperament and impulsivity were consistently associated with greater post-traumatic stress disorder symptom severity. In addition to self-report measures, brain functional connectivity among regions commonly ascribed to the default mode network, central executive network, and salience network explained the unique variability of post-traumatic stress disorder symptoms. This study demonstrates the unique contributions of psychological measures and neural substrates to post-traumatic stress disorder symptoms.


Assuntos
Encéfalo , Imageamento por Ressonância Magnética , Transtornos de Estresse Pós-Traumáticos , Humanos , Transtornos de Estresse Pós-Traumáticos/psicologia , Transtornos de Estresse Pós-Traumáticos/fisiopatologia , Transtornos de Estresse Pós-Traumáticos/diagnóstico por imagem , Adulto , Masculino , Feminino , Pessoa de Meia-Idade , Idoso , Adulto Jovem , Encéfalo/fisiopatologia , Encéfalo/diagnóstico por imagem , Idoso de 80 Anos ou mais , Adolescente , Teorema de Bayes , Depressão/psicologia , Depressão/fisiopatologia , Comportamento Impulsivo/fisiologia , Temperamento/fisiologia
5.
Genet Epidemiol ; 47(1): 3-25, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36273411

RESUMO

Mendelian randomization (MR) is the use of genetic variants to assess the existence of a causal relationship between a risk factor and an outcome of interest. Here, we focus on two-sample summary-data MR analyses with many correlated variants from a single gene region, particularly on cis-MR studies which use protein expression as a risk factor. Such studies must rely on a small, curated set of variants from the studied region; using all variants in the region requires inverting an ill-conditioned genetic correlation matrix and results in numerically unstable causal effect estimates. We review methods for variable selection and estimation in cis-MR with summary-level data, ranging from stepwise pruning and conditional analysis to principal components analysis, factor analysis, and Bayesian variable selection. In a simulation study, we show that the various methods have comparable performance in analyses with large sample sizes and strong genetic instruments. However, when weak instrument bias is suspected, factor analysis and Bayesian variable selection produce more reliable inferences than simple pruning approaches, which are often used in practice. We conclude by examining two case studies, assessing the effects of low-density lipoprotein-cholesterol and serum testosterone on coronary heart disease risk using variants in the HMGCR and SHBG gene regions, respectively.


Assuntos
Análise da Randomização Mendeliana , Modelos Genéticos , Humanos , Análise da Randomização Mendeliana/métodos , Teorema de Bayes , Fatores de Risco , Causalidade
6.
Am J Epidemiol ; 193(2): 370-376, 2024 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-37771042

RESUMO

Variable selection in regression models is a particularly important issue in epidemiology, where one usually encounters observational studies. In contrast to randomized trials or experiments, confounding is often not controlled by the study design, but has to be accounted for by suitable statistical methods. For instance, when risk factors should be identified with unconfounded effect estimates, multivariable regression techniques can help to adjust for confounders. We investigated the current practice of variable selection in 4 major epidemiologic journals in 2019 and found that the majority of articles used subject-matter knowledge to determine a priori the set of included variables. In comparison with previous reviews from 2008 and 2015, fewer articles applied data-driven variable selection. Furthermore, for most articles the main aim of analysis was hypothesis-driven effect estimation in rather low-dimensional data situations (i.e., large sample size compared with the number of variables). Based on our results, we discuss the role of data-driven variable selection in epidemiology.


Assuntos
Projetos de Pesquisa , Humanos , Análise de Regressão , Tamanho da Amostra
7.
Biostatistics ; 24(2): 295-308, 2023 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-34494086

RESUMO

Support vector regression (SVR) is particularly beneficial when the outcome and predictors are nonlinearly related. However, when many covariates are available, the method's flexibility can lead to overfitting and an overall loss in predictive accuracy. To overcome this drawback, we develop a feature selection method for SVR based on a genetic algorithm that iteratively searches across potential subsets of covariates to find those that yield the best performance according to a user-defined fitness function. We evaluate the performance of our feature selection method for SVR, comparing it to alternate methods including LASSO and random forest, in a simulation study. We find that our method yields higher predictive accuracy than SVR without feature selection. Our method outperforms LASSO when the relationship between covariates and outcome is nonlinear. Random forest performs equivalently to our method in some scenarios, but more poorly when covariates are correlated. We apply our method to predict donor kidney function 1 year after transplant using data from the United Network for Organ Sharing national registry.


Assuntos
Algoritmos , Análise de Regressão , Humanos , Máquina de Vetores de Suporte
8.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36184192

RESUMO

For many high-dimensional genomic and epigenomic datasets, the outcome of interest is ordinal. While these ordinal outcomes are often thought of as the observed cutpoints of some latent continuous variable, some ordinal outcomes are truly discrete and are comprised of the subjective combination of several factors. The nonlinear stereotype logistic model, which does not assume proportional odds, was developed for these 'assessed' ordinal variables. It has previously been extended to the frequentist high-dimensional feature selection setting, but the Bayesian framework provides some distinct advantages in terms of simultaneous uncertainty quantification and variable selection. Here, we review the stereotype model and Bayesian variable selection methods and demonstrate how to combine them to select genomic features associated with discrete ordinal outcomes. We compared the Bayesian and frequentist methods in terms of variable selection performance. We additionally applied the Bayesian stereotype method to an acute myeloid leukemia RNA-sequencing dataset to further demonstrate its variable selection abilities by identifying features associated with the European LeukemiaNet prognostic risk score.


Assuntos
Genômica , Modelos Logísticos , Teorema de Bayes , Fatores de Risco
9.
Brief Bioinform ; 23(4)2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-35667004

RESUMO

In recent work, researchers have paid considerable attention to the estimation of causal effects in observational studies with a large number of covariates, which makes the unconfoundedness assumption plausible. In this paper, we review propensity score (PS) methods developed in high-dimensional settings and broadly group them into model-based methods that extend models for prediction to causal inference and balance-based methods that combine covariate balancing constraints. We conducted systematic simulation experiments to evaluate these two types of methods, and studied whether the use of balancing constraints further improved estimation performance. Our comparison methods were post-double-selection (PDS), double-index PS (DiPS), outcome-adaptive LASSO (OAL), group LASSO and doubly robust estimation (GLiDeR), high-dimensional covariate balancing PS (hdCBPS), regularized calibrated estimators (RCAL) and approximate residual balancing method (balanceHD). For the four model-based methods, simulation studies showed that GLiDeR was the most stable approach, with high estimation accuracy and precision, followed by PDS, OAL and DiPS. For balance-based methods, hdCBPS performed similarly to GLiDeR in terms of accuracy, and outperformed balanceHD and RCAL. These findings imply that PS methods do not benefit appreciably from covariate balancing constraints in high-dimensional settings. In conclusion, we recommend the preferential use of GLiDeR and hdCBPS approaches for estimating causal effects in high-dimensional settings; however, further studies on the construction of valid confidence intervals are required.


Assuntos
Modelos Estatísticos , Causalidade , Simulação por Computador , Pontuação de Propensão
10.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38497825

RESUMO

Modern biomedical datasets are increasingly high-dimensional and exhibit complex correlation structures. Generalized linear mixed models (GLMMs) have long been employed to account for such dependencies. However, proper specification of the fixed and random effects in GLMMs is increasingly difficult in high dimensions, and computational complexity grows with increasing dimension of the random effects. We present a novel reformulation of the GLMM using a factor model decomposition of the random effects, enabling scalable computation of GLMMs in high dimensions by reducing the latent space from a large number of random effects to a smaller set of latent factors. We also extend our prior work to estimate model parameters using a modified Monte Carlo Expectation Conditional Minimization algorithm, allowing us to perform variable selection on both the fixed and random effects simultaneously. We show through simulation that through this factor model decomposition, our method can fit high-dimensional penalized GLMMs faster than comparable methods and more easily scale to larger dimensions not previously seen in existing approaches.


Assuntos
Algoritmos , Simulação por Computador , Modelos Lineares , Método de Monte Carlo
11.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38465986

RESUMO

This paper proposes a novel likelihood-based boosting method for the selection of the random effects in linear mixed models. The nonconvexity of the objective function to minimize, which is the negative profile log-likelihood, requires the adoption of new solutions. In this respect, our optimization approach also employs the directions of negative curvature besides the usual Newton directions. A simulation study and a real-data application show the good performance of the proposal.


Assuntos
Funções Verossimilhança , Modelos Lineares , Simulação por Computador
12.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38465987

RESUMO

High-dimensional data sets are often available in genome-enabled predictions. Such data sets include nonlinear relationships with complex dependence structures. For such situations, vine copula-based (quantile) regression is an important tool. However, the current vine copula-based regression approaches do not scale up to high and ultra-high dimensions. To perform high-dimensional sparse vine copula-based regression, we propose 2 methods. First, we show their superiority regarding computational complexity over the existing methods. Second, we define relevant, irrelevant, and redundant explanatory variables for quantile regression. Then, we show our method's power in selecting relevant variables and prediction accuracy in high-dimensional sparse data sets via simulation studies. Next, we apply the proposed methods to the high-dimensional real data, aiming at the genomic prediction of maize traits. Some data processing and feature extraction steps for the real data are further discussed. Finally, we show the advantage of our methods over linear models and quantile regression forests in simulation studies and real data applications.


Assuntos
Genoma , Genômica , Genômica/métodos , Simulação por Computador , Modelos Lineares , Fenótipo
13.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38412301

RESUMO

Ordinal class labels are frequently observed in classification studies across various fields. In medical science, patients' responses to a drug can be arranged in the natural order, reflecting their recovery postdrug administration. The severity of the disease is often recorded using an ordinal scale, such as cancer grades or tumor stages. We propose a method based on the linear discriminant analysis (LDA) that generates a sparse, low-dimensional discriminant subspace reflecting the class orders. Unlike existing approaches that focus on predictors marginally associated with ordinal labels, our proposed method selects variables that collectively contribute to the ordinal labels. We employ the optimal scoring approach for LDA as a regularization framework, applying an ordinality penalty to the optimal scores and a sparsity penalty to the coefficients for the predictors. We demonstrate the effectiveness of our approach using a glioma dataset, where we predict cancer grades based on gene expression. A simulation study with various settings validates the competitiveness of our classification performance and demonstrates the advantages of our approach in terms of the interpretability of the estimated classifier with respect to the ordinal class labels.


Assuntos
Algoritmos , Neoplasias , Humanos , Análise Discriminante , Simulação por Computador , Neoplasias/genética , Neoplasias/metabolismo
14.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38465988

RESUMO

Mixed panel count data represent a common complex data structure in longitudinal survey studies. A major challenge in analyzing such data is variable selection and estimation while efficiently incorporating both the panel count and panel binary data components. Analyses in the medical literature have often ignored the panel binary component and treated it as missing with the unknown panel counts, while obviously such a simplification does not effectively utilize the original data information. In this research, we put forward a penalized likelihood variable selection and estimation procedure under the proportional mean model. A computationally efficient EM algorithm is developed that ensures sparse estimation for variable selection, and the resulting estimator is shown to have the desirable oracle property. Simulation studies assessed and confirmed the good finite-sample properties of the proposed method, and the method is applied to analyze a motivating dataset from the Health and Retirement Study.


Assuntos
Algoritmos , Funções Verossimilhança , Simulação por Computador , Estudos Longitudinais
15.
Stat Med ; 43(1): 61-88, 2024 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-37927105

RESUMO

Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr through P $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free of P $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.


Assuntos
Algoritmos , Projetos de Pesquisa , Humanos , Simulação por Computador
16.
Stat Med ; 2024 Jun 23.
Artigo em Inglês | MEDLINE | ID: mdl-38923006

RESUMO

Integrative analysis has emerged as a prominent tool in biomedical research, offering a solution to the "small n $$ n $$ and large p $$ p $$ " challenge. Leveraging the powerful capabilities of deep learning in extracting complex relationship between genes and diseases, our objective in this study is to incorporate deep learning into the framework of integrative analysis. Recognizing the redundancy within candidate features, we introduce a dedicated feature selection layer in the proposed integrative deep learning method. To further improve the performance of feature selection, the rich previous researches are utilized by an ensemble learning method to identify "prior information". This leads to the proposed prior assisted integrative deep learning (PANDA) method. We demonstrate the superiority of the PANDA method through a series of simulation studies, showing its clear advantages over competing approaches in both feature selection and outcome prediction. Finally, a skin cutaneous melanoma (SKCM) dataset is extensively analyzed by the PANDA method to show its practical application.

17.
Stat Med ; 43(14): 2713-2733, 2024 Jun 30.
Artigo em Inglês | MEDLINE | ID: mdl-38690642

RESUMO

This article presents a novel method for learning time-varying dynamic Bayesian networks. The proposed method breaks down the dynamic Bayesian network learning problem into a sequence of regression inference problems and tackles each problem using the Markov neighborhood regression technique. Notably, the method demonstrates scalability concerning data dimensionality, accommodates time-varying network structure, and naturally handles multi-subject data. The proposed method exhibits consistency and offers superior performance compared to existing methods in terms of estimation accuracy and computational efficiency, as supported by extensive numerical experiments. To showcase its effectiveness, we apply the proposed method to an fMRI study investigating the effective connectivity among various regions of interest (ROIs) during an emotion-processing task. Our findings reveal the pivotal role of the subcortical-cerebellum in emotion processing.


Assuntos
Teorema de Bayes , Emoções , Imageamento por Ressonância Magnética , Humanos , Imageamento por Ressonância Magnética/métodos , Emoções/fisiologia , Cadeias de Markov , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia , Simulação por Computador
18.
Stat Med ; 43(8): 1509-1526, 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-38320545

RESUMO

We propose a new simultaneous variable selection and estimation procedure with the Gaussian seamless- L 0 $$ {L}_0 $$ (GSELO) penalty for Cox proportional hazard model and additive hazards model. The GSELO procedure shows good potential to improve the existing variable selection methods by taking strength from both best subset selection (BSS) and regularization. In addition, we develop an iterative algorithm to implement the proposed procedure in a computationally efficient way. Theoretically, we establish the convergence properties of the algorithm and asymptotic theoretical properties of the proposed procedure. Since parameter tuning is crucial to the performance of the GSELO procedure, we also propose an extended Bayesian information criteria (EBIC) parameter selector for the GSELO procedure. Simulated and real data studies have demonstrated the prediction performance and effectiveness of the proposed method over several state-of-the-art methods.


Assuntos
Algoritmos , Humanos , Teorema de Bayes , Modelos de Riscos Proporcionais
19.
Stat Med ; 2024 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-38807296

RESUMO

Cox models with time-dependent coefficients and covariates are widely used in survival analysis. In high-dimensional settings, sparse regularization techniques are employed for variable selection, but existing methods for time-dependent Cox models lack flexibility in enforcing specific sparsity patterns (ie, covariate structures). We propose a flexible framework for variable selection in time-dependent Cox models, accommodating complex selection rules. Our method can adapt to arbitrary grouping structures, including interaction selection, temporal, spatial, tree, and directed acyclic graph structures. It achieves accurate estimation with low false alarm rates. We develop the sox package, implementing a network flow algorithm for efficiently solving models with complex covariate structures. sox offers a user-friendly interface for specifying grouping structures and delivers fast computation. Through examples, including a case study on identifying predictors of time to all-cause death in atrial fibrillation patients, we demonstrate the practical application of our method with specific selection rules.

20.
Stat Appl Genet Mol Biol ; 22(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-38015771

RESUMO

High-throughput technologies have made high-dimensional settings increasingly common, providing opportunities for the development of high-dimensional mediation methods. We aimed to provide useful guidance for researchers using high-dimensional mediation analysis and ideas for biostatisticians to develop it by summarizing and discussing recent advances in high-dimensional mediation analysis. The method still faces many challenges when extended single and multiple mediation analyses to high-dimensional settings. The development of high-dimensional mediation methods attempts to address these issues, such as screening true mediators, estimating mediation effects by variable selection, reducing the mediation dimension to resolve correlations between variables, and utilizing composite null hypothesis testing to test them. Although these problems regarding high-dimensional mediation have been solved to some extent, some challenges remain. First, the correlation between mediators are rarely considered when the variables are selected for mediation. Second, downscaling without incorporating prior biological knowledge makes the results difficult to interpret. In addition, a method of sensitivity analysis for the strict sequential ignorability assumption in high-dimensional mediation analysis is still lacking. An analyst needs to consider the applicability of each method when utilizing them, while a biostatistician could consider extensions and improvements in the methodology.


Assuntos
Análise de Mediação , Modelos Estatísticos , Projetos de Pesquisa
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA