Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
Ann Appl Stat ; 16(3): 1891-1918, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-36091495

RESUMO

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

2.
PLoS Genet ; 18(3): e1010105, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35324888

RESUMO

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's ⍴ = 0.61, p = 2.2 x 10-59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).


Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial , Bancos de Espécimes Biológicos , Predisposição Genética para Doença , Humanos , Herança Multifatorial/genética , Fenótipo , Fatores de Risco , Reino Unido
3.
Biostatistics ; 23(2): 522-540, 2022 04 13.
Artigo em Inglês | MEDLINE | ID: mdl-32989444

RESUMO

We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.


Assuntos
Algoritmos , Bancos de Espécimes Biológicos , Humanos , Funções Verossimilhança , Modelos de Riscos Proporcionais , Reino Unido
6.
Nat Genet ; 53(2): 185-194, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33462484

RESUMO

Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1 s.d.) protein-altering, human leukocyte antigen (HLA) and copy number variant (CNV) associations. Through Mendelian randomization (MR) analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores (PRSs) for each biomarker and build 'multi-PRS' models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n = 135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.


Assuntos
Biomarcadores/sangue , Biomarcadores/urina , Antígenos HLA/genética , Proteínas/genética , Bancos de Espécimes Biológicos , Doenças Cardiovasculares/genética , Doenças Cardiovasculares/metabolismo , Variações do Número de Cópias de DNA , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/metabolismo , Pleiotropia Genética , Humanos , Desequilíbrio de Ligação , Transportador 1 de Ânion Orgânico Específico do Fígado/genética , Análise da Randomização Mendeliana , Polimorfismo de Nucleotídeo Único , Insuficiência Renal Crônica , Serina Endopeptidases/genética , Reino Unido
7.
PLoS Genet ; 16(10): e1009141, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-33095761

RESUMO

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.


Assuntos
Asma/epidemiologia , Bancos de Espécimes Biológicos , Genética Populacional , Estudo de Associação Genômica Ampla , Algoritmos , Asma/sangue , Asma/genética , Estatura/genética , Índice de Massa Corporal , Colesterol/sangue , Estudos de Coortes , Genótipo , Humanos , Modelos Logísticos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Modelos de Riscos Proporcionais , Reino Unido/epidemiologia
8.
Proc Natl Acad Sci U S A ; 115(35): E8172-E8180, 2018 08 28.
Artigo em Inglês | MEDLINE | ID: mdl-30104359

RESUMO

Despite not spanning phospholipid bilayers, monotopic integral proteins (MIPs) play critical roles in organizing biochemical reactions on membrane surfaces. Defining the structural basis by which these proteins are anchored to membranes has been hampered by the paucity of unambiguously identified MIPs and a lack of computational tools that accurately distinguish monolayer-integrating motifs from bilayer-spanning transmembrane domains (TMDs). We used quantitative proteomics and statistical modeling to identify 87 high-confidence candidate MIPs in lipid droplets, including 21 proteins with predicted TMDs that cannot be accommodated in these monolayer-enveloped organelles. Systematic cysteine-scanning mutagenesis showed the predicted TMD of one candidate MIP, DHRS3, to be a partially buried amphipathic α-helix in both lipid droplet monolayers and the cytoplasmic leaflet of endoplasmic reticulum membrane bilayers. Coarse-grained molecular dynamics simulations support these observations, suggesting that this helix is most stable at the solvent-membrane interface. The simulations also predicted similar interfacial amphipathic helices when applied to seven additional MIPs from our dataset. Our findings suggest that interfacial helices may be a common motif by which MIPs are integrated into membranes, and provide high-throughput methods to identify and study MIPs.


Assuntos
Proteínas de Membrana/química , Proteômica , Células HEK293 , Humanos , Gotículas Lipídicas , Proteínas de Membrana/genética , Proteínas de Membrana/metabolismo , Mutagênese , Domínios Proteicos , Estrutura Secundária de Proteína
9.
Stat Med ; 37(11): 1767-1787, 2018 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-29508417

RESUMO

When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze 3 methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the 2 most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.


Assuntos
Bioestatística/métodos , Tomada de Decisões , Resultado do Tratamento , Algoritmos , Causalidade , Simulação por Computador , Registros Eletrônicos de Saúde/estatística & dados numéricos , Humanos , Aprendizado de Máquina/estatística & dados numéricos , Estudos Observacionais como Assunto/estatística & dados numéricos , Modelagem Computacional Específica para o Paciente/estatística & dados numéricos , Medicina de Precisão/estatística & dados numéricos , Pontuação de Propensão , Ensaios Clínicos Controlados Aleatórios como Assunto/estatística & dados numéricos , Análise de Regressão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA