Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
1.
Am J Hum Genet ; 110(2): 314-325, 2023 02 02.
Article in English | MEDLINE | ID: mdl-36610401

ABSTRACT

Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 105 to 106 samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.


Subject(s)
Biological Specimen Banks , Genome-Wide Association Study , Humans , Genome-Wide Association Study/methods , Likelihood Functions , Population Groups , Software , Genetics, Population
2.
Am J Hum Genet ; 109(3): 433-445, 2022 03 03.
Article in English | MEDLINE | ID: mdl-35196515

ABSTRACT

Biobanks linked to massive, longitudinal electronic health record (EHR) data make numerous new genetic research questions feasible. One among these is the study of biomarker trajectories. For example, high blood pressure measurements over visits strongly predict stroke onset, and consistently high fasting glucose and Hb1Ac levels define diabetes. Recent research reveals that not only the mean level of biomarker trajectories but also their fluctuations, or within-subject (WS) variability, are risk factors for many diseases. Glycemic variation, for instance, is recently considered an important clinical metric in diabetes management. It is crucial to identify the genetic factors that shift the mean or alter the WS variability of a biomarker trajectory. Compared to traditional cross-sectional studies, trajectory analysis utilizes more data points and captures a complete picture of the impact of time-varying factors, including medication history and lifestyle. Currently, there are no efficient tools for genome-wide association studies (GWASs) of biomarker trajectories at the biobank scale, even for just mean effects. We propose TrajGWAS, a linear mixed effect model-based method for testing genetic effects that shift the mean or alter the WS variability of a biomarker trajectory. It is scalable to biobank data with 100,000 to 1,000,000 individuals and many longitudinal measurements and robust to distributional assumptions. Simulation studies corroborate that TrajGWAS controls the type I error rate and is powerful. Analysis of eleven biomarkers measured longitudinally and extracted from UK Biobank primary care data for more than 150,000 participants with 1,800,000 observations reveals loci that significantly alter the mean or WS variability.


Subject(s)
Biological Specimen Banks , Genome-Wide Association Study , Biomarkers , Cross-Sectional Studies , Electronic Health Records , Humans , Longitudinal Studies
3.
Bioinformatics ; 39(4)2023 04 03.
Article in English | MEDLINE | ID: mdl-37067496

ABSTRACT

MOTIVATION: In a genome-wide association study, analyzing multiple correlated traits simultaneously is potentially superior to analyzing the traits one by one. Standard methods for multivariate genome-wide association study operate marker-by-marker and are computationally intensive. RESULTS: We present a sparsity constrained regression algorithm for multivariate genome-wide association study based on iterative hard thresholding and implement it in a convenient Julia package MendelIHT.jl. In simulation studies with up to 100 quantitative traits, iterative hard thresholding exhibits similar true positive rates, smaller false positive rates, and faster execution times than GEMMA's linear mixed models and mv-PLINK's canonical correlation analysis. On UK Biobank data with 470 228 variants, MendelIHT completed a three-trait joint analysis (n=185 656) in 20 h and an 18-trait joint analysis (n=104 264) in 53 h with an 80 GB memory footprint. In short, MendelIHT enables geneticists to fit a single regression model that simultaneously considers the effect of all SNPs and dozens of traits. AVAILABILITY AND IMPLEMENTATION: Software, documentation, and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelIHT.jl.


Subject(s)
Genome-Wide Association Study , Software , Algorithms , Computer Simulation , Phenotype , Polymorphism, Single Nucleotide
4.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-34254998

ABSTRACT

Statistical analysis of ultrahigh-dimensional omics scale data has long depended on univariate hypothesis testing. With growing data features and samples, the obvious next step is to establish multivariable association analysis as a routine method to describe genotype-phenotype association. Here we present ParProx, a state-of-the-art implementation to optimize overlapping and non-overlapping group lasso regression models for time-to-event and classification analysis, with selection of variables grouped by biological priors. ParProx enables multivariable model fitting for ultrahigh-dimensional data within an architecture for parallel or distributed computing via latent variable group representation. It thereby aims to produce interpretable regression models consistent with known biological relationships among independent variables, a property often explored post hoc, not during model estimation. Simulation studies clearly demonstrate the scalability of ParProx with graphics processing units in comparison to existing implementations. We illustrate the tool using three different omics data sets featuring moderate to large numbers of variables, where we use genomic regions and biological pathways as variable groups, rendering the selected independent variables directly interpretable with respect to those groups. ParProx is applicable to a wide range of studies using ultrahigh-dimensional omics data, from genome-wide association analysis to multi-omics studies where model estimation is computationally intractable with existing implementation.


Subject(s)
Algorithms , Computational Biology/methods , Genomics/methods , Regression Analysis , Software , Biomarkers , Disease Susceptibility , Gene Expression Profiling , Humans , Mutation , Prognosis , Proportional Hazards Models , Protein Interaction Mapping
5.
Bioinformatics ; 37(24): 4756-4763, 2021 12 11.
Article in English | MEDLINE | ID: mdl-34289008

ABSTRACT

MOTIVATION: Current methods for genotype imputation and phasing exploit the volume of data in haplotype reference panels and rely on hidden Markov models (HMMs). Existing programs all have essentially the same imputation accuracy, are computationally intensive and generally require prephasing the typed markers. RESULTS: We introduce a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for HMM calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs (single nucleotide polymorphisms). Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing. AVAILABILITY AND IMPLEMENTATION: Software, documentation and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelImpute.jl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Data Compression , Software , Genotype , Haplotypes , Polymorphism, Single Nucleotide
6.
Stat Sci ; 37(4): 494-518, 2022 Nov.
Article in English | MEDLINE | ID: mdl-37168541

ABSTRACT

Technological advances in the past decade, hardware and software alike, have made access to high-performance computing (HPC) easier than ever. We review these advances from a statistical computing perspective. Cloud computing makes access to supercomputers affordable. Deep learning software libraries make programming statistical algorithms easy and enable users to write code once and run it anywhere-from a laptop to a workstation with multiple graphics processing units (GPUs) or a supercomputer in a cloud. Highlighting how these developments benefit statisticians, we review recent optimization algorithms that are useful for high-dimensional models and can harness the power of HPC. Code snippets are provided to demonstrate the ease of programming. We also provide an easy-to-use distributed matrix data structure suitable for HPC. Employing this data structure, we illustrate various statistical applications including large-scale positron emission tomography and ℓ1-regularized Cox regression. Our examples easily scale up to an 8-GPU workstation and a 720-CPU-core cluster in a cloud. As a case in point, we analyze the onset of type-2 diabetes from the UK Biobank with 200,000 subjects and about 500,000 single nucleotide polymorphisms using the HPC ℓ1-regularized Cox regression. Fitting this half-million-variate model takes less than 45 minutes and reconfirms known associations. To our knowledge, this is the first demonstration of the feasibility of penalized regression of survival outcomes at this scale.

7.
Hum Genet ; 139(1): 61-71, 2020 Jan.
Article in English | MEDLINE | ID: mdl-30915546

ABSTRACT

Statistical methods for genome-wide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDEL project (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.


Subject(s)
Computational Biology/methods , Genome, Human , Genome-Wide Association Study , Models, Statistical , Programming Languages , Algorithms , Humans , Polymorphism, Single Nucleotide , Software
8.
Genes Chromosomes Cancer ; 54(11): 681-91, 2015 Nov.
Article in English | MEDLINE | ID: mdl-26227178

ABSTRACT

Relatively few recurrent gene fusion events have been associated with breast cancer to date. In an effort to uncover novel fusion transcripts, we performed whole-transcriptome sequencing of 120 fresh-frozen primary breast cancer samples and five adjacent normal breast tissues using the Illumina HiSeq2000 platform. Three different fusion-detecting tools (deFuse, Chimerascan, and TopHatFusion) were used, and the results were compared. These tools detected 3,831, 6,630 and 516 fusion transcripts (FTs) overall. We primarily focused on the results obtained using the deFuse software. More FTs were identified from HER2 subtype breast cancer samples than from the luminal or triple-negative subtypes (P < 0.05). Seventy fusion candidates were selected for validation, and 32 (45.7%) were confirmed by RT-PCR and Sanger sequencing. Of the validated fusions, six were recurrent (found in 2 or more samples), three were in-frame (PRDX1-AKR1A1, TACSTD2-OMA1, and C2CD2-TFF1) and three were off-frame (CEACAM7-CEACAM6, CYP4X1-CYP4Z2P, and EEF1DP3-FRY). Notably, the novel read-through fusion, EEF1DP3-FRY, was identified and validated in 6.7% (8/120) of the breast cancer samples. This off-frame fusion results in early truncation of the FRY gene, which plays a key role in the structural integrity during mitosis. Three previously reported fusions, PPP1R1B-STARD3, MFGE8-HAPL, and ETV6-NTRK3, were detected in 8.3, 3.3, and 0.8% of the 120 samples, respectively, by both deFuse and Chimerascan. The recently reported MAGI3-AKT3 fusion was not detected in our analysis. Although future work will be needed to examine the biological significance of our new findings, we identified a number of novel fusions and confirmed some previously reported fusions.


Subject(s)
Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Gene Fusion , RNA, Messenger/genetics , RNA, Messenger/metabolism , Transcriptome , Female , Gene Expression Profiling , Humans , Sequence Analysis, RNA/methods , Software
9.
Diabetes ; 71(5): 1137-1148, 2022 05 01.
Article in English | MEDLINE | ID: mdl-35133398

ABSTRACT

Diabetes-related complications reflect longstanding damage to small and large vessels throughout the body. In addition to the duration of diabetes and poor glycemic control, genetic factors are important contributors to the variability in the development of vascular complications. Early heritability studies found strong familial clustering of both macrovascular and microvascular complications. However, they were limited by small sample sizes and large phenotypic heterogeneity, leading to less accurate estimates. We take advantage of two independent studies-UK Biobank and the Action to Control Cardiovascular Risk in Diabetes trial-to survey the single nucleotide polymorphism heritability for diabetes microvascular (diabetic kidney disease and diabetic retinopathy) and macrovascular (cardiovascular events) complications. Heritability for diabetic kidney disease was estimated at 29%. The heritability estimate for microalbuminuria ranged from 24 to 60% and was 41% for macroalbuminuria. Heritability estimates of diabetic retinopathy ranged from 6 to 33%, depending on the phenotype definition. More severe diabetes retinopathy possessed higher genetic contributions. We show, for the first time, that rare variants account for much of the heritability of diabetic retinopathy. This study suggests that a large portion of the genetic risk of diabetes complications is yet to be discovered and emphasizes the need for additional genetic studies of diabetes complications.


Subject(s)
Diabetes Mellitus, Type 2 , Diabetic Nephropathies , Diabetic Retinopathy , Albuminuria , Biological Specimen Banks , Diabetes Mellitus, Type 2/genetics , Diabetic Nephropathies/genetics , Diabetic Retinopathy/complications , Diabetic Retinopathy/epidemiology , Diabetic Retinopathy/genetics , Female , Humans , Male , Risk Factors , United Kingdom/epidemiology
10.
Ann Appl Stat ; 15(4): 1652-1672, 2021 Dec.
Article in English | MEDLINE | ID: mdl-35198092

ABSTRACT

Single nucleotide polymorphism (SNP) set analysis aggregates both common and rare variants and tests for association between phenotype(s) of interest and a set. However, multiple SNP-sets, such as genes, pathways, or sliding windows are usually investigated across the whole genome in which all groups are tested separately, followed by multiple testing adjustments. We propose a novel method to prioritize SNP-sets in a joint multivariate variance component model. Each SNP-set corresponds to a variance component (or kernel), and model selection is achieved by incorporating either convex or nonconvex penalties. The uniqueness of this variance component selection framework, which we call VCSEL, is that it naturally encompasses multivariate traits (VCSEL-M) and SNP-set-treatment or -environment interactions (VCSEL-I). We devise an optimization algorithm scalable to many variance components, based on the majorization-minimization (MM) principle. Simulation studies demonstrate the superiority of our methods in model selection performance, as measured by the area under the precision-recall (PR) curve, compared to the commonly used marginal testing and group penalization methods. Finally, we apply our methods to a real pharmacogenomics study and a real whole exome sequencing study. Some top ranked genes by VCSEL are detected as insignificant by the marginal test methods which emphasizes formal inference of individual genes with a strict significance threshold. This provides alternative insights for biologists to prioritize follow-up studies and develop polygenic risk score models.

11.
Autophagy ; 11(5): 796-811, 2015.
Article in English | MEDLINE | ID: mdl-25946189

ABSTRACT

The EWSR1 (EWS RNA-binding protein 1/Ewing Sarcoma Break Point Region 1) gene encodes a RNA/DNA binding protein that is ubiquitously expressed and involved in various cellular processes. EWSR1 deficiency leads to impairment of development and accelerated senescence but the mechanism is not known. Herein, we found that EWSR1 modulates the Uvrag (UV radiation resistance associated) gene at the post-transcription level. Interestingly, EWSR1 deficiency led to the activation of the DROSHA-mediated microprocessor complex and increased the level of Mir125a and Mir351, which directly target Uvrag. Moreover, the Mir125a- and Mir351-mediated reduction of Uvrag was associated with the inhibition of autophagy that was confirmed in ewsr1 knockout (KO) MEFs and ewsr1 KO mice. Taken together, our data indicate that EWSR1 is involved in the post-transcriptional regulation of Uvrag via a miRNA-dependent pathway, resulting in the deregulation of autophagy inhibition. The mechanism of Uvrag and autophagy regulation by EWSR1 provides new insights into the role of EWSR1 deficiency-related cellular dysfunction.


Subject(s)
Autophagy , Calmodulin-Binding Proteins/deficiency , MicroRNAs/metabolism , Tumor Suppressor Proteins/metabolism , Animals , Autophagy/genetics , Base Sequence , Calmodulin-Binding Proteins/metabolism , Down-Regulation/genetics , Embryo, Mammalian/cytology , Fibroblasts/metabolism , Mice , Mice, Knockout , Molecular Sequence Data , NIH 3T3 Cells , RNA-Binding Protein EWS , RNA-Binding Proteins , Transcription, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL