Search | VHL CLAP/WR-PAHO/WHO

1.

Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets.

Ko, Seyoon; Chu, Benjamin B; Peterson, Daniel; Okenwa, Chidera; Papp, Jeanette C; Alexander, David H; Sobel, Eric M; Zhou, Hua; Lange, Kenneth L.

Am J Hum Genet ; 110(2): 314-325, 2023 02 02.

Article in English | MEDLINE | ID: mdl-36610401

ABSTRACT

Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 105 to 106 samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.

Subject(s)

Biological Specimen Banks , Genome-Wide Association Study , Humans , Genome-Wide Association Study/methods , Likelihood Functions , Population Groups , Software , Genetics, Population

2.

MM optimization: Proximal distance algorithms, path following, and trust regions.

Landeros, Alfonso; Xu, Jason; Lange, Kenneth.

Proc Natl Acad Sci U S A ; 120(27): e2303168120, 2023 Jul 04.

Article in English | MEDLINE | ID: mdl-37339185

ABSTRACT

We briefly review the majorization-minimization (MM) principle and elaborate on the closely related notion of proximal distance algorithms, a generic approach for solving constrained optimization problems via quadratic penalties. We illustrate how the MM and proximal distance principles apply to a variety of problems from statistics, finance, and nonlinear optimization. Drawing from our selected examples, we also sketch a few ideas pertinent to the acceleration of MM algorithms: a) structuring updates around efficient matrix decompositions, b) path following in proximal distance iteration, and c) cubic majorization and its connections to trust region methods. These ideas are put to the test on several numerical examples, but for the sake of brevity, we omit detailed comparisons to competing methods. The current article, which is a mix of review and current contributions, celebrates the MM principle as a powerful framework for designing optimization algorithms and reinterpreting existing ones.

3.

Multivariate genome-wide association analysis by iterative hard thresholding.

Chu, Benjamin B; Ko, Seyoon; Zhou, Jin J; Jensen, Aubrey; Zhou, Hua; Sinsheimer, Janet S; Lange, Kenneth.

Bioinformatics ; 39(4)2023 04 03.

Article in English | MEDLINE | ID: mdl-37067496

ABSTRACT

MOTIVATION: In a genome-wide association study, analyzing multiple correlated traits simultaneously is potentially superior to analyzing the traits one by one. Standard methods for multivariate genome-wide association study operate marker-by-marker and are computationally intensive. RESULTS: We present a sparsity constrained regression algorithm for multivariate genome-wide association study based on iterative hard thresholding and implement it in a convenient Julia package MendelIHT.jl. In simulation studies with up to 100 quantitative traits, iterative hard thresholding exhibits similar true positive rates, smaller false positive rates, and faster execution times than GEMMA's linear mixed models and mv-PLINK's canonical correlation analysis. On UK Biobank data with 470 228 variants, MendelIHT completed a three-trait joint analysis (n=185 656) in 20 h and an 18-trait joint analysis (n=104 264) in 53 h with an 80 GB memory footprint. In short, MendelIHT enables geneticists to fit a single regression model that simultaneously considers the effect of all SNPs and dozens of traits. AVAILABILITY AND IMPLEMENTATION: Software, documentation, and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelIHT.jl.

Subject(s)

Genome-Wide Association Study , Software , Algorithms , Computer Simulation , Phenotype , Polymorphism, Single Nucleotide

4.

Pooled analysis of radiation hybrids identifies loci for growth and drug action in mammalian cells.

Khan, Arshad H; Lin, Andy; Wang, Richard T; Bloom, Joshua S; Lange, Kenneth; Smith, Desmond J.

Genome Res ; 30(10): 1458-1467, 2020 10.

Article in English | MEDLINE | ID: mdl-32878976

ABSTRACT

Genetic screens in mammalian cells commonly focus on loss-of-function approaches. To evaluate the phenotypic consequences of extra gene copies, we used bulk segregant analysis (BSA) of radiation hybrid (RH) cells. We constructed six pools of RH cells, each consisting of â¼2500 independent clones, and placed the pools under selection in media with or without paclitaxel. Low pass sequencing identified 859 growth loci, 38 paclitaxel loci, 62 interaction loci, and three loci for mitochondrial abundance at genome-wide significance. Resolution was measured as â¼30 kb, close to single-gene. Divergent properties were displayed by the RH-BSA growth genes compared to those from loss-of-function screens, refuting the balance hypothesis. In addition, enhanced retention of human centromeres in the RH pools suggests a new approach to functional dissection of these chromosomal elements. Pooled analysis of RH cells showed high power and resolution and should be a useful addition to the mammalian genetic toolkit.

Subject(s)

Cell Growth Processes/genetics , Radiation Hybrid Mapping/methods , Animals , Centromere , Cricetinae , DNA , Disease/genetics , Genetic Loci , HEK293 Cells , Humans , Mitochondria , Mycoplasma/isolation & purification , Paclitaxel/pharmacology

5.

Differential methods for assessing sensitivity in biological models.

Mester, Rachel; Landeros, Alfonso; Rackauckas, Chris; Lange, Kenneth.

PLoS Comput Biol ; 18(6): e1009598, 2022 06.

Article in English | MEDLINE | ID: mdl-35696417

ABSTRACT

Differential sensitivity analysis is indispensable in fitting parameters, understanding uncertainty, and forecasting the results of both thought and lab experiments. Although there are many methods currently available for performing differential sensitivity analysis of biological models, it can be difficult to determine which method is best suited for a particular model. In this paper, we explain a variety of differential sensitivity methods and assess their value in some typical biological models. First, we explain the mathematical basis for three numerical methods: adjoint sensitivity analysis, complex perturbation sensitivity analysis, and forward mode sensitivity analysis. We then carry out four instructive case studies. (a) The CARRGO model for tumor-immune interaction highlights the additional information that differential sensitivity analysis provides beyond traditional naive sensitivity methods, (b) the deterministic SIR model demonstrates the value of using second-order sensitivity in refining model predictions, (c) the stochastic SIR model shows how differential sensitivity can be attacked in stochastic modeling, and (d) a discrete birth-death-migration model illustrates how the complex perturbation method of differential sensitivity can be generalized to a broader range of biological models. Finally, we compare the speed, accuracy, and ease of use of these methods. We find that forward mode automatic differentiation has the quickest computational time, while the complex perturbation method is the simplest to implement and the most generalizable.

Subject(s)

Models, Biological , Stochastic Processes , Uncertainty

6.

A fast data-driven method for genotype imputation, phasing and local ancestry inference: MendelImpute.jl.

Chu, Benjamin B; Sobel, Eric M; Wasiolek, Rory; Ko, Seyoon; Sinsheimer, Janet S; Zhou, Hua; Lange, Kenneth.

Bioinformatics ; 37(24): 4756-4763, 2021 12 11.

Article in English | MEDLINE | ID: mdl-34289008

ABSTRACT

MOTIVATION: Current methods for genotype imputation and phasing exploit the volume of data in haplotype reference panels and rely on hidden Markov models (HMMs). Existing programs all have essentially the same imputation accuracy, are computationally intensive and generally require prephasing the typed markers. RESULTS: We introduce a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for HMM calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs (single nucleotide polymorphisms). Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing. AVAILABILITY AND IMPLEMENTATION: Software, documentation and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelImpute.jl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Data Compression , Software , Genotype , Haplotypes , Polymorphism, Single Nucleotide

7.

Postnatal persistence of nonhuman primate sex-dependent renal structural and molecular changes programmed by intrauterine growth restriction.

Bishop, Andrew C; Spradling-Reeves, Kimberly D; Shade, Robert E; Lange, Kenneth J; Birnbaum, Shifra; Favela, Kristin; Dick, Edward J; Nijland, Mark J; Li, Cun; Nathanielsz, Peter W; Cox, Laura A.

J Med Primatol ; 51(6): 329-344, 2022 12.

Article in English | MEDLINE | ID: mdl-35855511

ABSTRACT

BACKGROUND: Poor nutrition during fetal development programs postnatal kidney function. Understanding postnatal consequences in nonhuman primates (NHP) is important for translation to our understanding the impact on human kidney function and disease risk. We hypothesized that intrauterine growth restriction (IUGR) in NHP persists postnatally, with potential molecular mechanisms revealed by Western-type diet challenge. METHODS: IUGR juvenile baboons were fed a 7-week Western diet, with kidney biopsies, blood, and urine collected before and after challenge. Transcriptomics and metabolomics were used to analyze biosamples. RESULTS: Pre-challenge IUGR kidney transcriptome and urine metabolome differed from controls. Post-challenge, sex and diet-specific responses in urine metabolite and renal signaling pathways were observed. Dysregulated mTOR signaling persisted postnatally in female pre-challenge. Post-challenge IUGR male response showed uncoordinated signaling suggesting proximal tubule injury. CONCLUSION: Fetal undernutrition impacts juvenile offspring kidneys at the molecular level suggesting early-onset blood pressure dysregulation.

Subject(s)

Fetal Growth Retardation , Kidney , Humans , Animals , Female , Male , Fetal Growth Retardation/etiology , Fetal Growth Retardation/veterinary , Kidney/pathology , Papio , Blood Pressure

8.

A Legacy of EM Algorithms.

Lange, Kenneth; Zhou, Hua.

Int Stat Rev ; 90(Suppl 1): S52-S66, 2022 Dec.

Article in English | MEDLINE | ID: mdl-37204987

ABSTRACT

Nan Laird has an enormous and growing impact on computational statistics. Her paper with Dempster and Rubin on the expectation-maximisation (EM) algorithm is the second most cited paper in statistics. Her papers and book on longitudinal modelling are nearly as impressive. In this brief survey, we revisit the derivation of some of her most useful algorithms from the perspective of the minorisation-maximisation (MM) principle. The MM principle generalises the EM principle and frees it from the shackles of missing data and conditional expectations. Instead, the focus shifts to the construction of surrogate functions via standard mathematical inequalities. The MM principle can deliver a classical EM algorithm with less fuss or an entirely new algorithm with a faster rate of convergence. In any case, the MM principle enriches our understanding of the EM principle and suggests new algorithms of considerable potential in high-dimensional settings where standard algorithms such as Newton's method and Fisher scoring falter.

9.

Modern simulation utilities for genetic analysis.

Ji, Sarah S; German, Christopher A; Lange, Kenneth; Sinsheimer, Janet S; Zhou, Hua; Zhou, Jin; Sobel, Eric M.

BMC Bioinformatics ; 22(1): 228, 2021 May 03.

Article in English | MEDLINE | ID: mdl-33941078

ABSTRACT

BACKGROUND: Statistical geneticists employ simulation to estimate the power of proposed studies, test new analysis tools, and evaluate properties of causal models. Although there are existing trait simulators, there is ample room for modernization. For example, most phenotype simulators are limited to Gaussian traits or traits transformable to normality, while ignoring qualitative traits and realistic, non-normal trait distributions. Also, modern computer languages, such as Julia, that accommodate parallelization and cloud-based computing are now mainstream but rarely used in older applications. To meet the challenges of contemporary big studies, it is important for geneticists to adopt new computational tools. RESULTS: We present TraitSimulation, an open-source Julia package that makes it trivial to quickly simulate phenotypes under a variety of genetic architectures. This package is integrated into our OpenMendel suite for easy downstream analyses. Julia was purpose-built for scientific programming and provides tremendous speed and memory efficiency, easy access to multi-CPU and GPU hardware, and to distributed and cloud-based parallelization. TraitSimulation is designed to encourage flexible trait simulation, including via the standard devices of applied statistics, generalized linear models (GLMs) and generalized linear mixed models (GLMMs). TraitSimulation also accommodates many study designs: unrelateds, sibships, pedigrees, or a mixture of all three. (Of course, for data with pedigrees or cryptic relationships, the simulation process must include the genetic dependencies among the individuals.) We consider an assortment of trait models and study designs to illustrate integrated simulation and analysis pipelines. Step-by-step instructions for these analyses are available in our electronic Jupyter notebooks on Github. These interactive notebooks are ideal for reproducible research. CONCLUSION: The TraitSimulation package has three main advantages. (1) It leverages the computational efficiency and ease of use of Julia to provide extremely fast, straightforward simulation of even the most complex genetic models, including GLMs and GLMMs. (2) It can be operated entirely within, but is not limited to, the integrated analysis pipeline of OpenMendel. And finally (3), by allowing a wider range of more realistic phenotype models, TraitSimulation brings power calculations and diagnostic tools closer to what investigators might see in real-world analyses.

Subject(s)

Cloud Computing , Genetic Testing , Aged , Computer Simulation , Humans , Pedigree , Phenotype

10.

Linear mixed models for association analysis of quantitative traits with next-generation sequencing data.

Chiu, Chi-Yang; Yuan, Fang; Zhang, Bing-Song; Yuan, Ao; Li, Xin; Fang, Hong-Bin; Lange, Kenneth; Weeks, Daniel E; Wilson, Alexander F; Bailey-Wilson, Joan E; Musolf, Anthony M; Stambolian, Dwight; Lakhal-Chaieb, M'Hamed Lajmi; Cook, Richard J; McMahon, Francis J; Amos, Christopher I; Xiong, Momiao; Fan, Ruzong.

Genet Epidemiol ; 43(2): 189-206, 2019 Mar.

Article in English | MEDLINE | ID: mdl-30537345

ABSTRACT

We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene-based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. F -statistics and χ 2 likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association. We show empirically that the F -distributed statistics provide a good control of the type I error rate. The F -test statistics of the LMMs have similar or higher power than the FLMMs, kernel-based famSKAT (family-based sequence kernel association test), and burden test famBT (family-based burden test). The F -statistics of the FLMMs perform well when analyzing a combination of rare and common variants. For small samples, the LRT statistics of the FLMMs control the type I error rate well at the nominal levels α = 0.01 and 0.05 . For moderate/large samples, the LRT statistics of the FLMMs control the type I error rates well. The LRT statistics of the LMMs can lead to inflated type I error rates. The proposed models are useful in whole genome and whole exome association studies of complex traits.

Subject(s)

Genetic Association Studies , High-Throughput Nucleotide Sequencing/methods , Models, Genetic , Quantitative Trait, Heritable , Computer Simulation , Family , Humans , Linear Models , Myopia/genetics

11.

OPENMENDEL: a cooperative programming project for statistical genetics.

Zhou, Hua; Sinsheimer, Janet S; Bates, Douglas M; Chu, Benjamin B; German, Christopher A; Ji, Sarah S; Keys, Kevin L; Kim, Juhyun; Ko, Seyoon; Mosher, Gordon D; Papp, Jeanette C; Sobel, Eric M; Zhai, Jing; Zhou, Jin J; Lange, Kenneth.

Hum Genet ; 139(1): 61-71, 2020 Jan.

Article in English | MEDLINE | ID: mdl-30915546

ABSTRACT

Statistical methods for genome-wide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDEL project (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.

Subject(s)

Computational Biology/methods , Genome, Human , Genome-Wide Association Study , Models, Statistical , Programming Languages , Algorithms , Humans , Polymorphism, Single Nucleotide , Software

12.

Iterative hard thresholding for model selection in genome-wide association studies.

Keys, Kevin L; Chen, Gary K; Lange, Kenneth.

Genet Epidemiol ; 41(8): 756-768, 2017 12.

Article in English | MEDLINE | ID: mdl-28875524

ABSTRACT

A genome-wide association study (GWAS) correlates marker and trait variation in a study sample. Each subject is genotyped at a multitude of SNPs (single nucleotide polymorphisms) spanning the genome. Here, we assume that subjects are randomly collected unrelateds and that trait values are normally distributed or can be transformed to normality. Over the past decade, geneticists have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these studies present unique computational challenges. Penalized regression with the â1 penalty (LASSO) or minimax concave penalty (MCP) penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Here, we compare LASSO and MCP penalized regression to iterative hard thresholding (IHT). On GWAS regression data, IHT is better at model selection and comparable in speed to both methods of penalized regression. This conclusion holds for both simulated and real GWAS data. IHT fosters parallelization and scales well in problems with large numbers of causal markers. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage commodity desktop computers in GWAS analysis and to avoid supercomputing. AVAILABILITY: Source code is freely available at https://github.com/klkeys/IHT.jl.

Subject(s)

Genome-Wide Association Study , Models, Genetic , Algorithms , Body Mass Index , Cholesterol, HDL/genetics , Cholesterol, LDL/genetics , Humans , Phenotype , Polymorphism, Single Nucleotide , Triglycerides/genetics

13.

Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data.

Zhou, Hua; Blangero, John; Dyer, Thomas D; Chan, Kei-Hang K; Lange, Kenneth; Sobel, Eric M.

Genet Epidemiol ; 41(3): 174-186, 2017 04.

Article in English | MEDLINE | ID: mdl-27943406

ABSTRACT

Since most analysis software for genome-wide association studies (GWAS) currently exploit only unrelated individuals, there is a need for efficient applications that can handle general pedigree data or mixtures of both population and pedigree data. Even datasets thought to consist of only unrelated individuals may include cryptic relationships that can lead to false positives if not discovered and controlled for. In addition, family designs possess compelling advantages. They are better equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Pedigrees selected for extreme trait values often segregate a single gene with strong effect. Finally, many pedigrees are available as an important legacy from the era of linkage analysis. Unfortunately, pedigree likelihoods are notoriously hard to compute. In this paper, we reexamine the computational bottlenecks and implement ultra-fast pedigree-based GWAS analysis. Kinship coefficients can either be based on explicitly provided pedigrees or automatically estimated from dense markers. Our strategy (a) works for random sample data, pedigree data, or a mix of both; (b) entails no loss of power; (c) allows for any number of covariate adjustments, including correction for population stratification; (d) allows for testing SNPs under additive, dominant, and recessive models; and (e) accommodates both univariate and multivariate quantitative traits. On a typical personal computer (six CPU cores at 2.67 GHz), analyzing a univariate HDL (high-density lipoprotein) trait from the San Antonio Family Heart Study (935,392 SNPs on 1,388 individuals in 124 pedigrees) takes less than 2 min and 1.5 GB of memory. Complete multivariate QTL analysis of the three time-points of the longitudinal HDL multivariate trait takes less than 5 min and 1.5 GB of memory. The algorithm is implemented as the Ped-GWAS Analysis (Option 29) in the Mendel statistical genetics package, which is freely available for Macintosh, Linux, and Windows platforms from http://genetics.ucla.edu/software/mendel.

Subject(s)

Genetic Linkage , Genome, Human , Genome-Wide Association Study , Pedigree , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci , Humans , Models, Genetic , Models, Statistical , Software

14.

The non-human primate kidney transcriptome in fetal development.

Spradling-Reeves, Kimberly D; Glenn, Jeremy P; Lange, Kenneth J; Kuhn, Natalia; Coalson, Jacqueline J; Nijland, Mark J; Li, Cun; Nathanielsz, Peter W; Cox, Laura A.

J Med Primatol ; 47(3): 157-171, 2018 06.

Article in English | MEDLINE | ID: mdl-29603257

ABSTRACT

BACKGROUND: Little is known about the repertoire of non-human primate kidney genes expressed throughout development. The present work establishes an understanding of the primate renal transcriptome during fetal development in the context of renal maturation. METHODS: The baboon kidney transcriptome was characterized at 60-day gestation (DG), 90 DG, 125 DG, 140 DG, 160 DG and adulthood (6-12 years) using gene arrays and validated by QRT-PCR. Pathway and cluster analyses were used to characterize gene expression in the context of biological pathways. RESULTS: Pathway analysis indicated activation of pathways not previously reported as relevant to kidney development. Cluster analysis also revealed gene splice variants with discordant expression profiles during development. CONCLUSIONS: This study provides the first detailed genetic analysis of the developing primate kidney, and our findings of discordant expression of gene splice variants suggest that gene arrays likely provide a simplified view and demonstrate the need to study the fetal renal proteome.

Subject(s)

Fetal Development/genetics , Kidney/growth & development , Papio hamadryas/genetics , Transcriptome , Animals , Kidney/embryology , Papio hamadryas/embryology , Papio hamadryas/growth & development , RNA, Messenger/genetics

15.

Efficient analysis of large datasets and sex bias with ADMIXTURE.

Shringarpure, Suyash S; Bustamante, Carlos D; Lange, Kenneth; Alexander, David H.

BMC Bioinformatics ; 17: 218, 2016 May 23.

Article in English | MEDLINE | ID: mdl-27216439

ABSTRACT

BACKGROUND: A number of large genomic datasets are being generated for studies of human ancestry and diseases. The ADMIXTURE program is commonly used to infer individual ancestry from genomic data. RESULTS: We describe two improvements to the ADMIXTURE software. The first enables ADMIXTURE to infer ancestry for a new set of individuals using cluster allele frequencies from a reference set of individuals. Using data from the 1000 Genomes Project, we show that this allows ADMIXTURE to infer ancestry for 10,920 individuals in a few hours (a 5 × speedup). This mode also allows ADMIXTURE to correctly estimate individual ancestry and allele frequencies from a set of related individuals. The second modification allows ADMIXTURE to correctly handle X-chromosome (and other haploid) data from both males and females. We demonstrate increased power to detect sex-biased admixture in African-American individuals from the 1000 Genomes project using this extension. CONCLUSIONS: These modifications make ADMIXTURE more efficient and versatile, allowing users to extract more information from large genomic datasets.

Subject(s)

Genetics, Population , Genomics/methods , Software , Black or African American/genetics , Female , Gene Frequency , HapMap Project , Humans , Male , Southwestern United States

16.

Genotype imputation via matrix completion.

Chi, Eric C; Zhou, Hua; Chen, Gary K; Del Vecchyo, Diego Ortega; Lange, Kenneth.

Genome Res ; 23(3): 509-18, 2013 Mar.

Article in English | MEDLINE | ID: mdl-23233546

ABSTRACT

Most current genotype imputation methods are model-based and computationally intensive, taking days to impute one chromosome pair on 1000 people. We describe an efficient genotype imputation method based on matrix completion. Our matrix completion method is implemented in MATLAB and tested on real data from HapMap 3, simulated pedigree data, and simulated low-coverage sequencing data derived from the 1000 Genomes Project. Compared with leading imputation programs, the matrix completion algorithm embodied in our program MENDEL-IMPUTE achieves comparable imputation accuracy while reducing run times significantly. Implementation in a lower-level language such as Fortran or C is apt to further improve computational efficiency.

Subject(s)

Artificial Intelligence , Genotype , Models, Genetic , Software , Algorithms , Computer Simulation , Genome, Human , HapMap Project , Humans , Microarray Analysis , Polymorphism, Single Nucleotide

17.

A multivariate Bernoulli model to predict DNaseI hypersensitivity status from haplotype data.

Shi, Huwenbo; Pasaniuc, Bogdan; Lange, Kenneth L.

Bioinformatics ; 31(21): 3514-21, 2015 Nov 01.

Article in English | MEDLINE | ID: mdl-26139633

ABSTRACT

MOTIVATION: Haplotype models enjoy a wide range of applications in population inference and disease gene discovery. The hidden Markov models traditionally used for haplotypes are hindered by the dubious assumption that dependencies occur only between consecutive pairs of variants. In this article, we apply the multivariate Bernoulli (MVB) distribution to model haplotype data. The MVB distribution relies on interactions among all sets of variants, thus allowing for the detection and exploitation of long-range and higher-order interactions. We discuss penalized estimation and present an efficient algorithm for fitting sparse versions of the MVB distribution to haplotype data. Finally, we showcase the benefits of the MVB model in predicting DNaseI hypersensitivity (DH) status--an epigenetic mark describing chromatin accessibility--from population-scale haplotype data. RESULTS: We fit the MVB model to real data from 59 individuals on whom both haplotypes and DH status in lymphoblastoid cell lines are publicly available. The model allows prediction of DH status from genetic data (prediction R2=0.12 in cross-validations). Comparisons of prediction under the MVB model with prediction under linear regression (best linear unbiased prediction) and logistic regression demonstrate that the MVB model achieves about 10% higher prediction R2 than the two competing methods in empirical data. AVAILABILITY AND IMPLEMENTATION: Software implementing the method described can be downloaded at http://bogdan.bioinformatics.ucla.edu/software/. CONTACT: shihuwenbo@ucla.edu or pasaniuc@ucla.edu.

Subject(s)

Deoxyribonuclease I , Haplotypes , Models, Statistical , Algorithms , Cell Line , Humans , Linear Models , Logistic Models , Multivariate Analysis , Software

18.

Convex clustering: an attractive alternative to hierarchical clustering.

Chen, Gary K; Chi, Eric C; Ranola, John Michael O; Lange, Kenneth.

PLoS Comput Biol ; 11(5): e1004228, 2015 May.

Article in English | MEDLINE | ID: mdl-25965340

ABSTRACT

The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/.

Subject(s)

Cluster Analysis , Computational Biology/methods , Pattern Recognition, Automated/methods , Algorithms , Databases, Genetic , Gene Expression Profiling/methods , Humans , Software

19.

Coupling bounds for approximating birth-death processes by truncation.

Crawford, Forrest W; Stutz, Timothy C; Lange, Kenneth.

Stat Probab Lett ; 109: 30-38, 2016 Feb 01.

Article in English | MEDLINE | ID: mdl-26622074

ABSTRACT

Birth-death processes are continuous-time Markov counting processes. Approximate moments can be computed by truncating the transition rate matrix. Using a coupling argument, we derive bounds for the total variation distance between the process and its finite approximation.

20.

Fast spatial ancestry via flexible allele frequency surfaces.

Rañola, John Michael; Novembre, John; Lange, Kenneth.

Bioinformatics ; 30(20): 2915-22, 2014 Oct 15.

Article in English | MEDLINE | ID: mdl-25012181

ABSTRACT

MOTIVATION: Unique modeling and computational challenges arise in locating the geographic origin of individuals based on their genetic backgrounds. Single-nucleotide polymorphisms (SNPs) vary widely in informativeness, allele frequencies change non-linearly with geography and reliable localization requires evidence to be integrated across a multitude of SNPs. These problems become even more acute for individuals of mixed ancestry. It is hardly surprising that matching genetic models to computational constraints has limited the development of methods for estimating geographic origins. We attack these related problems by borrowing ideas from image processing and optimization theory. Our proposed model divides the region of interest into pixels and operates SNP by SNP. We estimate allele frequencies across the landscape by maximizing a product of binomial likelihoods penalized by nearest neighbor interactions. Penalization smooths allele frequency estimates and promotes estimation at pixels with no data. Maximization is accomplished by a minorize-maximize (MM) algorithm. Once allele frequency surfaces are available, one can apply Bayes' rule to compute the posterior probability that each pixel is the pixel of origin of a given person. Placement of admixed individuals on the landscape is more complicated and requires estimation of the fractional contribution of each pixel to a person's genome. This estimation problem also succumbs to a penalized MM algorithm. RESULTS: We applied the model to the Population Reference Sample (POPRES) data. The model gives better localization for both unmixed and admixed individuals than existing methods despite using just a small fraction of the available SNPs. Computing times are comparable with the best competing software. AVAILABILITY AND IMPLEMENTATION: Software will be freely available as the OriGen package in R. CONTACT: ranolaj@uw.edu or klange@ucla.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Gene Frequency , Phylogeography/methods , Algorithms , Bayes Theorem , Genome, Human/genetics , Humans , Polymorphism, Single Nucleotide/genetics , Software , Time Factors

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL