Search | VHL Regional Portal

Regularized multi-trait multi-locus linear mixed models for genome-wide association studies and genomic selection in crops.

Lozano, Aurélie C; Ding, Hantian; Abe, Naoki; Lipka, Alexander E.

BMC Bioinformatics ; 24(1): 399, 2023 Oct 26.

Article in English | MEDLINE | ID: mdl-37884874

ABSTRACT

BACKGROUND: We consider two key problems in genomics involving multiple traits: multi-trait genome wide association studies (GWAS), where the goal is to detect genetic variants associated with the traits; and multi-trait genomic selection (GS), where the emphasis is on accurately predicting trait values. Multi-trait linear mixed models build on the linear mixed model to jointly model multiple traits. Existing estimation methods, however, are limited to the joint analysis of a small number of genotypes; in fact, most approaches consider one SNP at a time. Estimating multi-dimensional genetic and environment effects also results in considerable computational burden. Efficient approaches that incorporate regularization into multi-trait linear models (no random effects) have been recently proposed to identify genomic loci associated with multiple traits (Yu et al. in Multitask learning using task clustering with applications to predictive modeling and GWAS of plant varieties. arXiv:1710.01788 , 2017; Yu et al in Front Big Data 2:27, 2019), but these ignore population structure and familial relatedness (Yu et al in Nat Genet 38:203-208, 2006). RESULTS: This work addresses this gap by proposing a novel class of regularized multi-trait linear mixed models along with scalable approaches for estimation in the presence of high-dimensional genotypes and a large number of traits. We evaluate the effectiveness of the proposed methods using datasets in maize and sorghum diversity panels, and demonstrate benefits in both achieving high prediction accuracy in GS and in identifying relevant marker-trait associations. CONCLUSIONS: The proposed regularized multivariate linear mixed models are relevant for both GWAS and GS. We hope that they will facilitate agronomy-related research in plant biology and crop breeding endeavors.

Subject(s)

Genome-Wide Association Study , Plant Breeding , Genome-Wide Association Study/methods , Linear Models , Phenotype , Genomics/methods , Crops, Agricultural , Polymorphism, Single Nucleotide , Models, Genetic

Simultaneous Parameter Learning and Bi-clustering for Multi-Response Models.

Yu, Ming; Natesan Ramamurthy, Karthikeyan; Thompson, Addie; Lozano, Aurélie C.

Front Big Data ; 2: 27, 2019.

Article in English | MEDLINE | ID: mdl-33693350

ABSTRACT

We consider multi-response and multi-task regression models, where the parameter matrix to be estimated is expected to have an unknown grouping structure. The groupings can be along tasks, or features, or both, the last one indicating a bi-cluster or "checkerboard" structure. Discovering this grouping structure along with parameter inference makes sense in several applications, such as multi-response Genome-Wide Association Studies (GWAS). By inferring this additional structure we can obtain valuable information on the underlying data mechanisms (e.g., relationships among genotypes and phenotypes in GWAS). In this paper, we propose two formulations to simultaneously learn the parameter matrix and its group structures, based on convex regularization penalties. We present optimization approaches to solve the resulting problems and provide numerical convergence guarantees. Extensive experiments demonstrate much better clustering quality compared to other methods, and our approaches are also validated on real datasets concerning phenotypes and genotypes of plant varieties.

Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods.

Haws, David C; Rish, Irina; Teyssedre, Simon; He, Dan; Lozano, Aurelie C; Kambadur, Prabhanjan; Karaman, Zivan; Parida, Laxmi.

PLoS One ; 10(10): e0138903, 2015.

Article in English | MEDLINE | ID: mdl-26439851

ABSTRACT

Accurate prediction of complex traits based on whole-genome data is a computational problem of paramount importance, particularly to plant and animal breeders. However, the number of genetic markers is typically orders of magnitude larger than the number of samples (p >> n), amongst other challenges. We assessed the effectiveness of a diverse set of state-of-the-art methods on publicly accessible real data. The most surprising finding was that approaches with feature selection performed better than others on average, in contrast to the expectation in the community that variable selection is mostly ineffective, i.e. that it does not improve accuracy of prediction, in spite of p >> n. We observed superior performance despite a somewhat simplistic approach to variable selection, possibly suggesting an inherent robustness. This bodes well in general since the variable selection methods usually improve interpretability without loss of prediction power. Apart from identifying a set of benchmark data sets (including one simulated data), we also discuss the performance analysis for each data set in terms of the input characteristics.

Subject(s)

Genetic Markers/genetics , Models, Genetic , Quantitative Trait Loci/genetics , Algorithms , Animals , Genome/genetics , Swine , Zea mays/genetics

Grouped graphical Granger modeling for gene expression regulatory networks discovery.

Lozano, Aurélie C; Abe, Naoki; Liu, Yan; Rosset, Saharon.

Bioinformatics ; 25(12): i110-8, 2009 Jun 15.

Article in English | MEDLINE | ID: mdl-19477976

ABSTRACT

We consider the problem of discovering gene regulatory networks from time-series microarray data. Recently, graphical Granger modeling has gained considerable attention as a promising direction for addressing this problem. These methods apply graphical modeling methods on time-series data and invoke the notion of 'Granger causality' to make assertions on causality through inference on time-lagged effects. Existing algorithms, however, have neglected an important aspect of the problem--the group structure among the lagged temporal variables naturally imposed by the time series they belong to. Specifically, existing methods in computational biology share this shortcoming, as well as additional computational limitations, prohibiting their effective applications to the large datasets including a large number of genes and many data points. In the present article, we propose a novel methodology which we term 'grouped graphical Granger modeling method', which overcomes the limitations mentioned above by applying a regression method suited for high-dimensional and large data, and by leveraging the group structure among the lagged temporal variables according to the time series they belong to. We demonstrate the effectiveness of the proposed methodology on both simulated and actual gene expression data, specifically the human cancer cell (HeLa S3) cycle data. The simulation results show that the proposed methodology generally exhibits higher accuracy in recovering the underlying causal structure. Those on the gene expression data demonstrate that it leads to improved accuracy with respect to prediction of known links, and also uncovers additional causal relationships uncaptured by earlier works.

Subject(s)

Computational Biology/methods , Gene Regulatory Networks , Gene Expression Profiling/methods , HeLa Cells , Humans

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL