Fast computation of the eigensystem of genomic similarity matrices.

Hahn, Georg; Lutz, Sharon M; Hecker, Julian; Prokopenko, Dmitry; Cho, Michael H; Silverman, Edwin K; Weiss, Scott T; Lange, Christoph

Hahn, Georg; Lutz, Sharon M; Hecker, Julian; Prokopenko, Dmitry; Cho, Michael H; Silverman, Edwin K; Weiss, Scott T; Lange, Christoph.

Affiliation

Hahn G; T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA. ghahn@hsph.harvard.edu.
Lutz SM; T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA.
Hecker J; Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA.
Prokopenko D; Massachusetts General Hospital, Harvard University, Boston, MA, 02114, USA.
Cho MH; Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA.
Silverman EK; Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA.
Weiss ST; Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA.
Lange C; T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA.

BMC Bioinformatics ; 25(1): 43, 2024 Jan 25.

Article in En | MEDLINE | ID: mdl-38273228

ABSTRACT

ABSTRACT

The computation of a similarity measure for genomic data is a standard tool in computational genetics. The principal components of such matrices are routinely used to correct for biases due to confounding by population stratification, for instance in linear regressions. However, the calculation of both a similarity matrix and its singular value decomposition (SVD) are computationally intensive. The contribution of this article is threefold. First, we demonstrate that the calculation of three matrices (called the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be reformulated in a unified way which allows for the application of a randomized SVD algorithm, which is faster than the traditional computation. The fast SVD algorithm we present is adapted from an existing randomized SVD algorithm and ensures that all computations are carried out in sparse matrix algebra. The algorithm only assumes that row-wise and column-wise subtraction and multiplication of a vector with a sparse matrix is available, an operation that is efficiently implemented in common sparse matrix packages. An exception is the so-called Jaccard matrix, which does not have a structure applicable for the fast SVD algorithm. Second, an approximate Jaccard matrix is introduced to which the fast SVD computation is applicable. Third, we establish guaranteed theoretical bounds on the accuracy (in [Formula see text] norm and angle) between the principal components of the Jaccard matrix and the ones of our proposed approximation, thus putting the proposed Jaccard approximation on a solid mathematical foundation, and derive the theoretical runtime of our algorithm. We illustrate that the approximation error is low in practice and empirically verify the theoretical runtime scalings on both simulated data and data of the 1000 Genome Project.

Subject(s)
Key words

Covariance matrix; Fast SVD; Genomic relationship matrix; Jaccard matrix; Principal components; Weighted Jaccard matrix

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Genome / Genomics Type of study: Clinical_trials Language: En Journal: BMC Bioinformatics Journal subject: INFORMATICA MEDICA Year: 2024 Document type: Article Affiliation country: Estados Unidos

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google