Your browser doesn't support javascript.
loading
Genotype prediction of 336,463 samples from public expression data.
Razi, Afrooz; Lo, Christopher C; Wang, Siruo; Leek, Jeffrey T; Hansen, Kasper D.
Affiliation
  • Razi A; Department of Genetic Medicine, Johns Hopkins University School of Medicine.
  • Lo CC; Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health.
  • Wang S; Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health.
  • Leek JT; Biostatistics Program, Division of Public Health Sciences, Fred Hutchinson Cancer Center.
  • Hansen KD; Department of Genetic Medicine, Johns Hopkins University School of Medicine.
bioRxiv ; 2024 Mar 13.
Article in En | MEDLINE | ID: mdl-38559266
ABSTRACT
Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples. Most of these samples only include RNA expression measurements; genotype data for these same samples would enable a wide range of analyses including variant prioritization, eQTL analysis, and studies of allele specific expression. Here, we developed a statistical model based on the existing reference and alternative read counts from the RNA-seq experiments available through Recount3 to predict genotypes at autosomal biallelic loci in coding regions. We demonstrate the accuracy of our model using large-scale studies that measured both gene expression and genotype genome-wide. We show that our predictive model is highly accurate with 99.5% overall accuracy, 99.6% major allele accuracy, and 90.4% minor allele accuracy. Our model is robust to tissue and study effects, provided the coverage is high enough. We applied this model to genotype all the samples in Recount 3 and provide the largest ready-to-use expression repository containing genotype information. We illustrate that the predicted genotype from RNA-seq data is sufficient to unravel the underlying population structure of samples in Recount3 using Principal Component Analysis.

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: BioRxiv Year: 2024 Document type: Article Country of publication: Estados Unidos

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: BioRxiv Year: 2024 Document type: Article Country of publication: Estados Unidos