RESUMEN
NIAGADS is the National Institute on Aging (NIA) designated national data repository for human genetics research on Alzheimer's Disease and related dementia (ADRD). NIAGADS maintains a high-quality data collection for ADRD genetic/genomic research and supports genetics data production and analysis. NIAGADS hosts whole genome and exome sequence data from the Alzheimer's Disease Sequencing Project (ADSP) and other genotype/phenotype data, encompassing 209,000 samples. NIAGADS shares these data with hundreds of research groups around the world via the Data Sharing Service, a FISMA moderate compliant cloud-based platform that fully supports the NIH Genome Data Sharing Policy. NIAGADS Open Access consists of multiple knowledge bases with genome-wide association summary statistics and rich annotations on the biological significance of genetic variants and genes across the human genome. NIAGADS stands as a keystone in promoting collaborations to advance the understanding and treatment of Alzheimer's disease.
RESUMEN
The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer's Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD > 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community.
Asunto(s)
Enfermedad de Alzheimer , Humanos , Exoma , Biología Computacional , Exactitud de los Datos , GenotipoRESUMEN
INTRODUCTION: The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site Alzheimer's Genomics Database (GenomicsDB) is a public knowledge base of Alzheimer's disease (AD) genetic datasets and genomic annotations. METHODS: GenomicsDB uses a custom systems architecture to adopt and enforce rigorous standards that facilitate harmonization of AD-relevant genome-wide association study summary statistics datasets with functional annotations, including over 230 million annotated variants from the AD Sequencing Project. RESULTS: GenomicsDB generates interactive reports compiled from the harmonized datasets and annotations. These reports contextualize AD-risk associations in a broader functional genomic setting and summarize them in the context of functionally annotated genes and variants. DISCUSSION: Created to make AD-genetics knowledge more accessible to AD researchers, the GenomicsDB is designed to guide users unfamiliar with genetic data in not only exploring but also interpreting this ever-growing volume of data. Scalable and interoperable with other genomics resources using data technology standards, the GenomicsDB can serve as a central hub for research and data analysis on AD and related dementias. HIGHLIGHTS: The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (NIAGADS) offers to the public a unique, disease-centric collection of AD-relevant GWAS summary statistics datasets. Interpreting these data is challenging and requires significant bioinformatics expertise to standardize datasets and harmonize them with functional annotations on genome-wide scales. The NIAGADS Alzheimer's GenomicsDB helps overcome these challenges by providing a user-friendly public knowledge base for AD-relevant genetics that shares harmonized, annotated summary statistics datasets from the NIAGADS repository in an interpretable, easily searchable format.
Asunto(s)
Enfermedad de Alzheimer , Estados Unidos , Humanos , Enfermedad de Alzheimer/genética , Estudio de Asociación del Genoma Completo , National Institute on Aging (U.S.) , Genómica , Bases de Datos Factuales , Predisposición Genética a la Enfermedad/genéticaRESUMEN
Querying massive functional genomic and annotation data collections, linking and summarizing the query results across data sources/data types are important steps in high-throughput genomic and genetic analytical workflows. However, these steps are made difficult by the heterogeneity and breadth of data sources, experimental assays, biological conditions/tissues/cell types and file formats. FILER (FunctIonaL gEnomics Repository) is a framework for querying large-scale genomics knowledge with a large, curated integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface. FILER uniquely provides: (i) streamlined access to >50 000 harmonized, annotated genomic datasets across >20 integrated data sources, >1100 tissues/cell types and >20 experimental assays; (ii) a scalable genomic querying interface; and (iii) ability to analyze and annotate user's experimental data. This rich resource spans >17 billion GRCh37/hg19 and GRCh38/hg38 genomic records. Our benchmark querying 7 × 109 hg19 FILER records shows FILER is highly scalable, with a sub-linear 32-fold increase in querying time when increasing the number of queries 1000-fold from 1000 to 1 000 000 intervals. Together, these features facilitate reproducible research and streamline integrating/querying large-scale genomic data within analyses/workflows. FILER can be deployed on cloud or local servers (https://bitbucket.org/wanglab-upenn/FILER) for integration with custom pipelines and is freely available (https://lisanwanglab.org/FILER).
RESUMEN
SUMMARY: We report Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO), a scalable bioinformatics pipeline characterizing non-coding genome-wide association study (GWAS) association findings. SparkINFERNO prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts and plausible target genes they affect. To achieve this, the SparkINFERNO algorithm integrates GWAS summary statistics with large-scale collection of functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci and other functional datasets across more than 400 tissues and cell types. Scalability is achieved by an underlying API implemented using Apache Spark and Giggle-based genomic indexing. We evaluated SparkINFERNO on large GWASs and show that SparkINFERNO is more than 60 times efficient and scales with data size and amount of computational resources. AVAILABILITY AND IMPLEMENTATION: SparkINFERNO runs on clusters or a single server with Apache Spark environment, and is available at https://bitbucket.org/wanglab-upenn/SparkINFERNO or https://hub.docker.com/r/wanglab/spark-inferno. CONTACT: lswang@pennmedicine.upenn.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Estudio de Asociación del Genoma Completo , Sitios de Carácter Cuantitativo , Algoritmos , Genómica , Programas InformáticosRESUMEN
SUMMARY: We report VCPA, our SNP/Indel Variant Calling Pipeline and data management tool used for the analysis of whole genome and exome sequencing (WGS/WES) for the Alzheimer's Disease Sequencing Project. VCPA consists of two independent but linkable components: pipeline and tracking database. The pipeline, implemented using the Workflow Description Language and fully optimized for the Amazon elastic compute cloud environment, includes steps from aligning raw sequence reads to variant calling using GATK. The tracking database allows users to view job running status in real time and visualize >100 quality metrics per genome. VCPA is functionally equivalent to the CCDG/TOPMed pipeline. Users can use the pipeline and the dockerized database to process large WGS/WES datasets on Amazon cloud with minimal configuration. AVAILABILITY AND IMPLEMENTATION: VCPA is released under the MIT license and is available for academic and nonprofit use for free. The pipeline source code and step-by-step instructions are available from the National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (http://www.niagads.org/VCPA). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Enfermedad de Alzheimer , Manejo de Datos , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Programas InformáticosRESUMEN
Importance: It is unclear whether female carriers of the apolipoprotein E (APOE) ε4 allele are at greater risk of developing Alzheimer disease (AD) than men, and the sex-dependent association of mild cognitive impairment (MCI) and APOE has not been established. Objective: To determine how sex and APOE genotype affect the risks for developing MCI and AD. Data Sources: Twenty-seven independent research studies in the Global Alzheimer's Association Interactive Network with data on nearly 58â¯000 participants. Study Selection: Non-Hispanic white individuals with clinical diagnostic and APOE genotype data. Data Extraction and Synthesis: Homogeneous data sets were pooled in case-control analyses, and logistic regression models were used to compute risks. Main Outcomes and Measures: Age-adjusted odds ratios (ORs) and 95% confidence intervals for developing MCI and AD were calculated for men and women across APOE genotypes. Results: Participants were men and women between ages 55 and 85 years. Across data sets most participants were white, and for many participants, racial/ethnic information was either not collected or not known. Men (OR, 3.09; 95% CI, 2.79-3.42) and women (OR, 3.31; CI, 3.03-3.61) with the APOE ε3/ε4 genotype from ages 55 to 85 years did not show a difference in AD risk; however, women had an increased risk compared with men between the ages of 65 and 75 years (women, OR, 4.37; 95% CI, 3.82-5.00; men, OR, 3.14; 95% CI, 2.68-3.67; P = .002). Men with APOE ε3/ε4 had an increased risk of AD compared with men with APOE ε3/ε3. The APOE ε2/ε3 genotype conferred a protective effect on women (OR, 0.51; 95% CI, 0.43-0.61) decreasing their risk of AD more (P value = .01) than men (OR, 0.71; 95% CI, 0.60-0.85). There was no difference between men with APOE ε3/ε4 (OR, 1.55; 95% CI, 1.36-1.76) and women (OR, 1.60; 95% CI, 1.43-1.81) in their risk of developing MCI between the ages of 55 and 85 years, but women had an increased risk between 55 and 70 years (women, OR, 1.43; 95% CI, 1.19-1.73; men, OR, 1.07; 95% CI, 0.87-1.30; P = .05). There were no significant differences between men and women in their risks for converting from MCI to AD between the ages of 55 and 85 years. Individuals with APOE ε4/ε4 showed increased risks vs individuals with ε3/ε4, but no significant differences between men and women with ε4/ε4 were seen. Conclusions and Relevance: Contrary to long-standing views, men and women with the APOE ε3/ε4 genotype have nearly the same odds of developing AD from age 55 to 85 years, but women have an increased risk at younger ages.