ABSTRACT
The NHGRI-EBI GWAS Catalog (www.ebi.ac.uk/gwas) is a FAIR knowledgebase providing detailed, structured, standardised and interoperable genome-wide association study (GWAS) data to >200 000 users per year from academic research, healthcare and industry. The Catalog contains variant-trait associations and supporting metadata for >45 000 published GWAS across >5000 human traits, and >40 000 full P-value summary statistics datasets. Content is curated from publications or acquired via author submission of prepublication summary statistics through a new submission portal and validation tool. GWAS data volume has vastly increased in recent years. We have updated our software to meet this scaling challenge and to enable rapid release of submitted summary statistics. The scope of the repository has expanded to include additional data types of high interest to the community, including sequencing-based GWAS, gene-based analyses and copy number variation analyses. Community outreach has increased the number of shared datasets from under-represented traits, e.g. cancer, and we continue to contribute to awareness of the lack of population diversity in GWAS. Interoperability of the Catalog has been enhanced through links to other resources including the Polygenic Score Catalog and the International Mouse Phenotyping Consortium, refinements to GWAS trait annotation, and the development of a standard format for GWAS data.
Subject(s)
Genome-Wide Association Study , Knowledge Bases , Animals , Humans , Mice , DNA Copy Number Variations , National Human Genome Research Institute (U.S.) , Phenotype , Polymorphism, Single Nucleotide , Software , United StatesABSTRACT
Open Targets Genetics (https://genetics.opentargets.org) is an open-access integrative resource that aggregates human GWAS and functional genomics data including gene expression, protein abundance, chromatin interaction and conformation data from a wide range of cell types and tissues to make robust connections between GWAS-associated loci, variants and likely causal genes. This enables systematic identification and prioritisation of likely causal variants and genes across all published trait-associated loci. In this paper, we describe the public resources we aggregate, the technology and analyses we use, and the functionality that the portal offers. Open Targets Genetics can be searched by variant, gene or study/phenotype. It offers tools that enable users to prioritise causal variants and genes at disease-associated loci and access systematic cross-disease and disease-molecular trait colocalization analysis across 92 cell types and tissues including the eQTL Catalogue. Data visualizations such as Manhattan-like plots, regional plots, credible sets overlap between studies and PheWAS plots enable users to explore GWAS signals in depth. The integrated data is made available through the web portal, for bulk download and via a GraphQL API, and the software is open source. Applications of this integrated data include identification of novel targets for drug discovery and drug repurposing.
Subject(s)
Databases, Genetic , Genome, Human , Inflammatory Bowel Diseases/genetics , Molecular Targeted Therapy/methods , Quantitative Trait Loci , Software , Chromatin/chemistry , Chromatin/metabolism , Datasets as Topic , Drug Discovery/methods , Drug Repositioning/methods , Genome-Wide Association Study , Genotype , Humans , Inflammatory Bowel Diseases/drug therapy , Inflammatory Bowel Diseases/metabolism , Inflammatory Bowel Diseases/pathology , Internet , Phenotype , Quantitative Trait, HeritableABSTRACT
The use of data from smartphones and wearable devices has huge potential for population health research, given the high level of device ownership; the range of novel health-relevant data types available from consumer devices; and the frequency and duration with which data are, or could be, collected. Yet, the uptake and success of large-scale mobile health research in the last decade have not met this intensely promoted opportunity. We make the argument that digital person-generated health data are required and necessary to answer many top priority research questions, using illustrative examples taken from the James Lind Alliance Priority Setting Partnerships. We then summarize the findings from 2 UK initiatives that considered the challenges and possible solutions for what needs to be done and how such solutions can be implemented to realize the future opportunities of digital person-generated health data for clinically important population health research. Examples of important areas that must be addressed to advance the field include digital inequality and possible selection bias; easy access for researchers to the appropriate data collection tools, including how best to harmonize data items; analysis methodologies for time series data; patient and public involvement and engagement methods for optimizing recruitment, retention, and public trust; and methods for providing research participants with greater control over their data. There is also a major opportunity, provided through the linkage of digital person-generated health data to routinely collected data, to support novel population health research, bringing together clinician-reported and patient-reported measures. We recognize that well-conducted studies need a wide range of diverse challenges to be skillfully addressed in unison (eg, challenges regarding epidemiology, data science and biostatistics, psychometrics, behavioral and social science, software engineering, user interface design, information governance, data management, and patient and public involvement and engagement). Consequently, progress would be accelerated by the establishment of a new interdisciplinary community where all relevant and necessary skills are brought together to allow for excellence throughout the life cycle of a research study. This will require a partnership of diverse people, methods, and technologies. If done right, the synergy of such a partnership has the potential to transform many millions of people's lives for the better.
Subject(s)
Telemedicine , Wearable Electronic Devices , Humans , Smartphone , Research DesignABSTRACT
The GWAS Catalog delivers a high-quality curated collection of all published genome-wide association studies enabling investigations to identify causal variants, understand disease mechanisms, and establish targets for novel therapies. The scope of the Catalog has also expanded to targeted and exome arrays with 1000 new associations added for these technologies. As of September 2018, the Catalog contains 5687 GWAS comprising 71673 variant-trait associations from 3567 publications. New content includes 284 full P-value summary statistics datasets for genome-wide and new targeted array studies, representing 6 × 109 individual variant-trait statistics. In the last 12 months, the Catalog's user interface was accessed by â¼90000 unique users who viewed >1 million pages. We have improved data access with the release of a new RESTful API to support high-throughput programmatic access, an improved web interface and a new summary statistics database. Summary statistics provision is supported by a new format proposed as a community standard for summary statistics data representation. This format was derived from our experience in standardizing heterogeneous submissions, mapping formats and in harmonizing content. Availability: https://www.ebi.ac.uk/gwas/.
Subject(s)
Databases, Genetic , Genome-Wide Association Study , Disease/genetics , Genetic Variation , Humans , Microarray Analysis , Publications , Software , User-Computer InterfaceABSTRACT
Nonallelic homologous recombination (NAHR) between highly similar duplicated sequences generates chromosomal deletions, duplications and inversions, which can cause diverse genetic disorders. Little is known about interindividual variation in NAHR rates and the factors that influence this. We estimated the rate of deletion at the CMT1A-REP NAHR hotspot in sperm DNA from 34 male donors, including 16 monozygotic (MZ) co-twins (8 twin pairs) aged 24 to 67 years old. The average NAHR rate was 3.5 × 10(-5) with a seven-fold variation across individuals. Despite good statistical power to detect even a subtle correlation, we observed no relationship between age of unrelated individuals and the rate of NAHR in their sperm, likely reflecting the meiotic-specific origin of these events. We then estimated the heritability of deletion rate by calculating the intraclass correlation (ICC) within MZ co-twins, revealing a significant correlation between MZ co-twins (ICC = 0.784, p = 0.0039), with MZ co-twins being significantly more correlated than unrelated pairs. We showed that this heritability cannot be explained by variation in PRDM9, a known regulator of NAHR, or variation within the NAHR hotspot itself. We also did not detect any correlation between Body Mass Index (BMI), smoking status or alcohol intake and rate of NAHR. Our results suggest that other, as yet unidentified, genetic or environmental factors play a significant role in the regulation of NAHR and are responsible for the extensive variation in the population for the probability of fathering a child with a genomic disorder resulting from a pathogenic deletion.
Subject(s)
Homologous Recombination/genetics , Neurofibromatosis 1/genetics , Twins, Monozygotic/genetics , Adult , Aged , Alleles , Chromosome Deletion , Gene Duplication , Humans , INDEL Mutation/genetics , Male , Middle Aged , Sequence Deletion/genetics , SpermatozoaABSTRACT
Locus Reference Genomic (LRG; http://www.lrg-sequence.org/) records contain internationally recognized stable reference sequences designed specifically for reporting clinically relevant sequence variants. Each LRG is contained within a single file consisting of a stable 'fixed' section and a regularly updated 'updatable' section. The fixed section contains stable genomic DNA sequence for a genomic region, essential transcripts and proteins for variant reporting and an exon numbering system. The updatable section contains mapping information, annotation of all transcripts and overlapping genes in the region and legacy exon and amino acid numbering systems. LRGs provide a stable framework that is vital for reporting variants, according to Human Genome Variation Society (HGVS) conventions, in genomic DNA, transcript or protein coordinates. To enable translation of information between LRG and genomic coordinates, LRGs include mapping to the human genome assembly. LRGs are compiled and maintained by the National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI). LRG reference sequences are selected in collaboration with the diagnostic and research communities, locus-specific database curators and mutation consortia. Currently >700 LRGs have been created, of which >400 are publicly available. The aim is to create an LRG for every locus with clinical implications.
Subject(s)
Databases, Genetic , Genetic Variation , Genome, Human , Exons , Genetic Loci , Genomics/standards , Humans , Internet , Proteins/genetics , RNA, Messenger/chemistry , Reference StandardsABSTRACT
A forum of the Human Variome Project (HVP) was held as a satellite to the 2012 Annual Meeting of the American Society of Human Genetics in San Francisco, California. The theme of this meeting was "Getting Ready for the Human Phenome Project." Understanding the genetic contribution to both rare single-gene "Mendelian" disorders and more complex common diseases will require integration of research efforts among many fields and better defined phenotypes. The HVP is dedicated to bringing together researchers and research populations throughout the world to provide the resources to investigate the impact of genetic variation on disease. To this end, there needs to be a greater sharing of phenotype and genotype data. For this to occur, many databases that currently exist will need to become interoperable to allow for the combining of cohorts with similar phenotypes to increase statistical power for studies attempting to identify novel disease genes or causative genetic variants. Improved systems and tools that enhance the collection of phenotype data from clinicians are urgently needed. This meeting begins the HVP's effort toward this important goal.
Subject(s)
Databases, Genetic , Human Genome Project , Phenotype , Computational Biology , HumansABSTRACT
Genome sequencing has recently become a viable genotyping technology for use in genome-wide association studies (GWASs), offering the potential to analyze a broader range of genome-wide variation, including rare variants. To survey current standards, we assessed the content and quality of reporting of statistical methods, analyses, results, and datasets in 167 exome- or genome-wide-sequencing-based GWAS publications published from 2014 to 2020; 81% of publications included tests of aggregate association across multiple variants, with multiple test models frequently used. We observed a lack of standardized terms and incomplete reporting of datasets, particularly for variants analyzed in aggregate tests. We also find a lower frequency of sharing of summary statistics compared with array-based GWASs. Reporting standards and increased data sharing are required to ensure sequencing-based association study data are findable, interoperable, accessible, and reusable (FAIR). To support that, we recommend adopting the standard terminology of sequencing-based GWAS (seqGWAS). Further, we recommend that single-variant analyses be reported following the same standards and conventions as standard array-based GWASs and be shared in the GWAS Catalog. We also provide initial recommended standards for aggregate analyses metadata and summary statistics.
ABSTRACT
Genome-wide association studies (GWASs) have enabled robust mapping of complex traits in humans. The open sharing of GWAS summary statistics (SumStats) is essential in facilitating the larger meta-analyses needed for increased power in resolving the genetic basis of disease. However, most GWAS SumStats are not readily accessible because of limited sharing and a lack of defined standards. With the aim of increasing the availability, quality, and utility of GWAS SumStats, the National Human Genome Research Institute-European Bioinformatics Institute (NHGRI-EBI) GWAS Catalog organized a community workshop to address the standards, infrastructure, and incentives required to promote and enable sharing. We evaluated the barriers to SumStats sharing, both technological and sociological, and developed an action plan to address those challenges and ensure that SumStats and study metadata are findable, accessible, interoperable, and reusable (FAIR). We encourage early deposition of datasets in the GWAS Catalog as the recognized central repository. We recommend standard requirements for reporting elements and formats for SumStats and accompanying metadata as guidelines for community standards and a basis for submission to the GWAS Catalog. Finally, we provide recommendations to enable, promote, and incentivize broader data sharing, standards and FAIRness in order to advance genomic medicine.
ABSTRACT
The accurate description of ancestry is essential to interpret, access, and integrate human genomics data, and to ensure that these benefit individuals from all ancestral backgrounds. However, there are no established guidelines for the representation of ancestry information. Here we describe a framework for the accurate and standardized description of sample ancestry, and validate it by application to the NHGRI-EBI GWAS Catalog. We confirm known biases and gaps in diversity, and find that African and Hispanic or Latin American ancestry populations contribute a disproportionately high number of associations. It is our hope that widespread adoption of this framework will lead to improved analysis, interpretation, and integration of human genomics data.
Subject(s)
Genome-Wide Association Study/standards , Genomics/standards , Genetic Variation , Humans , Racial GroupsSubject(s)
Colorectal Neoplasms/genetics , Databases, Genetic , Metadata/statistics & numerical data , Multifactorial Inheritance , Software , Benchmarking , Colorectal Neoplasms/metabolism , Colorectal Neoplasms/pathology , Computational Biology/methods , Female , Genetic Predisposition to Disease , Genome, Human , Genome-Wide Association Study , Humans , Male , Reproducibility of ResultsABSTRACT
The recent discovery of heterozygous human mutations that truncate full-length titin (TTN, an abundant structural, sensory, and signaling filament in muscle) as a common cause of end-stage dilated cardiomyopathy (DCM) promises new prospects for improving heart failure management. However, realization of this opportunity has been hindered by the burden of TTN-truncating variants (TTNtv) in the general population and uncertainty about their consequences in health or disease. To elucidate the effects of TTNtv, we coupled TTN gene sequencing with cardiac phenotyping in 5267 individuals across the spectrum of cardiac physiology and integrated these data with RNA and protein analyses of human heart tissues. We report diversity of TTN isoform expression in the heart, define the relative inclusion of TTN exons in different isoforms (using the TTN transcript annotations available at http://cardiodb.org/titin), and demonstrate that these data, coupled with the position of the TTNtv, provide a robust strategy to discriminate pathogenic from benign TTNtv. We show that TTNtv is the most common genetic cause of DCM in ambulant patients in the community, identify clinically important manifestations of TTNtv-positive DCM, and define the penetrance and outcomes of TTNtv in the general population. By integrating genetic, transcriptome, and protein analyses, we provide evidence for a length-dependent mechanism of disease. These data inform diagnostic criteria and management strategies for TTNtv-positive DCM patients and for TTNtv that are identified as incidental findings.
Subject(s)
Alleles , Connectin/genetics , Heart/physiology , Mutation , Transcription, Genetic , Adolescent , Adult , Aged , Cardiomyopathy, Dilated/genetics , Cardiomyopathy, Dilated/pathology , Cohort Studies , Connectin/physiology , Exons , Genetic Variation , Healthy Volunteers , Heart Failure/genetics , Heart Failure/therapy , Humans , Immunoglobulins/metabolism , Middle Aged , Protein Isoforms/genetics , Protein Isoforms/physiology , Young AdultABSTRACT
ß-defensins are a family of important peptides of innate immunity, involved in host defense, immunomodulation, reproduction, and pigmentation. Genes encoding ß-defensins show evidence of birth-and-death evolution, adaptation by amino acid sequence changes, and extensive copy number variation (CNV) within humans and other species. The role of CNV in the adaptation of ß-defensins to new functions remains unclear, as does the adaptive role of CNV in general. Here, we fine-map CNV of a cluster of ß-defensins in humans and rhesus macaques. Remarkably, we found that the structure of the CNV is different between primates, with distinct mutational origins and CNV boundaries defined by retroviral long terminal repeat elements. Although the human ß-defensin CNV region is 322 kb and encompasses several genes, including ß-defensins, a long noncoding RNA gene, and testes-specific zinc-finger transcription factors, the orthologous region in the rhesus macaque shows CNV of a 20-kb region, containing only a single gene, the ortholog of the human ß-defensin-2 gene. Despite its independent origins, the range of gene copy numbers in the rhesus macaque is similar to humans. In addition, the rhesus macaque gene has been subject to divergent positive selection at the amino acid level following its initial duplication event between 3 and 9.5 Ma, suggesting adaptation of this gene as the macaque successfully colonized novel environments outside Africa. Therefore, the molecular phenotype of ß-defensin-2 CNV has undergone convergent evolution, and this gene shows evidence of adaptation at the amino acid level in rhesus macaques.