RESUMEN
Primary open-angle glaucoma (POAG), the leading cause of irreversible blindness worldwide, disproportionately affects individuals of African ancestry. We conducted a genome-wide association study (GWAS) for POAG in 11,275 individuals of African ancestry (6,003 cases; 5,272 controls). We detected 46 risk loci associated with POAG at genome-wide significance. Replication and post-GWAS analyses, including functionally informed fine-mapping, multiple trait co-localization, and in silico validation, implicated two previously undescribed variants (rs1666698 mapping to DBF4P2; rs34957764 mapping to ROCK1P1) and one previously associated variant (rs11824032 mapping to ARHGEF12) as likely causal. For individuals of African ancestry, a polygenic risk score (PRS) for POAG from our mega-analysis (African ancestry individuals) outperformed a PRS from summary statistics of a much larger GWAS derived from European ancestry individuals. This study quantifies the genetic architecture similarities and differences between African and non-African ancestry populations for this blinding disease.
Asunto(s)
Estudio de Asociación del Genoma Completo , Glaucoma de Ángulo Abierto , Humanos , Predisposición Genética a la Enfermedad , Glaucoma de Ángulo Abierto/genética , Población Negra/genética , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.
Asunto(s)
Genoma Humano , Secuenciación Completa del Genoma , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación INDEL , Masculino , Polimorfismo de Nucleótido SimpleRESUMEN
Cancer genomes often harbor hundreds of somatic DNA rearrangement junctions, many of which cannot be easily classified into simple (e.g., deletion) or complex (e.g., chromothripsis) structural variant classes. Applying a novel genome graph computational paradigm to analyze the topology of junction copy number (JCN) across 2,778 tumor whole-genome sequences, we uncovered three novel complex rearrangement phenomena: pyrgo, rigma, and tyfonas. Pyrgo are "towers" of low-JCN duplications associated with early-replicating regions, superenhancers, and breast or ovarian cancers. Rigma comprise "chasms" of low-JCN deletions enriched in late-replicating fragile sites and gastrointestinal carcinomas. Tyfonas are "typhoons" of high-JCN junctions and fold-back inversions associated with expressed protein-coding fusions, breakend hypermutation, and acral, but not cutaneous, melanomas. Clustering of tumors according to genome graph-derived features identified subgroups associated with DNA repair defects and poor prognosis.
Asunto(s)
Variación Estructural del Genoma/genética , Genómica/métodos , Neoplasias/genética , Inversión Cromosómica/genética , Cromotripsis , Variaciones en el Número de Copia de ADN/genética , Reordenamiento Génico/genética , Genoma Humano/genética , Humanos , Mutación/genética , Secuenciación Completa del Genoma/métodosRESUMEN
To further our understanding of the genetic etiology of autism, we generated and analyzed genome sequence data from 516 idiopathic autism families (2,064 individuals). This resource includes >59 million single-nucleotide variants (SNVs) and 9,212 private copy number variants (CNVs), of which 133,992 and 88 are de novo mutations (DNMs), respectively. We estimate a mutation rate of â¼1.5 × 10-8 SNVs per site per generation with a significantly higher mutation rate in repetitive DNA. Comparing probands and unaffected siblings, we observe several DNM trends. Probands carry more gene-disruptive CNVs and SNVs, resulting in severe missense mutations and mapping to predicted fetal brain promoters and embryonic stem cell enhancers. These differences become more pronounced for autism genes (p = 1.8 × 10-3, OR = 2.2). Patients are more likely to carry multiple coding and noncoding DNMs in different genes, which are enriched for expression in striatal neurons (p = 3 × 10-3), suggesting a path forward for genetically characterizing more complex cases of autism.
Asunto(s)
Trastorno Autístico/genética , Variaciones en el Número de Copia de ADN , Polimorfismo de Nucleótido Simple , Animales , Análisis Mutacional de ADN , Femenino , Estudio de Asociación del Genoma Completo , Humanos , Mutación INDEL , Masculino , RatonesRESUMEN
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
RESUMEN
A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline1 to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.
Asunto(s)
Variación Genética , Genoma Humano/genética , Secuenciación Completa del Genoma , Alelos , Estudios de Casos y Controles , Epigénesis Genética , Femenino , Dosificación de Gen/genética , Genética de Población , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Masculino , Anotación de Secuencia Molecular , Sitios de Carácter Cuantitativo , Grupos Raciales/genética , Programas InformáticosRESUMEN
Studies of de novo mutation (DNM) have typically excluded some of the most repetitive and complex regions of the genome because these regions cannot be unambiguously mapped with short-read sequencing data. To better understand the genome-wide pattern of DNM, we generated long-read sequence data from an autism parent-child quad with an affected female where no pathogenic variant had been discovered in short-read Illumina sequence data. We deeply sequenced all four individuals by using three sequencing platforms (Illumina, Oxford Nanopore, and Pacific Biosciences) and three complementary technologies (Strand-seq, optical mapping, and 10X Genomics). Using long-read sequencing, we initially discovered and validated 171 DNMs across two children-a 20% increase in the number of de novo single-nucleotide variants (SNVs) and indels when compared to short-read callsets. The number of DNMs further increased by 5% when considering a more complete human reference (T2T-CHM13) because of the recovery of events in regions absent from GRCh38 (e.g., three DNMs in heterochromatic satellites). In total, we validated 195 de novo germline mutations and 23 potential post-zygotic mosaic mutations across both children; the overall true substitution rate based on this integrated callset is at least 1.41 × 10-8 substitutions per nucleotide per generation. We also identified six de novo insertions and deletions in tandem repeats, two of which represent structural variants. We demonstrate that long-read sequencing and assembly, especially when combined with a more complete reference genome, increases the number of DNMs by >25% compared to previous studies, providing a more complete catalog of DNM compared to short-read data alone.
Asunto(s)
Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Femenino , Humanos , Mutación/genética , Nucleótidos , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
SUMMARY: We present a new version of the popular somatic variant caller, Lancet, that supports the analysis of linked-reads sequencing data. By seamlessly integrating barcodes and haplotype read assignments within the colored De Bruijn graph local-assembly framework, Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure. AVAILABILITY AND IMPLEMENTATION: Lancet is implemented in C++ and available for academic and non-commercial research purposes as an open-source package at https://github.com/nygenome/lancet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Algoritmos , Diploidia , Análisis de Secuencia de ADNRESUMEN
BACKGROUND: Previous research in autism and other neurodevelopmental disorders (NDDs) has indicated an important contribution of protein-coding (coding) de novo variants (DNVs) within specific genes. The role of de novo noncoding variation has been observable as a general increase in genetic burden but has yet to be resolved to individual functional elements. In this study, we assessed whole-genome sequencing data in 2671 families with autism (discovery cohort of 516 families, replication cohort of 2155 families). We focused on DNVs in enhancers with characterized in vivo activity in the brain and identified an excess of DNVs in an enhancer named hs737. RESULTS: We adapted the fitDNM statistical model to work in noncoding regions and tested enhancers for excess of DNVs in families with autism. We found only one enhancer (hs737) with nominal significance in the discovery (p = 0.0172), replication (p = 2.5 × 10-3), and combined dataset (p = 1.1 × 10-4). Each individual with a DNV in hs737 had shared phenotypes including being male, intact cognitive function, and hypotonia or motor delay. Our in vitro assessment of the DNVs showed they all reduce enhancer activity in a neuronal cell line. By epigenomic analyses, we found that hs737 is brain-specific and targets the transcription factor gene EBF3 in human fetal brain. EBF3 is genome-wide significant for coding DNVs in NDDs (missense p = 8.12 × 10-35, loss-of-function p = 2.26 × 10-13) and is widely expressed in the body. Through characterization of promoters bound by EBF3 in neuronal cells, we saw enrichment for binding to NDD genes (p = 7.43 × 10-6, OR = 1.87) involved in gene regulation. Individuals with coding DNVs have greater phenotypic severity (hypotonia, ataxia, and delayed development syndrome [HADDS]) in comparison to individuals with noncoding DNVs that have autism and hypotonia. CONCLUSIONS: In this study, we identify DNVs in the hs737 enhancer in individuals with autism. Through multiple approaches, we find hs737 targets the gene EBF3 that is genome-wide significant in NDDs. By assessment of noncoding variation and the genes they affect, we are beginning to understand their impact on gene regulatory networks in NDDs.
Asunto(s)
Trastorno Autístico/genética , Predisposición Genética a la Enfermedad , Hipotonía Muscular/genética , Trastornos del Neurodesarrollo/genética , Factores de Transcripción/genética , Trastorno Autístico/epidemiología , Trastorno Autístico/patología , Elementos de Facilitación Genéticos/genética , Exoma/genética , Femenino , Redes Reguladoras de Genes/genética , Humanos , Masculino , Hipotonía Muscular/epidemiología , Hipotonía Muscular/patología , Mutación/genética , Trastornos del Neurodesarrollo/epidemiología , Trastornos del Neurodesarrollo/patología , Neuronas/metabolismo , Neuronas/patologíaRESUMEN
High-throughput sequencing is a revolutionary technology for the analysis of metagenomic samples. However, querying large volumes of reads against comprehensive DNA/RNA databases in a sensitive manner can be compute-intensive. Here, we present taxMaps, a highly efficient, sensitive, and fully scalable taxonomic classification tool. Using a combination of simulated and real metagenomics data sets, we demonstrate that taxMaps is more sensitive and more precise than widely used taxonomic classifiers and is capable of delivering classification accuracy comparable to that of BLASTN, but at up to three orders of magnitude less computational cost.
Asunto(s)
Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metagenómica/métodos , Programas Informáticos , Bacterias/clasificación , Bacterias/genética , Bases de Datos de Ácidos Nucleicos , Humanos , Microbiota/genética , Reproducibilidad de los Resultados , Ríos/microbiología , Especificidad de la Especie , Microbiología del AguaRESUMEN
DNA methylation patterns in the genome both reflect and help to mediate transcriptional regulatory processes. The digital nature of DNA methylation, present or absent on each allele, makes this assay capable of quantifying events in subpopulations of cells, whereas genome-wide chromatin studies lack the same quantitative capacity. Testing DNA methylation throughout the genome is possible using whole-genome bisulfite sequencing (WGBS), but the high costs associated with the assay have made it impractical for studies involving more than limited numbers of samples. We have optimized a new transposase-based library preparation assay for the Illumina HiSeq X platform suitable for limited amounts of DNA and providing a major cost reduction for WGBS. By incorporating methylated cytosines during fragment end repair, we reveal an end-repair artifact affecting 1%-2% of reads that we can remove analytically. We show that the use of a high (G + C) content spike-in performs better than PhiX in terms of bisulfite sequencing quality. As expected, the loci with transposase-accessible chromatin are DNA hypomethylated and enriched in flanking regions by post-translational modifications of histones usually associated with positive effects on gene expression. Using these transposase-accessible loci to represent the cis-regulatory loci in the genome, we compared the representation of these loci between WGBS and other genome-wide DNA methylation assays, showing WGBS to outperform substantially all of the alternatives. We conclude that it is now technologically and financially feasible to perform WGBS in larger numbers of samples with greater accuracy than previously possible.
Asunto(s)
Secuenciación Completa del Genoma/métodos , Composición de Base , Línea Celular , Costos y Análisis de Costo , Metilación de ADN , Código de Histonas , Humanos , Reproducibilidad de los Resultados , Sulfitos/química , Secuenciación Completa del Genoma/economía , Secuenciación Completa del Genoma/normasRESUMEN
Rationale: Puerto Ricans have the highest childhood asthma prevalence in the United States (23.6%); however, the etiology is uncertain.Objectives: In this study, we sought to uncover the genetic architecture of lung function in Puerto Rican youth with and without asthma who were recruited from the island (n = 836).Methods: We used admixture-mapping and whole-genome sequencing data to discover genomic regions associated with lung function. Functional roles of the prioritized candidate SNPs were examined with chromatin immunoprecipitation sequencing, RNA sequencing, and expression quantitative trait loci data.Measurements and Main Results: We discovered a genomic region at 1q32 that was significantly associated with a 0.12-L decrease in the lung volume of exhaled air (95% confidence interval, -0.17 to -0.07; P = 6.62 × 10-8) with each allele of African ancestry. Within this region, two SNPs were expression quantitative trait loci of TMEM9 in nasal airway epithelial cells and MROH3P in esophagus mucosa. The minor alleles of these SNPs were associated with significantly decreased lung function and decreased TMEM9 gene expression. Another admixture-mapping peak was observed on chromosome 5q35.1, indicating that each Native American ancestry allele was associated with a 0.15-L increase in lung function (95% confidence interval, 0.08-0.21; P = 5.03 × 10-6). The region-based association tests identified four suggestive windows that harbored candidate rare variants associated with lung function.Conclusions: We identified common and rare genetic variants that may play a critical role in lung function among Puerto Rican youth. We independently validated an inflammatory pathway that could potentially be used to develop more targeted treatments and interventions for patients with asthma.
Asunto(s)
Asma/genética , Población Negra/genética , Cromosomas Humanos Par 1/genética , Cromosomas Humanos Par 5/genética , Volumen Espiratorio Forzado/genética , Indígenas Norteamericanos/genética , Pulmón/fisiopatología , Adolescente , Asma/fisiopatología , Bronquios/citología , Estudios de Casos y Controles , Línea Celular , Niño , Inmunoprecipitación de Cromatina , Mapeo Cromosómico , Mucosa Esofágica/metabolismo , Femenino , Expresión Génica , Humanos , Desequilibrio de Ligamiento , Pulmón/fisiología , Masculino , Proteínas de la Membrana/genética , Proteínas de la Membrana/metabolismo , Miocitos del Músculo Liso , Mucosa Nasal/metabolismo , Polimorfismo de Nucleótido Simple , Puerto Rico , Sitios de Carácter Cuantitativo , Análisis de Secuencia de ARN , Población Blanca/genética , Secuenciación Completa del Genoma , Adulto JovenRESUMEN
We performed whole-genome sequencing (WGS) of 208 genomes from 53 families affected by simplex autism. For the majority of these families, no copy-number variant (CNV) or candidate de novo gene-disruptive single-nucleotide variant (SNV) had been detected by microarray or whole-exome sequencing (WES). We integrated multiple CNV and SNV analyses and extensive experimental validation to identify additional candidate mutations in eight families. We report that compared to control individuals, probands showed a significant (p = 0.03) enrichment of de novo and private disruptive mutations within fetal CNS DNase I hypersensitive sites (i.e., putative regulatory regions). This effect was only observed within 50 kb of genes that have been previously associated with autism risk, including genes where dosage sensitivity has already been established by recurrent disruptive de novo protein-coding mutations (ARID1B, SCN2A, NR3C2, PRKCA, and DSCAM). In addition, we provide evidence of gene-disruptive CNVs (in DISC1, WNT7A, RBFOX1, and MBD5), as well as smaller de novo CNVs and exon-specific SNVs missed by exome sequencing in neurodevelopmental genes (e.g., CANX, SAE1, and PIK3CA). Our results suggest that the detection of smaller, often multiple CNVs affecting putative regulatory elements might help explain additional risk of simplex autism.
Asunto(s)
Trastorno Autístico/genética , ADN/genética , Genoma Humano , Exoma , Femenino , Humanos , Masculino , Linaje , Polimorfismo de Nucleótido SimpleRESUMEN
PURPOSE: To maximize the discovery of potentially pathogenic variants to better understand the diagnostic utility of genome sequencing (GS) and to assess how the presence of multiple risk events might affect the phenotypic severity in autism spectrum disorders (ASD). METHODS: GS was applied to 180 simplex and multiplex ASD families (578 individuals, 213 patients) with exome sequencing and array comparative genomic hybridization further applied to a subset for validation and cross-platform comparisons. RESULTS: We found that 40.8% of patients carried variants with evidence of disease risk, including a de novo frameshift variant in NR4A2 and two de novo missense variants in SYNCRIP, while 21.1% carried clinically relevant pathogenic or likely pathogenic variants. Patients with more than one risk variant (9.9%) were more severely affected with respect to cognitive ability compared with patients with a single or no-risk variant. We observed no instance among the 27 multiplex families where a pathogenic or likely pathogenic variant was transmitted to all affected members in the family. CONCLUSION: The study demonstrates the diagnostic utility of GS, especially for multiple risk variants that contribute to the phenotypic severity, shows the genetic heterogeneity in multiplex families, and provides evidence for new genes for follow up.
Asunto(s)
Trastorno Autístico/genética , Secuenciación del Exoma , Niño , Hibridación Genómica Comparativa , Variaciones en el Número de Copia de ADN , Análisis Mutacional de ADN , Femenino , Humanos , Masculino , FenotipoRESUMEN
Marine stickleback fish have colonized and adapted to thousands of streams and lakes formed since the last ice age, providing an exceptional opportunity to characterize genomic mechanisms underlying repeated ecological adaptation in nature. Here we develop a high-quality reference genome assembly for threespine sticklebacks. By sequencing the genomes of twenty additional individuals from a global set of marine and freshwater populations, we identify a genome-wide set of loci that are consistently associated with marine-freshwater divergence. Our results indicate that reuse of globally shared standing genetic variation, including chromosomal inversions, has an important role in repeated evolution of distinct marine and freshwater sticklebacks, and in the maintenance of divergent ecotypes during early stages of reproductive isolation. Both coding and regulatory changes occur in the set of loci underlying marine-freshwater evolution, but regulatory changes appear to predominate in this well known example of repeated adaptive evolution in nature.
Asunto(s)
Adaptación Fisiológica/genética , Evolución Biológica , Genoma/genética , Smegmamorpha/genética , Alaska , Animales , Organismos Acuáticos/genética , Inversión Cromosómica/genética , Cromosomas/genética , Secuencia Conservada/genética , Ecotipo , Femenino , Agua Dulce , Variación Genética/genética , Genómica , Datos de Secuencia Molecular , Agua de Mar , Análisis de Secuencia de ADNRESUMEN
UNLABELLED: The introduction of West Nile virus (WNV) into North America in 1999 is a classic example of viral emergence in a new environment, with its subsequent dispersion across the continent having a major impact on local bird populations. Despite the importance of this epizootic, the pattern, dynamics, and determinants of WNV spread in its natural hosts remain uncertain. In particular, it is unclear whether the virus encountered major barriers to transmission, or spread in an unconstrained manner, and if specific viral lineages were favored over others indicative of intrinsic differences in fitness. To address these key questions in WNV evolution and ecology, we sequenced the complete genomes of approximately 300 avian isolates sampled across the United States between 2001 and 2012. Phylogenetic analysis revealed a relatively star-like tree structure, indicative of explosive viral spread in the United States, although with some replacement of viral genotypes through time. These data are striking in that viral sequences exhibit relatively limited clustering according to geographic region, particularly for those viruses sampled from birds, and no strong phylogenetic association with well-sampled avian species. The genome sequence data analyzed here also contain relatively little evidence for adaptive evolution, particularly of structural proteins, suggesting that most viral lineages are of similar fitness and that WNV is well adapted to the ecology of mosquito vectors and diverse avian hosts in the United States. In sum, the molecular evolution of WNV in North America depicts a largely unfettered expansion within a permissive host and geographic population with little evidence of major adaptive barriers. IMPORTANCE: How viruses spread in new host and geographic environments is central to understanding the emergence and evolution of novel infectious diseases and for predicting their likely impact. The emergence of the vector-borne West Nile virus (WNV) in North America in 1999 represents a classic example of this process. Using approximately 300 new viral genomes sampled from wild birds, we show that WNV experienced an explosive spread with little geographical or host constraints within birds and relatively low levels of adaptive evolution. From its introduction into the state of New York, WNV spread across the United States, reaching California and Florida within 4 years, a migration that is clearly reflected in our genomic sequence data, and with a general absence of distinct geographical clusters of bird viruses. However, some geographically distinct viral lineages were found to circulate in mosquitoes, likely reflecting their limited long-distance movement compared to avian species.
Asunto(s)
Enfermedades de las Aves/epidemiología , Enfermedades de las Aves/transmisión , Transmisión de Enfermedad Infecciosa , Filogeografía , Fiebre del Nilo Occidental/veterinaria , Animales , Enfermedades de las Aves/virología , Análisis por Conglomerados , Evolución Molecular , Variación Genética , Genoma Viral , Genotipo , Epidemiología Molecular , Datos de Secuencia Molecular , Análisis de Secuencia de ADN , Homología de Secuencia , Estados Unidos/epidemiología , Fiebre del Nilo Occidental/epidemiología , Fiebre del Nilo Occidental/transmisión , Virus del Nilo Occidental/clasificación , Virus del Nilo Occidental/genética , Virus del Nilo Occidental/aislamiento & purificaciónRESUMEN
MOTIVATION: Sequencing of matched tumor and normal samples is the standard study design for reliable detection of somatic alterations. However, even very low levels of cross-sample contamination significantly impact calling of somatic mutations, because contaminant germline variants can be incorrectly interpreted as somatic. There are currently no sequence-only based methods that reliably estimate contamination levels in tumor samples, which frequently display copy number changes. As a solution, we developed Conpair, a tool for detection of sample swaps and cross-individual contamination in whole-genome and whole-exome tumor-normal sequencing experiments. RESULTS: On a ladder of in silico contaminated samples, we demonstrated that Conpair reliably measures contamination levels as low as 0.1%, even in presence of copy number changes. We also estimated contamination levels in glioblastoma WGS and WXS tumor-normal datasets from TCGA and showed that they strongly correlate with tumor-normal concordance, as well as with the number of germline variants called as somatic by several widely-used somatic callers. AVAILABILITY AND IMPLEMENTATION: The method is available at: https://github.com/nygenome/conpair CONTACT: egrabowska@gmail.com or mczody@nygenome.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Simulación por Computador , ADN de Neoplasias , Neoplasias , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Neoplasias/patologíaRESUMEN
The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering â¼4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for â¼60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.
Asunto(s)
Evolución Molecular , Genoma Humano/genética , Genoma/genética , Mamíferos/genética , Animales , Enfermedad , Exones/genética , Genómica , Salud , Humanos , Anotación de Secuencia Molecular , Filogenia , ARN/clasificación , ARN/genética , Selección Genética/genética , Alineación de Secuencia , Análisis de Secuencia de ADNRESUMEN
Using comparative sequencing approaches, we investigated the evolutionary history of the European-enriched 17q21.31 MAPT inversion polymorphism. We present a detailed, BAC-based sequence assembly of the inverted human H2 haplotype and compare it to the sequence structure and genetic variation of the corresponding 1.5-Mb region for the noninverted H1 human haplotype and that of chimpanzee and orangutan. We found that inversion of the MAPT region is similarly polymorphic in other great ape species, and we present evidence that the inversions occurred independently in chimpanzees and humans. In humans, the inversion breakpoints correspond to core duplications with the LRRC37 gene family. Our analysis favors the H2 configuration and sequence haplotype as the likely great ape and human ancestral state, with inversion recurrences during primate evolution. We show that the H2 architecture has evolved more extensive sequence homology, perhaps explaining its tendency to undergo microdeletion associated with mental retardation in European populations.
Asunto(s)
Inversión Cromosómica , Cromosomas Humanos Par 17 , Evolución Molecular , Polimorfismo Genético , Proteínas tau/genética , Animales , Secuencia de Bases , Duplicación de Gen , Humanos , Modelos Biológicos , Datos de Secuencia Molecular , Pan troglodytes/genética , Filogenia , Pongo pygmaeus/genética , Análisis de Secuencia de ADNRESUMEN
Following recent success in genome-wide association studies, a critical focus of human genetics is to understand how genetic variation at implicated loci influences cellular and disease processes. Crohn's disease (CD) is associated with SNPs around IRGM, but coding-sequence variation has been excluded as a source of this association. We identified a common, 20-kb deletion polymorphism, immediately upstream of IRGM and in perfect linkage disequilibrium (r2 = 1.0) with the most strongly CD-associated SNP, that causes IRGM to segregate in the population with two distinct upstream sequences. The deletion (CD risk) and reference (CD protective) haplotypes of IRGM showed distinct expression patterns. Manipulation of IRGM expression levels modulated cellular autophagy of internalized bacteria, a process implicated in CD. These results suggest that the CD association at IRGM arises from an alteration in IRGM regulation that affects the efficacy of autophagy and identify a common deletion polymorphism as a likely causal variant.