ABSTRACT
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
Subject(s)
Genetic Variation/genetics , Genome, Human/genetics , Physical Chromosome Mapping , Amino Acid Sequence , Genetic Predisposition to Disease , Genetics, Medical , Genetics, Population , Genome-Wide Association Study , Genomics , Genotype , Haplotypes/genetics , Homozygote , Humans , Molecular Sequence Data , Mutation Rate , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics , Sequence Analysis, DNA , Sequence Deletion/geneticsABSTRACT
Motivation: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. Results: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. Availability and implementation: https://github.com/daewoooo/SaaRclust.
Subject(s)
Chromosomes, Human , Computer Simulation , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Software , Algorithms , Female , Genome, Human , Humans , Sequence Analysis, DNA/methodsABSTRACT
BACKGROUND: While active LINE-1 (L1) elements possess the ability to mobilize flanking sequences to different genomic loci through a process termed transduction influencing genomic content and structure, an approach for detecting polymorphic germline non-reference transductions in massively-parallel sequencing data has been lacking. RESULTS: Here we present the computational approach TIGER (Transduction Inference in GERmline genomes), enabling the discovery of non-reference L1-mediated transductions by combining L1 discovery with detection of unique insertion sequences and detailed characterization of insertion sites. We employed TIGER to characterize polymorphic transductions in fifteen genomes from non-human primate species (chimpanzee, orangutan and rhesus macaque), as well as in a human genome. We achieved high accuracy as confirmed by PCR and two single molecule DNA sequencing techniques, and uncovered differences in relative rates of transduction between primate species. CONCLUSIONS: By enabling detection of polymorphic transductions, TIGER makes this form of relevant structural variation amenable for population and personal genome analysis.
Subject(s)
Germ Cells/metabolism , High-Throughput Nucleotide Sequencing , Long Interspersed Nucleotide Elements , Transduction, Genetic , Animals , Base Sequence , Computational Biology/methods , Genome , Humans , Macaca mulatta/genetics , Pan troglodytes/geneticsABSTRACT
Structural variation (SV), involving deletions, duplications, inversions and translocations of DNA segments, is a major source of genetic variability in somatic cells and can dysregulate cancer-related pathways. However, discovering somatic SVs in single cells has been challenging, with copy-number-neutral and complex variants typically escaping detection. Here we describe single-cell tri-channel processing (scTRIP), a computational framework that integrates read depth, template strand and haplotype phase to comprehensively discover SVs in individual cells. We surveyed SV landscapes of 565 single cells, including transformed epithelial cells and patient-derived leukemic samples, to discover abundant SV classes, including inversions, translocations and complex DNA rearrangements. Analysis of the leukemic samples revealed four times more somatic SVs than cytogenetic karyotyping, submicroscopic copy-number alterations, oncogenic copy-neutral rearrangements and a subclonal chromothripsis event. Advancing current methods, single-cell tri-channel processing can directly measure SV mutational processes in individual cells, such as breakage-fusion-bridge cycles, facilitating studies of clonal evolution, genetic mosaicism and SV formation mechanisms, which could improve disease classification for precision medicine.
Subject(s)
Computational Biology/methods , Genomic Structural Variation , Leukemia/genetics , Single-Cell Analysis/methods , Cell Line , Chromothripsis , Clonal Evolution , Gene Rearrangement , Humans , INDEL Mutation , Sequence Inversion , Translocation, GeneticABSTRACT
Chromatin topology is intricately linked to gene expression, yet its functional requirement remains unclear. Here, we comprehensively assessed the interplay between genome topology and gene expression using highly rearranged chromosomes (balancers) spanning ~75% of the Drosophila genome. Using transheterozyte (balancer/wild-type) embryos, we measured allele-specific changes in topology and gene expression in cis, while minimizing trans effects. Through genome sequencing, we resolved eight large nested inversions, smaller inversions, duplications and thousands of deletions. These extensive rearrangements caused many changes to chromatin topology, disrupting long-range loops, topologically associating domains (TADs) and promoter interactions, yet these are not predictive of changes in expression. Gene expression is generally not altered around inversion breakpoints, indicating that mis-appropriate enhancer-promoter activation is a rare event. Similarly, shuffling or fusing TADs, changing intra-TAD connections and disrupting long-range inter-TAD loops does not alter expression for the majority of genes. Our results suggest that properties other than chromatin topology ensure productive enhancer-promoter interactions.