RESUMEN
A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline1 to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.
Asunto(s)
Variación Genética , Genoma Humano/genética , Secuenciación Completa del Genoma , Alelos , Estudios de Casos y Controles , Epigénesis Genética , Femenino , Dosificación de Gen/genética , Genética de Población , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Masculino , Anotación de Secuencia Molecular , Sitios de Carácter Cuantitativo , Grupos Raciales/genética , Programas InformáticosRESUMEN
Identification of rare-variant associations is crucial to full characterization of the genetic architecture of complex traits and diseases. Essential in this process is the evaluation of novel methods in simulated data that mirror the distribution of rare variants and haplotype structure in real data. Additionally, importing real-variant annotation enables in silico comparison of methods, such as rare-variant association tests and polygenic scoring methods, that focus on putative causal variants. Existing simulation methods are either unable to employ real-variant annotation or severely under- or overestimate the number of singletons and doubletons, thereby reducing the ability to generalize simulation results to real studies. We present RAREsim, a flexible and accurate rare-variant simulation algorithm. Using parameters and haplotypes derived from real sequencing data, RAREsim efficiently simulates the expected variant distribution and enables real-variant annotations. We highlight RAREsim's utility across various genetic regions, sample sizes, ancestries, and variant classes.
Asunto(s)
Variación Genética , Proyectos de Investigación , Simulación por Computador , Variación Genética/genética , Haplotipos/genética , Humanos , Modelos Genéticos , Herencia MultifactorialRESUMEN
Structural variants are associated with cancers and developmental disorders, but challenges with estimating population frequency remain a barrier to prioritizing mutations over inherited variants. In particular, variability in variant calling heuristics and filtering limits the use of current structural variant catalogs. We present STIX, a method that, instead of relying on variant calls, indexes and searches the raw alignments from thousands of samples to enable more comprehensive allele frequency estimation.
Asunto(s)
Genoma , Variación Estructural del Genoma , Neoplasias , Algoritmos , Variación Estructural del Genoma/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Neoplasias/genética , Programas InformáticosRESUMEN
The human genome encodes an order of magnitude more gene expression enhancers than promoters, suggesting that most genes are regulated by the combined action of multiple enhancers. We have previously shown that neighboring estrogen-responsive enhancers exhibit complex synergistic contributions to the production of an estrogenic transcriptional response. Here we sought to determine the molecular underpinnings of this enhancer cooperativity. We generated genetic deletions of four estrogen receptor α (ER) bound enhancers that regulate two genes and found that enhancers containing full estrogen response element (ERE) motifs control ER binding at neighboring sites, while enhancers with pre-existing histone acetylation/accessibility confer a permissible chromatin environment to the neighboring enhancers. Genome engineering revealed that two enhancers with half EREs could not compensate for the lack of a full ERE site within the cluster. In contrast, two enhancers with full EREs produced a transcriptional response greater than the wild-type locus. By swapping genomic sequences, we found that the genomic location of a full ERE strongly influences enhancer activity. Our results lead to a model in which a full ERE is required for ER recruitment, but the presence of a pre-existing permissible chromatin environment can also be needed for estrogen-driven gene regulation to occur.
Asunto(s)
Elementos de Facilitación Genéticos/genética , Receptor alfa de Estrógeno/genética , Motivos de Nucleótidos/genética , Transcripción Genética , Acetilación , Cromatina/genética , Proteínas de Unión al ADN/genética , Regulación de la Expresión Génica/genética , Genoma Humano/genética , Humanos , Regiones Promotoras Genéticas/genéticaRESUMEN
GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.
Asunto(s)
Neoplasias de la Mama/genética , Genoma Humano , Genómica/métodos , Motor de Búsqueda/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Bases de Datos Genéticas , Femenino , Humanos , InternetRESUMEN
SUMMARY: Large-scale human genetics studies are now employing whole genome sequencing with the goal of conducting comprehensive trait mapping analyses of all forms of genome variation. However, methods for structural variation (SV) analysis have lagged far behind those for smaller scale variants, and there is an urgent need to develop more efficient tools that scale to the size of human populations. Here, we present a fast and highly scalable software toolkit (svtools) and cloud-based pipeline for assembling high quality SV maps-including deletions, duplications, mobile element insertions, inversions and other rearrangements-in many thousands of human genomes. We show that this pipeline achieves similar variant detection performance to established per-sample methods (e.g. LUMPY), while providing fast and affordable joint analysis at the scale of ≥100 000 genomes. These tools will help enable the next generation of human genetics studies. AVAILABILITY AND IMPLEMENTATION: svtools is implemented in Python and freely available (MIT) from https://github.com/hall-lab/svtools. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genoma Humano , Programas Informáticos , Humanos , Eliminación de Secuencia , Secuenciación Completa del GenomaRESUMEN
Functional genomics assays produce sets of genomic regions as one of their main outputs. To biologically interpret such region-sets, researchers often use colocalization analysis, where the statistical significance of colocalization (overlap, spatial proximity) between two or more region-sets is tested. Existing colocalization analysis tools vary in the statistical methodology and analysis approaches, thus potentially providing different conclusions for the same research question. As the findings of colocalization analysis are often the basis for follow-up experiments, it is helpful to use several tools in parallel and to compare the results. We developed the Coloc-stats web service to facilitate such analyses. Coloc-stats provides a unified interface to perform colocalization analysis across various analytical methods and method-specific options (e.g. colocalization measures, resolution, null models). Coloc-stats helps the user to find a method that supports their experimental requirements and allows for a straightforward comparison across methods. Coloc-stats is implemented as a web server with a graphical user interface that assists users with configuring their colocalization analyses. Coloc-stats is freely available at https://hyperbrowser.uio.no/coloc-stats/.
Asunto(s)
Genómica/métodos , Programas Informáticos , Inmunoprecipitación de Cromatina , Factor de Transcripción GATA1/metabolismo , Internet , Análisis de Secuencia de ADN , Interfaz Usuario-ComputadorRESUMEN
Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.
Asunto(s)
Variación Genética , Genotipo , Conjuntos de Datos como AsuntoRESUMEN
SpeedSeq is an open-source genome analysis platform that accomplishes alignment, variant detection and functional annotation of a 50× human genome in 13 h on a low-cost server and alleviates a bioinformatics bottleneck that typically demands weeks of computation with extensive hands-on expert involvement. SpeedSeq offers performance competitive with or superior to current methods for detecting germline and somatic single-nucleotide variants, structural variants, insertions and deletions, and it includes novel functionality for streamlined interpretation.
Asunto(s)
Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Anotación de Secuencia Molecular/métodos , Programas Informáticos , Variación Genética , Humanos , Neoplasias/genética , Polimorfismo de Nucleótido Simple , Medicina de Precisión/métodos , Flujo de TrabajoRESUMEN
The comparison of sets of genome intervals (e.g., genes, repeats, ChIP-seq peaks) is essential to genome research, especially as modern sequencing technologies enable ever larger and more complex experiments. Relationships between genomic features are commonly identified by their intersection: that is, if feature sets contain overlapping intervals then it is inferred that they share a common biological function or origin. Using this technique, researchers identify genomic regions that are common among multiple (or unique to individual) datasets. While there have been recent advances in algorithms for pairwise intersections between two sets of genomic intervals, few advances have been made to the intersection of many sets of genomic intervals. Identifying intersections among many interval sets is particularly important when attempting to distill biological insights from the massive, multi-dimensional datasets that are common to modern genome research. For such analyses, speed and efficiency are crucial given the size and sheer number of datasets involved. To solve this problem, we present a novel "slice-then-sweep" algorithm that, given N interval sets, efficiently reveals the subset of intervals that are common to all N sets. We demonstrate that our algorithm is more efficient in the sequential case and has a vastly higher capacity for parallelization with a 19x speedup over the existing algorithm.
RESUMEN
Tumor genomes are generally thought to evolve through a gradual accumulation of mutations, but the observation that extraordinarily complex rearrangements can arise through single mutational events suggests that evolution may be accelerated by punctuated changes in genome architecture. To assess the prevalence and origins of complex genomic rearrangements (CGRs), we mapped 6179 somatic structural variation breakpoints in 64 cancer genomes from seven tumor types and screened for clusters of three or more interconnected breakpoints. We find that complex breakpoint clusters are extremely common: 154 clusters comprise 25% of all somatic breakpoints, and 75% of tumors exhibit at least one complex cluster. Based on copy number state profiling, 63% of breakpoint clusters are consistent with being CGRs that arose through a single mutational event. CGRs have diverse architectures including focal breakpoint clusters, large-scale rearrangements joining clusters from one or more chromosomes, and staggeringly complex chromothripsis events. Notably, chromothripsis has a significantly higher incidence in glioblastoma samples (39%) relative to other tumor types (9%). Chromothripsis breakpoints also show significantly elevated intra-tumor allele frequencies relative to simple SVs, which indicates that they arise early during tumorigenesis or confer selective advantage. Finally, assembly and analysis of 4002 somatic and 6982 germline breakpoint sequences reveal that somatic breakpoints show significantly less microhomology and fewer templated insertions than germline breakpoints, and this effect is stronger at CGRs than at simple variants. These results are inconsistent with replication-based models of CGR genesis and strongly argue that nonhomologous repair of concurrently arising DNA double-strand breaks is the predominant mechanism underlying complex cancer genome rearrangements.
Asunto(s)
Aberraciones Cromosómicas , Puntos de Rotura del Cromosoma , Mutación/genética , Neoplasias/genética , Secuencia de Bases , Roturas del ADN de Doble Cadena , Replicación del ADN/genética , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Neoplasias/patologíaRESUMEN
MOTIVATION: The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. RESULTS: We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. AVAILABILITY: https://github.com/arq5x/bits.
Asunto(s)
Algoritmos , Genómica/métodos , Método de Montecarlo , Alineación de Secuencia , Análisis de Secuencia de ADNRESUMEN
Many nocturnally active fireflies use precisely timed bioluminescent patterns to identify mates, making them especially vulnerable to light pollution. As urbanization continues to brighten the night sky, firefly populations are under constant stress, and close to half of the species are now threatened. Ensuring the survival of firefly biodiversity depends on a large-scale conservation effort to monitor and protect thousands of populations. While species can be identified by their flash patterns, current methods require expert measurement and manual classification and are infeasible given the number and geographic distribution of fireflies. Here we present the application of a recurrent neural network (RNN) for accurate automated firefly flash pattern classification. Using recordings from commodity cameras, we can extract flash trajectories of individuals within a swarm and classify their species with an accuracy of approximately seventy percent. In addition to its potential in population monitoring, automated classification provides the means to study firefly behavior at the population level. We employ the classifier to measure and characterize the variability within and between swarms, unlocking a new dimension of their behavior. Our method is open source, and deployment in community science applications could revolutionize our ability to monitor and understand firefly populations.
Asunto(s)
Luciérnagas , Conducta Sexual Animal , Humanos , AnimalesRESUMEN
Comprehensive characterization of structural variation in natural populations has only become feasible in the last decade. To investigate the population genomic nature of structural variation, reproducible and high-confidence structural variation callsets are first required. We created a population-scale reference of the genome-wide landscape of structural variation across 33 Nordic house sparrows (Passer domesticus). To produce a consensus callset across all samples using short-read data, we compare heuristic-based quality filtering and visual curation (Samplot/PlotCritic and Samplot-ML) approaches. We demonstrate that curation of structural variants is important for reducing putative false positives and that the time invested in this step outweighs the potential costs of analyzing short-read-discovered structural variation data sets that include many potential false positives. We find that even a lenient manual curation strategy (e.g. applied by a single curator) can reduce the proportion of putative false positives by up to 80%, thus enriching the proportion of high-confidence variants. Crucially, in applying a lenient manual curation strategy with a single curator, nearly all (>99%) variants rejected as putative false positives were also classified as such by a more stringent curation strategy using three additional curators. Furthermore, variants rejected by manual curation failed to reflect the expected population structure from SNPs, whereas variants passing curation did. Combining heuristic-based quality filtering with rapid manual curation of structural variants in short-read data can therefore become a time- and cost-effective first step for functional and population genomic studies requiring high-confidence structural variation callsets.
Asunto(s)
Genoma , Genómica , Metagenómica , Polimorfismo de Nucleótido SimpleRESUMEN
There is an unmet need to improve the efficacy of platinum-based cancer chemotherapy, which is used in primary and metastatic settings in many cancer types. In bladder cancer, platinum-based chemotherapy leads to better outcomes in a subset of patients when used in the neoadjuvant setting or in combination with immunotherapy for advanced disease. Despite such promising results, extending the benefits of platinum drugs to a greater number of patients is highly desirable. Using the multiomic assessment of cisplatin-responsive and -resistant human bladder cancer cell lines and whole-genome CRISPR screens, we identified puromycin-sensitive aminopeptidase (NPEPPS) as a driver of cisplatin resistance. NPEPPS depletion sensitized resistant bladder cancer cells to cisplatin in vitro and in vivo. Conversely, overexpression of NPEPPS in sensitive cells increased cisplatin resistance. NPEPPS affected treatment response by regulating intracellular cisplatin concentrations. Patient-derived organoids (PDO) generated from bladder cancer samples before and after cisplatin-based treatment, and from patients who did not receive cisplatin, were evaluated for sensitivity to cisplatin, which was concordant with clinical response. In the PDOs, depletion or pharmacologic inhibition of NPEPPS increased cisplatin sensitivity, while NPEPPS overexpression conferred resistance. Our data present NPEPPS as a druggable driver of cisplatin resistance by regulating intracellular cisplatin concentrations. SIGNIFICANCE: Targeting NPEPPS, which induces cisplatin resistance by controlling intracellular drug concentrations, is a potential strategy to improve patient responses to platinum-based therapies and lower treatment-associated toxicities.
Asunto(s)
Cisplatino , Resistencia a Antineoplásicos , Neoplasias de la Vejiga Urinaria , Humanos , Cisplatino/farmacología , Neoplasias de la Vejiga Urinaria/tratamiento farmacológico , Neoplasias de la Vejiga Urinaria/genética , Neoplasias de la Vejiga Urinaria/patología , Neoplasias de la Vejiga Urinaria/metabolismo , Animales , Ratones , Línea Celular Tumoral , Aminopeptidasas/genética , Aminopeptidasas/metabolismo , Ensayos Antitumor por Modelo de Xenoinjerto , Antineoplásicos/farmacología , Organoides/efectos de los fármacos , Organoides/metabolismoRESUMEN
Insulinomas are rare neuroendocrine tumors arising from pancreatic ß cells, characterized by aberrant proliferation and altered insulin secretion, leading to glucose homeostasis failure. With the aim of uncovering the role of noncoding regulatory regions and their aberrations in the development of these tumors, we coupled epigenetic and transcriptome profiling with whole-genome sequencing. As a result, we unraveled somatic mutations associated with changes in regulatory functions. Critically, these regions impact insulin secretion, tumor development, and epigenetic modifying genes, including polycomb complex components. Chromatin remodeling is apparent in insulinoma-selective domains shared across patients, containing a specific set of regulatory sequences dominated by the SOX17 binding motif. Moreover, many of these regions are H3K27me3 repressed in ß cells, suggesting that tumoral transition involves derepression of polycomb-targeted domains. Our work provides a compendium of aberrant cis-regulatory elements affecting the function and fate of ß cells in their progression to insulinomas and a framework to identify coding and noncoding driver mutations.
Asunto(s)
Insulinoma , Humanos , Insulinoma/genética , Insulinoma/patología , Insulinoma/metabolismo , Células Secretoras de Insulina/metabolismo , Células Secretoras de Insulina/patología , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/patología , Mutación , Regulación Neoplásica de la Expresión Génica , Epigénesis Genética , Ensamble y Desensamble de Cromatina/genéticaRESUMEN
BACKGROUND: Notch3 is expressed in myogenic precursors, but its function is not well known. RESULTS: Notch3 represses the activity of Mef2c and is in turn inhibited by the microRNAs-1 and -206. CONCLUSION: Notch3 serves as a regulator for preventing premature myogenic differentiation. SIGNIFICANCE: Understanding how precocious differentiation is prevented is critical for designing therapy for skeletal muscle regeneration. The Notch signaling pathway is a well known regulator of skeletal muscle stem cells known as satellite cells. Loss of Notch1 signaling leads to spontaneous myogenic differentiation. Notch1, normally expressed in satellite cells, is targeted for proteasomal degradation by Numb during differentiation. A homolog of Notch1, Notch3, is also expressed in these cells but is not inhibited by Numb. We find that Notch3 is paradoxically up-regulated during the early stages of differentiation by an enhancer that requires both MyoD and activated Notch1. Notch3 itself strongly inhibits the myogenic transcription factor Mef2c, most likely by increasing the p38 phosphatase Mkp1, which inhibits the Mef2c activator p38 MAP kinase. Active Notch3 decreases differentiation. Mef2c, however, induces microRNAs miR-1 and miR-206, which directly down-regulate Notch3 and allow differentiation to proceed. Thus, the myogenic differentiation-induced microRNAs miR-1 and -206 are important for differentiation at least partly because they turn off Notch3. We suggest that the transient expression of Notch3 early in differentiation generates a temporal lag between myoblast activation by MyoD and terminal differentiation into myotubes directed by Mef2c.
Asunto(s)
Diferenciación Celular , Fosfatasa 1 de Especificidad Dual/metabolismo , MicroARNs/metabolismo , Mioblastos/citología , Factores Reguladores Miogénicos/metabolismo , Receptores Notch/metabolismo , Animales , Línea Celular , Regulación hacia Abajo , Fosfatasa 1 de Especificidad Dual/genética , Factores de Transcripción MEF2 , Ratones , MicroARNs/genética , Proteína MioD/genética , Proteína MioD/metabolismo , Mioblastos/metabolismo , Factores Reguladores Miogénicos/antagonistas & inhibidores , Factores Reguladores Miogénicos/genética , Receptor Notch3 , Receptores Notch/antagonistas & inhibidores , Receptores Notch/genética , Transducción de SeñalRESUMEN
MicroRNAs play important roles in many cell processes, including the differentiation process in several different lineages. For example, microRNAs can promote differentiation by repressing negative regulators of transcriptional activity. These regulated transcription factors can further up-regulate levels of the microRNA in a feed-forward mechanism. Here we show that MyoD up-regulates miR-378 during myogenic differentiation in C2C12 cells. ChIP and high throughput sequencing analysis shows that MyoD binds in close proximity to the miR-378 gene and causes both transactivation and chromatin remodeling. Overexpression of miR-378 increases the transcriptional activity of MyoD, in part by repressing an antagonist, MyoR. The 3' untranslated region of MyoR contains a direct binding site for miR-378. The presence of this binding site significantly reduces the ability of MyoR to prevent the MyoD-driven transdifferentiation of fibroblasts. MyoR and miR-378 were anticorrelated during cardiotoxin-induced adult muscle regeneration in mice. Taken together, this shows a feed-forward loop where MyoD indirectly down-regulates MyoR via miR-378.
Asunto(s)
Diferenciación Celular/fisiología , MicroARNs/metabolismo , Músculo Esquelético/metabolismo , Mioblastos Esqueléticos/metabolismo , Regeneración/fisiología , Factores de Transcripción/metabolismo , Regiones no Traducidas 3'/fisiología , Animales , Factores de Transcripción con Motivo Hélice-Asa-Hélice Básico , Línea Celular , Transdiferenciación Celular/fisiología , Perros , Fibroblastos/citología , Fibroblastos/metabolismo , Humanos , Masculino , Ratones , MicroARNs/genética , Desarrollo de Músculos/fisiología , Músculo Esquelético/citología , Proteína MioD/genética , Proteína MioD/metabolismo , Mioblastos Esqueléticos/citología , Ratas , Factores de Transcripción/genéticaRESUMEN
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.