ABSTRACT
A comprehensive catalog of cancer driver mutations is essential for understanding tumorigenesis and developing therapies. Exome-sequencing studies have mapped many protein-coding drivers, yet few non-coding drivers are known because genome-wide discovery is challenging. We developed a driver discovery method, ActiveDriverWGS, and analyzed 120,788 cis-regulatory modules (CRMs) across 1,844 whole tumor genomes from the ICGC-TCGA PCAWG project. We found 30 CRMs with enriched SNVs and indels (FDR < 0.05). These frequently mutated regulatory elements (FMREs) were ubiquitously active in human tissues, showed long-range chromatin interactions and mRNA abundance associations with target genes, and were enriched in motif-rewiring mutations and structural variants. Genomic deletion of one FMRE in human cells caused proliferative deficiencies and transcriptional deregulation of cancer genes CCNB1IP1, CDH1, and CDKN2B, validating observations in FMRE-mutated tumors. Pathway analysis revealed further sub-significant FMREs at cancer genes and processes, indicating an unexplored landscape of infrequent driver mutations in the non-coding genome.
Subject(s)
Biomarkers, Tumor/genetics , Chromatin/metabolism , Gene Regulatory Networks , Mutation , Neoplasms/genetics , Neoplasms/pathology , Regulatory Sequences, Nucleic Acid , Cell Proliferation , Chromatin/genetics , Computational Biology/methods , DNA Mutational Analysis , Genome, Human , HEK293 Cells , HumansABSTRACT
Probing epigenetic features on DNA has tremendous potential to advance our understanding of the phased epigenome. In this study, we use nanopore sequencing to evaluate CpG methylation and chromatin accessibility simultaneously on long strands of DNA by applying GpC methyltransferase to exogenously label open chromatin. We performed nanopore sequencing of nucleosome occupancy and methylome (nanoNOMe) on four human cell lines (GM12878, MCF-10A, MCF-7 and MDA-MB-231). The single-molecule resolution allows footprinting of protein and nucleosome binding, and determination of the combinatorial promoter epigenetic signature on individual molecules. Long-read sequencing makes it possible to robustly assign reads to haplotypes, allowing us to generate a fully phased human epigenome, consisting of chromosome-level allele-specific profiles of CpG methylation and chromatin accessibility. We further apply this to a breast cancer model to evaluate differential methylation and accessibility between cancerous and noncancerous cells.
Subject(s)
Breast Neoplasms/genetics , Chromatin/genetics , DNA Methylation/genetics , Nanopore Sequencing/methods , Cell Line, Tumor , CpG Islands/genetics , DNA/metabolism , Epigenome/genetics , Female , Genome, Human/genetics , Humans , MCF-7 Cells , Methyltransferases/metabolism , Promoter Regions, Genetic/genetics , Sequence Analysis, DNAABSTRACT
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
ABSTRACT
Replication of eukaryotic genomes is highly stochastic, making it difficult to determine the replication dynamics of individual molecules with existing methods. We report a sequencing method for the measurement of replication fork movement on single molecules by detecting nucleotide analog signal currents on extremely long nanopore traces (D-NAscent). Using this method, we detect 5-bromodeoxyuridine (BrdU) incorporated by Saccharomyces cerevisiae to reveal, at a genomic scale and on single molecules, the DNA sequences replicated during a pulse-labeling period. Under conditions of limiting BrdU concentration, D-NAscent detects the differences in BrdU incorporation frequency across individual molecules to reveal the location of active replication origins, fork direction, termination sites, and fork pausing/stalling events. We used sequencing reads of 20-160 kilobases to generate a whole-genome single-molecule map of DNA replication dynamics and discover a class of low-frequency stochastic origins in budding yeast. The D-NAscent software is available at https://github.com/MBoemo/DNAscent.git .
Subject(s)
DNA Replication , Genome, Fungal , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Nanopores , Saccharomyces cerevisiae/genetics , Bromodeoxyuridine/metabolism , DNA, Fungal/genetics , Genome , SoftwareABSTRACT
High-throughput complementary DNA sequencing technologies have advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and modifications are not retained. We address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies. Our study generated 9.9 million aligned sequence reads for the human cell line GM12878, using thirty MinION flow cells at six institutions. These native RNA reads had a median length of 771 bases, and a maximum aligned length of over 21,000 bases. Mitochondrial poly(A) reads provided an internal measure of read-length quality. We combined these long nanopore reads with higher accuracy short-reads and annotated GM12878 promoter regions to identify 33,984 plausible RNA isoforms. We describe strategies for assessing 3' poly(A) tail length, base modifications and transcript haplotypes.
Subject(s)
Nanopore Sequencing/methods , Poly A/genetics , Sequence Analysis, RNA/methods , Transcriptome , Cells, Cultured , HumansABSTRACT
Pancreatic cancer, a highly aggressive tumour type with uniformly poor prognosis, exemplifies the classically held view of stepwise cancer development. The current model of tumorigenesis, based on analyses of precursor lesions, termed pancreatic intraepithelial neoplasm (PanINs) lesions, makes two predictions: first, that pancreatic cancer develops through a particular sequence of genetic alterations (KRAS, followed by CDKN2A, then TP53 and SMAD4); and second, that the evolutionary trajectory of pancreatic cancer progression is gradual because each alteration is acquired independently. A shortcoming of this model is that clonally expanded precursor lesions do not always belong to the tumour lineage, indicating that the evolutionary trajectory of the tumour lineage and precursor lesions can be divergent. This prevailing model of tumorigenesis has contributed to the clinical notion that pancreatic cancer evolves slowly and presents at a late stage. However, the propensity for this disease to rapidly metastasize and the inability to improve patient outcomes, despite efforts aimed at early detection, suggest that pancreatic cancer progression is not gradual. Here, using newly developed informatics tools, we tracked changes in DNA copy number and their associated rearrangements in tumour-enriched genomes and found that pancreatic cancer tumorigenesis is neither gradual nor follows the accepted mutation order. Two-thirds of tumours harbour complex rearrangement patterns associated with mitotic errors, consistent with punctuated equilibrium as the principal evolutionary trajectory. In a subset of cases, the consequence of such errors is the simultaneous, rather than sequential, knockout of canonical preneoplastic genetic drivers that are likely to set-off invasive cancer growth. These findings challenge the current progression model of pancreatic cancer and provide insights into the mutational processes that give rise to these aggressive tumours.
Subject(s)
Carcinogenesis/genetics , Carcinogenesis/pathology , Gene Rearrangement/genetics , Genome, Human/genetics , Models, Biological , Mutagenesis/genetics , Pancreatic Neoplasms/genetics , Pancreatic Neoplasms/pathology , Carcinoma in Situ/genetics , Chromothripsis , DNA Copy Number Variations/genetics , Disease Progression , Evolution, Molecular , Female , Genes, Neoplasm/genetics , Humans , Male , Mitosis/genetics , Mutation/genetics , Neoplasm Invasiveness/genetics , Neoplasm Invasiveness/pathology , Neoplasm Metastasis/genetics , Neoplasm Metastasis/pathology , Polyploidy , Precancerous Conditions/geneticsABSTRACT
The Ebola virus disease epidemic in West Africa is the largest on record, responsible for over 28,599 cases and more than 11,299 deaths. Genome sequencing in viral outbreaks is desirable to characterize the infectious agent and determine its evolutionary rate. Genome sequencing also allows the identification of signatures of host adaptation, identification and monitoring of diagnostic targets, and characterization of responses to vaccines and treatments. The Ebola virus (EBOV) genome substitution rate in the Makona strain has been estimated at between 0.87 × 10(-3) and 1.42 × 10(-3) mutations per site per year. This is equivalent to 16-27 mutations in each genome, meaning that sequences diverge rapidly enough to identify distinct sub-lineages during a prolonged epidemic. Genome sequencing provides a high-resolution view of pathogen evolution and is increasingly sought after for outbreak surveillance. Sequence data may be used to guide control measures, but only if the results are generated quickly enough to inform interventions. Genomic surveillance during the epidemic has been sporadic owing to a lack of local sequencing capacity coupled with practical difficulties transporting samples to remote sequencing facilities. To address this problem, here we devise a genomic surveillance system that utilizes a novel nanopore DNA sequencing instrument. In April 2015 this system was transported in standard airline luggage to Guinea and used for real-time genomic surveillance of the ongoing epidemic. We present sequence data and analysis of 142 EBOV samples collected during the period March to October 2015. We were able to generate results less than 24 h after receiving an Ebola-positive sample, with the sequencing process taking as little as 15-60 min. We show that real-time genomic surveillance is possible in resource-limited settings and can be established rapidly to monitor outbreaks.
Subject(s)
Ebolavirus/genetics , Epidemiological Monitoring , Genome, Viral/genetics , Hemorrhagic Fever, Ebola/epidemiology , Hemorrhagic Fever, Ebola/virology , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/methods , Aircraft , Disease Outbreaks/statistics & numerical data , Ebolavirus/classification , Ebolavirus/pathogenicity , Guinea/epidemiology , Humans , Mutagenesis/genetics , Mutation Rate , Time FactorsABSTRACT
BACKGROUND: Nanopore sequencing enables portable, real-time sequencing applications, including point-of-care diagnostics and in-the-field genotyping. Achieving these outcomes requires efficient bioinformatic algorithms for the analysis of raw nanopore signal data. However, comparing raw nanopore signals to a biological reference sequence is a computationally complex task. The dynamic programming algorithm called Adaptive Banded Event Alignment (ABEA) is a crucial step in polishing sequencing data and identifying non-standard nucleotides, such as measuring DNA methylation. Here, we parallelise and optimise an implementation of the ABEA algorithm (termed f5c) to efficiently run on heterogeneous CPU-GPU architectures. RESULTS: By optimising memory, computations and load balancing between CPU and GPU, we demonstrate how f5c can perform â¼3-5 × faster than an optimised version of the original CPU-only implementation of ABEA in the Nanopolish software package. We also show that f5c enables DNA methylation detection on-the-fly using an embedded System on Chip (SoC) equipped with GPUs. CONCLUSIONS: Our work not only demonstrates that complex genomics analyses can be performed on lightweight computing systems, but also benefits High-Performance Computing (HPC). The associated source code for f5c along with GPU optimised ABEA is available at https://github.com/hasindu2008/f5c .
Subject(s)
Computer Graphics , Nanopores , Signal Processing, Computer-Assisted , Algorithms , Computational Biology , Databases as Topic , Genome, Human , Humans , Sequence AnalysisABSTRACT
We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.
Subject(s)
Genome, Human/genetics , Genomics , Sequence Alignment/methods , Whole Genome Sequencing/methods , Alleles , Data Compression , Genotype , Humans , INDEL Mutation/genetics , Sequence Analysis, DNA , SoftwareABSTRACT
The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
Subject(s)
Contig Mapping/methods , Genome, Human , Genomics/methods , Sequence Analysis, DNA/methods , Software , Contig Mapping/standards , Genomics/standards , Haploidy , Haplotypes , Humans , Polymorphism, Genetic , Reference Standards , Sequence Analysis, DNA/standardsABSTRACT
In nanopore sequencing devices, electrolytic current signals are sensitive to base modifications, such as 5-methylcytosine (5-mC). Here we quantified the strength of this effect for the Oxford Nanopore Technologies MinION sequencer. By using synthetically methylated DNA, we were able to train a hidden Markov model to distinguish 5-mC from unmethylated cytosine. We applied our method to sequence the methylome of human DNA, without requiring special steps for library preparation.
Subject(s)
5-Methylcytosine/analysis , Cytosine/metabolism , DNA Methylation , Genome, Human , Cell Line, Tumor , CpG Islands , Cytosine/analysis , Escherichia coli/genetics , Humans , Markov Chains , NanoporesABSTRACT
The current genomic revolution was made possible by joint advances in genome sequencing technologies and computational approaches for analyzing sequence data. The close interaction between biologists and computational scientists is perhaps most apparent in the development of approaches for sequencing entire genomes, a feat that would not be possible without sophisticated computational tools called genome assemblers (short for genome sequence assemblers). Here, we survey the key developments in algorithms for assembling genome sequences since the development of the first DNA sequencing methods more than 35 years ago.
Subject(s)
Algorithms , Genomics/methods , Sequence Analysis, DNA/methods , Chromosomes, Artificial, Bacterial , Cloning, Molecular , Computer Graphics , Genome , HumansABSTRACT
We have assembled de novo the Escherichia coli K-12 MG1655 chromosome in a single 4.6-Mb contig using only nanopore data. Our method has three stages: (i) overlaps are detected between reads and then corrected by a multiple-alignment process; (ii) corrected reads are assembled using the Celera Assembler; and (iii) the assembly is polished using a probabilistic model of the signal-level data. The assembly reconstructs gene order and has 99.5% nucleotide identity.
Subject(s)
Computational Biology/methods , Escherichia coli K12/genetics , Genome, Bacterial , Nanopores , Nanotechnology/methods , Sequence Analysis, DNA/methods , Algorithms , Contig Mapping/methods , High-Throughput Nucleotide Sequencing/methods , Reproducibility of Results , SoftwareABSTRACT
MOTIVATION: The highly portable Oxford Nanopore MinION sequencer has enabled new applications of genome sequencing directly in the field. However, the MinION currently relies on a cloud computing platform, Metrichor (metrichor.com), for translating locally generated sequencing data into basecalls. RESULTS: To allow offline and private analysis of MinION data, we created Nanocall. Nanocall is the first freely available, open-source basecaller for Oxford Nanopore sequencing data and does not require an internet connection. Using R7.3 chemistry, on two E.coli and two human samples, with natural as well as PCR-amplified DNA, Nanocall reads have â¼68% identity, directly comparable to Metrichor '1D' data. Further, Nanocall is efficient, processing â¼2500 Kbp of sequence per core hour using the fastest settings, and fully parallelized. Using a 4 core desktop computer, Nanocall could basecall a MinION sequencing run in real time. Metrichor provides the ability to integrate the '1D' sequencing of template and complement strands of a single DNA molecule, and create a '2D' read. Nanocall does not currently integrate this technology, and addition of this capability will be an important future development. In summary, Nanocall is the first open-source, freely available, off-line basecaller for Oxford Nanopore sequencing data. AVAILABILITY AND IMPLEMENTATION: Nanocall is available at github.com/mateidavid/nanocall, released under the MIT license. CONTACT: matei.david@oicr.on.caSupplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
DNA/analysis , Sequence Analysis, DNA/methods , Software , Escherichia coli/genetics , Humans , Polymerase Chain ReactionABSTRACT
Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
Subject(s)
Evolution, Molecular , Genetic Speciation , Genome/genetics , Gorilla gorilla/genetics , Animals , Female , Gene Expression Regulation , Genetic Variation/genetics , Genomics , Humans , Macaca mulatta/genetics , Molecular Sequence Data , Pan troglodytes/genetics , Phylogeny , Pongo/genetics , Proteins/genetics , Sequence Alignment , Species Specificity , Transcription, GeneticABSTRACT
The question of how genetic variation in a population influences phenotypic variation and evolution is of major importance in modern biology. Yet much is still unknown about the relative functional importance of different forms of genome variation and how they are shaped by evolutionary processes. Here we address these questions by population level sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S. paradoxus. We find that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher within S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymorphisms being vastly smaller within the former species. This genome content variation, as well as loss-of-function variation in the form of premature stop codons and frameshifting indels, is heavily enriched in the subtelomeres, strongly reinforcing the relevance of these regions to functional evolution. Genes affected by these likely functional forms of variation are enriched for functions mediating interaction with the external environment (sugar transport and metabolism, flocculation, metal transport, and metabolism). Our results and analyses provide a comprehensive view of genomic diversity in budding yeast and expose surprising and pronounced differences between the variation within S. cerevisiae and that within S. paradoxus. We also believe that the sequence data and de novo assemblies will constitute a useful resource for further evolutionary and population genomics studies.
Subject(s)
Genes, Fungal , Saccharomyces cerevisiae/genetics , Arsenites/pharmacology , DNA Copy Number Variations , Drug Resistance, Fungal/genetics , Evolution, Molecular , Genetic Linkage , Genetic Speciation , Genome, Fungal , Molecular Sequence Annotation , Multigene Family , Phylogeny , Polymorphism, Single Nucleotide , Saccharomyces cerevisiae/drug effects , Saccharomyces cerevisiae/growth & development , Sequence Analysis, DNA , Sodium Compounds/pharmacologyABSTRACT
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.
Subject(s)
Genomics/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Animals , Computational Biology/methods , Data Compression , Humans , Internet , Reproducibility of ResultsABSTRACT
MOTIVATION: The de novo assembly of large, complex genomes is a significant challenge with currently available DNA sequencing technology. While many de novo assembly software packages are available, comparatively little attention has been paid to assisting the user with the assembly. RESULTS: This article addresses the practical aspects of de novo assembly by introducing new ways to perform quality assessment on a collection of sequence reads. The software implementation calculates per-base error rates, paired-end fragment-size distributions and coverage metrics in the absence of a reference genome. Additionally, the software will estimate characteristics of the sequenced genome, such as repeat content and heterozygosity that are key determinants of assembly difficulty.