ABSTRACT
The predominant methodology for DNA methylation analysis relies on the chemical deamination by sodium bisulfite of unmodified cytosine to uracil to permit the differential readout of methylated cytosines. Bisulfite treatment damages the DNA, leading to fragmentation and loss of long-range methylation information. To overcome this limitation of bisulfite-treated DNA, we applied a new enzymatic deamination approach, termed enzymatic methyl-seq (EM-seq), to long-range sequencing technologies. Our methodology, named long-read enzymatic modification sequencing (LR-EM-seq), preserves the integrity of DNA, allowing long-range methylation profiling of 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) over multikilobase length of genomic DNA. When applied to known differentially methylated regions (DMRs), LR-EM-seq achieves phasing of >5 kb, resulting in broader and better defined DMRs compared with that previously reported. This result showed the importance of phasing methylation for biologically relevant questions and the applicability of LR-EM-seq for long-range epigenetic analysis at single-molecule and single-nucleotide resolution.
ABSTRACT
RAre DAmage and Repair sequencing (RADAR-seq) is a highly adaptable sequencing method that enables the identification and detection of rare DNA damage events for a wide variety of DNA lesions at single-molecule resolution on a genome-wide scale. In RADAR-seq, DNA lesions are replaced with a patch of modified bases that can be directly detected by Pacific Biosciences Single Molecule Real-Time (SMRT) sequencing. RADAR-seq enables dynamic detection over a wide range of DNA damage frequencies, including low physiological levels. Furthermore, without the need for DNA amplification and enrichment steps, RADAR-seq provides sequencing coverage of damaged and undamaged DNA across an entire genome. Here, we use RADAR-seq to measure the frequency and map the location of ribonucleotides in wild-type and RNaseH2-deficient E. coli and Thermococcus kodakarensis strains. Additionally, by tracking ribonucleotides incorporated during in vivo lagging strand DNA synthesis, we determined the replication initiation point in E. coli, and its relation to the origin of replication (oriC). RADAR-seq was also used to map cyclobutane pyrimidine dimers (CPDs) in Escherichia coli (E. coli) genomic DNA exposed to UV-radiation. On a broader scale, RADAR-seq can be applied to understand formation and repair of DNA damage, the correlation between DNA damage and disease initiation and progression, and complex biological pathways, including DNA replication.
Subject(s)
DNA Damage , DNA Repair , Genome, Archaeal , Genome, Bacterial , Mutagenicity Tests/methods , Sequence Analysis, DNA/methods , DNA Replication , DNA, Archaeal , DNA, Bacterial/radiation effects , Escherichia coli/genetics , Escherichia coli/radiation effects , High-Throughput Nucleotide Sequencing/methods , Pyrimidine Dimers , Ribonucleotides , Thermococcus/genetics , Ultraviolet RaysABSTRACT
The use of next-generation sequencing (NGS) has been instrumental in advancing biological research and clinical diagnostics. To fully utilize the power of NGS, complete, uniform coverage of the entire genome is required. In this study, we identified the primary sources of bias observed in sequence coverage across AT-rich regions of the human genome with existing amplification-free DNA library preparation methods. We have found evidence that a major source of bias is the inefficient processing of AT-rich DNA in end repair and 3' A-tailing, causing under-representation of extremely AT-rich regions. We have employed immobilized DNA modifying enzymes to catalyze end repair and 3' A-tailing reactions, to notably reduce the GC bias observed with existing library construction methods.
Subject(s)
DNA Repair Enzymes/metabolism , DNA Repair , DNA/metabolism , Genome, Human , Base Composition , DNA/chemistry , DNA Repair Enzymes/chemistry , Enzymes, Immobilized/chemistry , Enzymes, Immobilized/metabolism , Gene Library , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA , TemperatureABSTRACT
Following the Comment of Stewart et al, we repeated our analysis on sequencing runs from The Cancer Genome Atlas (TCGA) using their suggested parameters. We found signs of oxidative damage in all sequence contexts and irrespective of the sequencing date, reaffirming that DNA damage affects mutation-calling pipelines in their ability to accurately identify somatic variations.
Subject(s)
DNA Damage , Software , Databases, Genetic , Genome , High-Throughput Nucleotide Sequencing , Mutation , Sequence Analysis, DNAABSTRACT
Mutations in somatic cells generate a heterogeneous genomic population and may result in serious medical conditions. Although cancer is typically associated with somatic variations, advances in DNA sequencing indicate that cell-specific variants affect a number of phenotypes and pathologies. Here, we show that mutagenic damage accounts for the majority of the erroneous identification of variants with low to moderate (1 to 5%) frequency. More important, we found signatures of damage in most sequencing data sets in widely used resources, including the 1000 Genomes Project and The Cancer Genome Atlas, establishing damage as a pervasive cause of sequencing errors. The extent of this damage directly confounds the determination of somatic variants in these data sets.
Subject(s)
Artifacts , DNA Damage , DNA Mutational Analysis/standards , Genetic Variation , Neoplasms/genetics , DNA Mutational Analysis/statistics & numerical data , Human Genome Project , Humans , MutationABSTRACT
We combined functional information such as protein-protein interactions or metabolic networks with genome information in Saccharomyces cerevisiae to predict cis-regulatory motifs in the upstream region of genes. We developed a new scoring metric combining these two information sources and used this metric in motif discovery. To estimate the statistical significance of this metric, we used brute-force randomization, which shows a consistent well-behaved trend. In contrast, real data showed complex nonrandom behavior. With conservative parameters we were able to find 42 degenerate motifs (that touch 40% of yeast genes) based on 647 original patterns, five of which are well known. Some of these motifs also show limited spatial position in the promoter, indicative of a true motif. We also tested the metric on other known motifs and show that this metric is a good discriminator of real motifs. As well as a pragmatic motif discovery method, with many applications beyond this work, these results also show that interacting proteins are often coordinated at the level of transcription, even in the absence of obvious coregulation in gene expression data sets.