RESUMEN
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
Asunto(s)
Mapeo Cromosómico , Diploidia , Genoma Humano , Genómica , Humanos , Mapeo Cromosómico/normas , Genoma Humano/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Estándares de Referencia , Genómica/métodos , Genómica/normas , Cromosomas Humanos/genética , Variación Genética/genéticaRESUMEN
Egg-laying mammals (monotremes) are the only extant mammalian outgroup to therians (marsupial and eutherian animals) and provide key insights into mammalian evolution1,2. Here we generate and analyse reference genomes of the platypus (Ornithorhynchus anatinus) and echidna (Tachyglossus aculeatus), which represent the only two extant monotreme lineages. The nearly complete platypus genome assembly has anchored almost the entire genome onto chromosomes, markedly improving the genome continuity and gene annotation. Together with our echidna sequence, the genomes of the two species allow us to detect the ancestral and lineage-specific genomic changes that shape both monotreme and mammalian evolution. We provide evidence that the monotreme sex chromosome complex originated from an ancestral chromosome ring configuration. The formation of such a unique chromosome complex may have been facilitated by the unusually extensive interactions between the multi-X and multi-Y chromosomes that are shared by the autosomal homologues in humans. Further comparative genomic analyses unravel marked differences between monotremes and therians in haptoglobin genes, lactation genes and chemosensory receptor genes for smell and taste that underlie the ecological adaptation of monotremes.
Asunto(s)
Evolución Biológica , Genoma , Ornitorrinco/genética , Tachyglossidae/genética , Animales , Femenino , Masculino , Mamíferos/genética , Filogenia , Cromosomas Sexuales/genéticaRESUMEN
Decrypting the rearrangements that drive mammalian chromosome evolution is critical to understanding the molecular bases of speciation, adaptation, and disease susceptibility. Using 8 scaffolded and 26 chromosome-scale genome assemblies representing 23/26 mammal orders, we computationally reconstructed ancestral karyotypes and syntenic relationships at 16 nodes along the mammalian phylogeny. Three different reference genomes (human, sloth, and cattle) representing phylogenetically distinct mammalian superorders were used to assess reference bias in the reconstructed ancestral karyotypes and to expand the number of clades with reconstructed genomes. The mammalian ancestor likely had 19 pairs of autosomes, with nine of the smallest chromosomes shared with the common ancestor of all amniotes (three still conserved in extant mammals), demonstrating a striking conservation of synteny for â¼320 My of vertebrate evolution. The numbers and types of chromosome rearrangements were classified for transitions between the ancestral mammalian karyotype, descendent ancestors, and extant species. For example, 94 inversions, 16 fissions, and 14 fusions that occurred over 53 My differentiated the therian from the descendent eutherian ancestor. The highest breakpoint rate was observed between the mammalian and therian ancestors (3.9 breakpoints/My). Reconstructed mammalian ancestor chromosomes were found to have distinct evolutionary histories reflected in their rates and types of rearrangements. The distributions of genes, repetitive elements, topologically associating domains, and actively transcribed regions in multispecies homologous synteny blocks and evolutionary breakpoint regions indicate that purifying selection acted over millions of years of vertebrate evolution to maintain syntenic relationships of developmentally important genes and regulatory landscapes of gene-dense chromosomes.
Asunto(s)
Evolución Molecular , Cariotipo , Mamíferos , Sintenía , Animales , Bovinos/genética , Cromosomas de los Mamíferos/genética , Euterios/genética , Humanos , Mamíferos/genética , Filogenia , Perezosos/genética , Sintenía/genéticaRESUMEN
To determine the phase of NUDT15 sequence variants for more comprehensive star (*) allele diplotyping, we developed a novel long-read single-molecule real-time HiFi amplicon sequencing method. A 10.5 kb NUDT15 amplicon assay was validated using reference material positive controls and additional samples for specimen type and blinded accuracy assessment. Triplicate NUDT15 HiFi sequencing of two reference material samples had nonreference genotype concordances of >99.9%, indicating that the assay is robust. Notably, short-read genome sequencing of a subset of samples was unable to determine the phase of star (*) allele-defining NUDT15 variants, resulting in ambiguous diplotype results. In contrast, long-read HiFi sequencing phased all variants across the NUDT15 amplicons, including a *2/*9 diplotype that previously was characterized as *1/*2 in the 1000 Genomes Project v3 data set. Assay throughput was also tested using 8.5 kb amplicons from 100 Ashkenazi Jewish individuals, which identified a novel NUDT15 *1 suballele (c.-121G>A) and a rare likely deleterious coding variant (p.Pro129Arg). Both novel alleles were Sanger confirmed and assigned as *1.007 and *20, respectively, by the PharmVar Consortium. Taken together, NUDT15 HiFi amplicon sequencing is an innovative method for phased full-gene characterization and novel allele discovery, which could improve NUDT15 pharmacogenomic testing and subsequent phenotype prediction.
Asunto(s)
Farmacogenética , Alelos , Genotipo , Haplotipos , Humanos , Análisis de Secuencia de ADN/métodosRESUMEN
PURPOSE: This study aimed to provide comprehensive diagnostic and candidate analyses in a pediatric rare disease cohort through the Genomic Answers for Kids program. METHODS: Extensive analyses of 960 families with suspected genetic disorders included short-read exome sequencing and short-read genome sequencing (srGS); PacBio HiFi long-read genome sequencing (HiFi-GS); variant calling for single nucleotide variants (SNV), structural variant (SV), and repeat variants; and machine-learning variant prioritization. Structured phenotypes, prioritized variants, and pedigrees were stored in PhenoTips database, with data sharing through controlled access the database of Genotypes and Phenotypes. RESULTS: Diagnostic rates ranged from 11% in patients with prior negative genetic testing to 34.5% in naive patients. Incorporating SVs from genome sequencing added up to 13% of new diagnoses in previously unsolved cases. HiFi-GS yielded increased discovery rate with >4-fold more rare coding SVs compared with srGS. Variants and genes of unknown significance remain the most common finding (58% of nondiagnostic cases). CONCLUSION: Computational prioritization is efficient for diagnostic SNVs. Thorough identification of non-SNVs remains challenging and is partly mitigated using HiFi-GS sequencing. Importantly, community research is supported by sharing real-time data to accelerate gene validation and by providing HiFi variant (SNV/SV) resources from >1000 human alleles to facilitate implementation of new sequencing platforms for rare disease diagnoses.
Asunto(s)
Genómica , Enfermedades Raras , Niño , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Linaje , Enfermedades Raras/diagnóstico , Enfermedades Raras/genética , Análisis de Secuencia de ADNRESUMEN
Currently, protein-coding de novo variants and large copy number variants have been identified as important for ~30% of individuals with autism. One approach to identify relevant variation in individuals who lack these types of events is by utilizing newer genomic technologies. In this study, highly accurate PacBio HiFi long-read sequencing was applied to a family with autism, epileptic encephalopathy, cognitive impairment, and mild dysmorphic features (two affected female siblings, unaffected parents, and one unaffected male sibling) with no known clinical variant. From our long-read sequencing data, a de novo missense variant in the KCNC2 gene (encodes Kv3.2) was identified in both affected children. This variant was phased to the paternal chromosome of origin and is likely a germline mosaic. In silico assessment revealed the variant was not in controls, highly conserved, and predicted damaging. This specific missense variant (Val473Ala) has been shown in both an ortholog and paralog of Kv3.2 to accelerate current decay, shift the voltage dependence of activation, and prevent the channel from entering a long-lasting open state. Seven additional missense variants have been identified in other individuals with neurodevelopmental disorders (p = 1.03 × 10-5 ). KCNC2 is most highly expressed in the brain; in particular, in the thalamus and is enriched in GABAergic neurons. Long-read sequencing was useful in discovering the relevant variant in this family with autism that had remained a mystery for several years and will potentially have great benefits in the clinic once it is widely available.
Asunto(s)
Trastorno Autístico , Epilepsia , Canales de Potasio Shaw , Trastorno Autístico/genética , Niño , Epilepsia/genética , Femenino , Células Germinativas , Humanos , Masculino , Mosaicismo , Mutación Missense , Canales de Potasio Shaw/genéticaRESUMEN
Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing, next-generation mapping, microfluidics-based linked reads, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novo assembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of unreported and Asian-specific structural variants, and high-quality haplotyping of clinically relevant alleles for precision medicine.
Asunto(s)
Pueblo Asiatico/genética , Mapeo Contig , Genoma Humano/genética , Genómica , Haplotipos/genética , Análisis de Secuencia de ADN , Alelos , Cromosomas Artificiales Bacterianos/genética , Citocromo P-450 CYP2D6/genética , Diploidia , Variación Genética/genética , Antígenos de Histocompatibilidad Clase II/genética , Humanos , Medicina de Precisión , Estándares de Referencia , República de CoreaRESUMEN
The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome--78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.
Asunto(s)
Variación Genética/genética , Genoma Humano/genética , Genómica , Análisis de Secuencia de ADN/métodos , Inversión Cromosómica/genética , Cromosomas Humanos Par 10/genética , Clonación Molecular , Secuencia Rica en GC/genética , Haploidia , Humanos , Mutagénesis Insercional/genética , Estándares de Referencia , Secuencias Repetidas en Tándem/genéticaRESUMEN
In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as â¼16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that â¼59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.
Asunto(s)
Mapeo Contig/métodos , Genoma Humano , Variación Estructural del Genoma , Haploidia , Análisis de Secuencia de ADN/métodos , Mapeo Contig/normas , Proyecto Genoma Humano , Humanos , Análisis de Secuencia de ADN/normasRESUMEN
We have recently shown that nanopore zero-mode waveguides are effective tools for capturing picogram levels of long DNA fragments for single-molecule DNA sequencing. Despite these key advantages, the manufacturing of large arrays is not practical due to the need for serial nanopore fabrication. To overcome this challenge, we have developed an approach for the wafer-scale fabrication of waveguide arrays on low-cost porous membranes, which are deposited using molecular-layer deposition. The membrane at each waveguide base contains a network of serpentine pores that allows for efficient electrophoretic DNA capture at picogram levels while eliminating the need for prohibitive serial pore milling. Here, we show that the loading efficiency of these porous waveguides is up to 2 orders of magnitude greater than their nanopore predecessors. This new device facilitates the scaling-up of the process, greatly reducing the cost and effort of manufacturing. Furthermore, the porous zero-mode waveguides can be used for applications that benefit from low-input single-molecule real-time sequencing.
Asunto(s)
ADN/química , Nanoporos/ultraestructura , Nanotecnología/instrumentación , Análisis de Secuencia de ADN/instrumentación , Asbestos Serpentinas/química , ADN/aislamiento & purificación , Diseño de Equipo , Membranas Artificiales , PorosidadRESUMEN
BACKGROUND & AIMS: Helicobacter pylori is remarkable for its genetic variation; yet, little is known about its genetic changes during early stages of human infection, as the bacteria adapt to their new environment. We analyzed genome and methylome variations in a fully virulent strain of H pylori during experimental infection. METHODS: We performed a randomized Phase I/II, observer-blind, placebo-controlled study of 12 healthy, H pylori-negative adults in Germany from October 2008 through March 2010. The volunteers were given a prophylactic vaccine candidate (n = 7) or placebo (n = 5) and then challenged with H pylori strain BCM-300. Biopsy samples were collected and H pylori were isolated. Genomes of the challenge strain and 12 reisolates, obtained 12 weeks after (or in 1 case, 62 weeks after) infection were sequenced by single-molecule, real-time technology, which, in parallel, permitted determination of genome-wide methylation patterns for all strains. Functional effects of genetic changes observed in H pylori strains during human infection were assessed by measuring release of interleukin 8 from AGS cells (to detect cag pathogenicity island function), neutral red uptake (to detect vacuolating cytotoxin activity), and adhesion assays. RESULTS: The observed mutation rate was in agreement with rates previously determined from patients with chronic H pylori infections, without evidence of a mutation burst. A loss of cag pathogenicity island function was observed in 3 reisolates. In addition, 3 reisolates from the vaccine group acquired mutations in the vacuolating cytotoxin gene vacA, resulting in loss of vacuolization activity. We observed interstrain variation in methylomes due to phase variation in genes encoding methyltransferases. CONCLUSIONS: We analyzed adaptation of a fully virulent strain of H pylori to 12 different volunteers to obtain a robust estimate of the frequency of genetic and epigenetic changes in the absence of interstrain recombination. Our findings indicate that the large amount of genetic variation in H pylori poses a challenge to vaccine development. ClinicalTrials.gov no: NCT00736476.
Asunto(s)
Metilación de ADN , Epigénesis Genética , Genoma Bacteriano , Islas Genómicas , Infecciones por Helicobacter/microbiología , Helicobacter pylori/genética , Antígenos Bacterianos/administración & dosificación , Antígenos Bacterianos/genética , Antígenos Bacterianos/inmunología , Adhesión Bacteriana , Proteínas Bacterianas/administración & dosificación , Proteínas Bacterianas/genética , Proteínas Bacterianas/inmunología , Vacunas Bacterianas/administración & dosificación , Vacunas Bacterianas/genética , Vacunas Bacterianas/inmunología , Biopsia , Regulación Bacteriana de la Expresión Génica , Genotipo , Alemania , Infecciones por Helicobacter/diagnóstico , Infecciones por Helicobacter/inmunología , Infecciones por Helicobacter/prevención & control , Helicobacter pylori/inmunología , Helicobacter pylori/patogenicidad , Interacciones Huésped-Patógeno , Humanos , Interleucina-8/inmunología , Interleucina-8/metabolismo , Mutación , Fenotipo , Polimorfismo de Nucleótido Simple , Factores de Tiempo , VirulenciaRESUMEN
DNA methylation acts in concert with restriction enzymes to protect the integrity of prokaryotic genomes. Studies in a limited number of organisms suggest that methylation also contributes to prokaryotic genome regulation, but the prevalence and properties of such non-restriction-associated methylation systems remain poorly understood. Here, we used single molecule, real-time sequencing to map DNA modifications including m6A, m4C, and m5C across the genomes of 230 diverse bacterial and archaeal species. We observed DNA methylation in nearly all (93%) organisms examined, and identified a total of 834 distinct reproducibly methylated motifs. This data enabled annotation of the DNA binding specificities of 620 DNA Methyltransferases (MTases), doubling known specificities for previously hard to study Type I, IIG and III MTases, and revealing their extraordinary diversity. Strikingly, 48% of organisms harbor active Type II MTases with no apparent cognate restriction enzyme. These active 'orphan' MTases are present in diverse bacterial and archaeal phyla and show motif specificities and methylation patterns consistent with functions in gene regulation and DNA replication. Our results reveal the pervasive presence of DNA methylation throughout the prokaryotic kingdoms, as well as the diversity of sequence specificities and potential functions of DNA methylation systems.
Asunto(s)
Epigenómica , Células Procariotas/metabolismo , Secuencia Conservada , Metilación de ADN/genética , Replicación del ADN/genética , Enzimas de Restricción-Modificación del ADN/clasificación , Enzimas de Restricción-Modificación del ADN/metabolismo , Evolución Molecular , Regulación de la Expresión Génica , Genoma , Metiltransferasas/metabolismo , Anotación de Secuencia Molecular , Familia de Multigenes , Motivos de Nucleótidos/genética , Filogenia , Especificidad por SustratoRESUMEN
[This corrects the article DOI: 10.1371/journal.pgen.1005954.].
RESUMEN
We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics.
Asunto(s)
Lubina/genética , Mapeo Cromosómico , Animales , Lubina/clasificación , Genoma , Hibridación Fluorescente in Situ , FilogeniaRESUMEN
PurposeCurrent clinical genomics assays primarily utilize short-read sequencing (SRS), but SRS has limited ability to evaluate repetitive regions and structural variants. Long-read sequencing (LRS) has complementary strengths, and we aimed to determine whether LRS could offer a means to identify overlooked genetic variation in patients undiagnosed by SRS.MethodsWe performed low-coverage genome LRS to identify structural variants in a patient who presented with multiple neoplasia and cardiac myxomata, in whom the results of targeted clinical testing and genome SRS were negative.ResultsThis LRS approach yielded 6,971 deletions and 6,821 insertions > 50 bp. Filtering for variants that are absent in an unrelated control and overlap a disease gene coding exon identified three deletions and three insertions. One of these, a heterozygous 2,184 bp deletion, overlaps the first coding exon of PRKAR1A, which is implicated in autosomal dominant Carney complex. RNA sequencing demonstrated decreased PRKAR1A expression. The deletion was classified as pathogenic based on guidelines for interpretation of sequence variants.ConclusionThis first successful application of genome LRS to identify a pathogenic variant in a patient suggests that LRS has significant potential for the identification of disease-causing structural variation. Larger studies will ultimately be required to evaluate the potential clinical utility of LRS.
Asunto(s)
Estudios de Asociación Genética , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/genética , Predisposición Genética a la Enfermedad , Variación Genética , Genoma Humano , Genómica , Análisis de Secuencia de ADN , Niño , Subunidad RIalfa de la Proteína Quinasa Dependiente de AMP Cíclico/genética , Ecocardiografía , Genómica/métodos , Humanos , Masculino , Fenotipo , Análisis de Secuencia de ADN/métodos , Eliminación de SecuenciaRESUMEN
The initiation of translation establishes the reading frame for protein synthesis and is a key point of regulation. Initiation involves factor-driven assembly at a start codon of a messenger RNA of an elongation-competent 70S ribosomal particle (in bacteria) from separated 30S and 50S subunits and initiator transfer RNA. Here we establish in Escherichia coli, using direct single-molecule tracking, the timing of initiator tRNA, initiation factor 2 (IF2; encoded by infB) and 50S subunit joining during initiation. Our results show multiple pathways to initiation, with orders of arrival of tRNA and IF2 dependent on factor concentration and composition. IF2 accelerates 50S subunit joining and stabilizes the assembled 70S complex. Transition to elongation is gated by the departure of IF2 after GTP hydrolysis, allowing efficient arrival of elongator tRNAs to the second codon presented in the aminoacyl-tRNA binding site (A site). These experiments highlight the power of single-molecule approaches to delineate mechanisms in complex multicomponent systems.
Asunto(s)
Escherichia coli/genética , Escherichia coli/metabolismo , Iniciación de la Cadena Peptídica Traduccional/fisiología , Factor 2 Procariótico de Iniciación/metabolismo , ARN de Transferencia/metabolismo , Subunidades Ribosómicas Grandes Bacterianas/metabolismo , Factores de TiempoRESUMEN
Obtaining high-quality sequence continuity of complex regions of recent segmental duplication remains one of the major challenges of finishing genome assemblies. In the human and mouse genomes, this was achieved by targeting large-insert clones using costly and laborious capillary-based sequencing approaches. Sanger shotgun sequencing of clone inserts, however, has now been largely abandoned, leaving most of these regions unresolved in newer genome assemblies generated primarily by next-generation sequencing hybrid approaches. Here we show that it is possible to resolve regions that are complex in a genome-wide context but simple in isolation for a fraction of the time and cost of traditional methods using long-read single molecule, real-time (SMRT) sequencing and assembly technology from Pacific Biosciences (PacBio). We sequenced and assembled BAC clones corresponding to a 1.3-Mbp complex region of chromosome 17q21.31, demonstrating 99.994% identity to Sanger assemblies of the same clones. We targeted 44 differences using Illumina sequencing and find that PacBio and Sanger assemblies share a comparable number of validated variants, albeit with different sequence context biases. Finally, we targeted a poorly assembled 766-kbp duplicated region of the chimpanzee genome and resolved the structure and organization for a fraction of the cost and time of traditional finishing approaches. Our data suggest a straightforward path for upgrading genomes to a higher quality finished state.
Asunto(s)
Cromosomas Humanos Par 17/genética , Genoma Bacteriano/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Cromosomas Artificiales Bacterianos/genética , Humanos , Ratones , Datos de Secuencia Molecular , Pan troglodytes/genéticaRESUMEN
MOTIVATION: Determining the methylation state of regions with high copy numbers is challenging for second-generation sequencing, because the read length is insufficient to map reads uniquely, especially when repetitive regions are long and nearly identical to each other. Single-molecule real-time (SMRT) sequencing is a promising method for observing such regions, because it is not vulnerable to GC bias, it produces long read lengths, and its kinetic information is sensitive to DNA modifications. RESULTS: We propose a novel linear-time algorithm that combines the kinetic information for neighboring CpG sites and increases the confidence in identifying the methylation states of those sites. Using a practical read coverage of â¼30-fold from an inbred strain medaka (Oryzias latipes), we observed that both the sensitivity and precision of our method on individual CpG sites were â¼93.7%. We also observed a high correlation coefficient (R = 0.884) between our method and bisulfite sequencing, and for 92.0% of CpG sites, methylation levels ranging over [0,1] were in concordance within an acceptable difference 0.25. Using this method, we characterized the landscape of the methylation status of repetitive elements, such as LINEs, in the human genome, thereby revealing the strong correlation between CpG density and hypomethylation and detecting hypomethylation hot spots of LTRs and LINEs. We uncovered the methylation states for nearly identical active transposons, two novel LINE insertions of identity â¼99% and length 6050 base pairs (bp) in the human genome, and 16 Tol2 elements of identity >99.8% and length 4682 bp in the medaka genome. AVAILABILITY AND IMPLEMENTATION: AgIn (Aggregate on Intervals) is available at: https://github.com/hacone/AgIn CONTACT: ysuzuki@cb.k.u-tokyo.ac.jp or moris@cb.k.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Islas de CpG , Metilación de ADN , Genoma Humano , Humanos , Análisis de Secuencia de ADNRESUMEN
Listeria monocytogenes is a bacterial pathogen that is found in a wide variety of anthropogenic and natural environments. Genome sequencing technologies are rapidly becoming a powerful tool in facilitating our understanding of how genotype, classification phenotypes, and virulence phenotypes interact to predict the health risks of individual bacterial isolates. Currently, 57 closed L. monocytogenes genomes are publicly available, representing three of the four phylogenetic lineages, and they suggest that L. monocytogenes has high genomic synteny. This study contributes an additional 15 closed L. monocytogenes genomes that were used to determine the associations between the genome and methylome with host invasion magnitude. In contrast to previous findings, large chromosomal inversions and rearrangements were detected in five isolates at the chromosome terminus and within rRNA genes, including a previously undescribed inversion within rRNA-encoding regions. Each isolate's epigenome contained highly diverse methyltransferase recognition sites, even within the same serotype and methylation pattern. Eleven strains contained a single chromosomally encoded methyltransferase, one strain contained two methylation systems (one system on a plasmid), and three strains exhibited no methylation, despite the occurrence of methyltransferase genes. In three isolates a new, unknown DNA modification was observed in addition to diverse methylation patterns, accompanied by a novel methylation system. Neither chromosome rearrangement nor strain-specific patterns of epigenome modification observed within virulence genes were correlated with serotype designation, clonal complex, or in vitro infectivity. These data suggest that genome diversity is larger than previously considered in L. monocytogenes and that as more genomes are sequenced, additional structure and methylation novelty will be observed in this organism. IMPORTANCE: Listeria monocytogenes is the causative agent of listeriosis, a disease which manifests as gastroenteritis, meningoencephalitis, and abortion. Among Salmonella, Escherichia coli, Campylobacter, and Listeria-causing the most prevalent foodborne illnesses-infection by L. monocytogenes carries the highest mortality rate. The ability of L. monocytogenes to regulate its response to various harsh environments enables its persistence and transmission. Small-scale comparisons of L. monocytogenes focusing solely on genome contents reveal a highly syntenic genome yet fail to address the observed diversity in phenotypic regulation. This study provides a large-scale comparison of 302 L. monocytogenes isolates, revealing the importance of the epigenome and restriction-modification systems as major determinants of L. monocytogenes phylogenetic grouping and subsequent phenotypic expression. Further examination of virulence genes of select outbreak strains reveals an unprecedented diversity in methylation statuses despite high degrees of genome conservation.