RESUMO
The adoption of agriculture triggered a rapid shift towards starch-rich diets in human populations1. Amylase genes facilitate starch digestion, and increased amylase copy number has been observed in some modern human populations with high-starch intake2, although evidence of recent selection is lacking3,4. Here, using 94 long-read haplotype-resolved assemblies and short-read data from approximately 5,600 contemporary and ancient humans, we resolve the diversity and evolutionary history of structural variation at the amylase locus. We find that amylase genes have higher copy numbers in agricultural populations than in fishing, hunting and pastoral populations. We identify 28 distinct amylase structural architectures and demonstrate that nearly identical structures have arisen recurrently on different haplotype backgrounds throughout recent human history. AMY1 and AMY2A genes each underwent multiple duplication/deletion events with mutation rates up to more than 10,000-fold the single-nucleotide polymorphism mutation rate, whereas AMY2B gene duplications share a single origin. Using a pangenome-based approach, we infer structural haplotypes across thousands of humans identifying extensively duplicated haplotypes at higher frequency in modern agricultural populations. Leveraging 533 ancient human genomes, we find that duplication-containing haplotypes (with more gene copies than the ancestral haplotype) have rapidly increased in frequency over the past 12,000 years in West Eurasians, suggestive of positive selection. Together, our study highlights the potential effects of the agricultural revolution on human genomes and the importance of structural variation in human adaptation.
Assuntos
Agricultura , Amilases , Evolução Molecular , Dosagem de Genes , Genoma Humano , Haplótipos , Seleção Genética , Humanos , Agricultura/história , Agricultura/estatística & dados numéricos , Amilases/genética , Amilases/química , Dosagem de Genes/genética , Duplicação Gênica/genética , Loci Gênicos/genética , Genoma Humano/genética , Haplótipos/genética , História Antiga , Taxa de Mutação , Polimorfismo de Nucleotídeo Único/genética , Caça/estatística & dados numéricos , Deleção de Genes , DNA Antigo/análiseRESUMO
Apes possess two sex chromosomes-the male-specific Y chromosome and the X chromosome, which is present in both males and females. The Y chromosome is crucial for male reproduction, with deletions being linked to infertility1. The X chromosome is vital for reproduction and cognition2. Variation in mating patterns and brain function among apes suggests corresponding differences in their sex chromosomes. However, owing to their repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the methodology developed for the telomere-to-telomere (T2T) human genome, we produced gapless assemblies of the X and Y chromosomes for five great apes (bonobo (Pan paniscus), chimpanzee (Pan troglodytes), western lowland gorilla (Gorilla gorilla gorilla), Bornean orangutan (Pongo pygmaeus) and Sumatran orangutan (Pongo abelii)) and a lesser ape (the siamang gibbon (Symphalangus syndactylus)), and untangled the intricacies of their evolution. Compared with the X chromosomes, the ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements-owing to the accumulation of lineage-specific ampliconic regions, palindromes, transposable elements and satellites. Many Y chromosome genes expand in multi-copy families and some evolve under purifying selection. Thus, the Y chromosome exhibits dynamic evolution, whereas the X chromosome is more stable. Mapping short-read sequencing data to these assemblies revealed diversity and selection patterns on sex chromosomes of more than 100 individual great apes. These reference assemblies are expected to inform human evolution and conservation genetics of non-human apes, all of which are endangered species.
Assuntos
Hominidae , Cromossomo X , Cromossomo Y , Animais , Feminino , Masculino , Gorilla gorilla/genética , Hominidae/genética , Hominidae/classificação , Hylobatidae/genética , Pan paniscus/genética , Pan troglodytes/genética , Filogenia , Pongo abelii/genética , Pongo pygmaeus/genética , Telômero/genética , Cromossomo X/genética , Cromossomo Y/genética , Evolução Molecular , Variações do Número de Cópias de DNA/genética , Humanos , Espécies em Perigo de Extinção , Padrões de ReferênciaRESUMO
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Assuntos
Cromossomos Humanos Y , Genômica , Análise de Sequência de DNA , Humanos , Sequência de Bases , Cromossomos Humanos Y/genética , DNA Satélite/genética , Variação Genética/genética , Genética Populacional , Genômica/métodos , Genômica/normas , Heterocromatina/genética , Família Multigênica/genética , Padrões de Referência , Duplicações Segmentares Genômicas/genética , Análise de Sequência de DNA/normas , Sequências de Repetição em Tandem/genética , Telômero/genéticaRESUMO
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
Assuntos
Mapeamento Cromossômico , Diploide , Genoma Humano , Genômica , Humanos , Mapeamento Cromossômico/normas , Genoma Humano/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas , Padrões de Referência , Genômica/métodos , Genômica/normas , Cromossomos Humanos/genética , Variação Genética/genéticaRESUMO
Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.
Assuntos
Genoma Humano , Genômica , Masculino , Humanos , Complexo Principal de HistocompatibilidadeRESUMO
Complete and accurate reference genomes and annotations provide fundamental tools for characterization of genetic and functional variation. These resources facilitate the determination of biological processes and support translation of research findings into improved and sustainable agricultural technologies. Many reference genomes for crop plants have been generated over the past decade, but these genomes are often fragmented and missing complex repeat regions. Here we report the assembly and annotation of a reference genome of maize, a genetic and agricultural model species, using single-molecule real-time sequencing and high-resolution optical mapping. Relative to the previous reference genome, our assembly features a 52-fold increase in contig length and notable improvements in the assembly of intergenic spaces and centromeres. Characterization of the repetitive portion of the genome revealed more than 130,000 intact transposable elements, allowing us to identify transposable element lineage expansions that are unique to maize. Gene annotations were updated using 111,000 full-length transcripts obtained by single-molecule real-time sequencing. In addition, comparative optical mapping of two other inbred maize lines revealed a prevalence of deletions in regions of low gene density and maize lineage-specific genes.
Assuntos
Genoma de Planta/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Imagem Individual de Molécula/métodos , Zea mays/genética , Centrômero/genética , Cromossomos de Plantas/genética , Mapeamento de Sequências Contíguas , Produtos Agrícolas/genética , Elementos de DNA Transponíveis/genética , DNA Intergênico/genética , Genes de Plantas/genética , Anotação de Sequência Molecular , Óptica e Fotônica , Filogenia , RNA Mensageiro/análise , RNA Mensageiro/genética , Padrões de Referência , Sorghum/genéticaRESUMO
SUMMARY: Ribbon is an alignment visualization tool that shows how alignments are positioned within both the reference and read contexts, giving an intuitive view that enables a better understanding of structural variants and the read evidence supporting them. Ribbon was born out of a need to curate complex structural variant calls and determine whether each was well supported by long-read evidence, and it uses the same intuitive visualization method to shed light on contig alignments from genome-to-genome comparisons. AVAILABILITY AND IMPLEMENTATION: Ribbon is freely available online at http://genomeribbon.com/ and is open-source at https://github.com/marianattestad/ribbon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica , Software , GenomaRESUMO
Although the gold standard of monitoring kidney transplant function relies on glomerular filtration rate (GFR), little is known about GFR trajectories after transplantation, their determinants, and their association with outcomes. To evaluate these parameters we examined kidney transplant recipients receiving care at 15 academic centers. Patients underwent prospective monitoring of estimated GFR (eGFR) measurements, with assessment of clinical, functional, histological and immunological parameters. Additional validation took place in seven randomized controlled trials that included a total of 14,132 patients with 403,497 eGFR measurements. After a median follow-up of 6.5 years, 1,688 patients developed end-stage kidney disease. Using unsupervised latent class mixed models, we identified eight distinct eGFR trajectories. Multinomial regression models identified seven significant determinants of eGFR trajectories including donor age, eGFR, proteinuria, and several significant histological features: graft scarring, graft interstitial inflammation and tubulitis, microcirculation inflammation, and circulating anti-HLA donor specific antibodies. The eGFR trajectories were associated with progression to end stage kidney disease. These trajectories, their determinants and respective associations with end stage kidney disease were similar across cohorts, as well as in diverse clinical scenarios, therapeutic eras and in the seven randomized control trials. Thus, our results provide the basis for a trajectory-based assessment of kidney transplant patients for risk stratification and monitoring.
Assuntos
Falência Renal Crônica , Transplante de Rim , Taxa de Filtração Glomerular , Humanos , Falência Renal Crônica/diagnóstico , Falência Renal Crônica/cirurgia , Transplante de Rim/efeitos adversos , Estudos ProspectivosRESUMO
The SK-BR-3 cell line is one of the most important models for HER2+ breast cancers, which affect one in five breast cancer patients. SK-BR-3 is known to be highly rearranged, although much of the variation is in complex and repetitive regions that may be underreported. Addressing this, we sequenced SK-BR-3 using long-read single molecule sequencing from Pacific Biosciences and develop one of the most detailed maps of structural variations (SVs) in a cancer genome available, with nearly 20,000 variants present, most of which were missed by short-read sequencing. Surrounding the important ERBB2 oncogene (also known as HER2), we discover a complex sequence of nested duplications and translocations, suggesting a punctuated progression. Full-length transcriptome sequencing further revealed several novel gene fusions within the nested genomic variants. Combining long-read genome and transcriptome sequencing enables an in-depth analysis of how SVs disrupt the genome and sheds new light on the complex mechanisms involved in cancer genome evolution.
Assuntos
Neoplasias da Mama/genética , Amplificação de Genes/genética , Rearranjo Gênico/genética , Oncogenes/genética , Neoplasias da Mama/patologia , Feminino , Genoma Humano , Variação Estrutural do Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Células MCF-7 , Receptor ErbB-2/genética , Sequências Repetitivas de Ácido Nucleico/genética , Transcriptoma/genéticaRESUMO
The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Mapeamento de Sequências Contíguas/normas , Genômica/normas , Haploidia , Haplótipos , Humanos , Polimorfismo Genético , Padrões de Referência , Análise de Sequência de DNA/normasRESUMO
While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
Assuntos
Diploide , Genoma Fúngico/genética , Genoma de Planta/genética , Genômica/métodos , Polimorfismo de Nucleotídeo Único/genética , Algoritmos , Arabidopsis/genética , Basidiomycota/genética , DNA Fúngico/genética , DNA de Plantas/genética , Haplótipos , Heterozigoto , Humanos , Análise de Sequência de DNA , Vitis/genéticaRESUMO
Genomic studies have revealed significant branching heterogeneity in cancer. Studies of resistance to tyrosine kinase inhibitor therapy have not fully reflected this heterogeneity because resistance in individual patients has been ascribed to largely mutually exclusive on-target or off-target mechanisms in which tumors either retain dependency on the target oncogene or subvert it through a parallel pathway. Using targeted sequencing from single cells and colonies from patient samples, we demonstrate tremendous clonal diversity in the majority of acute myeloid leukemia (AML) patients with activating FLT3 internal tandem duplication mutations at the time of acquired resistance to the FLT3 inhibitor quizartinib. These findings establish that clinical resistance to quizartinib is highly complex and reflects the underlying clonal heterogeneity of AML.
Assuntos
Benzotiazóis/administração & dosagem , Resistencia a Medicamentos Antineoplásicos , Sequenciamento de Nucleotídeos em Larga Escala , Mutação INDEL , Leucemia Mieloide Aguda , Compostos de Fenilureia/administração & dosagem , Tirosina Quinase 3 Semelhante a fms/genética , Resistencia a Medicamentos Antineoplásicos/efeitos dos fármacos , Resistencia a Medicamentos Antineoplásicos/genética , Feminino , Humanos , Leucemia Mieloide Aguda/tratamento farmacológico , Leucemia Mieloide Aguda/genética , MasculinoRESUMO
[This corrects the article DOI: 10.1371/journal.pgen.1005954.].
RESUMO
We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics.
Assuntos
Bass/genética , Mapeamento Cromossômico , Animais , Bass/classificação , Genoma , Hibridização in Situ Fluorescente , FilogeniaRESUMO
We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.
Assuntos
Biologia Computacional/métodos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Algoritmos , Mapeamento Cromossômico , Diploide , Biblioteca Gênica , Variação Genética , Genoma , Haplótipos , Humanos , Nucleotídeos/genética , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Sequências de Repetição em TandemRESUMO
Effective targeted cancer therapeutic development depends upon distinguishing disease-associated 'driver' mutations, which have causative roles in malignancy pathogenesis, from 'passenger' mutations, which are dispensable for cancer initiation and maintenance. Translational studies of clinically active targeted therapeutics can definitively discriminate driver from passenger lesions and provide valuable insights into human cancer biology. Activating internal tandem duplication (ITD) mutations in FLT3 (FLT3-ITD) are detected in approximately 20% of acute myeloid leukaemia (AML) patients and are associated with a poor prognosis. Abundant scientific and clinical evidence, including the lack of convincing clinical activity of early FLT3 inhibitors, suggests that FLT3-ITD probably represents a passenger lesion. Here we report point mutations at three residues within the kinase domain of FLT3-ITD that confer substantial in vitro resistance to AC220 (quizartinib), an active investigational inhibitor of FLT3, KIT, PDGFRA, PDGFRB and RET; evolution of AC220-resistant substitutions at two of these amino acid positions was observed in eight of eight FLT3-ITD-positive AML patients with acquired resistance to AC220. Our findings demonstrate that FLT3-ITD can represent a driver lesion and valid therapeutic target in human AML. AC220-resistant FLT3 kinase domain mutants represent high-value targets for future FLT3 inhibitor development efforts.
Assuntos
Benzotiazóis/uso terapêutico , Leucemia Mieloide Aguda/tratamento farmacológico , Leucemia Mieloide Aguda/genética , Terapia de Alvo Molecular , Mutação/genética , Compostos de Fenilureia/uso terapêutico , Tirosina Quinase 3 Semelhante a fms/antagonistas & inibidores , Tirosina Quinase 3 Semelhante a fms/genética , Benzotiazóis/farmacologia , Linhagem Celular Tumoral , Análise Mutacional de DNA , Resistencia a Medicamentos Antineoplásicos/genética , Humanos , Leucemia Mieloide Aguda/metabolismo , Modelos Moleculares , Estrutura Molecular , Compostos de Fenilureia/farmacologia , Ligação Proteica , Estrutura Terciária de Proteína/genética , Recidiva , Reprodutibilidade dos Testes , Tirosina Quinase 3 Semelhante a fms/metabolismoRESUMO
BACKGROUND: Long read technologies have revolutionized de novo genome assembly by generating contigs orders of magnitude longer than that of short read assemblies. Although assembly contiguity has increased, it usually does not reconstruct a full chromosome or an arm of the chromosome, resulting in an unfinished chromosome level assembly. To increase the contiguity of the assembly to the chromosome level, different strategies are used which exploit long range contact information between chromosomes in the genome. METHODS: We develop a scalable and computationally efficient scaffolding method that can boost the assembly contiguity to a large extent using genome-wide chromatin interaction data such as Hi-C. RESULTS: we demonstrate an algorithm that uses Hi-C data for longer-range scaffolding of de novo long read genome assemblies. We tested our methods on the human and goat genome assemblies. We compare our scaffolds with the scaffolds generated by LACHESIS based on various metrics. CONCLUSION: Our new algorithm SALSA produces more accurate scaffolds compared to the existing state of the art method LACHESIS.
Assuntos
Mapeamento de Sequências Contíguas/métodos , Algoritmos , Animais , Genômica , Cabras/genética , HumanosRESUMO
MOTIVATION: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly. RESULTS: We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies. AVAILABILITY AND IMPLEMENTATION: Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI CONTACT: ali.bashir@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Centrômero/genética , Biologia Computacional/métodos , Genômica , Análise de Sequência de DNA/métodos , Sequências de Repetição em Tandem , Algoritmos , Sequência Consenso , Feminino , Humanos , Mola Hidatiforme/genética , GravidezRESUMO
We present a hierarchical genome-assembly process (HGAP) for high-quality de novo microbial genome assemblies using only a single, long-insert shotgun DNA library in conjunction with Single Molecule, Real-Time (SMRT) DNA sequencing. Our method uses the longest reads as seeds to recruit all other reads for construction of highly accurate preassembled reads through a directed acyclic graph-based consensus procedure, which we follow with assembly using off-the-shelf long-read assemblers. In contrast to hybrid approaches, HGAP does not require highly accurate raw reads for error correction. We demonstrate efficient genome assembly for several microorganisms using as few as three SMRT Cell zero-mode waveguide arrays of sequencing and for BACs using just one SMRT Cell. Long repeat regions can be successfully resolved with this workflow. We also describe a consensus algorithm that incorporates SMRT sequencing primary quality values to produce de novo genome sequence exceeding 99.999% accuracy.