Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 60
Filter
1.
Nature ; 621(7978): 344-354, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37612512

ABSTRACT

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.


Subject(s)
Chromosomes, Human, Y , Genomics , Sequence Analysis, DNA , Humans , Base Sequence , Chromosomes, Human, Y/genetics , DNA, Satellite/genetics , Genetic Variation/genetics , Genetics, Population , Genomics/methods , Genomics/standards , Heterochromatin/genetics , Multigene Family/genetics , Reference Standards , Segmental Duplications, Genomic/genetics , Sequence Analysis, DNA/standards , Tandem Repeat Sequences/genetics , Telomere/genetics
2.
Nature ; 617(7960): 312-324, 2023 05.
Article in English | MEDLINE | ID: mdl-37165242

ABSTRACT

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.


Subject(s)
Genome, Human , Genomics , Humans , Diploidy , Genome, Human/genetics , Haplotypes/genetics , Sequence Analysis, DNA , Genomics/standards , Reference Standards , Cohort Studies , Alleles , Genetic Variation
3.
Nature ; 611(7936): 519-531, 2022 Nov.
Article in English | MEDLINE | ID: mdl-36261518

ABSTRACT

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.


Subject(s)
Chromosome Mapping , Diploidy , Genome, Human , Genomics , Humans , Chromosome Mapping/standards , Genome, Human/genetics , Haplotypes/genetics , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards , Reference Standards , Genomics/methods , Genomics/standards , Chromosomes, Human/genetics , Genetic Variation/genetics
4.
Genome Res ; 34(3): 454-468, 2024 04 25.
Article in English | MEDLINE | ID: mdl-38627094

ABSTRACT

Reference-free genome phasing is vital for understanding allele inheritance and the impact of single-molecule DNA variation on phenotypes. To achieve thorough phasing across homozygous or repetitive regions of the genome, long-read sequencing technologies are often used to perform phased de novo assembly. As a step toward reducing the cost and complexity of this type of analysis, we describe new methods for accurately phasing Oxford Nanopore Technologies (ONT) sequence data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of ONT PromethION sequencing, including those using proximity ligation, and show that newer, higher accuracy ONT reads substantially improve assembly quality.


Subject(s)
Nanopores , Humans , Sequence Analysis, DNA/methods , Nanopore Sequencing/methods , High-Throughput Nucleotide Sequencing/methods , Software , Genomics/methods
5.
Nat Methods ; 20(10): 1483-1492, 2023 10.
Article in English | MEDLINE | ID: mdl-37710018

ABSTRACT

Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer's and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls.


Subject(s)
Genome, Human , Nanopore Sequencing , Humans , Sequence Analysis, DNA/methods , Haplotypes , Methylation , Pilot Projects , High-Throughput Nucleotide Sequencing/methods
6.
Am J Hum Genet ; 108(5): 809-824, 2021 05 06.
Article in English | MEDLINE | ID: mdl-33794196

ABSTRACT

Variable number tandem repeats (VNTRs) are composed of large tandemly repeated motifs, many of which are highly polymorphic in copy number. However, because of their large size and repetitive nature, they remain poorly studied. To investigate the regulatory potential of VNTRs, we used read-depth data from Illumina whole-genome sequencing to perform association analysis between copy number of ∼70,000 VNTRs (motif size ≥ 10 bp) with both gene expression (404 samples in 48 tissues) and DNA methylation (235 samples in peripheral blood), identifying thousands of VNTRs that are associated with local gene expression (eVNTRs) and DNA methylation levels (mVNTRs). Using an independent cohort, we validated 73%-80% of signals observed in the two discovery cohorts, while allelic analysis of VNTR length and CpG methylation in 30 Oxford Nanopore genomes gave additional support for mVNTR loci, thus providing robust evidence to support that these represent genuine associations. Further, conditional analysis indicated that many eVNTRs and mVNTRs act as QTLs independently of other local variation. We also observed strong enrichments of eVNTRs and mVNTRs for regulatory features such as enhancers and promoters. Using the Human Genome Diversity Panel, we define sets of VNTRs that show highly divergent copy numbers among human populations and show that these are enriched for regulatory effects and preferentially associate with genes that have been linked with human phenotypes through GWASs. Our study provides strong evidence supporting functional variation at thousands of VNTRs and defines candidate sets of VNTRs, copy number variation of which potentially plays a role in numerous human phenotypes.


Subject(s)
DNA Copy Number Variations/genetics , DNA Methylation , Gene Expression Regulation , Minisatellite Repeats/genetics , Quantitative Trait Loci/genetics , Adolescent , Adult , Algorithms , Child , Child, Preschool , Chromosomes, Human, X/genetics , Cohort Studies , CpG Islands/genetics , Enhancer Elements, Genetic/genetics , Female , Genome-Wide Association Study , Genotype , Humans , Infant , Infant, Newborn , Male , Middle Aged , Phenotype , Promoter Regions, Genetic/genetics , Young Adult
7.
Nat Methods ; 18(11): 1322-1332, 2021 11.
Article in English | MEDLINE | ID: mdl-34725481

ABSTRACT

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).


Subject(s)
Genes , Haplotypes , High-Throughput Nucleotide Sequencing/methods , Nanopores , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Software , Genome, Human , Humans , Molecular Sequence Annotation
8.
RNA ; 28(2): 162-176, 2022 02.
Article in English | MEDLINE | ID: mdl-34728536

ABSTRACT

Nanopore sequencing devices read individual RNA strands directly. This facilitates identification of exon linkages and nucleotide modifications; however, using conventional direct RNA nanopore sequencing, the 5' and 3' ends of poly(A) RNA cannot be identified unambiguously. This is due in part to RNA degradation in vivo and in vitro that can obscure transcription start and end sites. In this study, we aimed to identify individual full-length human RNA isoforms among ∼4 million nanopore poly(A)-selected RNA reads. First, to identify RNA strands bearing 5' m7G caps, we exchanged the biological cap for a modified cap attached to a 45-nt oligomer. This oligomer adaptation method improved 5' end sequencing and ensured correct identification of the 5' m7G capped ends. Second, among these 5'-capped nanopore reads, we screened for features consistent with a 3' polyadenylation site. Combining these two steps, we identified 294,107 individual high-confidence full-length RNA scaffolds from human GM12878 cells, most of which (257,721) aligned to protein-coding genes. Of these, 4876 scaffolds indicated unannotated isoforms that were often internal to longer, previously identified RNA isoforms. Orthogonal data for m7G caps and open chromatin, such as CAGE and DNase-HS seq, confirmed the validity of these high-confidence RNA scaffolds.


Subject(s)
RNA Isoforms/chemistry , RNA, Messenger/chemistry , Cell Line, Tumor , Humans , Nanopore Sequencing/methods , RNA 3' Polyadenylation Signals , RNA Isoforms/genetics , RNA, Messenger/genetics , Transcriptome
9.
Nucleic Acids Res ; 50(6): 3475-3489, 2022 04 08.
Article in English | MEDLINE | ID: mdl-35244721

ABSTRACT

The SARS-CoV-2 virus has a complex transcriptome characterised by multiple, nested subgenomic RNAsused to express structural and accessory proteins. Long-read sequencing technologies such as nanopore direct RNA sequencing can recover full-length transcripts, greatly simplifying the assembly of structurally complex RNAs. However, these techniques do not detect the 5' cap, thus preventing reliable identification and quantification of full-length, coding transcript models. Here we used Nanopore ReCappable Sequencing (NRCeq), a new technique that can identify capped full-length RNAs, to assemble a complete annotation of SARS-CoV-2 sgRNAs and annotate the location of capping sites across the viral genome. We obtained robust estimates of sgRNA expression across cell lines and viral isolates and identified novel canonical and non-canonical sgRNAs, including one that uses a previously un-annotated leader-to-body junction site. The data generated in this work constitute a useful resource for the scientific community and provide important insights into the mechanisms that regulate the transcription of SARS-CoV-2 sgRNAs.


Subject(s)
COVID-19 , Nanopores , RNA, Guide, Kinetoplastida/chemistry , COVID-19/genetics , Genome, Viral/genetics , Humans , RNA Caps , RNA, Viral/genetics , RNA, Viral/metabolism , SARS-CoV-2/genetics
10.
Proc Natl Acad Sci U S A ; 118(13)2021 03 30.
Article in English | MEDLINE | ID: mdl-33758101

ABSTRACT

Among the large, diverse set of mammalian long noncoding RNAs (lncRNAs), long noncoding primary microRNAs (lnc-pri-miRNAs) are those that host miRNAs. Whether lnc-pri-miRNA loci have important biological function independent of their cognate miRNAs is poorly understood. From a genome-scale lncRNA screen, lnc-pri-miRNA loci were enriched for function in cell proliferation, and in glioblastoma (i.e., GBM) cells with DGCR8 or DROSHA knockdown, lnc-pri-miRNA screen hits still regulated cell growth. To molecularly dissect the function of a lnc-pri-miRNA locus, we studied LOC646329 (also known as MIR29HG), which hosts the miR-29a/b1 cluster. In GBM cells, LOC646329 knockdown reduced miR-29a/b1 levels, and these cells exhibited decreased growth. However, genetic deletion of the miR-29a/b1 cluster (LOC646329-miR29Δ) did not decrease cell growth, while knockdown of LOC646329-miR29Δ transcripts reduced cell proliferation. The miR-29a/b1-independent activity of LOC646329 corresponded to enhancer-like activation of a neighboring oncogene (MKLN1), regulating cell propagation. The LOC646329 locus interacts with the MKLN1 promoter, and antisense oligonucleotide knockdown of the lncRNA disrupts these interactions and reduces the enhancer-like activity. More broadly, analysis of genome-wide data from multiple human cell types showed that lnc-pri-miRNA loci are significantly enriched for DNA looping interactions with gene promoters as well as genomic and epigenetic characteristics of transcriptional enhancers. Functional studies of additional lnc-pri-miRNA loci demonstrated cognate miRNA-independent enhancer-like activity. Together, these data demonstrate that lnc-pri-miRNA loci can regulate cell biology via both miRNA-dependent and miRNA-independent mechanisms.


Subject(s)
Cell Proliferation/genetics , Genetic Loci , RNA, Long Noncoding/metabolism , Apoptosis/genetics , Gene Knockdown Techniques , HEK293 Cells , Humans , MicroRNAs/genetics , MicroRNAs/metabolism , RNA, Long Noncoding/genetics , RNA-Seq
11.
Am J Hum Genet ; 107(4): 654-669, 2020 10 01.
Article in English | MEDLINE | ID: mdl-32937144

ABSTRACT

There is growing recognition that epivariations, most often recognized as promoter hypermethylation events that lead to gene silencing, are associated with a number of human diseases. However, little information exists on the prevalence and distribution of rare epigenetic variation in the human population. In order to address this, we performed a survey of methylation profiles from 23,116 individuals using the Illumina 450k array. Using a robust outlier approach, we identified 4,452 unique autosomal epivariations, including potentially inactivating promoter methylation events at 384 genes linked to human disease. For example, we observed promoter hypermethylation of BRCA1 and LDLR at population frequencies of ∼1 in 3,000 and ∼1 in 6,000, respectively, suggesting that epivariations may underlie a fraction of human disease which would be missed by purely sequence-based approaches. Using expression data, we confirmed that many epivariations are associated with outlier gene expression. Analysis of variation data and monozygous twin pairs suggests that approximately two-thirds of epivariations segregate in the population secondary to underlying sequence mutations, while one-third are likely sporadic events that occur post-zygotically. We identified 25 loci where rare hypermethylation coincided with the presence of an unstable CGG tandem repeat, validated the presence of CGG expansions at several loci, and identified the putative molecular defect underlying most of the known folate-sensitive fragile sites in the genome. Our study provides a catalog of rare epigenetic changes in the human genome, gives insight into the underlying origins and consequences of epivariations, and identifies many hypermethylated CGG repeat expansions.


Subject(s)
BRCA1 Protein/genetics , Epigenesis, Genetic , Genetic Diseases, Inborn/genetics , Genome, Human , Receptors, LDL/genetics , Trinucleotide Repeat Expansion , BRCA1 Protein/metabolism , DNA Methylation , Female , Folic Acid/metabolism , Gene Silencing , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/pathology , Genetic Loci , Genetic Variation , High-Throughput Nucleotide Sequencing , Humans , Male , Promoter Regions, Genetic , Receptors, LDL/metabolism , Twins, Monozygotic
13.
RNA ; 27(12): 1497-1511, 2021 12.
Article in English | MEDLINE | ID: mdl-34446532

ABSTRACT

Understanding transcriptomes requires documenting the structures, modifications, and abundances of RNAs as well as their proximity to other molecules. The methods that make this possible depend critically on enzymes (including mutant derivatives) that act on nucleic acids for capturing and sequencing RNA. We tested two 3' nucleotidyl transferases, Saccharomyces cerevisiae poly(A) polymerase and Schizosaccharomyces pombe Cid1, for the ability to add base and sugar modified rNTPs to free RNA 3' ends, eventually focusing on Cid1. Although unable to polymerize ΨTP or 1meΨTP, Cid1 can use 5meUTP and 4thioUTP. Surprisingly, Cid1 can use inosine triphosphate to add poly(I) to the 3' ends of a wide variety of RNA molecules. Most poly(A) mRNAs efficiently acquire a uniform tract of about 50 inosine residues from Cid1, whereas non-poly(A) RNAs acquire longer, more heterogeneous tails. Here we test these activities for use in direct RNA sequencing on nanopores, and find that Cid1-mediated poly(I)-tailing permits detection and quantification of both mRNAs and non-poly(A) RNAs simultaneously, as well as enabling the analysis of nascent RNAs associated with RNA polymerase II. Poly(I) produces a different current trace than poly(A), enabling recognition of native RNA 3' end sequence lost by in vitro poly(A) addition. Addition of poly(I) by Cid1 offers a broadly useful alternative to poly(A) capture for direct RNA sequencing on nanopores.


Subject(s)
Nanopores , Nucleotides/chemistry , Nucleotidyltransferases/metabolism , Polymers/chemistry , Polynucleotide Adenylyltransferase/metabolism , Saccharomyces cerevisiae/enzymology , Schizosaccharomyces pombe Proteins/metabolism , Schizosaccharomyces/enzymology , Sequence Analysis, RNA/methods , Nucleotidyltransferases/genetics , Polynucleotide Adenylyltransferase/genetics , Schizosaccharomyces pombe Proteins/genetics
14.
Nat Methods ; 16(12): 1297-1305, 2019 12.
Article in English | MEDLINE | ID: mdl-31740818

ABSTRACT

High-throughput complementary DNA sequencing technologies have advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and modifications are not retained. We address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies. Our study generated 9.9 million aligned sequence reads for the human cell line GM12878, using thirty MinION flow cells at six institutions. These native RNA reads had a median length of 771 bases, and a maximum aligned length of over 21,000 bases. Mitochondrial poly(A) reads provided an internal measure of read-length quality. We combined these long nanopore reads with higher accuracy short-reads and annotated GM12878 promoter regions to identify 33,984 plausible RNA isoforms. We describe strategies for assessing 3' poly(A) tail length, base modifications and transcript haplotypes.


Subject(s)
Nanopore Sequencing/methods , Poly A/genetics , Sequence Analysis, RNA/methods , Transcriptome , Cells, Cultured , Humans
15.
Genome Res ; 28(2): 266-274, 2018 02.
Article in English | MEDLINE | ID: mdl-29273626

ABSTRACT

Advances in long-read single molecule sequencing have opened new possibilities for 'benchtop' whole-genome sequencing. The Oxford Nanopore Technologies MinION is a portable device that uses nanopore technology that can directly sequence DNA molecules. MinION single molecule long sequence reads are well suited for de novo assembly of complex genomes as they facilitate the construction of highly contiguous physical genome maps obviating the need for labor-intensive physical genome mapping. Long sequence reads can also be used to delineate complex chromosomal rearrangements, such as those that occur in tumor cells, that can confound analysis using short reads. Here, we assessed MinION long-read-derived sequences for feasibility concerning: (1) the de novo assembly of a large complex genome, and (2) the elucidation of complex rearrangements. The genomes of two Caenorhabditis elegans strains, a wild-type strain and a strain containing two complex rearrangements, were sequenced with MinION. Up to 42-fold coverage was obtained from a single flow cell, and the best pooled data assembly produced a highly contiguous wild-type C. elegans genome containing 48 contigs (N50 contig length = 3.99 Mb) covering >99% of the 100,286,401-base reference genome. Further, the MinION-derived genome assembly expanded the C. elegans reference genome by >2 Mb due to a more accurate determination of repetitive sequence elements and assembled the complete genomes of two co-extracted bacteria. MinION long-read sequence data also facilitated the elucidation of complex rearrangements in a mutagenized strain. The sequence accuracy of the MinION long-read contigs (∼98%) was improved using Illumina-derived sequence data to polish the final genome assembly to 99.8% nucleotide accuracy when compared to the reference assembly.


Subject(s)
Caenorhabditis elegans/genetics , Genome/genetics , Molecular Sequence Annotation , Animals , Chromosome Mapping , Gene Rearrangement/genetics , High-Throughput Nucleotide Sequencing , Repetitive Sequences, Nucleic Acid/genetics
16.
Bioinformatics ; 36(19): 4928-4934, 2020 12 08.
Article in English | MEDLINE | ID: mdl-32597959

ABSTRACT

MOTIVATION: Nucleotide modification status can be decoded from the Oxford Nanopore Technologies nanopore-sequencing ionic current signals. Although various algorithms have been developed for nanopore-sequencing-based modification analysis, more detailed characterizations, such as modification numbers, corresponding signal levels and proportions are still lacking. RESULTS: We present a framework for the unsupervised determination of the number of nucleotide modifications from nanopore-sequencing readouts. We demonstrate the approach can effectively recapitulate the number of modifications, the corresponding ionic current signal levels, as well as mixing proportions under both DNA and RNA contexts. We further show, by integrating information from multiple detected modification regions, that the modification status of DNA and RNA molecules can be inferred. This method forms a key step of de novo characterization of nucleotide modifications, shedding light on the interpretation of various biological questions. AVAILABILITY AND IMPLEMENTATION: Modified nanopolish: https://github.com/adbailey4/nanopolish/tree/cigar_output. All other codes used to reproduce the results: https://github.com/hd2326/ModificationNumber. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Nanopores , High-Throughput Nucleotide Sequencing , Nucleotides/genetics , Sequence Analysis, DNA , Software
19.
Nat Methods ; 14(4): 411-413, 2017 Apr.
Article in English | MEDLINE | ID: mdl-28218897

ABSTRACT

DNA chemical modifications regulate genomic function. We present a framework for mapping cytosine and adenosine methylation with the Oxford Nanopore Technologies MinION using this nanopore sequencer's ionic current signal. We map three cytosine variants and two adenine variants. The results show that our model is sensitive enough to detect changes in genomic DNA methylation levels as a function of growth phase in Escherichia coli.


Subject(s)
5-Methylcytosine/metabolism , DNA Methylation , High-Throughput Nucleotide Sequencing/methods , Nanopores , 5-Methylcytosine/analysis , Escherichia coli/genetics , Genome, Bacterial , High-Throughput Nucleotide Sequencing/instrumentation , Markov Chains , Models, Genetic
20.
Nat Methods ; 12(4): 351-6, 2015 Apr.
Article in English | MEDLINE | ID: mdl-25686389

ABSTRACT

Speed, single-base sensitivity and long read lengths make nanopores a promising technology for high-throughput sequencing. We evaluated and optimized the performance of the MinION nanopore sequencer using M13 genomic DNA and used expectation maximization to obtain robust maximum-likelihood estimates for insertion, deletion and substitution error rates (4.9%, 7.8% and 5.1%, respectively). Over 99% of high-quality 2D MinION reads mapped to the reference at a mean identity of 85%. We present a single-nucleotide-variant detection tool that uses maximum-likelihood parameter estimates and marginalization over many possible read alignments to achieve precision and recall of up to 99%. By pairing our high-confidence alignment strategy with long MinION reads, we resolved the copy number for a cancer-testis gene family (CT47) within an unresolved region of human chromosome Xq24.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Nanopores , Algorithms , Gene Dosage , Humans , Neoplasms/genetics
SELECTION OF CITATIONS
SEARCH DETAIL