Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 28
Filter
1.
Nature ; 617(7960): 312-324, 2023 05.
Article in English | MEDLINE | ID: mdl-37165242

ABSTRACT

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.


Subject(s)
Genome, Human , Genomics , Humans , Diploidy , Genome, Human/genetics , Haplotypes/genetics , Sequence Analysis, DNA , Genomics/standards , Reference Standards , Cohort Studies , Alleles , Genetic Variation
2.
Nature ; 611(7936): 519-531, 2022 Nov.
Article in English | MEDLINE | ID: mdl-36261518

ABSTRACT

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.


Subject(s)
Chromosome Mapping , Diploidy , Genome, Human , Genomics , Humans , Chromosome Mapping/standards , Genome, Human/genetics , Haplotypes/genetics , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards , Reference Standards , Genomics/methods , Genomics/standards , Chromosomes, Human/genetics , Genetic Variation/genetics
3.
Genome Res ; 34(3): 454-468, 2024 04 25.
Article in English | MEDLINE | ID: mdl-38627094

ABSTRACT

Reference-free genome phasing is vital for understanding allele inheritance and the impact of single-molecule DNA variation on phenotypes. To achieve thorough phasing across homozygous or repetitive regions of the genome, long-read sequencing technologies are often used to perform phased de novo assembly. As a step toward reducing the cost and complexity of this type of analysis, we describe new methods for accurately phasing Oxford Nanopore Technologies (ONT) sequence data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of ONT PromethION sequencing, including those using proximity ligation, and show that newer, higher accuracy ONT reads substantially improve assembly quality.


Subject(s)
Nanopores , Humans , Sequence Analysis, DNA/methods , Nanopore Sequencing/methods , High-Throughput Nucleotide Sequencing/methods , Software , Genomics/methods
4.
RNA ; 28(2): 162-176, 2022 02.
Article in English | MEDLINE | ID: mdl-34728536

ABSTRACT

Nanopore sequencing devices read individual RNA strands directly. This facilitates identification of exon linkages and nucleotide modifications; however, using conventional direct RNA nanopore sequencing, the 5' and 3' ends of poly(A) RNA cannot be identified unambiguously. This is due in part to RNA degradation in vivo and in vitro that can obscure transcription start and end sites. In this study, we aimed to identify individual full-length human RNA isoforms among ∼4 million nanopore poly(A)-selected RNA reads. First, to identify RNA strands bearing 5' m7G caps, we exchanged the biological cap for a modified cap attached to a 45-nt oligomer. This oligomer adaptation method improved 5' end sequencing and ensured correct identification of the 5' m7G capped ends. Second, among these 5'-capped nanopore reads, we screened for features consistent with a 3' polyadenylation site. Combining these two steps, we identified 294,107 individual high-confidence full-length RNA scaffolds from human GM12878 cells, most of which (257,721) aligned to protein-coding genes. Of these, 4876 scaffolds indicated unannotated isoforms that were often internal to longer, previously identified RNA isoforms. Orthogonal data for m7G caps and open chromatin, such as CAGE and DNase-HS seq, confirmed the validity of these high-confidence RNA scaffolds.


Subject(s)
RNA Isoforms/chemistry , RNA, Messenger/chemistry , Cell Line, Tumor , Humans , Nanopore Sequencing/methods , RNA 3' Polyadenylation Signals , RNA Isoforms/genetics , RNA, Messenger/genetics , Transcriptome
5.
Nucleic Acids Res ; 50(6): 3475-3489, 2022 04 08.
Article in English | MEDLINE | ID: mdl-35244721

ABSTRACT

The SARS-CoV-2 virus has a complex transcriptome characterised by multiple, nested subgenomic RNAsused to express structural and accessory proteins. Long-read sequencing technologies such as nanopore direct RNA sequencing can recover full-length transcripts, greatly simplifying the assembly of structurally complex RNAs. However, these techniques do not detect the 5' cap, thus preventing reliable identification and quantification of full-length, coding transcript models. Here we used Nanopore ReCappable Sequencing (NRCeq), a new technique that can identify capped full-length RNAs, to assemble a complete annotation of SARS-CoV-2 sgRNAs and annotate the location of capping sites across the viral genome. We obtained robust estimates of sgRNA expression across cell lines and viral isolates and identified novel canonical and non-canonical sgRNAs, including one that uses a previously un-annotated leader-to-body junction site. The data generated in this work constitute a useful resource for the scientific community and provide important insights into the mechanisms that regulate the transcription of SARS-CoV-2 sgRNAs.


Subject(s)
COVID-19 , Nanopores , RNA, Guide, Kinetoplastida/chemistry , COVID-19/genetics , Genome, Viral/genetics , Humans , RNA Caps , RNA, Viral/genetics , RNA, Viral/metabolism , SARS-CoV-2/genetics
6.
Am J Hum Genet ; 107(4): 654-669, 2020 10 01.
Article in English | MEDLINE | ID: mdl-32937144

ABSTRACT

There is growing recognition that epivariations, most often recognized as promoter hypermethylation events that lead to gene silencing, are associated with a number of human diseases. However, little information exists on the prevalence and distribution of rare epigenetic variation in the human population. In order to address this, we performed a survey of methylation profiles from 23,116 individuals using the Illumina 450k array. Using a robust outlier approach, we identified 4,452 unique autosomal epivariations, including potentially inactivating promoter methylation events at 384 genes linked to human disease. For example, we observed promoter hypermethylation of BRCA1 and LDLR at population frequencies of ∼1 in 3,000 and ∼1 in 6,000, respectively, suggesting that epivariations may underlie a fraction of human disease which would be missed by purely sequence-based approaches. Using expression data, we confirmed that many epivariations are associated with outlier gene expression. Analysis of variation data and monozygous twin pairs suggests that approximately two-thirds of epivariations segregate in the population secondary to underlying sequence mutations, while one-third are likely sporadic events that occur post-zygotically. We identified 25 loci where rare hypermethylation coincided with the presence of an unstable CGG tandem repeat, validated the presence of CGG expansions at several loci, and identified the putative molecular defect underlying most of the known folate-sensitive fragile sites in the genome. Our study provides a catalog of rare epigenetic changes in the human genome, gives insight into the underlying origins and consequences of epivariations, and identifies many hypermethylated CGG repeat expansions.


Subject(s)
BRCA1 Protein/genetics , Epigenesis, Genetic , Genetic Diseases, Inborn/genetics , Genome, Human , Receptors, LDL/genetics , Trinucleotide Repeat Expansion , BRCA1 Protein/metabolism , DNA Methylation , Female , Folic Acid/metabolism , Gene Silencing , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/pathology , Genetic Loci , Genetic Variation , High-Throughput Nucleotide Sequencing , Humans , Male , Promoter Regions, Genetic , Receptors, LDL/metabolism , Twins, Monozygotic
8.
Nat Methods ; 16(12): 1297-1305, 2019 12.
Article in English | MEDLINE | ID: mdl-31740818

ABSTRACT

High-throughput complementary DNA sequencing technologies have advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and modifications are not retained. We address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies. Our study generated 9.9 million aligned sequence reads for the human cell line GM12878, using thirty MinION flow cells at six institutions. These native RNA reads had a median length of 771 bases, and a maximum aligned length of over 21,000 bases. Mitochondrial poly(A) reads provided an internal measure of read-length quality. We combined these long nanopore reads with higher accuracy short-reads and annotated GM12878 promoter regions to identify 33,984 plausible RNA isoforms. We describe strategies for assessing 3' poly(A) tail length, base modifications and transcript haplotypes.


Subject(s)
Nanopore Sequencing/methods , Poly A/genetics , Sequence Analysis, RNA/methods , Transcriptome , Cells, Cultured , Humans
9.
Genome Res ; 28(2): 266-274, 2018 02.
Article in English | MEDLINE | ID: mdl-29273626

ABSTRACT

Advances in long-read single molecule sequencing have opened new possibilities for 'benchtop' whole-genome sequencing. The Oxford Nanopore Technologies MinION is a portable device that uses nanopore technology that can directly sequence DNA molecules. MinION single molecule long sequence reads are well suited for de novo assembly of complex genomes as they facilitate the construction of highly contiguous physical genome maps obviating the need for labor-intensive physical genome mapping. Long sequence reads can also be used to delineate complex chromosomal rearrangements, such as those that occur in tumor cells, that can confound analysis using short reads. Here, we assessed MinION long-read-derived sequences for feasibility concerning: (1) the de novo assembly of a large complex genome, and (2) the elucidation of complex rearrangements. The genomes of two Caenorhabditis elegans strains, a wild-type strain and a strain containing two complex rearrangements, were sequenced with MinION. Up to 42-fold coverage was obtained from a single flow cell, and the best pooled data assembly produced a highly contiguous wild-type C. elegans genome containing 48 contigs (N50 contig length = 3.99 Mb) covering >99% of the 100,286,401-base reference genome. Further, the MinION-derived genome assembly expanded the C. elegans reference genome by >2 Mb due to a more accurate determination of repetitive sequence elements and assembled the complete genomes of two co-extracted bacteria. MinION long-read sequence data also facilitated the elucidation of complex rearrangements in a mutagenized strain. The sequence accuracy of the MinION long-read contigs (∼98%) was improved using Illumina-derived sequence data to polish the final genome assembly to 99.8% nucleotide accuracy when compared to the reference assembly.


Subject(s)
Caenorhabditis elegans/genetics , Genome/genetics , Molecular Sequence Annotation , Animals , Chromosome Mapping , Gene Rearrangement/genetics , High-Throughput Nucleotide Sequencing , Repetitive Sequences, Nucleic Acid/genetics
10.
Bioinformatics ; 36(19): 4928-4934, 2020 12 08.
Article in English | MEDLINE | ID: mdl-32597959

ABSTRACT

MOTIVATION: Nucleotide modification status can be decoded from the Oxford Nanopore Technologies nanopore-sequencing ionic current signals. Although various algorithms have been developed for nanopore-sequencing-based modification analysis, more detailed characterizations, such as modification numbers, corresponding signal levels and proportions are still lacking. RESULTS: We present a framework for the unsupervised determination of the number of nucleotide modifications from nanopore-sequencing readouts. We demonstrate the approach can effectively recapitulate the number of modifications, the corresponding ionic current signal levels, as well as mixing proportions under both DNA and RNA contexts. We further show, by integrating information from multiple detected modification regions, that the modification status of DNA and RNA molecules can be inferred. This method forms a key step of de novo characterization of nucleotide modifications, shedding light on the interpretation of various biological questions. AVAILABILITY AND IMPLEMENTATION: Modified nanopolish: https://github.com/adbailey4/nanopolish/tree/cigar_output. All other codes used to reproduce the results: https://github.com/hd2326/ModificationNumber. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Nanopores , High-Throughput Nucleotide Sequencing , Nucleotides/genetics , Sequence Analysis, DNA , Software
12.
Nat Methods ; 14(4): 411-413, 2017 Apr.
Article in English | MEDLINE | ID: mdl-28218897

ABSTRACT

DNA chemical modifications regulate genomic function. We present a framework for mapping cytosine and adenosine methylation with the Oxford Nanopore Technologies MinION using this nanopore sequencer's ionic current signal. We map three cytosine variants and two adenine variants. The results show that our model is sensitive enough to detect changes in genomic DNA methylation levels as a function of growth phase in Escherichia coli.


Subject(s)
5-Methylcytosine/metabolism , DNA Methylation , High-Throughput Nucleotide Sequencing/methods , Nanopores , 5-Methylcytosine/analysis , Escherichia coli/genetics , Genome, Bacterial , High-Throughput Nucleotide Sequencing/instrumentation , Markov Chains , Models, Genetic
13.
Nat Methods ; 12(4): 351-6, 2015 Apr.
Article in English | MEDLINE | ID: mdl-25686389

ABSTRACT

Speed, single-base sensitivity and long read lengths make nanopores a promising technology for high-throughput sequencing. We evaluated and optimized the performance of the MinION nanopore sequencer using M13 genomic DNA and used expectation maximization to obtain robust maximum-likelihood estimates for insertion, deletion and substitution error rates (4.9%, 7.8% and 5.1%, respectively). Over 99% of high-quality 2D MinION reads mapped to the reference at a mean identity of 85%. We present a single-nucleotide-variant detection tool that uses maximum-likelihood parameter estimates and marginalization over many possible read alignments to achieve precision and recall of up to 99%. By pairing our high-confidence alignment strategy with long MinION reads, we resolved the copy number for a cancer-testis gene family (CT47) within an unresolved region of human chromosome Xq24.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Nanopores , Algorithms , Gene Dosage , Humans , Neoplasms/genetics
14.
bioRxiv ; 2023 Feb 22.
Article in English | MEDLINE | ID: mdl-36865218

ABSTRACT

As a step towards simplifying and reducing the cost of haplotype resolved de novo assembly, we describe new methods for accurately phasing nanopore data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of Oxford Nanopore Technologies' (ONT) PromethION sequencing, including those using proximity ligation and show that newer, higher accuracy ONT reads substantially improve assembly quality.

15.
Methods Mol Biol ; 2298: 53-74, 2021.
Article in English | MEDLINE | ID: mdl-34085238

ABSTRACT

Historically, RNA has been sequenced as cDNA copies derived from reverse transcription of cellular RNA followed by PCR amplification. Recently, RNA sequencing using nanopores has emerged as an alternative. Using this technology, individual cellular RNA strands are read directly as they are driven through nanoscale pores by an applied voltage. The speed of translocation is regulated by a helicase that is loaded onto each RNA strand by an adapter that also facilitates capture by the nanopore electric field. Here we describe a technique for adapting human ribosomal RNA subunits for nanopore sequencing. Using this strategy, a single Oxford Nanopore MinION run delivered 470,907 sequence reads of which 396,048 aligned to ribosomal RNA, with 28S, 18S, 5.8S, and 5S coverage of 6053, 369,472, 16,058, and 4465 reads, respectively. Example alignments that reveal putative nucleotide modifications are provided.


Subject(s)
Nanopore Sequencing/methods , Nucleotides/genetics , RNA, Ribosomal/genetics , Sequence Analysis, RNA/methods , Humans , Nanopores , Sequence Analysis, DNA/methods
16.
ACS Nano ; 15(10): 16642-16653, 2021 10 26.
Article in English | MEDLINE | ID: mdl-34618430

ABSTRACT

We describe a method for direct tRNA sequencing using the Oxford Nanopore MinION. The principal technical advance is custom adapters that facilitate end-to-end sequencing of individual transfer RNA (tRNA) molecules at subnanometer precision. A second advance is a nanopore sequencing pipeline optimized for tRNA. We tested this method using purified E. coli tRNAfMet, tRNALys, and tRNAPhe samples. 76-92% of individual aligned tRNA sequence reads were full length. As a proof of concept, we showed that nanopore sequencing detected all 43 expected isoacceptors in total E. coli MRE600 tRNA as well as isodecoders that further define that tRNA population. Alignment-based comparisons between the three purified tRNAs and their synthetic controls revealed systematic nucleotide miscalls that were diagnostic of known modifications. Systematic miscalls were also observed proximal to known modifications in total E. coli tRNA alignments, including a highly conserved pseudouridine in the T loop. This work highlights the potential of nanopore direct tRNA sequencing as well as improvements needed to implement tRNA sequencing for human healthcare applications.


Subject(s)
Nanopore Sequencing , Nanopores , Escherichia coli/genetics , High-Throughput Nucleotide Sequencing , Humans , Nucleotides
17.
bioRxiv ; 2021 Apr 06.
Article in English | MEDLINE | ID: mdl-33851162

ABSTRACT

We report a SARS-CoV-2 lineage that shares N501Y, P681H, and other mutations with known variants of concern, such as B.1.1.7. This lineage, which we refer to as B.1.x (COG-UK sometimes references similar samples as B.1.324.1), is present in at least 20 states across the USA and in at least six countries. However, a large deletion causes the sequence to be automatically rejected from repositories, suggesting that the frequency of this new lineage is underestimated using public data. Recent dynamics based on 339 samples obtained in Santa Cruz County, CA, USA suggest that B.1.x may be increasing in frequency at a rate similar to that of B.1.1.7 in Southern California. At present the functional differences between this variant B.1.x and other circulating SARS-CoV-2 variants are unknown, and further studies on secondary attack rates, viral loads, immune evasion and/or disease severity are needed to determine if it poses a public health concern. Nonetheless, given what is known from well-studied circulating variants of concern, it seems unlikely that the lineage could pose larger concerns for human health than many already globally distributed lineages. Our work highlights a need for rapid turnaround time from sequence generation to submission and improved sequence quality control that removes submission bias. We identify promising paths toward this goal.

18.
Genome Biol ; 21(1): 83, 2020 03 31.
Article in English | MEDLINE | ID: mdl-32234056

ABSTRACT

BACKGROUND: Long non-coding RNAs (lncRNAs) exhibit highly cell type-specific expression and function, making this class of transcript attractive for targeted cancer therapy. However, the vast majority of lncRNAs have not been tested as potential therapeutic targets, particularly in the context of currently used cancer treatments. Malignant glioma is rapidly fatal, and ionizing radiation is part of the current standard-of-care used to slow tumor growth in both adult and pediatric patients. RESULTS: We use CRISPR interference (CRISPRi) to screen 5689 lncRNA loci in human glioblastoma (GBM) cells, identifying 467 hits that modify cell growth in the presence of clinically relevant doses of fractionated radiation. Thirty-three of these lncRNA hits sensitize cells to radiation, and based on their expression in adult and pediatric gliomas, nine of these hits are prioritized as lncRNA Glioma Radiation Sensitizers (lncGRS). Knockdown of lncGRS-1, a primate-conserved, nuclear-enriched lncRNA, inhibits the growth and proliferation of primary adult and pediatric glioma cells, but not the viability of normal brain cells. Using human brain organoids comprised of mature neural cell types as a three-dimensional tissue substrate to model the invasive growth of glioma, we find that antisense oligonucleotides targeting lncGRS-1 selectively decrease tumor growth and sensitize glioma cells to radiation therapy. CONCLUSIONS: These studies identify lncGRS-1 as a glioma-specific therapeutic target and establish a generalizable approach to rapidly identify novel therapeutic targets in the vast non-coding genome to enhance radiation therapy.


Subject(s)
Brain Neoplasms/therapy , CRISPR-Cas Systems , Glioblastoma/therapy , RNA, Long Noncoding/antagonists & inhibitors , Adult , Astrocytes , Brain , Brain Neoplasms/genetics , Brain Neoplasms/pathology , Brain Neoplasms/radiotherapy , Cell Line, Tumor , Combined Modality Therapy , Glioblastoma/genetics , Glioblastoma/pathology , Glioblastoma/radiotherapy , Humans , Oligonucleotides, Antisense , Organoids , Radiation Tolerance
19.
Nat Biotechnol ; 38(9): 1044-1053, 2020 09.
Article in English | MEDLINE | ID: mdl-32686750

ABSTRACT

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.


Subject(s)
Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , Nanopore Sequencing , Sequence Analysis, DNA/methods , Algorithms , Benchmarking , Chromosomes, Human/genetics , Deep Learning , Genomics , HLA Antigens/genetics , Haploidy , High-Throughput Nucleotide Sequencing/standards , Humans , Sequence Analysis, DNA/standards
20.
Microb Genom ; 4(11)2018 11.
Article in English | MEDLINE | ID: mdl-30461375

ABSTRACT

The genome of Bordetella pertussis is complex, with high G+C content and many repeats, each longer than 1000 bp. Long-read sequencing offers the opportunity to produce single-contig B. pertussis assemblies using sequencing reads which are longer than the repetitive sections, with the potential to reveal genomic features which were previously unobservable in multi-contig assemblies produced by short-read sequencing alone. We used an R9.4 MinION flow cell and barcoding to sequence five B. pertussis strains in a single sequencing run. We then trialled combinations of the many nanopore user community-built long-read analysis tools to establish the current optimal assembly pipeline for B. pertussis genome sequences. This pipeline produced closed genome sequences for four strains, allowing visualization of inter-strain genomic rearrangement. Read mapping to the Tohama I reference genome suggests that the remaining strain contains an ultra-long duplicated region (almost 200 kbp), which was not resolved by our pipeline; further investigation also revealed that a second strain that was seemingly resolved by our pipeline may contain an even longer duplication, albeit in a small subset of cells. We have therefore demonstrated the ability to resolve the structure of several B. pertussis strains per single barcoded nanopore flow cell, but the genomes with highest complexity (e.g. very large duplicated regions) remain only partially resolved using the standard library preparation and will require an alternative library preparation method. For full strain characterization, we recommend hybrid assembly of long and short reads together; for comparison of genome arrangement, assembly using long reads alone is sufficient.


Subject(s)
Bordetella pertussis/genetics , Genome, Bacterial , Sequence Analysis, DNA/methods , Molecular Sequence Annotation , Nanopores
SELECTION OF CITATIONS
SEARCH DETAIL