Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 234
Filter
Add more filters

Publication year range
1.
Nature ; 622(7981): 41-47, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37794265

ABSTRACT

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.


Subject(s)
Genes , Genome, Human , Molecular Sequence Annotation , Protein Isoforms , Humans , Genome, Human/genetics , Molecular Sequence Annotation/standards , Molecular Sequence Annotation/trends , Protein Isoforms/genetics , Human Genome Project , Pseudogenes , RNA/genetics
2.
Nat Rev Genet ; 21(4): 243-254, 2020 04.
Article in English | MEDLINE | ID: mdl-32034321

ABSTRACT

Since the early days of the genome era, the scientific community has relied on a single 'reference' genome for each species, which is used as the basis for a wide range of genetic analyses, including studies of variation within and across species. As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. Here we review efforts to create pan-genomes for a range of species, from bacteria to humans, and we further consider the computational methods that have been proposed in order to capture, interpret and compare pan-genome data. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.


Subject(s)
Genome, Human , Genomics , Genome, Bacterial , Genome, Plant , Humans
3.
Genome Res ; 31(2): 301-308, 2021 Feb.
Article in English | MEDLINE | ID: mdl-33361112

ABSTRACT

RNA sequencing is widely used to measure gene expression across a vast range of animal and plant tissues and conditions. Most studies of computational methods for gene expression analysis use simulated data to evaluate the accuracy of these methods. These simulations typically include reads generated from known genes at varying levels of expression. Until now, simulations did not include reads from noisy transcripts, which might include erroneous transcription, erroneous splicing, and other processes that affect transcription in living cells. Here we examine the effects of realistic amounts of transcriptional noise on the ability of leading computational methods to assemble and quantify the genes and transcripts in an RNA sequencing experiment. We show that the inclusion of noise leads to systematic errors in the ability of these programs to measure expression, including systematic underestimates of transcript abundance levels and large increases in the number of false-positive genes and transcripts. Our results also suggest that alignment-free computational methods sometimes fail to detect transcripts expressed at relatively low levels.

4.
PLoS Comput Biol ; 19(3): e1011032, 2023 03.
Article in English | MEDLINE | ID: mdl-37000853

ABSTRACT

Advances in long-read sequencing technologies have dramatically improved the contiguity and completeness of genome assemblies. Using the latest nanopore-based sequencers, we can generate enough data for the assembly of a human genome from a single flow cell. With the long-read data from these sequences, we can now routinely produce de novo genome assemblies in which half or more of a genome is contained in megabase-scale contigs. Assemblies produced from nanopore data alone, though, have relatively high error rates and can benefit from a process called polishing, in which more-accurate reads are used to correct errors in the consensus sequence. In this manuscript, we present a novel tool for genome polishing called JASPER (Jellyfish-based Assembly Sequence Polisher for Error Reduction). In contrast to many other polishing methods, JASPER gains efficiency by avoiding the alignment of reads to the assembly. Instead, JASPER uses a database of k-mer counts that it creates from the reads to detect and correct errors in the consensus. Our experiments demonstrate that JASPER is faster than alignment-based polishers, and both faster and more accurate than other k-mer based polishing methods. We also introduce the idea of using a polishing tool to create population-specific reference genomes, and illustrate this idea using sequence data from multiple individuals from Tokyo, Japan.


Subject(s)
High-Throughput Nucleotide Sequencing , Nanopores , Humans , Sequence Analysis, DNA , Genome, Human/genetics , Metagenomics
5.
Plant J ; 109(1): 7-22, 2022 01.
Article in English | MEDLINE | ID: mdl-34800071

ABSTRACT

Drought is a major limitation for survival and growth in plants. With more frequent and severe drought episodes occurring due to climate change, it is imperative to understand the genomic and physiological basis of drought tolerance to be able to predict how species will respond in the future. In this study, univariate and multitrait multivariate genome-wide association study methods were used to identify candidate genes in two iconic and ecosystem-dominating species of the western USA, coast redwood and giant sequoia, using 10 drought-related physiological and anatomical traits and genome-wide sequence-capture single nucleotide polymorphisms. Population-level phenotypic variation was found in carbon isotope discrimination, osmotic pressure at full turgor, xylem hydraulic diameter, and total area of transporting fibers in both species. Our study identified new 78 new marker × trait associations in coast redwood and six in giant sequoia, with genes involved in a range of metabolic, stress, and signaling pathways, among other functions. This study contributes to a better understanding of the genomic basis of drought tolerance in long-generation conifers and helps guide current and future conservation efforts in the species.


Subject(s)
Adaptation, Physiological/genetics , Genome, Plant/genetics , Sequoia/genetics , Sequoiadendron/genetics , Signal Transduction/genetics , Carbon Isotopes/analysis , Conservation of Natural Resources , Droughts , Genome-Wide Association Study , Multifactorial Inheritance/genetics , Osmotic Pressure , Phenotype , Plant Stomata/genetics , Plant Stomata/physiology , Sequoia/physiology , Sequoiadendron/physiology , Xylem/genetics , Xylem/physiology
6.
Bioinformatics ; 38(5): 1440-1442, 2022 02 07.
Article in English | MEDLINE | ID: mdl-34734986

ABSTRACT

SUMMARY: PhyloCSF++ is an efficient and parallelized C++ implementation of the popular PhyloCSF method to distinguish protein-coding and non-coding regions in a genome based on multiple sequence alignments (MSAs). It can score alignments or produce browser tracks for entire genomes in the wig file format. Additionally, PhyloCSF++ annotates coding sequences in GFF/GTF files using precomputed tracks or computes and scores MSAs on the fly with MMseqs2. AVAILABILITY AND IMPLEMENTATION: PhyloCSF++ is released under the AGPLv3 license. Binaries and source code are available at https://github.com/cpockrandt/PhyloCSFpp. The software can be installed through bioconda. A variety of tracks can be accessed through ftp://ftp.ccb.jhu.edu/pub/software/phylocsfpp/.


Subject(s)
Genome , Software , Sequence Alignment , Exons
7.
PLoS Comput Biol ; 18(2): e1009860, 2022 02.
Article in English | MEDLINE | ID: mdl-35120119

ABSTRACT

Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Software
8.
Nature ; 551(7681): 498-502, 2017 11 23.
Article in English | MEDLINE | ID: mdl-29143815

ABSTRACT

Aegilops tauschii is the diploid progenitor of the D genome of hexaploid wheat (Triticum aestivum, genomes AABBDD) and an important genetic resource for wheat. The large size and highly repetitive nature of the Ae. tauschii genome has until now precluded the development of a reference-quality genome sequence. Here we use an array of advanced technologies, including ordered-clone genome sequencing, whole-genome shotgun sequencing, and BioNano optical genome mapping, to generate a reference-quality genome sequence for Ae. tauschii ssp. strangulata accession AL8/78, which is closely related to the wheat D genome. We show that compared to other sequenced plant genomes, including a much larger conifer genome, the Ae. tauschii genome contains unprecedented amounts of very similar repeated sequences. Our genome comparisons reveal that the Ae. tauschii genome has a greater number of dispersed duplicated genes than other sequenced genomes and its chromosomes have been structurally evolving an order of magnitude faster than those of other grass genomes. The decay of colinearity with other grass genomes correlates with recombination rates along chromosomes. We propose that the vast amounts of very similar repeated sequences cause frequent errors in recombination and lead to gene duplications and structural chromosome changes that drive fast genome evolution.


Subject(s)
Genome, Plant , Phylogeny , Poaceae/genetics , Triticum/genetics , Chromosome Mapping , Diploidy , Evolution, Molecular , Gene Duplication , Genes, Plant/genetics , Genomics/standards , Poaceae/classification , Recombination, Genetic/genetics , Sequence Analysis, DNA/standards , Triticum/classification
9.
PLoS Genet ; 16(1): e1008571, 2020 01.
Article in English | MEDLINE | ID: mdl-31986137

ABSTRACT

Long-read sequencing facilitates assembly of complex genomic regions. In plants, loci containing nucleotide-binding, leucine-rich repeat (NLR) disease resistance genes are an important example of such regions. NLR genes constitute one of the largest gene families in plants and are often clustered, evolving via duplication, contraction, and transposition. We recently mapped the Xo1 locus for resistance to bacterial blight and bacterial leaf streak, found in the American heirloom rice variety Carolina Gold Select, to a region that in the Nipponbare reference genome is NLR gene-rich. Here, toward identification of the Xo1 gene, we combined Nanopore and Illumina reads and generated a high-quality Carolina Gold Select genome assembly. We identified 529 complete or partial NLR genes and discovered, relative to Nipponbare, an expansion of NLR genes at the Xo1 locus. One of these has high sequence similarity to the cloned, functionally similar Xa1 gene. Both harbor an integrated zfBED domain, and the repeats within each protein are nearly perfect. Across diverse Oryzeae, we identified two sub-clades of NLR genes with these features, varying in the presence of the zfBED domain and the number of repeats. The Carolina Gold Select genome assembly also uncovered at the Xo1 locus a rice blast resistance gene and a gene encoding a polyphenol oxidase (PPO). PPO activity has been used as a marker for blast resistance at the locus in some varieties; however, the Carolina Gold Select sequence revealed a loss-of-function mutation in the PPO gene that breaks this association. Our results demonstrate that whole genome sequencing combining Nanopore and Illumina reads effectively resolves NLR gene loci. Our identification of an Xo1 candidate is an important step toward mechanistic characterization, including the role(s) of the zfBED domain. Finally, the Carolina Gold Select genome assembly will facilitate identification of other useful traits in this historically important variety.


Subject(s)
Disease Resistance , NLR Proteins/genetics , Oryza/genetics , Plant Proteins/genetics , Molecular Sequence Annotation , NLR Proteins/chemistry , NLR Proteins/metabolism , Nanopore Sequencing/methods , Oryza/immunology , Plant Proteins/chemistry , Plant Proteins/metabolism , Whole Genome Sequencing/methods , Zinc Fingers
10.
Genome Res ; 29(6): 954-960, 2019 06.
Article in English | MEDLINE | ID: mdl-31064768

ABSTRACT

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein "families" across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.


Subject(s)
DNA Contamination , Genome, Bacterial , Genome, Human , Genomics , Databases, Genetic , Genetic Variation , Genome, Archaeal , Genomics/methods , Genomics/standards , High-Throughput Nucleotide Sequencing , Humans , Open Reading Frames , Repetitive Sequences, Nucleic Acid
11.
Bioinformatics ; 37(12): 1639-1643, 2021 Jul 19.
Article in English | MEDLINE | ID: mdl-33320174

ABSTRACT

MOTIVATION: Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however, for most species, only the reference genome is well-annotated. RESULTS: One strategy to annotate new or improved genome assemblies is to map or 'lift over' the genes from a previously annotated reference genome. Here, we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. AVAILABILITY AND IMPLEMENTATION: Liftoff can be installed via bioconda and PyPI. In addition, the source code for Liftoff is available at https://github.com/agshumate/Liftoff. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

12.
PLoS Comput Biol ; 17(2): e1008727, 2021 02.
Article in English | MEDLINE | ID: mdl-33635857

ABSTRACT

Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.


Subject(s)
Gene Expression Profiling , Genome, Bacterial , Genome, Microbial/genetics , Genomics/methods , Prokaryotic Cells , Algorithms , Computational Biology , Computer Simulation , Genome , Genome, Archaeal , Molecular Sequence Annotation , Open Reading Frames , Programming Languages , Protein Biosynthesis , Software
13.
Plant J ; 104(2): 365-376, 2020 10.
Article in English | MEDLINE | ID: mdl-32654344

ABSTRACT

The genomic architecture and molecular mechanisms controlling variation in quantitative disease resistance loci are not well understood in plant species and have been barely studied in long-generation trees. Quantitative trait loci mapping and genome-wide association studies were combined to test a large single nucleotide polymorphism (SNP) set for association with quantitative and qualitative white pine blister rust resistance in sugar pine. In the absence of a chromosome-scale reference genome, a high-density consensus linkage map was generated to obtain locations for associated SNPs. Newly discovered associations for white pine blister rust quantitative disease resistance included 453 SNPs involved in wide biological functions, including genes associated with disease resistance and others involved in morphological and developmental processes. In addition, NBS-LRR pathogen recognition genes were found to be involved in quantitative disease resistance, suggesting these newly reported genes are qualitative genes with partial resistance, they are the result of defeated qualitative resistance due to avirulent races, or they have epistatic effects on qualitative disease resistance genes. This study is a step forward in our understanding of the complex genomic architecture of quantitative disease resistance in long-generation trees, and constitutes the first step towards marker-assisted disease resistance breeding in white pine species.


Subject(s)
Basidiomycota/physiology , Disease Resistance/genetics , Pinus/genetics , Pinus/microbiology , Chromosome Mapping , Genes, Plant , Genetics, Population , Genome, Plant , Genome-Wide Association Study , Phenotype , Plant Diseases/microbiology , Polymorphism, Single Nucleotide , Quantitative Trait Loci
14.
Brief Bioinform ; 20(4): 1125-1136, 2019 07 19.
Article in English | MEDLINE | ID: mdl-29028872

ABSTRACT

Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read classification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact on our ability to interpret a microbiome sample.


Subject(s)
Databases, Genetic , Metagenomics/methods , Algorithms , Computational Biology/methods , Databases, Genetic/statistics & numerical data , Gene Expression Profiling/statistics & numerical data , Genetic Markers , High-Throughput Nucleotide Sequencing/statistics & numerical data , Metagenome , Metagenomics/statistics & numerical data , Microbiota/genetics , Phylogeny , Sequence Alignment/statistics & numerical data
15.
Bioinformatics ; 36(4): 1303-1304, 2020 02 15.
Article in English | MEDLINE | ID: mdl-31553437

ABSTRACT

SUMMARY: Pavian is a web application for exploring classification results from metagenomics experiments. With Pavian, researchers can analyze, visualize and transform results from various classifiers-such as Kraken, Centrifuge and MethaPhlAn-using interactive data tables, heatmaps and Sankey flow diagrams. An interactive alignment coverage viewer can help in the validation of matches to a particular genome, which can be crucial when using metagenomics experiments for pathogen detection. AVAILABILITY AND IMPLEMENTATION: Pavian is implemented in the R language as a modular Shiny web app and is freely available under GPL-3 from http://github.com/fbreitwieser/pavian.


Subject(s)
Metagenomics , Microbiota , Data Interpretation, Statistical , Software
16.
PLoS Comput Biol ; 16(12): e1008439, 2020 12.
Article in English | MEDLINE | ID: mdl-33275607

ABSTRACT

GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI's Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.


Subject(s)
Cytosine/chemistry , Genome, Bacterial , Guanine/chemistry , DNA Replication , DNA, Bacterial/chemistry , Mutation
17.
PLoS Comput Biol ; 16(6): e1007981, 2020 06.
Article in English | MEDLINE | ID: mdl-32589667

ABSTRACT

The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8-15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to "polish" the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.


Subject(s)
Genome, Bacterial , Algorithms , Biopolymers/genetics , Sequence Analysis, DNA/methods
18.
Nature ; 517(7534): 381-5, 2015 Jan 15.
Article in English | MEDLINE | ID: mdl-25561180

ABSTRACT

Despite antiretroviral therapy (ART), human immunodeficiency virus (HIV)-1 persists in a stable latent reservoir, primarily in resting memory CD4(+) T cells. This reservoir presents a major barrier to the cure of HIV-1 infection. To purge the reservoir, pharmacological reactivation of latent HIV-1 has been proposed and tested both in vitro and in vivo. A key remaining question is whether virus-specific immune mechanisms, including cytotoxic T lymphocytes (CTLs), can clear infected cells in ART-treated patients after latency is reversed. Here we show that there is a striking all or none pattern for CTL escape mutations in HIV-1 Gag epitopes. Unless ART is started early, the vast majority (>98%) of latent viruses carry CTL escape mutations that render infected cells insensitive to CTLs directed at common epitopes. To solve this problem, we identified CTLs that could recognize epitopes from latent HIV-1 that were unmutated in every chronically infected patient tested. Upon stimulation, these CTLs eliminated target cells infected with autologous virus derived from the latent reservoir, both in vitro and in patient-derived humanized mice. The predominance of CTL-resistant viruses in the latent reservoir poses a major challenge to viral eradication. Our results demonstrate that chronically infected patients retain a broad-spectrum viral-specific CTL response and that appropriate boosting of this response may be required for the elimination of the latent reservoir.


Subject(s)
Genes, Dominant/genetics , Genes, Viral/genetics , HIV-1/genetics , HIV-1/immunology , Mutation/genetics , T-Lymphocytes, Cytotoxic/immunology , Virus Latency/immunology , Acute Disease/therapy , Animals , Anti-HIV Agents/administration & dosage , Anti-HIV Agents/pharmacology , Anti-HIV Agents/therapeutic use , CD4-Positive T-Lymphocytes/cytology , CD4-Positive T-Lymphocytes/immunology , CD4-Positive T-Lymphocytes/virology , Chronic Disease/drug therapy , Epitopes, T-Lymphocyte/genetics , Epitopes, T-Lymphocyte/immunology , Female , HIV Infections/blood , HIV Infections/drug therapy , HIV Infections/immunology , HIV Infections/virology , HIV-1/drug effects , HIV-1/growth & development , Humans , Male , Mice , RNA, Viral/blood , Viral Load/drug effects , Virus Latency/genetics , Virus Replication/immunology , gag Gene Products, Human Immunodeficiency Virus/genetics , gag Gene Products, Human Immunodeficiency Virus/immunology
20.
Genome Res ; 27(5): 787-792, 2017 05.
Article in English | MEDLINE | ID: mdl-28130360

ABSTRACT

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.


Subject(s)
Contig Mapping/methods , Genome, Plant , Genomics/methods , Poaceae/genetics , Repetitive Sequences, Nucleic Acid , Sequence Analysis, DNA/methods , Software , Contig Mapping/standards , Genome Size , Genomics/standards , Sequence Analysis, DNA/standards
SELECTION OF CITATIONS
SEARCH DETAIL