RESUMEN
Ensembl (https://www.ensembl.org) is a freely available genomic resource that has produced high-quality annotations, tools, and services for vertebrates and model organisms for more than two decades. In recent years, there has been a dramatic shift in the genomic landscape, with a large increase in the number and phylogenetic breadth of high-quality reference genomes, alongside major advances in the pan-genome representations of higher species. In order to support these efforts and accelerate downstream research, Ensembl continues to focus on scaling for the rapid annotation of new genome assemblies, developing new methods for comparative analysis, and expanding the depth and quality of our genome annotations. This year we have continued our expansion to support global biodiversity research, doubling the number of annotated genomes we support on our Rapid Release site to over 1700, driven by our close collaboration with biodiversity projects such as Darwin Tree of Life. We have also strengthened support for key agricultural species, including the first regulatory builds for farmed animals, and have updated key tools and resources that support the global scientific community, notably the Ensembl Variant Effect Predictor. Ensembl data, software, and tools are freely available.
Asunto(s)
Bases de Datos Genéticas , Genómica , Animales , Genoma , Anotación de Secuencia Molecular , Filogenia , Programas Informáticos , HumanosRESUMEN
BACKGROUND: Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS: Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS: This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.
Asunto(s)
Genoma de Planta , Polimorfismo de Nucleótido Simple , Flujo de Trabajo , Fitomejoramiento , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
Ensembl Genomes (https://www.ensemblgenomes.org) provides access to non-vertebrate genomes and analysis complementing vertebrate resources developed by the Ensembl project (https://www.ensembl.org). The two resources collectively present genome annotation through a consistent set of interfaces spanning the tree of life presenting genome sequence, annotation, variation, transcriptomic data and comparative analysis. Here, we present our largest increase in plant, metazoan and fungal genomes since the project's inception creating one of the world's most comprehensive genomic resources and describe our efforts to reduce genome redundancy in our Bacteria portal. We detail our new efforts in gene annotation, our emerging support for pangenome analysis, our efforts to accelerate data dissemination through the Ensembl Rapid Release resource and our new AlphaFold visualization. Finally, we present details of our future plans including updates on our integration with Ensembl, and how we plan to improve our support for the microbial research community. Software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license). Data updates are synchronised with Ensembl's release cycle.
Asunto(s)
Bases de Datos Genéticas , Genómica , Internet , Programas Informáticos , Animales , Biología Computacional , Genoma Bacteriano/genética , Genoma Fúngico/genética , Genoma de Planta/genética , Plantas/clasificación , Plantas/genética , Vertebrados/clasificación , Vertebrados/genéticaRESUMEN
The genetic basis of general plant vigor is of major interest to food producers, yet the trait is recalcitrant to genetic mapping because of the number of loci involved, their small effects, and linkage. Observations of heterosis in many crops suggests that recessive, malfunctioning versions of genes are a major cause of poor performance, yet we have little information on the mutational spectrum underlying these disruptions. To address this question, we generated a long-read assembly of a tropical japonica rice (Oryza sativa) variety, Carolina Gold, which allowed us to identify structural mutations (>50 bp) and orient them with respect to their ancestral state using the outgroup, Oryza glaberrima. Supporting prior work, we find substantial genome expansion in the sativa branch. While transposable elements (TEs) account for the largest share of size variation, the majority of events are not directly TE-mediated. Tandem duplications are the most common source of insertions and are highly enriched among 50-200bp mutations. To explore the relative impact of various mutational classes on crop fitness, we then track these structural events over the last century of US rice improvement using 101 resequenced varieties. Within this material, a pattern of temporary hybridization between medium and long-grain varieties was followed by recent divergence. During this long-term selection, structural mutations that impact gene exons have been removed at a greater rate than intronic indels and single-nucleotide mutations. These results support the use of ab initio estimates of mutational burden, based on structural data, as an orthogonal predictor in genomic selection.
Asunto(s)
Genes de Plantas , Mutación , Oryza/genética , Fitomejoramiento , Selección Genética , Productos Agrícolas/genética , Reparación del ADN , Elementos Transponibles de ADN , Ambiente , Interacción Gen-Ambiente , Genoma de Planta , Hibridación Genética , Mutación INDEL , Semillas/genéticaRESUMEN
Numerous staple crops exhibit polyploidy and are difficult to genetically modify. However, recent advances in genome sequencing and editing have enabled polyploid genome engineering. The hexaploid black nightshade species Solanum nigrum has immense potential as a beneficial food supplement. We assembled its genome at the scaffold level. After functional annotations, we identified homoeologous gene sets, with similar sequence and expression profiles, based on comparative analyses of orthologous genes with close diploid relatives Solanum americanum and S. lycopersicum. Using CRISPR-Cas9-mediated mutagenesis, we generated various mutation combinations in homoeologous genes. Multiple mutants showed quantitative phenotypic changes based on the genotype, resulting in a broad-spectrum effect on the quantitative traits of hexaploid S. nigrum. Furthermore, we successfully improved the fruit productivity of Boranong, an orphan cultivar of S. nigrum suggesting that engineering homoeologous genes could be useful for agricultural improvement of polyploid crops.
Asunto(s)
Productos Agrícolas , Poliploidía , Secuencia de Bases , Mapeo Cromosómico/métodos , Mutación , Fenotipo , Productos Agrícolas/genética , Genoma de Planta/genética , Edición GénicaRESUMEN
Gramene (http://www.gramene.org), a knowledgebase founded on comparative functional analyses of genomic and pathway data for model plants and major crops, supports agricultural researchers worldwide. The resource is committed to open access and reproducible science based on the FAIR data principles. Since the last NAR update, we made nine releases; doubled the genome portal's content; expanded curated genes, pathways and expression sets; and implemented the Domain Informational Vocabulary Extraction (DIVE) algorithm for extracting gene function information from publications. The current release, #63 (October 2020), hosts 93 reference genomes-over 3.9 million genes in 122 947 families with orthologous and paralogous classifications. Plant Reactome portrays pathway networks using a combination of manual biocuration in rice (320 reference pathways) and orthology-based projections to 106 species. The Reactome platform facilitates comparison between reference and projected pathways, gene expression analyses and overlays of gene-gene interactions. Gramene integrates ontology-based protein structure-function annotation; information on genetic, epigenetic, expression, and phenotypic diversity; and gene functional annotations extracted from plant-focused journals using DIVE. We train plant researchers in biocuration of genes and pathways; host curated maize gene structures as tracks in the maize genome browser; and integrate curated rice genes and pathways in the Plant Reactome.
Asunto(s)
Bases de Datos Genéticas , Regulación de la Expresión Génica de las Plantas , Genoma de Planta , Genómica/métodos , Proteínas de Plantas/genética , Plantas/genética , Productos Agrícolas , Elementos Transponibles de ADN , Duplicación de Gen , Ontología de Genes , Redes Reguladoras de Genes , Internet , Bases del Conocimiento , Redes y Vías Metabólicas , Anotación de Secuencia Molecular , Oryza/genética , Oryza/metabolismo , Proteínas de Plantas/metabolismo , Plantas/clasificación , Plantas/metabolismo , Poliploidía , Mapeo de Interacción de Proteínas , Programas Informáticos , Zea mays/genética , Zea mays/metabolismoRESUMEN
MAIN CONCLUSION: SorghumBase provides a community portal that integrates genetic, genomic, and breeding resources for sorghum germplasm improvement. Public research and development in agriculture rely on proper data and resource sharing within stakeholder communities. For plant breeders, agronomists, molecular biologists, geneticists, and bioinformaticians, centralizing desirable data into a user-friendly hub for crop systems is essential for successful collaborations and breakthroughs in germplasm development. Here, we present the SorghumBase web portal ( https://www.sorghumbase.org ), a resource for the sorghum research community. SorghumBase hosts a wide range of sorghum genomic information in a modular framework, built with open-source software, to provide a sustainable platform. This initial release of SorghumBase includes: (1) five sorghum reference genome assemblies in a pan-genome browser; (2) genetic variant information for natural diversity panels and ethyl methanesulfonate (EMS)-induced mutant populations; (3) search interface and integrated views of various data types; (4) links supporting interconnectivity with other repositories including genebank, QTL, and gene expression databases; and (5) a content management system to support access to community news and training materials. SorghumBase offers sorghum investigators improved data collation and access that will facilitate the growth of a robust research community to support genomics-assisted breeding.
Asunto(s)
Sorghum , Bases de Datos Genéticas , Grano Comestible , Genoma de Planta/genética , Genómica , Internet , Fitomejoramiento , Sorghum/genéticaRESUMEN
Gramene (http://www.gramene.org) is a knowledgebase for comparative functional analysis in major crops and model plant species. The current release, #54, includes over 1.7 million genes from 44 reference genomes, most of which were organized into 62,367 gene families through orthologous and paralogous gene classification, whole-genome alignments, and synteny. Additional gene annotations include ontology-based protein structure and function; genetic, epigenetic, and phenotypic diversity; and pathway associations. Gramene's Plant Reactome provides a knowledgebase of cellular-level plant pathway networks. Specifically, it uses curated rice reference pathways to derive pathway projections for an additional 66 species based on gene orthology, and facilitates display of gene expression, gene-gene interactions, and user-defined omics data in the context of these pathways. As a community portal, Gramene integrates best-of-class software and infrastructure components including the Ensembl genome browser, Reactome pathway browser, and Expression Atlas widgets, and undergoes periodic data and software upgrades. Via powerful, intuitive search interfaces, users can easily query across various portals and interactively analyze search results by clicking on diverse features such as genomic context, highly augmented gene trees, gene expression anatomograms, associated pathways, and external informatics resources. All data in Gramene are accessible through both visual and programmatic interfaces.
Asunto(s)
Bases de Datos Genéticas , Regulación de la Expresión Génica de las Plantas , Genómica/métodos , Bases del Conocimiento , Plantas/genética , Epigénesis Genética , Ontología de Genes , Investigación Genética , Variación Genética , Genoma de Planta , Redes y Vías Metabólicas/genética , Anotación de Secuencia Molecular , Plantas/metabolismo , Programas Informáticos , Interfaz Usuario-ComputadorRESUMEN
Gramene (http://www.gramene.org) is an online resource for comparative functional genomics in crops and model plant species. Its two main frameworks are genomes (collaboration with Ensembl Plants) and pathways (The Plant Reactome and archival BioCyc databases). Since our last NAR update, the database website adopted a new Drupal management platform. The genomes section features 39 fully assembled reference genomes that are integrated using ontology-based annotation and comparative analyses, and accessed through both visual and programmatic interfaces. Additional community data, such as genetic variation, expression and methylation, are also mapped for a subset of genomes. The Plant Reactome pathway portal (http://plantreactome.gramene.org) provides a reference resource for analyzing plant metabolic and regulatory pathways. In addition to â¼ 200 curated rice reference pathways, the portal hosts gene homology-based pathway projections for 33 plant species. Both the genome and pathway browsers interface with the EMBL-EBI's Expression Atlas to enable the projection of baseline and differential expression data from curated expression studies in plants. Gramene's archive website (http://archive.gramene.org) continues to provide previously reported resources on comparative maps, markers and QTL. To further aid our users, we have also introduced a live monthly educational webinar series and a Gramene YouTube channel carrying video tutorials.
Asunto(s)
Bases de Datos Genéticas , Genoma de Planta , Plantas/metabolismo , Expresión Génica , Variación Genética , Genómica , Internet , Redes y Vías Metabólicas , Anotación de Secuencia Molecular , Plantas/genéticaRESUMEN
Sorghum bicolor (L.) Moench is a significant grass crop globally, known for its genetic diversity. High quality genome sequences are needed to capture the diversity. We constructed high-quality, chromosome-level genome assemblies for two vital sorghum inbred lines, Tx2783 and RTx436. Through advanced single-molecule techniques, long-read sequencing and optical maps, we improved average sequence continuity 19-fold and 11-fold higher compared to existing Btx623 v3.0 reference genome and obtained 19 and 18 scaffolds (N50 of 25.6 and 14.4) for Tx2783 and RTx436, respectively. Our gene annotation efforts resulted in 29 612 protein-coding genes for the Tx2783 genome and 29 265 protein-coding genes for the RTx436 genome. Comparative analyses with 26 plant genomes which included 18 sorghum genomes and 8 outgroup species identified around 31 210 protein-coding gene families, with about 13 956 specific to sorghum. Using representative models from gene trees across the 18 sorghum genomes, a total of 72 579 pan-genes were identified, with 14% core, 60% softcore and 26% shell genes. We identified 99 genes in Tx2783 and 107 genes in RTx436 that showed functional enrichment specifically in binding and metabolic processes, as revealed by the GO enrichment Pearson Chi-Square test. We detected 36 potential large inversions in the comparison between the BTx623 Bionano map and the BTx623 v3.1 reference sequence. Strikingly, these inversions were notably absent when comparing Tx2783 or RTx436 with the BTx623 Bionano map. These inversion were mostly in the pericentromeric region which is known to have low complexity regions and harder to assemble and suggests the presence of potential artifacts in the public BTx623 reference assembly. Furthermore, in comparison to Tx2783, RTx436 exhibited 324 883 additional Single Nucleotide Polymorphisms (SNPs) and 16 506 more Insertions/Deletions (INDELs) when using BTx623 as the reference genome. We also characterized approximately 348 nucleotide-binding leucine-rich repeat (NLR) disease resistance genes in the two genomes. These high-quality genomes serve as valuable resources for discovering agronomic traits and structural variation studies.
RESUMEN
Whole-genome sequencing and assembly have revolutionized plant genetics and molecular biology over the last two decades. However, significant shortcomings in first- and second-generation technology resulted in imperfect reference genomes: numerous and large gaps of low quality or undeterminable sequence in areas of highly repetitive DNA along with limited chromosomal phasing restricted the ability of researchers to characterize regulatory noncoding elements and genic regions that underwent recent duplication events. Recently, advances in long-read sequencing have resulted in the first gapless, telomere-to-telomere (T2T) assemblies of plant genomes. This leap forward has the potential to increase the speed and confidence of genomics and molecular experimentation while reducing costs for the research community.
Asunto(s)
Genómica , Fitomejoramiento , Análisis de Secuencia de ADN/métodos , Genómica/métodos , Genoma de Planta/genética , Plantas/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , TecnologíaRESUMEN
African rice (Oryza glaberrima Steud), a short-day cereal crop closely related to Asian rice (Oryza sativa L.), has been cultivated in Sub-Saharan Africa for â¼ 3,000 years. Although less cultivated globally, it is a valuable genetic resource in creating high-yielding cultivars that are better adapted to diverse biotic and abiotic stresses. While inflorescence architecture, a key trait for rice grain yield improvement, has been extensively studied in Asian rice, the morphological and genetic determinants of this complex trait are less understood in African rice. In this study, using a previously developed association panel of 162 O. glaberrima accessions and new SNP variants characterized through mapping to a new version of the O. glaberrima reference genome, we conducted a genome-wide association study of four major morphological panicle traits. We have found a total of 41 stable genomic regions that are significantly associated with these traits, of which 13 co-localized with previously identified QTLs in O. sativa populations and 28 were unique for this association panel. Additionally, we found a genomic region of interest on chromosome 3 that was associated with the number of spikelets and primary and secondary branches. Within this region was localized the O. sativa ortholog of the PHYTOCHROME B gene (Oglab_006903/OgPHYB). Haplotype analysis revealed the occurrence of natural sequence variants at the OgPHYB locus associated with panicle architecture variation through modulation of the flowering time phenotype, whereas no equivalent alleles were found in O. sativa. The identification in this study of genomic regions specific to O. glaberrima indicates panicle-related intra-specific genetic variation in this species, increasing our understanding of the underlying molecular processes governing panicle architecture. Identified candidate genes and major haplotypes may facilitate the breeding of new African rice cultivars with preferred panicle traits.
Asunto(s)
Oryza , Oryza/genética , Estudio de Asociación del Genoma Completo , Alelos , Fitomejoramiento , Sitios de Carácter Cuantitativo , Grano Comestible/genéticaRESUMEN
Understanding and exploiting genetic diversity is a key factor for the productive and stable production of rice. Here, we utilize 73 high-quality genomes that encompass the subpopulation structure of Asian rice (Oryza sativa), plus the genomes of two wild relatives (O. rufipogon and O. punctata), to build a pan-genome inversion index of 1769 non-redundant inversions that span an average of ~29% of the O. sativa cv. Nipponbare reference genome sequence. Using this index, we estimate an inversion rate of ~700 inversions per million years in Asian rice, which is 16 to 50 times higher than previously estimated for plants. Detailed analyses of these inversions show evidence of their effects on gene expression, recombination rate, and linkage disequilibrium. Our study uncovers the prevalence and scale of large inversions (≥100 bp) across the pan-genome of Asian rice and hints at their largely unexplored role in functional biology and crop performance.
Asunto(s)
Oryza , Oryza/genética , Análisis de Secuencia de ADN , Genoma de Planta/genética , Evolución Biológica , FilogeniaRESUMEN
Efficient acquisition and use of available phosphorus from the soil is crucial for plant growth, development, and yield. With an ever-increasing acreage of croplands with suboptimal available soil phosphorus, genetic improvement of sorghum germplasm for enhanced phosphorus acquisition from soil is crucial to increasing agricultural output and reducing inputs, while confronted with a growing world population and uncertain climate. Sorghum bicolor is a globally important commodity for food, fodder, and forage. Known for robust tolerance to heat, drought, and other abiotic stresses, its capacity for optimal phosphorus use efficiency (PUE) is still being investigated for optimized root system architectures (RSA). Whilst a few RSA-influencing genes have been identified in sorghum and other grasses, the epigenetic impact on expression and tissue-specific activation of candidate PUE genes remains elusive. Here, we present transcriptomic, epigenetic, and regulatory network profiling of RSA modulation in the BTx623 sorghum background in response to limiting phosphorus (LP) conditions. We show that during LP, sorghum RSA is remodeled to increase root length and surface area, likely enhancing its ability to acquire P. Global DNA 5-methylcytosine and H3K4 and H3K27 trimethylation levels decrease in response to LP, while H3K4me3 peaks and DNA hypomethylated regions contain recognition motifs of numerous developmental and nutrient responsive transcription factors that display disparate expression patterns between different root tissues (primary root apex, elongation zone, and lateral root apex).
RESUMEN
Introduction: Sorghum (Sorghum bicolor (L.) Moench) is an agriculturally and economically important staple crop that has immense potential as a bioenergy feedstock due to its relatively high productivity on marginal lands. To capitalize on and further improve sorghum as a potential source of sustainable biofuel, it is essential to understand the genomic mechanisms underlying complex traits related to yield, composition, and environmental adaptations. Methods: Expanding on a recently developed mapping population, we generated de novo genome assemblies for 10 parental genotypes from this population and identified a comprehensive set of over 24 thousand large structural variants (SVs) and over 10.5 million single nucleotide polymorphisms (SNPs). Results: We show that SVs and nonsynonymous SNPs are enriched in different gene categories, emphasizing the need for long read sequencing in crop species to identify novel variation. Furthermore, we highlight SVs and SNPs occurring in genes and pathways with known associations to critical bioenergy-related phenotypes and characterize the landscape of genetic differences between sweet and cellulosic genotypes. Discussion: These resources can be integrated into both ongoing and future mapping and trait discovery for sorghum and its myriad uses including food, feed, bioenergy, and increasingly as a carbon dioxide removal mechanism.
RESUMEN
We report de novo genome assemblies, transcriptomes, annotations, and methylomes for the 26 inbreds that serve as the founders for the maize nested association mapping population. The number of pan-genes in these diverse genomes exceeds 103,000, with approximately a third found across all genotypes. The results demonstrate that the ancient tetraploid character of maize continues to degrade by fractionation to the present day. Excellent contiguity over repeat arrays and complete annotation of centromeres revealed additional variation in major cytological landmarks. We show that combining structural variation with single-nucleotide polymorphisms can improve the power of quantitative mapping studies. We also document variation at the level of DNA methylation and demonstrate that unmethylated regions are enriched for cis-regulatory elements that contribute to phenotypic variation.
Asunto(s)
Genoma de Planta , Anotación de Secuencia Molecular , Zea mays/genética , Centrómero/genética , Mapeo Cromosómico , Cromosomas de las Plantas , Metilación de ADN , Resistencia a la Enfermedad/genética , Genes de Plantas , Variación Genética , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Herencia Multifactorial/genética , Fenotipo , Enfermedades de las Plantas , Polimorfismo de Nucleótido Simple , Secuencias Reguladoras de Ácidos Nucleicos , Análisis de Secuencia de ADN , Tetraploidía , Transcriptoma , Secuenciación Completa del GenomaRESUMEN
Haplotype phasing maize genetic variants is important for genome interpretation, population genetic analysis and functional analysis of allelic activity. We performed an isoform-level phasing study using two maize inbred lines and their reciprocal crosses, based on single-molecule, full-length cDNA sequencing. To phase and analyze transcripts between hybrids and parents, we developed IsoPhase. Using this tool, we validated the majority of SNPs called against matching short-read data from embryo, endosperm and root tissues, and identified allele-specific, gene-level and isoform-level differential expression between the inbred parental lines and hybrid offspring. After phasing 6907 genes in the reciprocal hybrids, we annotated the SNPs and identified large-effect genes. In addition, we identified parent-of-origin isoforms, distinct novel isoforms in maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase accuracy in studies of allelic expression.
Asunto(s)
Análisis de Secuencia de ARN/métodos , Zea mays/genética , Alelos , Endospermo/genética , Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica de las Plantas , Genes de Plantas , Genoma de Planta , Haplotipos , Mutación , Proteínas de Plantas/genética , Plantas Modificadas Genéticamente , ARN Mensajero/análisis , ARN Mensajero/genética , Zea mays/fisiologíaRESUMEN
Creating gapless telomere-to-telomere assemblies of complex genomes is one of the ultimate challenges in genomics. We use two independent assemblies and an optical map-based merging pipeline to produce a maize genome (B73-Ab10) composed of 63 contigs and a contig N50 of 162 Mb. This genome includes gapless assemblies of chromosome 3 (236 Mb) and chromosome 9 (162 Mb), and 53 Mb of the Ab10 meiotic drive haplotype. The data also reveal the internal structure of seven centromeres and five heterochromatic knobs, showing that the major tandem repeat arrays (CentC, knob180, and TR-1) are discontinuous and frequently interspersed with retroelements.
Asunto(s)
Cromosomas de las Plantas , Genoma de Planta , Genómica/métodos , Mapeo Físico de Cromosoma/métodos , Zea mays/genéticaRESUMEN
Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11-21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Endogamia , Zea mays/genética , Secuencia de Bases , Elementos Transponibles de ADN/genética , Genoma de Planta , Secuencias Repetitivas de Ácidos Nucleicos/genéticaRESUMEN
BACKGROUND: Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. RESULTS: We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. CONCLUSIONS: The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.