Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
1.
BMC Biol ; 22(1): 13, 2024 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-38273258

RESUMO

BACKGROUND: Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS: Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS: This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.


Assuntos
Genoma de Planta , Polimorfismo de Nucleotídeo Único , Fluxo de Trabalho , Melhoramento Vegetal , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos
2.
Nucleic Acids Res ; 52(D1): D891-D899, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37953337

RESUMO

Ensembl (https://www.ensembl.org) is a freely available genomic resource that has produced high-quality annotations, tools, and services for vertebrates and model organisms for more than two decades. In recent years, there has been a dramatic shift in the genomic landscape, with a large increase in the number and phylogenetic breadth of high-quality reference genomes, alongside major advances in the pan-genome representations of higher species. In order to support these efforts and accelerate downstream research, Ensembl continues to focus on scaling for the rapid annotation of new genome assemblies, developing new methods for comparative analysis, and expanding the depth and quality of our genome annotations. This year we have continued our expansion to support global biodiversity research, doubling the number of annotated genomes we support on our Rapid Release site to over 1700, driven by our close collaboration with biodiversity projects such as Darwin Tree of Life. We have also strengthened support for key agricultural species, including the first regulatory builds for farmed animals, and have updated key tools and resources that support the global scientific community, notably the Ensembl Variant Effect Predictor. Ensembl data, software, and tools are freely available.


Assuntos
Bases de Dados Genéticas , Genômica , Animais , Genoma , Anotação de Sequência Molecular , Filogenia , Software , Humanos
3.
Plant Biotechnol J ; 21(12): 2458-2472, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37530518

RESUMO

Numerous staple crops exhibit polyploidy and are difficult to genetically modify. However, recent advances in genome sequencing and editing have enabled polyploid genome engineering. The hexaploid black nightshade species Solanum nigrum has immense potential as a beneficial food supplement. We assembled its genome at the scaffold level. After functional annotations, we identified homoeologous gene sets, with similar sequence and expression profiles, based on comparative analyses of orthologous genes with close diploid relatives Solanum americanum and S. lycopersicum. Using CRISPR-Cas9-mediated mutagenesis, we generated various mutation combinations in homoeologous genes. Multiple mutants showed quantitative phenotypic changes based on the genotype, resulting in a broad-spectrum effect on the quantitative traits of hexaploid S. nigrum. Furthermore, we successfully improved the fruit productivity of Boranong, an orphan cultivar of S. nigrum suggesting that engineering homoeologous genes could be useful for agricultural improvement of polyploid crops.


Assuntos
Produtos Agrícolas , Poliploidia , Sequência de Bases , Mapeamento Cromossômico/métodos , Mutação , Fenótipo , Produtos Agrícolas/genética , Genoma de Planta/genética , Edição de Genes
4.
G3 (Bethesda) ; 13(10)2023 09 30.
Artigo em Inglês | MEDLINE | ID: mdl-37535690

RESUMO

African rice (Oryza glaberrima Steud), a short-day cereal crop closely related to Asian rice (Oryza sativa L.), has been cultivated in Sub-Saharan Africa for ∼ 3,000 years. Although less cultivated globally, it is a valuable genetic resource in creating high-yielding cultivars that are better adapted to diverse biotic and abiotic stresses. While inflorescence architecture, a key trait for rice grain yield improvement, has been extensively studied in Asian rice, the morphological and genetic determinants of this complex trait are less understood in African rice. In this study, using a previously developed association panel of 162 O. glaberrima accessions and new SNP variants characterized through mapping to a new version of the O. glaberrima reference genome, we conducted a genome-wide association study of four major morphological panicle traits. We have found a total of 41 stable genomic regions that are significantly associated with these traits, of which 13 co-localized with previously identified QTLs in O. sativa populations and 28 were unique for this association panel. Additionally, we found a genomic region of interest on chromosome 3 that was associated with the number of spikelets and primary and secondary branches. Within this region was localized the O. sativa ortholog of the PHYTOCHROME B gene (Oglab_006903/OgPHYB). Haplotype analysis revealed the occurrence of natural sequence variants at the OgPHYB locus associated with panicle architecture variation through modulation of the flowering time phenotype, whereas no equivalent alleles were found in O. sativa. The identification in this study of genomic regions specific to O. glaberrima indicates panicle-related intra-specific genetic variation in this species, increasing our understanding of the underlying molecular processes governing panicle architecture. Identified candidate genes and major haplotypes may facilitate the breeding of new African rice cultivars with preferred panicle traits.


Assuntos
Oryza , Oryza/genética , Estudo de Associação Genômica Ampla , Alelos , Melhoramento Vegetal , Locos de Características Quantitativas , Grão Comestível/genética
5.
Nat Commun ; 14(1): 1567, 2023 03 21.
Artigo em Inglês | MEDLINE | ID: mdl-36944612

RESUMO

Understanding and exploiting genetic diversity is a key factor for the productive and stable production of rice. Here, we utilize 73 high-quality genomes that encompass the subpopulation structure of Asian rice (Oryza sativa), plus the genomes of two wild relatives (O. rufipogon and O. punctata), to build a pan-genome inversion index of 1769 non-redundant inversions that span an average of ~29% of the O. sativa cv. Nipponbare reference genome sequence. Using this index, we estimate an inversion rate of ~700 inversions per million years in Asian rice, which is 16 to 50 times higher than previously estimated for plants. Detailed analyses of these inversions show evidence of their effects on gene expression, recombination rate, and linkage disequilibrium. Our study uncovers the prevalence and scale of large inversions (≥100 bp) across the pan-genome of Asian rice and hints at their largely unexplored role in functional biology and crop performance.


Assuntos
Oryza , Oryza/genética , Análise de Sequência de DNA , Genoma de Planta/genética , Evolução Biológica , Filogenia
6.
Curr Opin Biotechnol ; 79: 102886, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36640454

RESUMO

Whole-genome sequencing and assembly have revolutionized plant genetics and molecular biology over the last two decades. However, significant shortcomings in first- and second-generation technology resulted in imperfect reference genomes: numerous and large gaps of low quality or undeterminable sequence in areas of highly repetitive DNA along with limited chromosomal phasing restricted the ability of researchers to characterize regulatory noncoding elements and genic regions that underwent recent duplication events. Recently, advances in long-read sequencing have resulted in the first gapless, telomere-to-telomere (T2T) assemblies of plant genomes. This leap forward has the potential to increase the speed and confidence of genomics and molecular experimentation while reducing costs for the research community.


Assuntos
Genômica , Melhoramento Vegetal , Análise de Sequência de DNA/métodos , Genômica/métodos , Genoma de Planta/genética , Plantas/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Tecnologia
7.
Plant Direct ; 6(5): e393, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-35600998

RESUMO

Efficient acquisition and use of available phosphorus from the soil is crucial for plant growth, development, and yield. With an ever-increasing acreage of croplands with suboptimal available soil phosphorus, genetic improvement of sorghum germplasm for enhanced phosphorus acquisition from soil is crucial to increasing agricultural output and reducing inputs, while confronted with a growing world population and uncertain climate. Sorghum bicolor is a globally important commodity for food, fodder, and forage. Known for robust tolerance to heat, drought, and other abiotic stresses, its capacity for optimal phosphorus use efficiency (PUE) is still being investigated for optimized root system architectures (RSA). Whilst a few RSA-influencing genes have been identified in sorghum and other grasses, the epigenetic impact on expression and tissue-specific activation of candidate PUE genes remains elusive. Here, we present transcriptomic, epigenetic, and regulatory network profiling of RSA modulation in the BTx623 sorghum background in response to limiting phosphorus (LP) conditions. We show that during LP, sorghum RSA is remodeled to increase root length and surface area, likely enhancing its ability to acquire P. Global DNA 5-methylcytosine and H3K4 and H3K27 trimethylation levels decrease in response to LP, while H3K4me3 peaks and DNA hypomethylated regions contain recognition motifs of numerous developmental and nutrient responsive transcription factors that display disparate expression patterns between different root tissues (primary root apex, elongation zone, and lateral root apex).

9.
Planta ; 255(2): 35, 2022 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-35015132

RESUMO

MAIN CONCLUSION: SorghumBase provides a community portal that integrates genetic, genomic, and breeding resources for sorghum germplasm improvement. Public research and development in agriculture rely on proper data and resource sharing within stakeholder communities. For plant breeders, agronomists, molecular biologists, geneticists, and bioinformaticians, centralizing desirable data into a user-friendly hub for crop systems is essential for successful collaborations and breakthroughs in germplasm development. Here, we present the SorghumBase web portal ( https://www.sorghumbase.org ), a resource for the sorghum research community. SorghumBase hosts a wide range of sorghum genomic information in a modular framework, built with open-source software, to provide a sustainable platform. This initial release of SorghumBase includes: (1) five sorghum reference genome assemblies in a pan-genome browser; (2) genetic variant information for natural diversity panels and ethyl methanesulfonate (EMS)-induced mutant populations; (3) search interface and integrated views of various data types; (4) links supporting interconnectivity with other repositories including genebank, QTL, and gene expression databases; and (5) a content management system to support access to community news and training materials. SorghumBase offers sorghum investigators improved data collation and access that will facilitate the growth of a robust research community to support genomics-assisted breeding.


Assuntos
Sorghum , Bases de Dados Genéticas , Grão Comestível , Genoma de Planta/genética , Genômica , Internet , Melhoramento Vegetal , Sorghum/genética
10.
Nucleic Acids Res ; 50(D1): D996-D1003, 2022 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-34791415

RESUMO

Ensembl Genomes (https://www.ensemblgenomes.org) provides access to non-vertebrate genomes and analysis complementing vertebrate resources developed by the Ensembl project (https://www.ensembl.org). The two resources collectively present genome annotation through a consistent set of interfaces spanning the tree of life presenting genome sequence, annotation, variation, transcriptomic data and comparative analysis. Here, we present our largest increase in plant, metazoan and fungal genomes since the project's inception creating one of the world's most comprehensive genomic resources and describe our efforts to reduce genome redundancy in our Bacteria portal. We detail our new efforts in gene annotation, our emerging support for pangenome analysis, our efforts to accelerate data dissemination through the Ensembl Rapid Release resource and our new AlphaFold visualization. Finally, we present details of our future plans including updates on our integration with Ensembl, and how we plan to improve our support for the microbial research community. Software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license). Data updates are synchronised with Ensembl's release cycle.


Assuntos
Bases de Dados Genéticas , Genômica , Internet , Software , Animais , Biologia Computacional , Genoma Bacteriano/genética , Genoma Fúngico/genética , Genoma de Planta/genética , Plantas/classificação , Plantas/genética , Vertebrados/classificação , Vertebrados/genética
11.
Front Plant Sci ; 13: 1040909, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36684744

RESUMO

Introduction: Sorghum (Sorghum bicolor (L.) Moench) is an agriculturally and economically important staple crop that has immense potential as a bioenergy feedstock due to its relatively high productivity on marginal lands. To capitalize on and further improve sorghum as a potential source of sustainable biofuel, it is essential to understand the genomic mechanisms underlying complex traits related to yield, composition, and environmental adaptations. Methods: Expanding on a recently developed mapping population, we generated de novo genome assemblies for 10 parental genotypes from this population and identified a comprehensive set of over 24 thousand large structural variants (SVs) and over 10.5 million single nucleotide polymorphisms (SNPs). Results: We show that SVs and nonsynonymous SNPs are enriched in different gene categories, emphasizing the need for long read sequencing in crop species to identify novel variation. Furthermore, we highlight SVs and SNPs occurring in genes and pathways with known associations to critical bioenergy-related phenotypes and characterize the landscape of genetic differences between sweet and cellulosic genotypes. Discussion: These resources can be integrated into both ongoing and future mapping and trait discovery for sorghum and its myriad uses including food, feed, bioenergy, and increasingly as a carbon dioxide removal mechanism.

12.
Science ; 373(6555): 655-662, 2021 08 06.
Artigo em Inglês | MEDLINE | ID: mdl-34353948

RESUMO

We report de novo genome assemblies, transcriptomes, annotations, and methylomes for the 26 inbreds that serve as the founders for the maize nested association mapping population. The number of pan-genes in these diverse genomes exceeds 103,000, with approximately a third found across all genotypes. The results demonstrate that the ancient tetraploid character of maize continues to degrade by fractionation to the present day. Excellent contiguity over repeat arrays and complete annotation of centromeres revealed additional variation in major cytological landmarks. We show that combining structural variation with single-nucleotide polymorphisms can improve the power of quantitative mapping studies. We also document variation at the level of DNA methylation and demonstrate that unmethylated regions are enriched for cis-regulatory elements that contribute to phenotypic variation.


Assuntos
Genoma de Planta , Anotação de Sequência Molecular , Zea mays/genética , Centrômero/genética , Mapeamento Cromossômico , Cromossomos de Plantas , Metilação de DNA , Resistência à Doença/genética , Genes de Plantas , Variação Genética , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Herança Multifatorial/genética , Fenótipo , Doenças das Plantas , Polimorfismo de Nucleotídeo Único , Sequências Reguladoras de Ácido Nucleico , Análise de Sequência de DNA , Tetraploidia , Transcriptoma , Sequenciamento Completo do Genoma
13.
PLoS Genet ; 17(3): e1009389, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33735256

RESUMO

The genetic basis of general plant vigor is of major interest to food producers, yet the trait is recalcitrant to genetic mapping because of the number of loci involved, their small effects, and linkage. Observations of heterosis in many crops suggests that recessive, malfunctioning versions of genes are a major cause of poor performance, yet we have little information on the mutational spectrum underlying these disruptions. To address this question, we generated a long-read assembly of a tropical japonica rice (Oryza sativa) variety, Carolina Gold, which allowed us to identify structural mutations (>50 bp) and orient them with respect to their ancestral state using the outgroup, Oryza glaberrima. Supporting prior work, we find substantial genome expansion in the sativa branch. While transposable elements (TEs) account for the largest share of size variation, the majority of events are not directly TE-mediated. Tandem duplications are the most common source of insertions and are highly enriched among 50-200bp mutations. To explore the relative impact of various mutational classes on crop fitness, we then track these structural events over the last century of US rice improvement using 101 resequenced varieties. Within this material, a pattern of temporary hybridization between medium and long-grain varieties was followed by recent divergence. During this long-term selection, structural mutations that impact gene exons have been removed at a greater rate than intronic indels and single-nucleotide mutations. These results support the use of ab initio estimates of mutational burden, based on structural data, as an orthogonal predictor in genomic selection.


Assuntos
Genes de Plantas , Mutação , Oryza/genética , Melhoramento Vegetal , Seleção Genética , Produtos Agrícolas/genética , Reparo do DNA , Elementos de DNA Transponíveis , Meio Ambiente , Interação Gene-Ambiente , Genoma de Planta , Hibridização Genética , Mutação INDEL , Sementes/genética
14.
Nucleic Acids Res ; 49(D1): D1452-D1463, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33170273

RESUMO

Gramene (http://www.gramene.org), a knowledgebase founded on comparative functional analyses of genomic and pathway data for model plants and major crops, supports agricultural researchers worldwide. The resource is committed to open access and reproducible science based on the FAIR data principles. Since the last NAR update, we made nine releases; doubled the genome portal's content; expanded curated genes, pathways and expression sets; and implemented the Domain Informational Vocabulary Extraction (DIVE) algorithm for extracting gene function information from publications. The current release, #63 (October 2020), hosts 93 reference genomes-over 3.9 million genes in 122 947 families with orthologous and paralogous classifications. Plant Reactome portrays pathway networks using a combination of manual biocuration in rice (320 reference pathways) and orthology-based projections to 106 species. The Reactome platform facilitates comparison between reference and projected pathways, gene expression analyses and overlays of gene-gene interactions. Gramene integrates ontology-based protein structure-function annotation; information on genetic, epigenetic, expression, and phenotypic diversity; and gene functional annotations extracted from plant-focused journals using DIVE. We train plant researchers in biocuration of genes and pathways; host curated maize gene structures as tracks in the maize genome browser; and integrate curated rice genes and pathways in the Plant Reactome.


Assuntos
Bases de Dados Genéticas , Regulação da Expressão Gênica de Plantas , Genoma de Planta , Genômica/métodos , Proteínas de Plantas/genética , Plantas/genética , Produtos Agrícolas , Elementos de DNA Transponíveis , Duplicação Gênica , Ontologia Genética , Redes Reguladoras de Genes , Internet , Bases de Conhecimento , Redes e Vias Metabólicas , Anotação de Sequência Molecular , Oryza/genética , Oryza/metabolismo , Proteínas de Plantas/metabolismo , Plantas/classificação , Plantas/metabolismo , Poliploidia , Mapeamento de Interação de Proteínas , Software , Zea mays/genética , Zea mays/metabolismo
15.
Nat Commun ; 11(1): 2288, 2020 05 08.
Artigo em Inglês | MEDLINE | ID: mdl-32385271

RESUMO

Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11-21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Endogamia , Zea mays/genética , Sequência de Bases , Elementos de DNA Transponíveis/genética , Genoma de Planta , Sequências Repetitivas de Ácido Nucleico/genética
16.
Genome Biol ; 21(1): 121, 2020 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32434565

RESUMO

Creating gapless telomere-to-telomere assemblies of complex genomes is one of the ultimate challenges in genomics. We use two independent assemblies and an optical map-based merging pipeline to produce a maize genome (B73-Ab10) composed of 63 contigs and a contig N50 of 162 Mb. This genome includes gapless assemblies of chromosome 3 (236 Mb) and chromosome 9 (162 Mb), and 53 Mb of the Ab10 meiotic drive haplotype. The data also reveal the internal structure of seven centromeres and five heterochromatic knobs, showing that the major tandem repeat arrays (CentC, knob180, and TR-1) are discontinuous and frequently interspersed with retroelements.


Assuntos
Cromossomos de Plantas , Genoma de Planta , Genômica/métodos , Mapeamento Físico do Cromossomo/métodos , Zea mays/genética
17.
Commun Biol ; 3(1): 78, 2020 02 18.
Artigo em Inglês | MEDLINE | ID: mdl-32071408

RESUMO

Haplotype phasing maize genetic variants is important for genome interpretation, population genetic analysis and functional analysis of allelic activity. We performed an isoform-level phasing study using two maize inbred lines and their reciprocal crosses, based on single-molecule, full-length cDNA sequencing. To phase and analyze transcripts between hybrids and parents, we developed IsoPhase. Using this tool, we validated the majority of SNPs called against matching short-read data from embryo, endosperm and root tissues, and identified allele-specific, gene-level and isoform-level differential expression between the inbred parental lines and hybrid offspring. After phasing 6907 genes in the reciprocal hybrids, we annotated the SNPs and identified large-effect genes. In addition, we identified parent-of-origin isoforms, distinct novel isoforms in maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase accuracy in studies of allelic expression.


Assuntos
Análise de Sequência de RNA/métodos , Zea mays/genética , Alelos , Endosperma/genética , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica de Plantas , Genes de Plantas , Genoma de Planta , Haplótipos , Mutação , Proteínas de Plantas/genética , Plantas Geneticamente Modificadas , RNA Mensageiro/análise , RNA Mensageiro/genética , Zea mays/fisiologia
18.
Genome Biol ; 20(1): 275, 2019 12 16.
Artigo em Inglês | MEDLINE | ID: mdl-31843001

RESUMO

BACKGROUND: Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. RESULTS: We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. CONCLUSIONS: The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.


Assuntos
Elementos de DNA Transponíveis , Anotação de Sequência Molecular/métodos , Animais , Benchmarking , Humanos , Software
19.
PLoS One ; 14(10): e0224086, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31658277

RESUMO

The sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer necessary. However, quality metrics generated by the MAKER-P gene annotation pipeline identified 17,225 of 130,330 (13%) protein-coding transcripts in the B73 Reference Genome V4 gene set with models of low concordance to available biological evidence. Working with eight graduate students, we used the Apollo annotation editor to curate 86 transcript models flagged by quality metrics and a complimentary method using the Gramene gene tree visualizer. All of the triaged models had significant errors-including missing or extra exons, non-canonical splice sites, and incorrect UTRs. A correct transcript model existed for about 60% of genes (or transcripts) flagged by quality metrics; we attribute this to the convention of elevating the transcript with the longest coding sequence (CDS) to the canonical, or first, position. The remaining 40% of flagged genes resulted in novel annotations and represent a manual curation space of about 10% of the maize genome (~4,000 protein-coding genes). MAKER-P metrics have a specificity of 100%, and a sensitivity of 85%; the gene tree visualizer has a specificity of 100%. Together with the Apollo graphical editor, our double triage provides an infrastructure to support the community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.


Assuntos
Curadoria de Dados/métodos , Proteínas de Plantas/genética , Zea mays/genética , Algoritmos , Bases de Dados Genéticas , Educação de Pós-Graduação , Humanos , Modelos Genéticos , Anotação de Sequência Molecular , Estudantes
20.
Nat Ecol Evol ; 3(4): 679-690, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30858588

RESUMO

New protein-coding genes that arise de novo from non-coding DNA sequences contribute to protein diversity. However, de novo gene origination is challenging to study as it requires high-quality reference genomes for closely related species, evidence for ancestral non-coding sequences, and transcription and translation of the new genes. High-quality genomes of 13 closely related Oryza species provide unprecedented opportunities to understand de novo origination events. Here, we identify a large number of young de novo genes with discernible recent ancestral non-coding sequences and evidence of translation. Using pipelines examining the synteny relationship between genomes and reciprocal-best whole-genome alignments, we detected at least 175 de novo open reading frames in the focal species O. sativa subspecies japonica, which were all detected in RNA sequencing-based transcriptomes. Mass spectrometry-based targeted proteomics and ribosomal profiling show translational evidence for 57% of the de novo genes. In recent divergence of Oryza, an average of 51.5 de novo genes per million years were generated and retained. We observed evolutionary patterns in which excess indels and early transcription were favoured in origination with a stepwise formation of gene structure. These data reveal that de novo genes contribute to the rapid evolution of protein diversity under positive selection.


Assuntos
Oryza/genética , Proteínas de Plantas/genética , Evolução Molecular , Fases de Leitura Aberta , Filogenia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA