RESUMO
Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.
RESUMO
Many genome annotations include false-positive gene models, leading to errors in phylogenetic and comparative studies. Here, we propose a method to support gene model prediction based on evolutionary conservation and use it to identify potentially erroneous annotations. Using this method, we developed a set of 15,345 representative gene models from 12 legume assemblies that can be used to support genome annotations for other legumes.
Assuntos
Fabaceae , FilogeniaRESUMO
SUMMARY: Genome-wide association studies (GWAS) excels at harnessing dense genomic variant datasets to identify candidate regions responsible for producing a given phenotype. However, GWAS and traditional fine-mapping methods do not provide insight into the complex local landscape of linkage that contains and has been shaped by the causal variant(s). Here, we present crosshap, an R package that performs robust density-based clustering of variants based on their linkage profiles to capture haplotype structures in a local genomic region of interest. Following this, crosshap is equipped with visualization tools for choosing optimal clustering parameters (É) before producing an intuitive figure that provides an overview of the complex relationships between linked variants, haplotype combinations, phenotype, and metadata traits. AVAILABILITY AND IMPLEMENTATION: The crosshap package is freely available under the MIT license and can be downloaded directly from CRAN with R >4.0.0. The development version is available on GitHub alongside issue support (https://github.com/jacobimarsh/crosshap). Tutorial vignettes and documentation are available (https://jacobimarsh.github.io/crosshap/).
Assuntos
Documentação , Estudo de Associação Genômica Ampla , Análise por Conglomerados , Haplótipos , FenótipoRESUMO
Pod dehiscence is a major source of yield loss in legumes, which is exacerbated by aridity. Disruptive mutations in "Pod indehiscent 1" (PDH1), a pod sclerenchyma-specific lignin biosynthesis gene, has been linked to significant reductions in dehiscence in several legume species. We compared syntenic PDH1 regions across 12 legumes and two outgroups to uncover key historical evolutionary trends at this important locus. Our results clarified the extent to which PDH1 orthologs are present in legumes, showing the typical genomic context surrounding PDH1 has only arisen relatively recently in certain phaseoloid species (Vigna, Phaseolus, Glycine). The notable absence of PDH1 in Cajanus cajan may be a major contributor to its indehiscent phenotype compared with other phaseoloids. In addition, we identified a novel PDH1 ortholog in Vigna angularis and detected remarkable increases in PDH1 transcript abundance during Vigna unguiculata pod development. Investigation of the shared genomic context of PDH1 revealed it lies in a hotspot of transcription factors and signaling gene families that respond to abscisic acid and drought stress, which we hypothesize may be an additional factor influencing expression of PDH1 under specific environmental conditions. Our findings provide key insights into the evolutionary history of PDH1 and lay the foundation for optimizing the pod dehiscence role of PDH1 in major and understudied legume species.
Assuntos
Phaseolus , Vigna , Vigna/genética , Locos de Características Quantitativas , Genoma de Planta/genética , Phaseolus/genética , GenômicaRESUMO
Copy number variations (CNVs) are defined as deletions, duplications and insertions among individuals of a species. There is growing evidence that CNV is a major factor underlining various autoimmune disorders and diseases in humans; however, in plants, especially oilseed crops, the role of CNVs in disease resistance is not well studied. Here, we investigate the genome-wide diversity and genetic properties of CNVs in resistance gene analogues (RGAs) across eight Brassica napus lines. A total of 1137 CNV events (704 deletions and 433 duplications) were detected across 563 RGAs. The results show CNVs are more likely to occur across clustered RGAs compared to singletons. In addition, 112 RGAs were linked to a blackleg resistance QTL, of which 25 were affected by CNV. Overall, we show that the presence and abundance of CNVs differ between lines, suggesting that in B. napus, the distribution of CNVs depends on genetic background. Our findings advance the understanding of CNV as an important type of genomic structural variation in B. napus and provide a resource to support breeding of advanced canola lines.
Assuntos
Brassica napus , Humanos , Brassica napus/genética , Variações do Número de Cópias de DNA/genética , Melhoramento Vegetal , Resistência à Doença/genética , GenomaRESUMO
Rye (Secale cereale) is a climate-resilient cereal grown extensively as grain or forage crop in Northern and Eastern Europe. In addition to being an important crop, it has been used to improve wheat through introgression of genomic regions for improved yield and disease resistance. Understanding the genomic diversity of rye will assist both the improvement of this crop and facilitate the introgression of more valuable traits into wheat. Here, we isolated and sequenced the short arm of rye chromosome 7 (7RS) from Triticale 380SD using flow cytometry and compared it to the public Lo7 rye whole genome reference assembly. We identify 2747 Lo7 genes present on the isolated chromosome arm and two clusters containing seven and sixty-five genes that are present on Triticale 380SD 7RS, but absent from Lo7 7RS. We identified 29 genes that are not assigned to chromosomal locations in the Lo7 assembly but are present on Triticale 380SD 7RS, suggesting a chromosome arm location for these genes. Our study supports the Lo7 reference assembly and provides a repertoire of genes on Triticale 7RS.
Assuntos
Secale , Triticale , Cromossomos de Plantas/genética , Resistência à Doença/genética , Grão Comestível/genética , Secale/genética , Triticale/genética , Triticum/genéticaRESUMO
During crop domestication and breeding, wild plant species have been shaped into modern high-yield crops and adapted to the main agro-ecological regions. However, climate change will impact crop productivity in these regions, and agriculture needs to adapt to support future food production. On a global scale, crop wild relatives grow in more diverse environments than crop species, and so may host genes that could support the adaptation of crops to new and variable environments. Through identification of individuals with increased climate resilience we may gain a greater understanding of the genomic basis for this resilience and transfer this to crops. Pangenome analysis can help to identify the genes underlying stress responses in individuals harbouring untapped genomic diversity in crop wild relatives. The information gained from the analysis of these pangenomes can then be applied towards breeding climate resilience into existing crops or to re-domesticating crops, combining environmental adaptation traits with crop productivity.
RESUMO
Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Pisum sativum Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91-0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.
RESUMO
Bread wheat (Triticum aestivum L.) is one of humanity's most important staple crops, characterized by a large and complex genome with a high level of gene presence-absence variation (PAV) between cultivars, hampering genomic approaches for crop improvement. With the growing global population and the increasing impact of climate change on crop yield, there is an urgent need to apply genomic approaches to accelerate wheat breeding. With recent advances in DNA sequencing technology, a growing number of high-quality reference genomes are becoming available, reflecting the genetic content of a diverse range of cultivars. However, information on the presence or absence of genomic regions has been hard to visualize and interrogate because of the size of these genomes and the lack of suitable bioinformatics tools. To address this limitation, we have produced a wheat pangenome graph maintained within an online database to facilitate interrogation and comparison of wheat cultivar genomes. The database allows users to visualize regions of the pangenome to assess PAV between bread wheat genomes.
Graph pangenomes represent more genomic variants than reference genomes. We present a wheat graph pangenome based on 16 public assemblies. We present Wheat Panache, an online visual representation of this graph. Wheat Panache lets users search the graph for presence-absence variants. We also distribute the graph preindexed for Giraffe utilization.
Assuntos
Pão , Triticum , Genoma de Planta , Melhoramento Vegetal , Análise de Sequência de DNA , Triticum/genéticaRESUMO
Soybean (Glycine max) is a legume species of significant economic and nutritional value. The yield of soybean continues to increase with the breeding of improved varieties, and this is likely to continue with the application of advanced genetic and genomic approaches for breeding. Genome technologies continue to advance rapidly, with an increasing number of high-quality genome assemblies becoming available. With accumulating data from marker arrays and whole-genome resequencing, studying variations between individuals and populations is becoming increasingly accessible. Furthermore, the recent development of soybean pangenomes has highlighted the significant structural variation between individuals, together with knowledge of what has been selected for or lost during domestication and breeding, information that can be applied for the breeding of improved cultivars. Because of this, resources such as genome assemblies, SNP datasets, pangenomes and associated databases are becoming increasingly important for research underlying soybean crop improvement.
RESUMO
Recent growth in crop genomic and trait data have opened opportunities for the application of novel approaches to accelerate crop improvement. Machine learning and deep learning are at the forefront of prediction-based data analysis. However, few approaches for genotype to phenotype prediction compare machine learning with deep learning and further interpret the models that support the predictions. This study uses genome wide molecular markers and traits across 1110 soybean individuals to develop accurate prediction models. For 13/14 sets of predictions, XGBoost or random forest outperformed deep learning models in prediction performance. Top ranked SNPs by F-score were identified from XGBoost, and with further investigation found overlap with significantly associated loci identified from GWAS and previous literature. Feature importance rankings were used to reduce marker input by up to 90%, and subsequent models maintained or improved their prediction performance. These findings support interpretable machine learning as an approach for genomic based prediction of traits in soybean and other crops.
Assuntos
Aprendizado Profundo , Glycine max , Genótipo , Aprendizado de Máquina , Fenótipo , Glycine max/genéticaRESUMO
Pangenomes are a rich resource to examine the genomic variation observed within a species or genera, supporting population genetics studies, with applications for the improvement of crop traits. Major crop species such as maize (Zea mays), rice (Oryza sativa), Brassica (Brassica spp.), and soybean (Glycine max) have had pangenomes constructed and released, and this has led to the discovery of valuable genes associated with disease resistance and yield components. However, pangenome data are not available for many less prominent crop species that are currently under-utilised. Despite many under-utilised species being important food sources in regional populations, the scarcity of genomic data for these species hinders their improvement. Here, we assess several under-utilised crops and review the pangenome approaches that could be used to build resources for their improvement. Many of these under-utilised crops are cultivated in arid or semi-arid environments, suggesting that novel genes related to drought tolerance may be identified and used for introgression into related major crop species. In addition, we discuss how previously collected data could be used to enrich pangenome functional analysis in genome-wide association studies (GWAS) based on studies in major crops. Considering the technological advances in genome sequencing, pangenome references for under-utilised species are becoming more obtainable, offering the opportunity to identify novel genes related to agro-morphological traits in these species.
Assuntos
Estudo de Associação Genômica Ampla , Oryza , Mapeamento Cromossômico , Produtos Agrícolas/genética , Genoma de Planta , Oryza/genética , Melhoramento Vegetal , Glycine max/genética , Zea mays/genéticaRESUMO
Pangenomes aim to represent the complete repertoire of the genome diversity present within a species or cohort of species, capturing the genomic structural variance between individuals. This genomic information coupled with phenotypic data can be applied to identify genes and alleles involved with abiotic stress tolerance, disease resistance, and other desirable traits. The characterisation of novel structural variants from pangenomes can support genome editing approaches such as Clustered Regularly Interspaced Short Palindromic Repeats and CRISPR associated protein Cas (CRISPR-Cas), providing functional information on gene sequences and new target sites in variant-specific genes with increased efficiency. This review discusses the application of pangenomes in genome editing and crop improvement, focusing on the potential of pangenomes to accurately identify target genes for CRISPR-Cas editing of plant genomes while avoiding adverse off-target effects. We consider the limitations of applying CRISPR-Cas editing with pangenome references and potential solutions to overcome these limitations.
Assuntos
Sistemas CRISPR-Cas/genética , Produtos Agrícolas/genética , Genoma de Planta/genética , Edição de Genes/métodos , Fenótipo , Melhoramento Vegetal/métodos , Plantas Geneticamente Modificadas/genéticaRESUMO
KEY MESSAGE: The major soy protein QTL, cqProt-003, was analysed for haplotype diversity and global distribution, and results indicate 304 bp deletion and variable tandem repeats in protein coding regions are likely causal candidates. Here, we present association and linkage analysis of 985 wild, landrace and cultivar soybean accessions in a pan genomic dataset to characterize the major high-protein/low-oil associated locus cqProt-003 located on chromosome 20. A significant trait-associated region within a 173 kb linkage block was identified, and variants in the region were characterized, identifying 34 high confidence SNPs, 4 insertions, 1 deletion and a larger 304 bp structural variant in the high-protein haplotype. Trinucleotide tandem repeats of variable length present in the second exon of gene Glyma.20G085100 are strongly correlated with the high-protein phenotype and likely represent causal variation. Structural variation has previously been found in the same gene, for which we report the global distribution of the 304 bp deletion and have identified additional nested variation present in high-protein individuals. Mapping variation at the cqProt-003 locus across demographic groups suggests that the high-protein haplotype is common in wild accessions (94.7%), rare in landraces (10.6%) and near absent in cultivated breeding pools (4.1%), suggesting its decrease in frequency primarily correlates with domestication and continued during subsequent improvement. However, the variation that has persisted in under-utilized wild and landrace populations holds high breeding potential for breeders willing to forego seed oil to maximize protein content. The results of this study include the identification of distinct haplotype structures within the high-protein population, and a broad characterization of the genomic context and linkage patterns of cqProt-003 across global populations, supporting future functional characterization and modification.
Assuntos
Fabaceae , Glycine max , Fabaceae/genética , Haplótipos , Melhoramento Vegetal , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Sementes/metabolismo , Glycine max/genética , Glycine max/metabolismoRESUMO
With third generation DNA sequencing and a general reduction of sequencing costs, the production of bioinformatic data has become easier than ever. Several pipeline automation tools have emerged to ease data processing through a multitude of steps. Here, we describe the setup and use of Snakemake, a pipeline automation tool derived from GNU MAKE.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Automação , Biologia Computacional , Análise de Sequência de DNARESUMO
Genomic selection approaches have increased the speed of plant breeding, leading to growing crop yields over the last decade. However, climate change is impacting current and future yields, resulting in the need to further accelerate breeding efforts to cope with these changing conditions. Here we present approaches to accelerate plant breeding and incorporate nonadditive effects in genomic selection by applying state-of-the-art machine learning approaches. These approaches are made more powerful by the inclusion of pangenomes, which represent the entire genome content of a species. Understanding the strengths and limitations of machine learning methods, compared with more traditional genomic selection efforts, is paramount to the successful application of these methods in crop breeding. We describe examples of genomic selection and pangenome-based approaches in crop breeding, discuss machine learning-specific challenges, and highlight the potential for the application of machine learning in genomic selection. We believe that careful implementation of machine learning approaches will support crop improvement to help counter the adverse outcomes of climate change on crop production.
Assuntos
Genoma de Planta , Genômica , Genômica/métodos , Aprendizado de Máquina , Melhoramento Vegetal/métodosRESUMO
Caseinolytic protease (CLPP) is an energy-dependent serine-type protease that plays a role in protein quality control. The CLPP gene is highly conserved across kingdoms and the protein is present in both bacteria and eukaryote organelles like mitochondria across a wide phylogenetic range. This pedigree has all the hallmarks of CLPP being an essential gene. However, in plants, disruption of mitochondrial CLPP has no impact on its growth, reminiscent of its nonessential role in some model fungi. Deletion of mitochondrial CLPP improves health and increased life span in the filamentous fungus, Podospora anserina, while loss of human mitochondrial CLPP leads to infertility and hearing loss. Recently it was revealed that both plant and human CLPP share a similar role in maintenance of the N-module of respiratory complex I. In addition, plant mitochondrial CLPP also coordinates the homeostasis of other mitochondrial protein complexes encoded by genes across mitochondrial and nuclear genomes. Understanding the contextual role of mitochondrial CLPP across kingdoms may help to understand these diverse sets of clpp phenotypes and the widespread conservation of CLPP genes.
Assuntos
Bactérias/enzimologia , Sequência Conservada , Mitocôndrias/metabolismo , Plantas/enzimologia , Sequência de Aminoácidos , Endopeptidase Clp/química , Endopeptidase Clp/metabolismo , Humanos , FenótipoRESUMO
Multiple studies have shown ribulose-1,5-bisphosphate carboxylase/oxygenase (E.C. 4.1.1.39; Rubisco) to be subject to Lys-acetylation at various residues; however, opposing reports exist about the biological significance of these post-translational modifications. One aspect of the Lys-acetylation that has not been addressed in plants generally, or with Rubisco specifically, is the stoichiometry at which these Lys-acetylation events occur. As a method to ascertain which Lys-acetylation sites on Arabidopsis Rubisco might be of regulatory importance to its catalytic function in the Calvin-Benson cycle, we purified Rubisco from leaves in both the day and night-time and performed independent mass spectrometry based methods to determine the stoichiometry of Rubisco Lys-acetylation events. The results indicate that Rubisco is acetylated at most Lys residues, but each acetylation event occurs at very low stoichiometry. Furthermore, in vitro treatments that increased the extent of Lys-acetylation on purified Rubisco had no effect on Rubisco maximal activity. Therefore, we are unable to confirm that Lys-acetylation at low stoichiometries can be a regulatory mechanism controlling Rubisco maximal activity. The results highlight the need for further use of stoichiometry measurements when determining the biological significance of reversible PTMs like acetylation.
Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/enzimologia , Fotossíntese/fisiologia , Folhas de Planta/enzimologia , Ribulose-Bifosfato Carboxilase/metabolismo , Acetilação , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Folhas de Planta/genética , Ribulose-Bifosfato Carboxilase/genéticaRESUMO
Protein homeostasis in eukaryotic organelles and their progenitor prokaryotes is regulated by a series of proteases including the caseinolytic protease (CLPP). CLPP has essential roles in chloroplast biogenesis and maintenance, but the significance of the plant mitochondrial CLPP remains unknown and factors that aid coordination of nuclear- and mitochondrial-encoded subunits for complex assembly in mitochondria await discovery. We generated knockout lines of the single gene for the mitochondrial CLP protease subunit, CLPP2, in Arabidopsis (Arabidopsis thaliana). Mutants showed a higher abundance of transcripts from mitochondrial genes encoding oxidative phosphorylation protein complexes, whereas nuclear genes encoding other subunits of the same complexes showed no change in transcript abundance. By contrast, the protein abundance of specific nuclear-encoded subunits in oxidative phosphorylation complexes I and V increased in CLPP2 knockouts, without accumulation of mitochondrial-encoded counterparts in the same complex. Complexes with subunits mainly or entirely encoded in the nucleus were unaffected. Analysis of protein import and function of complex I revealed that while function was retained, protein homeostasis was disrupted, leading to accumulation of soluble subcomplexes of nuclear-encoded subunits. Therefore, CLPP2 contributes to the mitochondrial protein degradation network through supporting coordination and homeostasis of protein complexes encoded across mitochondrial and nuclear genomes.
Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Mitocôndrias/metabolismo , Proteínas Mitocondriais/metabolismo , Núcleo Celular/metabolismo , Complexo I de Transporte de Elétrons/metabolismo , Endopeptidase Clp/metabolismo , Regulação da Expressão Gênica de Plantas , Fosforilação OxidativaRESUMO
Plant mitochondria operate as the powerhouses for cellular energy production by using the combined functions of both imported and organelle-synthesised proteins. Homeostasis of mitochondrial proteins is controlled by both synthesis and degradation processes which together define protein turnover in this organelle. Better understanding of plant mitochondrial protein turnover will provide information on protein quality control inside these organelles and its importance for proper function and regulation of mitochondrial machinery. This review discusses methods used for measurement of turnover rates of plant mitochondrial proteins and presents our current understanding of these rates for key mitochondrial proteins and protein complexes. The assembly and maintenance of mitochondrial OXPHOS complexes, in particular Complexes I and V, will be discussed in detail based on the evidence for differential protein turnover rates of the same protein subunits in different mitochondrial fractions. The impact of the loss of specific plant mitochondrial proteases on proteolysis events and rates of mitochondrial protein turnover will be highlighted. The challenges and future directions for investigation of plant mitochondrial protein turnover are also discussed.