RESUMO
Recent growth in crop genomic and trait data have opened opportunities for the application of novel approaches to accelerate crop improvement. Machine learning and deep learning are at the forefront of prediction-based data analysis. However, few approaches for genotype to phenotype prediction compare machine learning with deep learning and further interpret the models that support the predictions. This study uses genome wide molecular markers and traits across 1110 soybean individuals to develop accurate prediction models. For 13/14 sets of predictions, XGBoost or random forest outperformed deep learning models in prediction performance. Top ranked SNPs by F-score were identified from XGBoost, and with further investigation found overlap with significantly associated loci identified from GWAS and previous literature. Feature importance rankings were used to reduce marker input by up to 90%, and subsequent models maintained or improved their prediction performance. These findings support interpretable machine learning as an approach for genomic based prediction of traits in soybean and other crops.
Assuntos
Aprendizado Profundo , Glycine max , Genótipo , Aprendizado de Máquina , Fenótipo , Glycine max/genéticaRESUMO
Pangenomes are a rich resource to examine the genomic variation observed within a species or genera, supporting population genetics studies, with applications for the improvement of crop traits. Major crop species such as maize (Zea mays), rice (Oryza sativa), Brassica (Brassica spp.), and soybean (Glycine max) have had pangenomes constructed and released, and this has led to the discovery of valuable genes associated with disease resistance and yield components. However, pangenome data are not available for many less prominent crop species that are currently under-utilised. Despite many under-utilised species being important food sources in regional populations, the scarcity of genomic data for these species hinders their improvement. Here, we assess several under-utilised crops and review the pangenome approaches that could be used to build resources for their improvement. Many of these under-utilised crops are cultivated in arid or semi-arid environments, suggesting that novel genes related to drought tolerance may be identified and used for introgression into related major crop species. In addition, we discuss how previously collected data could be used to enrich pangenome functional analysis in genome-wide association studies (GWAS) based on studies in major crops. Considering the technological advances in genome sequencing, pangenome references for under-utilised species are becoming more obtainable, offering the opportunity to identify novel genes related to agro-morphological traits in these species.
Assuntos
Estudo de Associação Genômica Ampla , Oryza , Mapeamento Cromossômico , Produtos Agrícolas/genética , Genoma de Planta , Oryza/genética , Melhoramento Vegetal , Glycine max/genética , Zea mays/genéticaRESUMO
KEY MESSAGE: Safeguarding crop yields in a changing climate requires bioinformatics advances in harnessing data from vast phenomics and genomics datasets to translate research findings into climate smart crops in the field. Climate change and an additional 3 billion mouths to feed by 2050 raise serious concerns over global food security. Crop breeding and land management strategies will need to evolve to maximize the utilization of finite resources in coming years. High-throughput phenotyping and genomics technologies are providing researchers with the information required to guide and inform the breeding of climate smart crops adapted to the environment. Bioinformatics has a fundamental role to play in integrating and exploiting this fast accumulating wealth of data, through association studies to detect genomic targets underlying key adaptive climate-resilient traits. These data provide tools for breeders to tailor crops to their environment and can be introduced using advanced selection or genome editing methods. To effectively translate research into the field, genomic and phenomic information will need to be integrated into comprehensive clade-specific databases and platforms alongside accessible tools that can be used by breeders to inform the selection of climate adaptive traits. Here we discuss the role of bioinformatics in extracting, analysing, integrating and managing genomic and phenomic data to improve climate resilience in crops, including current, emerging and potential approaches, applications and bottlenecks in the research and breeding pipeline.
Assuntos
Mudança Climática , Biologia Computacional , Produtos Agrícolas/genética , Genômica , Fenômica , Melhoramento Vegetal/métodos , Adaptação Fisiológica , Edição de Genes , FenótipoRESUMO
Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.
RESUMO
Single-nucleotide polymorphisms (SNPs) have become the primary type of molecular genetic marker used in a diverse range of genetic and genomic studies. SNPs can be used to identify genomic regions linked to traits such as disease in genome-wide association studies, to understand population structure and diversity, or to understand mechanisms of genome evolution. One of the first steps of any SNP-based workflow, following SNP discovery, is quality control of SNP data. The protocol described here details how to perform quality control on SNP data to minimise errors in downstream analysis.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Confiabilidade dos Dados , Genoma , GenômicaRESUMO
Genomic prediction tools support crop breeding based on statistical methods, such as the genomic best linear unbiased prediction (GBLUP). However, these tools are not designed to capture non-linear relationships within multi-dimensional datasets, or deal with high dimension datasets such as imagery collected by unmanned aerial vehicles. Machine learning (ML) algorithms have the potential to surpass the prediction accuracy of current tools used for genotype to phenotype prediction, due to their capacity to autonomously extract data features and represent their relationships at multiple levels of abstraction. This review addresses the challenges of applying statistical and machine learning methods for predicting phenotypic traits based on genetic markers, environment data, and imagery for crop breeding. We present the advantages and disadvantages of explainable model structures, discuss the potential of machine learning models for genotype to phenotype prediction in crop breeding, and the challenges, including the scarcity of high-quality datasets, inconsistent metadata annotation and the requirements of ML models.
RESUMO
Presence-absence variants (PAV) are genomic regions present in some individuals of a species, but not others. PAVs have been shown to contribute to genomic diversity, especially in bacteria and plants. These structural variations have been linked to traits and can be used to track a species' evolutionary history. PAVs are usually called by aligning short read sequence data from one or more individuals to a reference genome or pangenome assembly, and then comparing coverage. Regions where reads do not align define absence in that individual, and the regions are classified as PAVs. The method below details how to align sequence reads to a reference and how to use the sequencing-coverage calculator Mosdepth to identify PAVs and construct a PAV table for use in downstream comparative genome analysis.
Assuntos
Genoma , Genômica , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência de DNA/métodosRESUMO
OBJECTIVE: To determine the mortality rate of patients treated with gastroschisis at a Jamaican pediatric hospital, and to identify factors that contribute significantly to mortality. METHODS: Eighty-five patients were treated with gastroschisis between November 1, 2006 and November 30, 2015. Of these, 80 records were recovered and reviewed retrospectively. Records were analyzed for maternal and patient characteristics, and details of the clinical course. Death during admission was the primary outcome measure. RESULTS: 63 of the 80 patients died during admission, giving a mortality rate of 78.8%. Sepsis was the main cause of death (82.4%). 27 patients (33.8%) had complicated gastroschisis (necrosis, perforation and/or atresia), all of whom died. Only preterm gestational age, complicated gastroschisis, and the lack of parenteral nutrition were found to be statistically associated with increased mortality. CONCLUSION: Our mortality rate is higher than those quoted in high-income countries, and correlates to those found in low- to middle-income countries. Mortality in our cohort was significantly associated with prematurity, complicated gastroschisis, and the lack of parenteral nutrition. Efforts to improve outcome must focus on improving antenatal care, establishing transfer protocols, and optimizing nutrition for all patients with gastroschisis. STUDY TYPE: Prognostic/Retrospective Study LEVEL OF EVIDENCE: Level II.