Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 209
Filter
1.
Proc Natl Acad Sci U S A ; 121(26): e2319811121, 2024 Jun 25.
Article in English | MEDLINE | ID: mdl-38889146

ABSTRACT

Rational design of plant cis-regulatory DNA sequences without expert intervention or prior domain knowledge is still a daunting task. Here, we developed PhytoExpr, a deep learning framework capable of predicting both mRNA abundance and plant species using the proximal regulatory sequence as the sole input. PhytoExpr was trained over 17 species representative of major clades of the plant kingdom to enhance its generalizability. Via input perturbation, quantitative functional annotation of the input sequence was achieved at single-nucleotide resolution, revealing an abundance of predicted high-impact nucleotides in conserved noncoding sequences and transcription factor binding sites. Evaluation of maize HapMap3 single-nucleotide polymorphisms (SNPs) by PhytoExpr demonstrates an enrichment of predicted high-impact SNPs in cis-eQTL. Additionally, we provided two algorithms that harnessed the power of PhytoExpr in designing functional cis-regulatory variants, and de novo creation of species-specific cis-regulatory sequences through in silico evolution of random DNA sequences. Our model represents a general and robust approach for functional variant discovery in population genetics and rational design of regulatory sequences for genome editing and synthetic biology.


Subject(s)
Polymorphism, Single Nucleotide , Regulatory Sequences, Nucleic Acid , Zea mays , Regulatory Sequences, Nucleic Acid/genetics , Zea mays/genetics , Quantitative Trait Loci , Algorithms , Gene Expression Regulation, Plant , Deep Learning , Plants/genetics , Transcription Factors/genetics , Transcription Factors/metabolism , Models, Genetic , Genes, Plant , Binding Sites/genetics
2.
bioRxiv ; 2024 Jun 10.
Article in English | MEDLINE | ID: mdl-38895432

ABSTRACT

Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.

3.
Genetics ; 227(1)2024 05 07.
Article in English | MEDLINE | ID: mdl-38469622

ABSTRACT

Design randomizations and spatial corrections have increased understanding of genotypic, spatial, and residual effects in field experiments, but precisely measuring spatial heterogeneity in the field remains a challenge. To this end, our study evaluated approaches to improve spatial modeling using high-throughput phenotypes (HTP) via unoccupied aerial vehicle (UAV) imagery. The normalized difference vegetation index was measured by a multispectral MicaSense camera and processed using ImageBreed. Contrasting to baseline agronomic trait spatial correction and a baseline multitrait model, a two-stage approach was proposed. Using longitudinal normalized difference vegetation index data, plot level permanent environment effects estimated spatial patterns in the field throughout the growing season. Normalized difference vegetation index permanent environment were separated from additive genetic effects using 2D spline, separable autoregressive models, or random regression models. The Permanent environment were leveraged within agronomic trait genomic best linear unbiased prediction either modeling an empirical covariance for random effects, or by modeling fixed effects as an average of permanent environment across time or split among three growth phases. Modeling approaches were tested using simulation data and Genomes-to-Fields hybrid maize (Zea mays L.) field experiments in 2015, 2017, 2019, and 2020 for grain yield, grain moisture, and ear height. The two-stage approach improved heritability, model fit, and genotypic effect estimation compared to baseline models. Electrical conductance and elevation from a 2019 soil survey significantly improved model fit, while 2D spline permanent environment were most strongly correlated with the soil parameters. Simulation of field effects demonstrated improved specificity for random regression models. In summary, the use of longitudinal normalized difference vegetation index measurements increased experimental accuracy and understanding of field spatio-temporal heterogeneity.


Subject(s)
Zea mays , Zea mays/genetics , Phenotype , Models, Genetic , Spatio-Temporal Analysis , Genome, Plant , Genomics/methods , Genotype , Quantitative Trait, Heritable
4.
bioRxiv ; 2024 Jan 08.
Article in English | MEDLINE | ID: mdl-37398426

ABSTRACT

The 3' end of a gene, often called a terminator, modulates mRNA stability, localization, translation, and polyadenylation. Here, we adapted Plant STARR-seq, a massively parallel reporter assay, to measure the activity of over 50,000 terminators from the plants Arabidopsis thaliana and Zea mays. We characterize thousands of plant terminators, including many that outperform bacterial terminators commonly used in plants. Terminator activity is species-specific, differing in tobacco leaf and maize protoplast assays. While recapitulating known biology, our results reveal the relative contributions of polyadenylation motifs to terminator strength. We built a computational model to predict terminator strength and used it to conduct in silico evolution that generated optimized synthetic terminators. Additionally, we discover alternative polyadenylation sites across tens of thousands of terminators; however, the strongest terminators tend to have a dominant cleavage site. Our results establish features of plant terminator function and identify strong naturally occurring and synthetic terminators.

5.
Trends Plant Sci ; 29(3): 355-369, 2024 03.
Article in English | MEDLINE | ID: mdl-37749022

ABSTRACT

Genome alignment is one of the most foundational methods for genome sequence studies. With rapid advances in sequencing and assembly technologies, these newly assembled genomes present challenges for alignment tools to meet the increased complexity and scale. Plant genome alignment is technologically challenging because of frequent whole-genome duplications (WGDs) as well as chromosome rearrangements and fractionation, high nucleotide diversity, widespread structural variation, and high transposable element (TE) activity causing large proportions of repeat elements. We summarize classical pairwise and multiple genome alignment (MGA) methods, and highlight techniques that are widely used or are being developed by the plant research community. We also outline the remaining challenges for precise genome alignment and the interpretation of alignment results in plants.


Subject(s)
Genome, Plant , Plants , Plants/genetics , Genome, Plant/genetics , DNA Transposable Elements/genetics
6.
BMC Res Notes ; 16(1): 219, 2023 Sep 14.
Article in English | MEDLINE | ID: mdl-37710302

ABSTRACT

OBJECTIVES: This release note describes the Maize GxE project datasets within the Genomes to Fields (G2F) Initiative. The Maize GxE project aims to understand genotype by environment (GxE) interactions and use the information collected to improve resource allocation efficiency and increase genotype predictability and stability, particularly in scenarios of variable environmental patterns. Hybrids and inbreds are evaluated across multiple environments and phenotypic, genotypic, environmental, and metadata information are made publicly available. DATA DESCRIPTION: The datasets include phenotypic data of the hybrids and inbreds evaluated in 30 locations across the US and one location in Germany in 2020 and 2021, soil and climatic measurements and metadata information for all environments (combination of year and location), ReadMe, and description files for each data type. A set of common hybrids is present in each environment to connect with previous evaluations. Each environment had a collaborator responsible for collecting and submitting the data, the GxE coordination team combined all the collected information and removed obvious erroneous data. Collaborators received the combined data to use, verify and declare that the data generated in their own environments was accurate. Combined data is released to the public with minimal filtering to maintain fidelity to the original data.


Subject(s)
Resource Allocation , Zea mays , Zea mays/genetics , Seasons , Genotype , Germany
8.
BMC Res Notes ; 16(1): 148, 2023 Jul 17.
Article in English | MEDLINE | ID: mdl-37461058

ABSTRACT

OBJECTIVES: The Genomes to Fields (G2F) 2022 Maize Genotype by Environment (GxE) Prediction Competition aimed to develop models for predicting grain yield for the 2022 Maize GxE project field trials, leveraging the datasets previously generated by this project and other publicly available data. DATA DESCRIPTION: This resource used data from the Maize GxE project within the G2F Initiative [1]. The dataset included phenotypic and genotypic data of the hybrids evaluated in 45 locations from 2014 to 2022. Also, soil, weather, environmental covariates data and metadata information for all environments (combination of year and location). Competitors also had access to ReadMe files which described all the files provided. The Maize GxE is a collaborative project and all the data generated becomes publicly available [2]. The dataset used in the 2022 Prediction Competition was curated and lightly filtered for quality and to ensure naming uniformity across years.


Subject(s)
Genome, Plant , Zea mays , Phenotype , Zea mays/genetics , Genotype , Genome, Plant/genetics , Edible Grain/genetics
9.
BMC Genom Data ; 24(1): 29, 2023 05 25.
Article in English | MEDLINE | ID: mdl-37231352

ABSTRACT

OBJECTIVES: This report provides information about the public release of the 2018-2019 Maize G X E project of the Genomes to Fields (G2F) Initiative datasets. G2F is an umbrella initiative that evaluates maize hybrids and inbred lines across multiple environments and makes available phenotypic, genotypic, environmental, and metadata information. The initiative understands the necessity to characterize and deploy public sources of genetic diversity to face the challenges for more sustainable agriculture in the context of variable environmental conditions. DATA DESCRIPTION: Datasets include phenotypic, climatic, and soil measurements, metadata information, and inbred genotypic information for each combination of location and year. Collaborators in the G2F initiative collected data for each location and year; members of the group responsible for coordination and data processing combined all the collected information and removed obvious erroneous data. The collaborators received the data before the DOI release to verify and declare that the data generated in their own locations was accurate. ReadMe and description files are available for each dataset. Previous years of evaluation are already publicly available, with common hybrids present to connect across all locations and years evaluated since this project's inception.


Subject(s)
Genome, Plant , Zea mays , Phenotype , Zea mays/genetics , Seasons , Genotype , Genome, Plant/genetics
10.
Cell ; 186(11): 2313-2328.e15, 2023 05 25.
Article in English | MEDLINE | ID: mdl-37146612

ABSTRACT

Hybrid potato breeding will transform the crop from a clonally propagated tetraploid to a seed-reproducing diploid. Historical accumulation of deleterious mutations in potato genomes has hindered the development of elite inbred lines and hybrids. Utilizing a whole-genome phylogeny of 92 Solanaceae and its sister clade species, we employ an evolutionary strategy to identify deleterious mutations. The deep phylogeny reveals the genome-wide landscape of highly constrained sites, comprising ∼2.4% of the genome. Based on a diploid potato diversity panel, we infer 367,499 deleterious variants, of which 50% occur at non-coding and 15% at synonymous sites. Counterintuitively, diploid lines with relatively high homozygous deleterious burden can be better starting material for inbred-line development, despite showing less vigorous growth. Inclusion of inferred deleterious mutations increases genomic-prediction accuracy for yield by 24.7%. Our study generates insights into the genome-wide incidence and properties of deleterious mutations and their far-reaching consequences for breeding.


Subject(s)
Plant Breeding , Solanum tuberosum , Diploidy , Mutation , Phylogeny , Solanum tuberosum/genetics
11.
Int J Mol Sci ; 24(7)2023 Mar 25.
Article in English | MEDLINE | ID: mdl-37047206

ABSTRACT

Maximizing soil exploration through modifications of the root system is a strategy for plants to overcome phosphorus (P) deficiency. Genome-wide association with 561 tropical maize inbred lines from Embrapa and DTMA panels was undertaken for root morphology and P acquisition traits under low- and high-P concentrations, with 353,540 SNPs. P supply modified root morphology traits, biomass and P content in the global maize panel, but root length and root surface area changed differentially in Embrapa and DTMA panels. This suggests that different root plasticity mechanisms exist for maize adaptation to low-P conditions. A total of 87 SNPs were associated to phenotypic traits in both P conditions at -log10(p-value) ≥ 5, whereas only seven SNPs reached the Bonferroni significance. Among these SNPs, S9_137746077, which is located upstream of the gene GRMZM2G378852 that encodes a MAPKKK protein kinase, was significantly associated with total seedling dry weight, with the same allele increasing root length and root surface area under P deficiency. The C allele of S8_88600375, mapped within GRMZM2G044531 that encodes an AGC kinase, significantly enhanced root length under low P, positively affecting root surface area and seedling weight. The broad genetic diversity evaluated in this panel suggests that candidate genes and favorable alleles could be exploited to improve P efficiency in maize breeding programs of Africa and Latin America.


Subject(s)
Genome-Wide Association Study , Zea mays , Zea mays/metabolism , Phosphorus/metabolism , Plant Breeding , Phenotype , Seedlings/metabolism , Polymorphism, Single Nucleotide
12.
G3 (Bethesda) ; 13(6)2023 06 01.
Article in English | MEDLINE | ID: mdl-37002915

ABSTRACT

Poa pratensis, commonly known as Kentucky bluegrass, is a popular cool-season grass species used as turf in lawns and recreation areas globally. Despite its substantial economic value, a reference genome had not previously been assembled due to the genome's relatively large size and biological complexity that includes apomixis, polyploidy, and interspecific hybridization. We report here a fortuitous de novo assembly and annotation of a P. pratensis genome. Instead of sequencing the genome of a C4 grass, we accidentally sampled and sequenced tissue from a weedy P. pratensis whose stolon was intertwined with that of the C4 grass. The draft assembly consists of 6.09 Gbp with an N50 scaffold length of 65.1 Mbp, and a total of 118 scaffolds, generated using PacBio long reads and Bionano optical map technology. We annotated 256K gene models and found 58% of the genome to be composed of transposable elements. To demonstrate the applicability of the reference genome, we evaluated population structure and estimated genetic diversity in P. pratensis collected from three North American prairies, two in Manitoba, Canada and one in Colorado, USA. Our results support previous studies that found high genetic diversity and population structure within the species. The reference genome and annotation will be an important resource for turfgrass breeding and study of bluegrasses.


Subject(s)
Plant Breeding , Poa , Genome , Poa/genetics , Plant Weeds/genetics , Base Sequence , Molecular Sequence Annotation
13.
PLoS Genet ; 19(3): e1010664, 2023 03.
Article in English | MEDLINE | ID: mdl-36943844

ABSTRACT

Pleiotropy-when a single gene controls two or more seemingly unrelated traits-has been shown to impact genes with effects on flowering time, leaf architecture, and inflorescence morphology in maize. However, the genome-wide impact of biological pleiotropy across all maize phenotypes is largely unknown. Here, we investigate the extent to which biological pleiotropy impacts phenotypes within maize using GWAS summary statistics reanalyzed from previously published metabolite, field, and expression phenotypes across the Nested Association Mapping population and Goodman Association Panel. Through phenotypic saturation of 120,597 traits, we obtain over 480 million significant quantitative trait nucleotides. We estimate that only 1.56-32.3% of intervals show some degree of pleiotropy. We then assess the relationship between pleiotropy and various biological features such as gene expression, chromatin accessibility, sequence conservation, and enrichment for gene ontology terms. We find very little relationship between pleiotropy and these variables when compared to permuted pleiotropy. We hypothesize that biological pleiotropy of common alleles is not widespread in maize and is highly impacted by nuisance terms such as population structure and linkage disequilibrium. Natural selection on large standing natural variation in maize populations may target wide and large effect variants, leaving the prevalence of detectable pleiotropy relatively low.


Subject(s)
Genome-Wide Association Study , Zea mays , Chromosome Mapping , Zea mays/genetics , Phenotype , Linkage Disequilibrium , Polymorphism, Single Nucleotide , Genetic Pleiotropy
15.
Proc Natl Acad Sci U S A ; 120(10): e2216894120, 2023 03 07.
Article in English | MEDLINE | ID: mdl-36848555

ABSTRACT

Drought tolerance is a highly complex trait controlled by numerous interconnected pathways with substantial variation within and across plant species. This complexity makes it difficult to distill individual genetic loci underlying tolerance, and to identify core or conserved drought-responsive pathways. Here, we collected drought physiology and gene expression datasets across diverse genotypes of the C4 cereals sorghum and maize and searched for signatures defining water-deficit responses. Differential gene expression identified few overlapping drought-associated genes across sorghum genotypes, but using a predictive modeling approach, we found a shared core drought response across development, genotype, and stress severity. Our model had similar robustness when applied to datasets in maize, reflecting a conserved drought response between sorghum and maize. The top predictors are enriched in functions associated with various abiotic stress-responsive pathways as well as core cellular functions. These conserved drought response genes were less likely to contain deleterious mutations than other gene sets, suggesting that core drought-responsive genes are under evolutionary and functional constraints. Our findings support a broad evolutionary conservation of drought responses in C4 grasses regardless of innate stress tolerance, which could have important implications for developing climate resilient cereals.


Subject(s)
Sorghum , Zea mays , Zea mays/genetics , Sorghum/genetics , Droughts , Edible Grain/genetics , Poaceae
16.
Plant J ; 112(6): 1525-1542, 2022 12.
Article in English | MEDLINE | ID: mdl-36353749

ABSTRACT

Linking genotype with phenotype is a fundamental goal in biology and requires robust data for both. Recent advances in plant-genome sequencing have expedited comparisons among multiple-related individuals. The abundance of structural genomic within-species variation that has been discovered indicates that a single reference genome cannot represent the complete sequence diversity of a species, leading to the expansion of the pan-genome concept. For high-resolution forward genetics, this unprecedented access to genomic variation should be paralleled and integrated with phenotypic characterization of genetic diversity. We developed a multi-parental framework for trait dissection in melon (Cucumis melo), leveraging a novel pan-genome constructed for this highly variable cucurbit crop. A core subset of 25 diverse founders (MelonCore25), consisting of 24 accessions from the two widely cultivated subspecies of C. melo, encompassing 12 horticultural groups, and 1 feral accession was sequenced using a combination of short- and long-read technologies, and their genomes were assembled de novo. The construction of this melon pan-genome exposed substantial variation in genome size and structure, including detection of ~300 000 structural variants and ~9 million SNPs. A half-diallel derived set of 300 F2 populations, representing all possible MelonCore25 parental combinations, was constructed as a framework for trait dissection through integration with the pan-genome. We demonstrate the potential of this unified framework for genetic analysis of various melon traits, including rind color intensity and pattern, fruit sugar content, and resistance to fungal diseases. We anticipate that utilization of this integrated resource will enhance genetic dissection of important traits and accelerate melon breeding.


Subject(s)
Cucumis melo , Cucurbitaceae , Cucumis melo/genetics , Cucurbitaceae/genetics , Plant Breeding , Chromosome Mapping , Phenotype
17.
Genome Biol ; 23(1): 183, 2022 09 01.
Article in English | MEDLINE | ID: mdl-36050782

ABSTRACT

BACKGROUND: Crop improvement through cross-population genomic prediction and genome editing requires identification of causal variants at high resolution, within fewer than hundreds of base pairs. Most genetic mapping studies have generally lacked such resolution. In contrast, evolutionary approaches can detect genetic effects at high resolution, but they are limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Here we use genomic annotations to accurately predict nucleotide conservation across angiosperms, as a proxy for fitness effect of mutations. RESULTS: Using only sequence analysis, we annotate nonsynonymous mutations in 25,824 maize gene models, with information from bioinformatics and deep learning. Our predictions are validated by experimental information: within-species conservation, chromatin accessibility, and gene expression. According to gene ontology and pathway enrichment analyses, predicted nucleotide conservation points to genes in central carbon metabolism. Importantly, it improves genomic prediction for fitness-related traits such as grain yield, in elite maize panels, by stringent prioritization of fewer than 1% of single-site variants. CONCLUSIONS: Our results suggest that predicting nucleotide conservation across angiosperms may effectively prioritize sites most likely to impact fitness-related traits in crops, without being limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Our approach-Prediction of mutation Impact by Calibrated Nucleotide Conservation (PICNC)-could be useful to select polymorphisms for accurate genomic prediction, and candidate mutations for efficient base editing. The trained PICNC models and predicted nucleotide conservation at protein-coding SNPs in maize are publicly available in CyVerse ( https://doi.org/10.25739/hybz-2957 ).


Subject(s)
Genomics , Zea mays , Genome , Genomics/methods , Nucleotides , Phenotype , Polymorphism, Single Nucleotide , Zea mays/genetics
18.
Plant Genome ; 15(3): e20249, 2022 09.
Article in English | MEDLINE | ID: mdl-35924336

ABSTRACT

Accessible chromatin regions are critical components of gene regulation but modeling them directly from sequence remains challenging, especially within plants, whose mechanisms of chromatin remodeling are less understood than in animals. We trained an existing deep-learning architecture, DanQ, on data from 12 angiosperm species to predict the chromatin accessibility in leaf of sequence windows within and across species. We also trained DanQ on DNA methylation data from 10 angiosperms because unmethylated regions have been shown to overlap significantly with ACRs in some plants. The across-species models have comparable or even superior performance to a model trained within species, suggesting strong conservation of chromatin mechanisms across angiosperms. Testing a maize (Zea mays L.) held-out model on a multi-tissue chromatin accessibility panel revealed our models are best at predicting constitutively accessible chromatin regions, with diminishing performance as cell-type specificity increases. Using a combination of interpretation methods, we ranked JASPAR motifs by their importance to each model and saw that the TCP and AP2/ERF transcription factor (TF) families consistently ranked highly. We embedded the top three JASPAR motifs for each model at all possible positions on both strands in our sequence window and observed position- and strand-specific patterns in their importance to the model. With our publicly available across-species 'a2z' model it is now feasible to predict the chromatin accessibility and methylation landscape of any angiosperm genome.


Subject(s)
Chromatin , Magnoliopsida , Animals , Genome , Magnoliopsida/genetics , Neural Networks, Computer , Transcription Factors/genetics , Zea mays/genetics
19.
Proc Natl Acad Sci U S A ; 119(27): e2100036119, 2022 07 05.
Article in English | MEDLINE | ID: mdl-35771940

ABSTRACT

Native Americans domesticated maize (Zea mays ssp. mays) from lowland teosinte parviglumis (Zea mays ssp. parviglumis) in the warm Mexican southwest and brought it to the highlands of Mexico and South America where it was exposed to lower temperatures that imposed strong selection on flowering time. Phospholipids are important metabolites in plant responses to low-temperature and phosphorus availability and have been suggested to influence flowering time. Here, we combined linkage mapping with genome scans to identify High PhosphatidylCholine 1 (HPC1), a gene that encodes a phospholipase A1 enzyme, as a major driver of phospholipid variation in highland maize. Common garden experiments demonstrated strong genotype-by-environment interactions associated with variation at HPC1, with the highland HPC1 allele leading to higher fitness in highlands, possibly by hastening flowering. The highland maize HPC1 variant resulted in impaired function of the encoded protein due to a polymorphism in a highly conserved sequence. A meta-analysis across HPC1 orthologs indicated a strong association between the identity of the amino acid at this position and optimal growth in prokaryotes. Mutagenesis of HPC1 via genome editing validated its role in regulating phospholipid metabolism. Finally, we showed that the highland HPC1 allele entered cultivated maize by introgression from the wild highland teosinte Zea mays ssp. mexicana and has been maintained in maize breeding lines from the Northern United States, Canada, and Europe. Thus, HPC1 introgressed from teosinte mexicana underlies a large metabolic QTL that modulates phosphatidylcholine levels and has an adaptive effect at least in part via induction of early flowering time.


Subject(s)
Adaptation, Physiological , Flowers , Gene-Environment Interaction , Phosphatidylcholines , Phospholipases A1 , Plant Proteins , Zea mays , Alleles , Chromosome Mapping , Flowers/genetics , Flowers/metabolism , Genes, Plant , Genetic Linkage , Phosphatidylcholines/metabolism , Phospholipases A1/classification , Phospholipases A1/genetics , Phospholipases A1/metabolism , Plant Proteins/classification , Plant Proteins/genetics , Plant Proteins/metabolism , Zea mays/genetics , Zea mays/growth & development
20.
Plant Genome ; 15(2): e20204, 2022 06.
Article in English | MEDLINE | ID: mdl-35416423

ABSTRACT

Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) to allow practical and sensitive multiple alignment of diverged plant genomes and calculation of conservation scores with minimal user inputs. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the effect of different masking approaches and parameters of the LAST aligner using genome assemblies of 33 grass species. Compared with conventional masking with RepeatMasker, a masking approach based on k-mers (nucleotide sequences of k length) increased the alignment rate of coding sequence and noncoding functional regions by 25 and 14%, respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for noncoding functional regions by over 52% compared with default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of noncoding sites that can be scored for conservation by over 76%. Overall, tuning of masking and alignment parameters can generate optimized multiple alignments to drive biological discovery in plants.


Subject(s)
Genome, Plant , Genomics , Base Sequence , Workflow
SELECTION OF CITATIONS
SEARCH DETAIL
...