Search | VHL Regional Portal

1.

Chromosome-scale genome assembly of bread wheat's wild relative Triticum timopheevii.

Grewal, Surbhi; Yang, Cai-Yun; Scholefield, Duncan; Ashling, Stephen; Ghosh, Sreya; Swarbreck, David; Collins, Joanna; Yao, Eric; Sen, Taner Z; Wilson, Michael; Yant, Levi; King, Ian P; King, Julie.

Sci Data ; 11(1): 420, 2024 Apr 23.

Article in English | MEDLINE | ID: mdl-38653999

ABSTRACT

Wheat (Triticum aestivum) is one of the most important food crops with an urgent need for increase in its production to feed the growing world. Triticum timopheevii (2n = 4x = 28) is an allotetraploid wheat wild relative species containing the At and G genomes that has been exploited in many pre-breeding programmes for wheat improvement. In this study, we report the generation of a chromosome-scale reference genome assembly of T. timopheevii accession PI 94760 based on PacBio HiFi reads and chromosome conformation capture (Hi-C). The assembly comprised a total size of 9.35 Gb, featuring a contig N50 of 42.4 Mb and included the mitochondrial and plastid genome sequences. Genome annotation predicted 166,325 gene models including 70,365 genes with high confidence. DNA methylation analysis showed that the G genome had on average more methylated bases than the At genome. In summary, the T. timopheevii genome assembly provides a valuable resource for genome-informed discovery of agronomically important genes for food security.

Subject(s)

Chromosomes, Plant , Genome, Plant , Triticum , Triticum/genetics , Chromosomes, Plant/genetics , DNA Methylation

2.

Harnessing the predicted maize pan-interactome for putative gene function prediction and prioritization of candidate genes for important traits.

Poretsky, Elly; Cagirici, Halise Busra; Andorf, Carson M; Sen, Taner Z.

G3 (Bethesda) ; 14(5)2024 05 07.

Article in English | MEDLINE | ID: mdl-38492232

ABSTRACT

The recent assembly and annotation of the 26 maize nested association mapping population founder inbreds have enabled large-scale pan-genomic comparative studies. These studies have expanded our understanding of agronomically important traits by integrating pan-transcriptomic data with trait-specific gene candidates from previous association mapping results. In contrast to the availability of pan-transcriptomic data, obtaining reliable protein-protein interaction (PPI) data has remained a challenge due to its high cost and complexity. We generated predicted PPI networks for each of the 26 genomes using the established STRING database. The individual genome-interactomes were then integrated to generate core- and pan-interactomes. We deployed the PPI clustering algorithm ClusterONE to identify numerous PPI clusters that were functionally annotated using gene ontology (GO) functional enrichment, demonstrating a diverse range of enriched GO terms across different clusters. Additional cluster annotations were generated by integrating gene coexpression data and gene description annotations, providing additional useful information. We show that the functionally annotated PPI clusters establish a useful framework for protein function prediction and prioritization of candidate genes of interest. Our study not only provides a comprehensive resource of predicted PPI networks for 26 maize genomes but also offers annotated interactome clusters for predicting protein functions and prioritizing gene candidates. The source code for the Python implementation of the analysis workflow and a standalone web application for accessing the analysis results are available at https://github.com/eporetsky/PanPPI.

Subject(s)

Zea mays , Zea mays/genetics , Protein Interaction Maps/genetics , Molecular Sequence Annotation , Gene Ontology , Genome, Plant , Quantitative Trait Loci , Computational Biology/methods , Algorithms , Genes, Plant , Quantitative Trait, Heritable , Phenotype , Databases, Genetic , Genomics/methods

3.

PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models.

Poretsky, Elly; Andorf, Carson M; Sen, Taner Z.

Plant Direct ; 7(12): e554, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38124705

ABSTRACT

Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. The regulatory role of phosphorylation in cellular signaling pathways, protein-protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. Experimental protein phosphorylation data in plants remains limited to a few species, necessitating a scalable and accurate prediction method. Here, we present PhosBoost, a machine-learning approach that leverages protein language models and gradient-boosting trees to predict protein phosphorylation from experimentally derived data. Trained on data obtained from a comprehensive plant phosphorylation database, qPTMplants, we compared the performance of PhosBoost to existing protein phosphorylation prediction methods, PhosphoLingo and DeepPhos. For serine and threonine prediction, PhosBoost achieved higher recall than PhosphoLingo and DeepPhos (.78, .56, and .14, respectively) while maintaining a competitive area under the precision-recall curve (.54, .56, and .42, respectively). PhosphoLingo and DeepPhos failed to predict any tyrosine phosphorylation sites, while PhosBoost achieved a recall score of .6. Despite the precision-recall tradeoff, PhosBoost offers improved performance when recall is prioritized while consistently providing more confident probability scores. A sequence-based pairwise alignment step improved prediction results for all classifiers by effectively increasing the number of inferred positive phosphosites. We provide evidence to show that PhosBoost models are transferable across species and scalable for genome-wide protein phosphorylation predictions. PhosBoost is freely and publicly available on GitHub.

4.

Data sharing and ontology use among agricultural genetics, genomics, and breeding databases and resources of the Agbiodata Consortium.

Clarke, Jennifer L; Cooper, Laurel D; Poelchau, Monica F; Berardini, Tanya Z; Elser, Justin; Farmer, Andrew D; Ficklin, Stephen; Kumari, Sunita; Laporte, Marie-Angélique; Nelson, Rex T; Sadohara, Rie; Selby, Peter; Thessen, Anne E; Whitehead, Brandon; Sen, Taner Z.

Database (Oxford) ; 20232023 11 15.

Article in English | MEDLINE | ID: mdl-37971715

ABSTRACT

Over the last couple of decades, there has been a rapid growth in the number and scope of agricultural genetics, genomics and breeding databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources (https://www.agbiodata.org/databases) covering model or crop plant and animal GGB data, ontologies, pathways, genetic variation and breeding platforms (referred to as 'databases' throughout). One of the goals of the Consortium is to facilitate FAIR (Findable, Accessible, Interoperable, and Reusable) data management and the integration of datasets which requires data sharing, along with structured vocabularies and/or ontologies. Two AgBioData working groups, focused on Data Sharing and Ontologies, respectively, conducted a Consortium-wide survey to assess the current status and future needs of the members in those areas. A total of 33 researchers responded to the survey, representing 37 databases. Results suggest that data-sharing practices by AgBioData databases are in a fairly healthy state, but it is not clear whether this is true for all metadata and data types across all databases; and that, ontology use has not substantially changed since a similar survey was conducted in 2017. Based on our evaluation of the survey results, we recommend (i) providing training for database personnel in a specific data-sharing techniques, as well as in ontology use; (ii) further study on what metadata is shared, and how well it is shared among databases; (iii) promoting an understanding of data sharing and ontologies in the stakeholder community; (iv) improving data sharing and ontologies for specific phenotypic data types and formats; and (v) lowering specific barriers to data sharing and ontology use, by identifying sustainability solutions, and the identification, promotion, or development of data standards. Combined, these improvements are likely to help AgBioData databases increase development efforts towards improved ontology use, and data sharing via programmatic means. Database URL https://www.agbiodata.org/databases.

Subject(s)

Data Management , Plant Breeding , Animals , Genomics/methods , Databases, Factual , Information Dissemination

5.

Co-expression pan-network reveals genes involved in complex traits within maize pan-genome.

Cagirici, H Busra; Andorf, Carson M; Sen, Taner Z.

BMC Plant Biol ; 22(1): 595, 2022 Dec 19.

Article in English | MEDLINE | ID: mdl-36529716

ABSTRACT

BACKGROUND: With the advances in the high throughput next generation sequencing technologies, genome-wide association studies (GWAS) have identified a large set of variants associated with complex phenotypic traits at a very fine scale. Despite the progress in GWAS, identification of genotype-phenotype relationship remains challenging in maize due to its nature with dozens of variants controlling the same trait. As the causal variations results in the change in expression, gene expression analyses carry a pivotal role in unraveling the transcriptional regulatory mechanisms behind the phenotypes. RESULTS: To address these challenges, we incorporated the gene expression and GWAS-driven traits to extend the knowledge of genotype-phenotype relationships and transcriptional regulatory mechanisms behind the phenotypes. We constructed a large collection of gene co-expression networks and identified more than 2 million co-expressing gene pairs in the GWAS-driven pan-network which contains all the gene-pairs in individual genomes of the nested association mapping (NAM) population. We defined four sub-categories for the pan-network: (1) core-network contains the highest represented ~ 1% of the gene-pairs, (2) near-core network contains the next highest represented 1-5% of the gene-pairs, (3) private-network contains ~ 50% of the gene pairs that are unique to individual genomes, and (4) the dispensable-network contains the remaining 50-95% of the gene-pairs in the maize pan-genome. Strikingly, the private-network contained almost all the genes in the pan-network but lacked half of the interactions. We performed gene ontology (GO) enrichment analysis for the pan-, core-, and private- networks and compared the contributions of variants overlapping with genes and promoters to the GWAS-driven pan-network. CONCLUSIONS: Gene co-expression networks revealed meaningful information about groups of co-regulated genes that play a central role in regulatory processes. Pan-network approach enabled us to visualize the global view of the gene regulatory network for the studied system that could not be well inferred by the core-network alone.

Subject(s)

Genome-Wide Association Study , Zea mays , Zea mays/genetics , Genome-Wide Association Study/methods , Multifactorial Inheritance , Phenotype , Gene Regulatory Networks , Polymorphism, Single Nucleotide/genetics

6.

Capturing Wheat Phenotypes at the Genome Level.

Hussain, Babar; Akpinar, Bala A; Alaux, Michael; Algharib, Ahmed M; Sehgal, Deepmala; Ali, Zulfiqar; Aradottir, Gudbjorg I; Batley, Jacqueline; Bellec, Arnaud; Bentley, Alison R; Cagirici, Halise B; Cattivelli, Luigi; Choulet, Fred; Cockram, James; Desiderio, Francesca; Devaux, Pierre; Dogramaci, Munevver; Dorado, Gabriel; Dreisigacker, Susanne; Edwards, David; El-Hassouni, Khaoula; Eversole, Kellye; Fahima, Tzion; Figueroa, Melania; Gálvez, Sergio; Gill, Kulvinder S; Govta, Liubov; Gul, Alvina; Hensel, Goetz; Hernandez, Pilar; Crespo-Herrera, Leonardo Abdiel; Ibrahim, Amir; Kilian, Benjamin; Korzun, Viktor; Krugman, Tamar; Li, Yinghui; Liu, Shuyu; Mahmoud, Amer F; Morgounov, Alexey; Muslu, Tugdem; Naseer, Faiza; Ordon, Frank; Paux, Etienne; Perovic, Dragan; Reddy, Gadi V P; Reif, Jochen Christoph; Reynolds, Matthew; Roychowdhury, Rajib; Rudd, Jackie; Sen, Taner Z.

Front Plant Sci ; 13: 851079, 2022.

Article in English | MEDLINE | ID: mdl-35860541

ABSTRACT

Recent technological advances in next-generation sequencing (NGS) technologies have dramatically reduced the cost of DNA sequencing, allowing species with large and complex genomes to be sequenced. Although bread wheat (Triticum aestivum L.) is one of the world's most important food crops, efficient exploitation of molecular marker-assisted breeding approaches has lagged behind that achieved in other crop species, due to its large polyploid genome. However, an international public-private effort spanning 9 years reported over 65% draft genome of bread wheat in 2014, and finally, after more than a decade culminated in the release of a gold-standard, fully annotated reference wheat-genome assembly in 2018. Shortly thereafter, in 2020, the genome of assemblies of additional 15 global wheat accessions was released. As a result, wheat has now entered into the pan-genomic era, where basic resources can be efficiently exploited. Wheat genotyping with a few hundred markers has been replaced by genotyping arrays, capable of characterizing hundreds of wheat lines, using thousands of markers, providing fast, relatively inexpensive, and reliable data for exploitation in wheat breeding. These advances have opened up new opportunities for marker-assisted selection (MAS) and genomic selection (GS) in wheat. Herein, we review the advances and perspectives in wheat genetics and genomics, with a focus on key traits, including grain yield, yield-related traits, end-use quality, and resistance to biotic and abiotic stresses. We also focus on reported candidate genes cloned and linked to traits of interest. Furthermore, we report on the improvement in the aforementioned quantitative traits, through the use of (i) clustered regularly interspaced short-palindromic repeats/CRISPR-associated protein 9 (CRISPR/Cas9)-mediated gene-editing and (ii) positional cloning methods, and of genomic selection. Finally, we examine the utilization of genomics for the next-generation wheat breeding, providing a practical example of using in silico bioinformatics tools that are based on the wheat reference-genome sequence.

7.

Predicting Tissue-Specific mRNA and Protein Abundance in Maize: A Machine Learning Approach.

Cho, Kyoung Tak; Sen, Taner Z; Andorf, Carson M.

Front Artif Intell ; 5: 830170, 2022.

Article in English | MEDLINE | ID: mdl-35719692

ABSTRACT

Machine learning and modeling approaches have been used to classify protein sequences for a broad set of tasks including predicting protein function, structure, expression, and localization. Some recent studies have successfully predicted whether a given gene is expressed as mRNA or even translated to proteins potentially, but given that not all genes are expressed in every condition and tissue, the challenge remains to predict condition-specific expression. To address this gap, we developed a machine learning approach to predict tissue-specific gene expression across 23 different tissues in maize, solely based on DNA promoter and protein sequences. For class labels, we defined high and low expression levels for mRNA and protein abundance and optimized classifiers by systematically exploring various methods and combinations of k-mer sequences in a two-phase approach. In the first phase, we developed Markov model classifiers for each tissue and built a feature vector based on the predictions. In the second phase, the feature vector was used as an input to a Bayesian network for final classification. Our results show that these methods can achieve high classification accuracy of up to 95% for predicting gene expression for individual tissues. By relying on sequence alone, our method works in settings where costly experimental data are unavailable and reveals useful insights into the functional, evolutionary, and regulatory characteristics of genes.

8.

G4Boost: a machine learning-based tool for quadruplex identification and stability prediction.

Cagirici, H Busra; Budak, Hikmet; Sen, Taner Z.

BMC Bioinformatics ; 23(1): 240, 2022 Jun 18.

Article in English | MEDLINE | ID: mdl-35717172

ABSTRACT

BACKGROUND: G-quadruplexes (G4s), formed within guanine-rich nucleic acids, are secondary structures involved in important biological processes. Although every G4 motif has the potential to form a stable G4 structure, not every G4 motif would, and accurate energy-based methods are needed to assess their structural stability. Here, we present a decision tree-based prediction tool, G4Boost, to identify G4 motifs and predict their secondary structure folding probability and thermodynamic stability based on their sequences, nucleotide compositions, and estimated structural topologies. RESULTS: G4Boost predicted the quadruplex folding state with an accuracy greater then 93% and an F1-score of 0.96, and the folding energy with an RMSE of 4.28 and R2 of 0.95 only by the means of sequence intrinsic feature. G4Boost was successfully applied and validated to predict the stability of experimentally-determined G4 structures, including for plants and humans. CONCLUSION: G4Boost outperformed the three machine-learning based prediction tools, DeepG4, Quadron, and G4RNA Screener, in terms of both accuracy and F1-score, and can be highly useful for G4 prediction to understand gene regulation across species including plants and humans.

Subject(s)

G-Quadruplexes , Gene Expression Regulation , Guanine/chemistry , Humans , Machine Learning , Thermodynamics

9.

GrainGenes: a data-rich repository for small grains genetics and genomics.

Yao, Eric; Blake, Victoria C; Cooper, Laurel; Wight, Charlene P; Michel, Steve; Cagirici, H Busra; Lazo, Gerard R; Birkett, Clay L; Waring, David J; Jannink, Jean-Luc; Holmes, Ian; Waters, Amanda J; Eickholt, David P; Sen, Taner Z.

Database (Oxford) ; 20222022 05 25.

Article in English | MEDLINE | ID: mdl-35616118

ABSTRACT

As one of the US Department of Agriculture-Agricultural Research Service flagship databases, GrainGenes (https://wheat.pw.usda.gov) serves the data and community needs of globally distributed small grains researchers for the genetic improvement of the Triticeae family and Avena species that include wheat, barley, rye and oat. GrainGenes accomplishes its mission by continually enriching its cross-linked data content following the findable, accessible, interoperable and reusable principles, enhancing and maintaining an intuitive web interface, creating tools to enable easy data access and establishing data connections within and between GrainGenes and other biological databases to facilitate knowledge discovery. GrainGenes operates within the biological database community, collaborates with curators and genome sequencing groups and contributes to the AgBioData Consortium and the International Wheat Initiative through the Wheat Information System (WheatIS). Interactive and linked content is paramount for successful biological databases and GrainGenes now has 2917 manually curated gene records, including 289 genes and 254 alleles from the Wheat Gene Catalogue (WGC). There are >4.8 million gene models in 51 genome browser assemblies, 6273 quantitative trait loci and >1.4 million genetic loci on 4756 genetic and physical maps contained within 443 mapping sets, complete with standardized metadata. Most notably, 50 new genome browsers that include outputs from the Wheat and Barley PanGenome projects have been created. We provide an example of an expression quantitative trait loci track on the International Wheat Genome Sequencing Consortium Chinese Spring wheat browser to demonstrate how genome browser tracks can be adapted for different data types. To help users benefit more from its data, GrainGenes created four tutorials available on YouTube. GrainGenes is executing its vision of service by continuously responding to the needs of the global small grains community by creating a centralized, long-term, interconnected data repository. Database URL:https://wheat.pw.usda.gov.

Subject(s)

Genome, Plant , Hordeum , Avena/genetics , Chromosome Mapping , Databases, Genetic , Genome, Plant/genetics , Genomics , Hordeum/genetics , Quantitative Trait Loci , Triticum/genetics

10.

GrainGenes: Tools and Content to Assist Breeders Improving Oat Quality.

Blake, Victoria C; Wight, Charlene P; Yao, Eric; Sen, Taner Z.

Foods ; 11(7)2022 Mar 23.

Article in English | MEDLINE | ID: mdl-35407001

ABSTRACT

GrainGenes is the USDA-ARS database and Web resource for wheat, barley, oat, rye, and their relatives. As a community Web hub and database for small grains, GrainGenes strives to provide resources for researchers, students, and plant breeders to improve traits such as quality, yield, and disease resistance. Quantitative trait loci (QTL), genes, and genetic maps for quality attributes in GrainGenes represent the historical approach to mapping genes for groat percentage, test weight, protein, fat, and ß-glucan content in oat (Avena spp.). Genetic maps are viewable in CMap, the comparative mapping tool that enables researchers to take advantage of highly populated consensus maps to increase the marker density around their genes-of-interest. GrainGenes hosts over 50 genome browsers and is launching an effort for community curation, including the manually curated tracks with beta-glucan QTL and significant markers found via GWAS and cloned cellulose synthase-like AsClF6 alleles.

11.

Multiple Variant Calling Pipelines in Wheat Whole Exome Sequencing.

Cagirici, H Busra; Akpinar, Bala Ani; Sen, Taner Z; Budak, Hikmet.

Int J Mol Sci ; 22(19)2021 Sep 27.

Article in English | MEDLINE | ID: mdl-34638743

ABSTRACT

The highly challenging hexaploid wheat (Triticum aestivum) genome is becoming ever more accessible due to the continued development of multiple reference genomes, a factor which aids in the plight to better understand variation in important traits. Although the process of variant calling is relatively straightforward, selection of the best combination of the computational tools for read alignment and variant calling stages of the analysis and efficient filtering of the false variant calls are not always easy tasks. Previous studies have analyzed the impact of methods on the quality metrics in diploid organisms. Given that variant identification in wheat largely relies on accurate mining of exome data, there is a critical need to better understand how different methods affect the analysis of whole exome sequencing (WES) data in polyploid species. This study aims to address this by performing whole exome sequencing of 48 wheat cultivars and assessing the performance of various variant calling pipelines at their suggested settings. The results show that all the pipelines require filtering to eliminate false-positive calls. The high consensus among the reference SNPs called by the best-performing pipelines suggests that filtering provides accurate and reproducible results. This study also provides detailed comparisons for high sensitivity and precision at individual and population levels for the raw and filtered SNP calls.

Subject(s)

Exome Sequencing , Genome, Plant , Polymorphism, Single Nucleotide , Polyploidy , Triticum/genetics

12.

mirMachine: A One-Stop Shop for Plant miRNA Annotation.

Cagirici, H Busra; Sen, Taner Z; Budak, Hikmet.

J Vis Exp ; (171)2021 05 01.

Article in English | MEDLINE | ID: mdl-33999024

ABSTRACT

Of different types of noncoding RNAs, microRNAs (miRNAs) have arguably been in the spotlight over the last decade. As post-transcriptional regulators of gene expression, miRNAs play key roles in various cellular pathways, including both development and response to a/biotic stress, such as drought and diseases. Having high-quality reference genome sequences enabled identification and annotation of miRNAs in several plant species, where miRNA sequences are highly conserved. As computational miRNA identification and annotation processes are mostly error-prone processes, homology-based predictions increase prediction accuracy. We developed and have improved the miRNA annotation pipeline, SUmir, in the last decade, which has been used for several plant genomes since then. This study presents a fully automated, new miRNA pipeline, mirMachine (miRNA Machine), by (i) adding an additional filtering step on the secondary structure predictions, (ii) making it fully automated, and (iii) introducing new options to predict either known miRNA based on homology or novel miRNAs based on small RNA sequencing reads using the previous pipeline. The new miRNA pipeline, mirMachine, was tested using The Arabidopsis Information Resource, TAIR10, release of the Arabidopsis genome and the International Wheat Genome Sequencing Consortium (IWGSC) wheat reference genome v2.

Subject(s)

Arabidopsis , MicroRNAs , Arabidopsis/genetics , Base Sequence , Gene Expression Regulation, Plant , Genome, Plant/genetics , High-Throughput Nucleotide Sequencing , MicroRNAs/genetics , RNA, Plant/genetics , Sequence Analysis, RNA

13.

FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences.

Banerjee, Sagnik; Bhandary, Priyanka; Woodhouse, Margaret; Sen, Taner Z; Wise, Roger P; Andorf, Carson M.

BMC Bioinformatics ; 22(1): 205, 2021 Apr 20.

Article in English | MEDLINE | ID: mdl-33879057

ABSTRACT

BACKGROUND: Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative. RESULTS: We present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species. CONCLUSIONS: FINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision-ideal for bench researchers with limited experience in handling computational tools.

Subject(s)

Eukaryota , Software , Eukaryota/genetics , Genome , Molecular Sequence Annotation , RNA-Seq , Sequence Analysis, RNA

14.

Genome-wide discovery of G-quadruplexes in barley.

Cagirici, H Busra; Budak, Hikmet; Sen, Taner Z.

Sci Rep ; 11(1): 7876, 2021 04 12.

Article in English | MEDLINE | ID: mdl-33846409

ABSTRACT

G-quadruplexes (G4s) are four-stranded nucleic acid structures with closely spaced guanine bases forming square planar G-quartets. Aberrant formation of G4 structures has been associated with genomic instability. However, most plant species are lacking comprehensive studies of G4 motifs. In this study, genome-wide identification of G4 motifs in barley was performed, followed by a comparison of genomic distribution and molecular functions to other monocot species, such as wheat, maize, and rice. Similar to the reports on human and some plants like wheat, G4 motifs peaked around the 5' untranslated region (5' UTR), the first coding domain sequence, and the first intron start sites on antisense strands. Our comparative analyses in human, Arabidopsis, maize, rice, and sorghum demonstrated that the peak points could be erroneously merged into a single peak when large window sizes are used. We also showed that the G4 distributions around genic regions are relatively similar in the species studied, except in the case of Arabidopsis. G4 containing genes in monocots showed conserved molecular functions for transcription initiation and hydrolase activity. Additionally, we provided examples of imperfect G4 motifs.

Subject(s)

G-Quadruplexes , Hordeum/genetics , Arabidopsis/genetics , Genome, Human , Genome, Plant , Humans , Polymorphism, Single Nucleotide , Zea mays/genetics

15.

LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants.

Cagirici, H Busra; Galvez, S; Sen, Taner Z; Budak, Hikmet.

Funct Integr Genomics ; 21(2): 195-204, 2021 Mar.

Article in English | MEDLINE | ID: mdl-33635499

ABSTRACT

Following the elucidation of the critical roles they play in numerous important biological processes, long noncoding RNAs (lncRNAs) have gained vast attention in recent years. Manual annotation of lncRNAs is restricted by known gene annotations and is prone to false prediction due to the incompleteness of available data. However, with the advent of high-throughput sequencing technologies, a magnitude of high-quality data has become available for annotation, especially for plant species such as wheat. Here, we compared prediction accuracies of several machine learning algorithms using a 10-fold cross-validation. This study includes a comprehensive feature selection step to refine irrelevant and repeated features. We present a crop-specific, alignment-free coding potential prediction tool, LncMachine, that performs at higher prediction accuracies than the currently available popular tools (CPC2, CPAT, and CNIT) when used with the Random Forest algorithm. Further, LncMachine with Random Forest performed well on human and mouse data, with an average accuracy of 92.67%. LncMachine only requires either a FASTA file or a TAB separated CSV file containing features as input files. LncMachine can deploy several user-provided algorithms in real time and therefore be effortlessly applied to a wide range of studies.

Subject(s)

Computational Biology , Molecular Sequence Annotation , Plants/genetics , RNA, Long Noncoding/genetics , Algorithms , High-Throughput Nucleotide Sequencing , Machine Learning , RNA, Long Noncoding/classification

16.

JBrowse Connect: A server API to connect JBrowse instances and users.

Yao, Eric; Buels, Robert; Stein, Lincoln; Sen, Taner Z; Holmes, Ian.

PLoS Comput Biol ; 16(8): e1007261, 2020 08.

Article in English | MEDLINE | ID: mdl-32810130

ABSTRACT

We describe JBrowse Connect, an optional expansion to the JBrowse genome browser, targeted at developers. JBrowse Connect allows live messaging, notifications for new annotation tracks, heavy-duty analyses initiated by the user from within the browser, and other dynamic features. We present example applications of JBrowse Connect that allow users 1) to specify and execute BLAST searches by either running on the same host as the webserver, with a self-contained BLAST module leveraging NCBI Blast+ commands, or via a managed Galaxy instance that can optionally run on a different host, and 2) to run the primer design service Primer3. JBrowse Connect allows users to track job progress and view results in the context of the browser. The software is available under a choice of open source licenses including LGPL and the Artistic License.

Subject(s)

Databases, Genetic , Genomics/methods , Software , Internet

17.

Genome-Wide Discovery of G-Quadruplexes in Wheat: Distribution and Putative Functional Roles.

Cagirici, H Busra; Sen, Taner Z.

G3 (Bethesda) ; 10(6): 2021-2032, 2020 06 01.

Article in English | MEDLINE | ID: mdl-32295768

ABSTRACT

G-quadruplexes are nucleic acid secondary structures formed by a stack of square planar G-quartets. G-quadruplexes were implicated in many biological functions including telomere maintenance, replication, transcription, and translation, in many species including humans and plants. For wheat, however, though it is one of the world's most important staple food, no G-quadruplex studies have been reported to date. Here, we computationally identify putative G4 structures (G4s) in wheat genome for the first time and compare its distribution across the genome against five other genomes (human, maize, Arabidopsis, rice, and sorghum). We identified close to 1 million G4 motifs with a density of 76 G4s/Mb across the whole genome and 93 G4s/Mb over genic regions. Remarkably, G4s were enriched around three regions, two located on the antisense and one on the sense strand at the following positions: 1) the transcription start site (TSS) (antisense), 2) the first coding domain sequence (CDS) (antisense), and 3) the start codon (sense). Functional enrichment analysis revealed that the gene models containing G4 motifs within these peaks were associated with specific gene ontology (GO) terms, such as developmental process, localization, and cellular component organization or biogenesis. We investigated genes encoding MADS-box transcription factors and showed examples of G4 motifs within critical regulatory regions in the VRN-1 genes in wheat. Furthermore, comparison with other plants showed that monocots share a similar distribution of G4s, but Arabidopsis shows a unique G4 distribution. Our study shows for the first time the prevalence and possible functional roles of G4s in wheat.

Subject(s)

G-Quadruplexes , Humans , Regulatory Sequences, Nucleic Acid , Transcription Initiation Site , Triticum/genetics , Zea mays

18.

Tissue-specific gene expression and protein abundance patterns are associated with fractionation bias in maize.

Walsh, Jesse R; Woodhouse, Margaret R; Andorf, Carson M; Sen, Taner Z.

BMC Plant Biol ; 20(1): 4, 2020 Jan 03.

Article in English | MEDLINE | ID: mdl-31900107

ABSTRACT

BACKGROUND: Maize experienced a whole-genome duplication event approximately 5 to 12 million years ago. Because this event occurred after speciation from sorghum, the pre-duplication subgenomes can be partially reconstructed by mapping syntenic regions to the sorghum chromosomes. During evolution, maize has had uneven gene loss between each ancient subgenome. Fractionation and divergence between these genomes continue today, constantly changing genetic make-up and phenotypes and influencing agronomic traits. RESULTS: Here we regenerate the subgenome reconstructions for the most recent maize reference genome assembly. Based on both expression and abundance data for homeologous gene pairs across multiple tissues, we observed functional divergence of genes across subgenomes. Although the genes in the larger maize subgenome are often expressing more highly than their homeologs in the smaller subgenome, we observed cases where homeolog expression dominance switches in different tissues. We demonstrate for the first time that protein abundances are higher in the larger subgenome, but they also show tissue-specific dominance, a pattern similar to RNA expression dominance. We also find that pollen expression is uniquely decoupled from protein abundance. CONCLUSION: Our study shows that the larger subgenome has a greater range of functional assignments and that there is a relative lack of overlap between the subgenomes in terms of gene functions than would be suggested by similar patterns of gene expression and protein abundance. Our study also revealed that some reactions are catalyzed uniquely by the larger and smaller subgenomes. The tissue-specific, nonequivalent expression-level dominance pattern observed here implies a change in regulatory control which favors differentiated selective pressure on the retained duplicates leading to eventual change in gene functions.

Subject(s)

Gene Expression Regulation, Plant/genetics , Gene Expression/genetics , Zea mays/genetics , Chromosome Mapping/methods , Evolution, Molecular , Gene Duplication , Gene Ontology , Genes, Plant , Genome, Plant , Phylogeny , Plant Proteins/biosynthesis , Plant Proteins/genetics , Pollen/genetics , Polyploidy

19.

Building a successful international research community through data sharing: The case of the Wheat Information System (WheatIS).

Sen, Taner Z; Caccamo, Mario; Edwards, David; Quesneville, Hadi.

F1000Res ; 9: 536, 2020.

Article in English | MEDLINE | ID: mdl-33763204

ABSTRACT

The International Wheat Information System (WheatIS) Expert Working Group (EWG) was initiated in 2012 under the Wheat Initiative with a broad range of contributing organizations. The mission of the WheatIS EWG was to create an informational infrastructure, establish data standards, and build a single portal that allows search, retrieval, and display of globally distributed wheat data sets that are indexed in standard data formats at servers around the world. The web portal at WheatIS.org was released publicly in 2015, and by 2020, it expanded to 8 geographically-distributed nodes and around 20 organizations under its umbrella. In this paper, we present our experience, the challenges we faced, and the answer we brought for establishing an international research community to build an informational infrastructure. Our hope is that our experience with building wheatis.org will guide current and future research communities to facilitate institutional and international challenges to create global tools and resources to help their respective scientific communities.

Subject(s)

Information Dissemination , Research/organization & administration , Triticum , Information Storage and Retrieval , Information Systems

20.

GrainGenes: centralized small grain resources and digital platform for geneticists and breeders.

Blake, Victoria C; Woodhouse, Margaret R; Lazo, Gerard R; Odell, Sarah G; Wight, Charlene P; Tinker, Nicholas A; Wang, Yi; Gu, Yong Q; Birkett, Clay L; Jannink, Jean-Luc; Matthews, Dave E; Hane, David L; Michel, Steve L; Yao, Eric; Sen, Taner Z.

Database (Oxford) ; 20192019 01 01.

Article in English | MEDLINE | ID: mdl-31210272

ABSTRACT

GrainGenes (https://wheat.pw.usda.gov or https://graingenes.org) is an international centralized repository for curated, peer-reviewed datasets useful to researchers working on wheat, barley, rye and oat. GrainGenes manages genomic, genetic, germplasm and phenotypic datasets through a dynamically generated web interface for facilitated data discovery. Since 1992, GrainGenes has served geneticists and breeders in both the public and private sectors on six continents. Recently, several new datasets were curated into the database along with new tools for analysis. The GrainGenes homepage was enhanced by making it more visually intuitive and by adding links to commonly used pages. Several genome assemblies and genomic tracks are displayed through the genome browsers at GrainGenes, including the Triticum aestivum (bread wheat) cv. 'Chinese Spring' IWGSC RefSeq v1.0 genome assembly, the Aegilops tauschii (D genome progenitor) Aet v4.0 genome assembly, the Triticum turgidum ssp. dicoccoides (wild emmer wheat) cv. 'Zavitan' WEWSeq v.1.0 genome assembly, a T. aestivum (bread wheat) pangenome, the Hordeum vulgare (barley) cv. 'Morex' IBSC genome assembly, the Secale cereale (rye) select 'Lo7' assembly, a partial hexaploid Avena sativa (oat) assembly and the Triticum durum cv. 'Svevo' (durum wheat) RefSeq Release 1.0 assembly. New genetic maps and markers were added and can be displayed through CMAP. Quantitative trait loci, genetic maps and genes from the Wheat Gene Catalogue are indexed and linked through the Wheat Information System (WheatIS) portal. Training videos were created to help users query and reach the data they need. GSP (Genome Specific Primers) and PIECE2 (Plant Intron Exon Comparison and Evolution) tools were implemented and are available to use. As more small grains reference sequences become available, GrainGenes will play an increasingly vital role in helping researchers improve crops.

Subject(s)

Databases, Genetic , Edible Grain/genetics , Genome, Plant , Plant Breeding , Poaceae/genetics , Quantitative Trait Loci

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL