RESUMO
BACKGROUND: Lung cancer is the leading cause of cancer death in both men and women. The most common lung cancer subtype is non-small cell lung carcinoma (NSCLC) comprising about 85% of all cases. NSCLC can be further divided into three subtypes: adenocarcinoma (LUAD), squamous cell carcinoma (LUSC), and large cell lung carcinoma. Specific genetic mutations and epigenetic aberrations play an important role in the developmental transition to a specific tumor subtype. The elucidation of normal lung versus lung tumor gene expression patterns and regulatory targets yields biomarker systems that discriminate lung phenotypes (i.e., biomarkers) and provide a foundation for the discovery of normal and aberrant gene regulatory mechanisms. RESULTS: We built condition-specific gene co-expression networks (csGCNs) for normal lung, LUAD, and LUSC conditions. Then, we integrated normal lung tissue-specific gene regulatory networks (tsGRNs) to elucidate control-target biomarker systems for normal and cancerous lung tissue. We characterized co-expressed gene edges, possibly under common regulatory control, for relevance in lung cancer. CONCLUSIONS: Our approach demonstrates the ability to elucidate csGCN:tsGRN merged biomarker systems based on gene expression correlation and regulation. The biomarker systems we describe can be used to classify and further describe lung specimens. Our approach is generalizable and can be used to discover and interpret complex gene expression patterns for any condition or species.
Assuntos
Adenocarcinoma de Pulmão , Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Adenocarcinoma de Pulmão/genética , Adenocarcinoma de Pulmão/patologia , Biomarcadores , Biomarcadores Tumorais/genética , Carcinoma Pulmonar de Células não Pequenas/genética , Carcinoma Pulmonar de Células não Pequenas/patologia , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Pulmão/patologia , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , PrognósticoRESUMO
Motivation: The ability to centralize and store data for long periods on an end user's computational resources is increasingly difficult for many scientific disciplines. For example, genomics data is increasingly large and distributed, and the data needs to be moved into workflow execution sites ranging from lab workstations to the cloud. However, the typical user is not always informed on emerging network technology or the most efficient methods to move and share data. Thus, the user defaults to using inefficient methods for transfer across the commercial internet. Results: To accelerate large data transfer, we created a tool called the Big Data Smart Socket (BDSS) that abstracts data transfer methodology from the user. The user provides BDSS with a manifest of datasets stored in a remote storage repository. BDSS then queries a metadata repository for curated data transfer mechanisms and optimal path to move each of the files in the manifest to the site of workflow execution. BDSS functions as a standalone tool or can be directly integrated into a computational workflow such as provided by the Galaxy Project. To demonstrate applicability, we use BDSS within a biological context, although it is applicable to any scientific domain. Availability and Implementation: BDSS is available under version 2 of the GNU General Public License at https://github.com/feltus/BDSS . Contact: ffeltus@clemson.edu.
Assuntos
Biologia Computacional/métodos , Bases de Dados Factuais , Software , Hábitos , Internet , Fluxo de TrabalhoRESUMO
Uterine cancer is the fourth most common cancer among women, projected to affect 66,000 US women in 2021. Uterine cancer often arises in the inner lining of the uterus, known as the endometrium, but can present as several different types of cancer, including endometrioid cancer, serous adenocarcinoma, and uterine carcinosarcoma. Previous studies have analyzed the genetic changes between normal and cancerous uterine tissue to identify specific genes of interest, including TP53 and PTEN. Here we used Gaussian Mixture Models to build condition-specific gene coexpression networks for endometrial cancer, uterine carcinosarcoma, and normal uterine tissue. We then incorporated uterine regulatory edges and investigated potential coregulation relationships. These networks were further validated using differential expression analysis, functional enrichment, and a statistical analysis comparing the expression of transcription factors and their target genes across cancerous and normal uterine samples. These networks allow for a more comprehensive look into the biological networks and pathways affected in uterine cancer compared with previous singular gene analyses. We hope this study can be incorporated into existing knowledge surrounding the genetics of uterine cancer and soon become clinical biomarkers as a tool for better prognosis and treatment.
Assuntos
Cistadenocarcinoma Seroso , Neoplasias do Endométrio , Neoplasias Uterinas , Biomarcadores , Cistadenocarcinoma Seroso/genética , Neoplasias do Endométrio/genética , Neoplasias do Endométrio/patologia , Feminino , Humanos , Neoplasias Uterinas/genética , Neoplasias Uterinas/metabolismo , Neoplasias Uterinas/patologia , Útero/metabolismoRESUMO
Identification of genes and pathways involved in diseases and physiological conditions is a major task in systems biology. In this study, we developed a novel non-parameter Ising model to integrate protein-protein interaction network and microarray data for identifying differentially expressed (DE) genes. We also proposed a simulated annealing algorithm to find the optimal configuration of the Ising model. The Ising model was applied to two breast cancer microarray data sets. The results showed that more cancer-related DE sub-networks and genes were identified by the Ising model than those by the Markov random field model. Furthermore, cross-validation experiments showed that DE genes identified by Ising model can improve classification performance compared with DE genes identified by Markov random field model.
Assuntos
Neoplasias da Mama/genética , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Mapas de Interação de Proteínas , Algoritmos , Simulação por Computador , Feminino , Humanos , Modelos Biológicos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Sensibilidade e EspecificidadeRESUMO
BACKGROUND: Switchgrass, a C4 species and a warm-season grass native to the prairies of North America, has been targeted for development into an herbaceous biomass fuel crop. Genetic improvement of switchgrass feedstock traits through marker-assisted breeding and biotechnology approaches calls for genomic tools development. Establishment of integrated physical and genetic maps for switchgrass will accelerate mapping of value added traits useful to breeding programs and to isolate important target genes using map based cloning. The reported polyploidy series in switchgrass ranges from diploid (2X = 18) to duodecaploid (12X = 108). Like in other large, repeat-rich plant genomes, this genomic complexity will hinder whole genome sequencing efforts. An extensive physical map providing enough information to resolve the homoeologous genomes would provide the necessary framework for accurate assembly of the switchgrass genome. RESULTS: A switchgrass BAC library constructed by partial digestion of nuclear DNA with EcoRI contains 147,456 clones covering the effective genome approximately 10 times based on a genome size of 3.2 Gigabases (~1.6 Gb effective). Restriction digestion and PFGE analysis of 234 randomly chosen BACs indicated that 95% of the clones contained inserts, ranging from 60 to 180 kb with an average of 120 kb. Comparative sequence analysis of two homoeologous genomic regions harboring orthologs of the rice OsBRI1 locus, a low-copy gene encoding a putative protein kinase and associated with biomass, revealed that orthologous clones from homoeologous chromosomes can be unambiguously distinguished from each other and correctly assembled to respective fingerprint contigs. Thus, the data obtained not only provide genomic resources for further analysis of switchgrass genome, but also improve efforts for an accurate genome sequencing strategy. CONCLUSIONS: The construction of the first switchgrass BAC library and comparative analysis of homoeologous harboring OsBRI1 orthologs present a glimpse into the switchgrass genome structure and complexity. Data obtained demonstrate the feasibility of using HICF fingerprinting to resolve the homoeologous chromosomes of the two distinct genomes in switchgrass, providing a robust and accurate BAC-based physical platform for this species. The genomic resources and sequence data generated will lay the foundation for deciphering the switchgrass genome and lead the way for an accurate genome sequencing strategy.
Assuntos
Cromossomos Artificiais Bacterianos/genética , Metabolismo Energético/genética , Biblioteca Gênica , Genoma de Planta/genética , Panicum/genética , Panicum/metabolismo , Homologia de Sequência do Ácido Nucleico , Impressões Digitais de DNA , Estudos de Viabilidade , Loci Gênicos/genética , Oryza/genética , Poliploidia , Proteínas Serina-Treonina Quinases/genética , Reprodutibilidade dos Testes , Análise de SequênciaRESUMO
BACKGROUND: The fermented dried seeds of Theobroma cacao (cacao tree) are the main ingredient in chocolate. World cocoa production was estimated to be 3 million tons in 2010 with an annual estimated average growth rate of 2.2%. The cacao bean production industry is currently under threat from a rise in fungal diseases including black pod, frosty pod, and witches' broom. In order to address these issues, genome-sequencing efforts have been initiated recently to facilitate identification of genetic markers and genes that could be utilized to accelerate the release of robust T. cacao cultivars. However, problems inherent with assembly and resolution of distal regions of complex eukaryotic genomes, such as gaps, chimeric joins, and unresolvable repeat-induced compressions, have been unavoidably encountered with the sequencing strategies selected. RESULTS: Here, we describe the construction of a BAC-based integrated genetic-physical map of the T. cacao cultivar Matina 1-6 which is designed to augment and enhance these sequencing efforts. Three BAC libraries, each comprised of 10× coverage, were constructed and fingerprinted. 230 genetic markers from a high-resolution genetic recombination map and 96 Arabidopsis-derived conserved ortholog set (COS) II markers were anchored using pooled overgo hybridization. A dense tile path consisting of 29,383 BACs was selected and end-sequenced. The physical map consists of 154 contigs and 4,268 singletons. Forty-nine contigs are genetically anchored and ordered to chromosomes for a total span of 307.2 Mbp. The unanchored contigs (105) span 67.4 Mbp and therefore the estimated genome size of T. cacao is 374.6 Mbp. A comparative analysis with A. thaliana, V. vinifera, and P. trichocarpa suggests that comparisons of the genome assemblies of these distantly related species could provide insights into genome structure, evolutionary history, conservation of functional sites, and improvements in physical map assembly. A comparison between the two T. cacao cultivars Matina 1-6 and Criollo indicates a high degree of collinearity in their genomes, yet rearrangements were also observed. CONCLUSIONS: The results presented in this study are a stand-alone resource for functional exploitation and enhancement of Theobroma cacao but are also expected to complement and augment ongoing genome-sequencing efforts. This resource will serve as a template for refinement of the T. cacao genome through gap-filling, targeted re-sequencing, and resolution of repetitive DNA arrays.
Assuntos
Cacau/genética , Mapeamento Físico do Cromossomo/métodos , Cromossomos Artificiais Bacterianos/genética , Mapeamento de Sequências Contíguas , Marcadores Genéticos/genética , Genoma de Planta/genética , Alinhamento de Sequência , Sitios de Sequências RotuladasRESUMO
BACKGROUND: BAC-based physical maps provide for sequencing across an entire genome or a selected sub-genomic region of biological interest. Such a region can be approached with next-generation whole-genome sequencing and assembly as if it were an independent small genome. Using the minimum tiling path as a guide, specific BAC clones representing the prioritized genomic interval are selected, pooled, and used to prepare a sequencing library. RESULTS: This pooled BAC approach was taken to sequence and assemble a QTL-rich region, of ~3 Mbp and represented by twenty-seven BACs, on linkage group 5 of the Theobroma cacao cv. Matina 1-6 genome. Using various mixtures of read coverages from paired-end and linear 454 libraries, multiple assemblies of varied quality were generated. Quality was assessed by comparing the assembly of 454 reads with a subset of ten BACs individually sequenced and assembled using Sanger reads. A mixture of reads optimal for assembly was identified. We found, furthermore, that a quality assembly suitable for serving as a reference genome template could be obtained even with a reduced depth of sequencing coverage. Annotation of the resulting assembly revealed several genes potentially responsible for three T. cacao traits: black pod disease resistance, bean shape index, and pod weight. CONCLUSIONS: Our results, as with other pooled BAC sequencing reports, suggest that pooling portions of a minimum tiling path derived from a BAC-based physical map is an effective method to target sub-genomic regions for sequencing. While we focused on a single QTL region, other QTL regions of importance could be similarly sequenced allowing for biological discovery to take place before a high quality whole-genome assembly is completed.
Assuntos
Cacau/genética , Cromossomos Artificiais Bacterianos , Genoma de Planta , Locos de Características Quantitativas , Biblioteca Genômica , Alinhamento de Sequência , Análise de Sequência de DNARESUMO
From noble beginnings as a prospective forage, polyploid Sorghum halepense ('Johnsongrass') is both an invasive species and one of the world's worst agricultural weeds. Formed by S. bicolor x S. propinquum hybridization, we show S. halepense to have S. bicolor-enriched allele composition and striking mutations in 5,957 genes that differentiate it from representatives of its progenitor species and an outgroup. The spread of S. halepense may have been facilitated by introgression from closely-related cultivated sorghum near genetic loci affecting rhizome development, seed size, and levels of lutein, a photochemical protectant and abscisic acid precursor. Rhizomes, subterranean stems that store carbohydrates and spawn clonal propagules, have growth correlated with reproductive rather than other vegetative tissues, and increase survival of both temperate cold seasons and tropical dry seasons. Rhizomes of S. halepense are more extensive than those of its rhizomatous progenitor S. propinquum, with gene expression including many alleles from its non-rhizomatous S. bicolor progenitor. The first surviving polyploid in its lineage in â¼96 million years, its post-Columbian spread across six continents carried rich genetic diversity that in the United States has facilitated transition from agricultural to non-agricultural niches. Projected to spread another 200-600 km northward in the coming century, despite its drawbacks S. halepense may offer novel alleles and traits of value to improvement of sorghum.
RESUMO
Genome duplication is potentially a good source of new genes, but such genes take time to evolve. We have found a group of "duplication-resistant" genes, which have undergone convergent restoration to singleton status following several independent genome duplications. Restoration of duplication-resistant genes to singleton status could be important to long-term survival of a polyploid lineage. Angiosperms show more frequent polyploidization and a higher degree of duplicate gene preservation than other paleopolyploids, making them well-suited to further study of duplication-resistant genes.
Assuntos
Arabidopsis/genética , Duplicação Gênica , Oryza/genética , Saccharomyces/genética , Tetraodontiformes/genética , Animais , Evolução Molecular , Genes de Plantas , Genoma Fúngico , Poliploidia , Estrutura Terciária de ProteínaRESUMO
Many genes duplicated by whole-genome duplications (WGDs) are more similar to one another than expected. We investigated whether concerted evolution through conversion and crossing over, well-known to affect tandem gene clusters, also affects dispersed paralogs. Genome sequences for two Oryza subspecies reveal appreciable gene conversion in the approximately 0.4 MY since their divergence, with a gradual progression toward independent evolution of older paralogs. Since divergence from subspecies indica, approximately 8% of japonica paralogs produced 5-7 MYA on chromosomes 11 and 12 have been affected by gene conversion and several reciprocal exchanges of chromosomal segments, while approximately 70-MY-old "paleologs" resulting from a genome duplication (GD) show much less conversion. Sequence similarity analysis in proximal gene clusters also suggests more conversion between younger paralogs. About 8% of paleologs may have been converted since rice-sorghum divergence approximately 41 MYA. Domain-encoding sequences are more frequently converted than nondomain sequences, suggesting a sort of circularity--that sequences conserved by selection may be further conserved by relatively frequent conversion. The higher level of concerted evolution in the 5-7 MY-old segmental duplication may reflect the behavior of many genomes within the first few million years after duplication or polyploidization.
Assuntos
Evolução Molecular , Oryza/genética , Troca Genética , Conversão Gênica , Duplicação Gênica , Genoma de Planta , Modelos Genéticos , Oryza/classificação , Proteínas de Plantas/genética , Poliploidia , Sorghum/classificação , Sorghum/genética , Especificidade da Espécie , Fatores de TempoRESUMO
We applied two state-of-the-art, knowledge independent data-mining methods - Dynamic Quantum Clustering (DQC) and t-Distributed Stochastic Neighbor Embedding (t-SNE) - to data from The Cancer Genome Atlas (TCGA). We showed that the RNA expression patterns for a mixture of 2,016 samples from five tumor types can sort the tumors into groups enriched for relevant annotations including tumor type, gender, tumor stage, and ethnicity. DQC feature selection analysis discovered 48 core biomarker transcripts that clustered tumors by tumor type. When these transcripts were removed, the geometry of tumor relationships changed, but it was still possible to classify the tumors using the RNA expression profiles of the remaining transcripts. We continued to remove the top biomarkers for several iterations and performed cluster analysis. Even though the most informative transcripts were removed from the cluster analysis, the sorting ability of remaining transcripts remained strong after each iteration. Further, in some iterations we detected a repeating pattern of biological function that wasn't detectable with the core biomarker transcripts present. This suggests the existence of a "background classification" potential in which the pattern of gene expression after continued removal of "biomarker" transcripts could still classify tumors in agreement with the tumor type.
Assuntos
Biomarcadores Tumorais/genética , Biologia Computacional , Neoplasias/classificação , Neoplasias/genética , Análise por Conglomerados , Feminino , Perfilação da Expressão Gênica , Humanos , Masculino , Estadiamento de Neoplasias , Neoplasias/patologiaRESUMO
[This corrects the article on p. 2008 in vol. 8, PMID: 29259608.].
RESUMO
Forest trees are valued sources of pulp, timber and biofuels, and serve a role in carbon sequestration, biodiversity maintenance and watershed stability. Examining the relationships among genetic, phenotypic and environmental factors for these species provides insight on the areas of concern for breeders and researchers alike. The TreeGenes database is a web-based repository that is home to 1790 tree species and over 1500 registered users. The database provides a curated archive for high-throughput genomics, including reference genomes, transcriptomes, genetic maps and variant data. These resources are paired with extensive phenotypic information and environmental layers. TreeGenes recently migrated to Tripal, an integrated and open-source database schema and content management system. This migration enabled developments focused on data exchange, data transfer and improved analytical capacity, as well as providing TreeGenes the opportunity to communicate with the following partner databases: Hardwood Genomics Web, Genome Database for Rosaceae, and the Citrus Genome Database. Recent development in TreeGenes has focused on coordinating information for georeferenced accessions, including metadata acquisition and ontological frameworks, to improve integration across studies combining genetic, phenotypic and environmental data. This focus was paired with the development of tools to enable comparative genomics and data visualization. By combining advanced data importers, relevant metadata standards and integrated analytical frameworks, TreeGenes provides a platform for researchers to store, submit and analyze forest tree data.
Assuntos
Bases de Dados Genéticas , Florestas , Genômica , Mineração de Dados , Ontologia Genética , Fenótipo , Filogenia , Ferramenta de Busca , Software , Árvores/genética , Árvores/crescimento & desenvolvimentoRESUMO
Cacao (Theobroma cacao L.) is an important cash crop in tropical regions around the world and has a rich agronomic history in South America. As a key component in the cosmetic and confectionary industries, millions of people worldwide use products made from cacao, ranging from shampoo to chocolate. An Illumina Infinity II array was created using 13,530 SNPs identified within a small diversity panel of cacao. Of these SNPs, 12,643 derive from variation within annotated cacao genes. The genotypes of 3,072 trees were obtained, including two mapping populations from Ecuador. High-density linkage maps for these two populations were generated and compared to the cacao genome assembly. Phenotypic data from these populations were combined with the linkage maps to identify the QTLs for yield and disease resistance.
RESUMO
High-throughput DNA sequencing technology has revolutionized the study of gene expression while introducing significant computational challenges for biologists. These computational challenges include access to sufficient computer hardware and functional data processing workflows. Both these challenges are addressed with our scalable, open-source Pegasus workflow for processing high-throughput DNA sequence datasets into a gene expression matrix (GEM) using computational resources available to U.S.-based researchers on the Open Science Grid (OSG). We describe the usage of the workflow (OSG-GEM), discuss workflow design, inspect performance data, and assess accuracy in mapping paired-end sequencing reads to a reference genome. A target OSG-GEM user is proficient with the Linux command line and possesses basic bioinformatics experience. The user may run this workflow directly on the OSG or adapt it to novel computing environments.
RESUMO
In the last decade, high-throughput DNA sequencing has become a disruptive technology and pushed the life sciences into a distributed ecosystem of sequence data producers and consumers. Given the power of genomics and declining sequencing costs, biology is an emerging "Big Data" discipline that will soon enter the exabyte data range when all subdisciplines are combined. These datasets must be transferred across commercial and research networks in creative ways since sending data without thought can have serious consequences on data processing time frames. Thus, it is imperative that biologists, bioinformaticians, and information technology engineers recalibrate data processing paradigms to fit this emerging reality. This review attempts to provide a snapshot of Big Data transfer across networks, which is often overlooked by many biologists. Specifically, we discuss four key areas: 1) data transfer networks, protocols, and applications; 2) data transfer security including encryption, access, firewalls, and the Science DMZ; 3) data flow control with software-defined networking; and 4) data storage, staging, archiving and access. A primary intention of this article is to orient the biologist in key aspects of the data transfer process in order to frame their genomics-oriented needs to enterprise IT professionals.
RESUMO
Tripal is an open-source freely available toolkit for construction of online genomic and genetic databases. It aims to facilitate development of community-driven biological websites by integrating the GMOD Chado database schema with Drupal, a popular website creation and content management software. Tripal provides a suite of tools for interaction with a Chado database and display of content therein. The tools are designed to be generic to support the various ways in which data may be stored in Chado. Previous releases of Tripal have supported organisms, genomic libraries, biological stocks, stock collections and genomic features, their alignments and annotations. Also, Tripal and its extension modules provided loaders for commonly used file formats such as FASTA, GFF, OBO, GAF, BLAST XML, KEGG heir files and InterProScan XML. Default generic templates were provided for common views of biological data, which could be customized using an open Application Programming Interface to change the way data are displayed. Here, we report additional tools and functionality that are part of release v1.1 of Tripal. These include (i) a new bulk loader that allows a site curator to import data stored in a custom tab delimited format; (ii) full support of every Chado table for Drupal Views (a powerful tool allowing site developers to construct novel displays and search pages); (iii) new modules including 'Feature Map', 'Genetic', 'Publication', 'Project', 'Contact' and the 'Natural Diversity' modules. Tutorials, mailing lists, download and set-up instructions, extension modules and other documentation can be found at the Tripal website located at http://tripal.info. DATABASE URL: http://tripal.info/.
Assuntos
Bases de Dados Genéticas , Genoma/genética , Genômica/métodos , Genômica/normas , Internet , Software , Variação Genética , Genótipo , Plantas/genética , Publicações , Padrões de Referência , Sementes/genética , Interface Usuário-ComputadorRESUMO
The study of gene relationships and their effect on biological function and phenotype is a focal point in systems biology. Gene co-expression networks built using microarray expression profiles are one technique for discovering and interpreting gene relationships. A knowledge-independent thresholding technique, such as Random Matrix Theory (RMT), is useful for identifying meaningful relationships. Highly connected genes in the thresholded network are then grouped into modules that provide insight into their collective functionality. While it has been shown that co-expression networks are biologically relevant, it has not been determined to what extent any given network is functionally robust given perturbations in the input sample set. For such a test, hundreds of networks are needed and hence a tool to rapidly construct these networks. To examine functional robustness of networks with varying input, we enhanced an existing RMT implementation for improved scalability and tested functional robustness of human (Homo sapiens), rice (Oryza sativa) and budding yeast (Saccharomyces cerevisiae). We demonstrate dramatic decrease in network construction time and computational requirements and show that despite some variation in global properties between networks, functional similarity remains high. Moreover, the biological function captured by co-expression networks thresholded by RMT is highly robust.
Assuntos
Perfilação da Expressão Gênica , Expressão Gênica , Redes Reguladoras de Genes , Modelos Teóricos , Algoritmos , Humanos , Oryza/genética , Saccharomyces cerevisiae/genéticaRESUMO
BACKGROUND: Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. RESULTS: We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina 1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. CONCLUSIONS: We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits.