RESUMO
Pod quality and yield traits in snap bean (Phaseolus vulgaris L.) influence consumer preferences, crop adoption by farmers, and the ability of the product to be commercially competitive locally and globally. The objective of the study was to identify the quantitative trait loci (QTL) for pod quality and yield traits in a snap × dry bean recombinant inbred line (RIL) population. A total of 184 F6 RILs derived from a cross between Vanilla (snap bean) and MCM5001 (dry bean) were grown in three field sites in Kenya and one greenhouse environment in Davis, CA, USA. They were genotyped at 5,951 single nucleotide polymorphisms (SNPs), and composite interval mapping was conducted to identify QTL for 16 pod quality and yield traits, including pod wall fiber, pod string, pod size, and harvest metrics. A combined total of 44 QTL were identified in field and greenhouse trials. The QTL for pod quality were identified on chromosomes Pv01, Pv02, Pv03, Pv04, Pv06, and Pv07, and for pod yield were identified on Pv08. Co-localization of QTL was observed for pod quality and yield traits. Some identified QTL overlapped with previously mapped QTL for pod quality and yield traits, with several others identified as novel. The identified QTL can be used in future marker-assisted selection in snap bean.
RESUMO
The UV resistance of bacterial endospores is an important quality supporting their survival in inhospitable environments and therefore constitutes an essential driver of the ecological success of spore-forming bacteria. Nevertheless, the variability and evolvability of this trait are poorly understood. In this study, directed evolution and genetics approaches revealed that the Bacillus cereus pdaA gene (encoding the endospore-specific peptidoglycan-N-acetylmuramic acid deacetylase) serves as a contingency locus in which the expansion and contraction of short tandem repeats can readily compromise (PdaAOFF) or restore (PdaAON) the pdaA open reading frame. Compared with B. cereus populations in the PdaAON state, populations in the PdaAOFF state produced a lower yield of viable endospores but endowed them with vastly increased UV resistance. Moreover, selection pressures based on either quantity (i.e., yield of viable endospores) or quality (i.e., UV resistance of viable endospores) aspects could readily shift populations between PdaAON and PdaAOFF states, respectively. Bioinformatic analysis also revealed that pdaA homologs within the Bacillus and Clostridium genera are often equipped with several short tandem repeat regions, suggesting a wider implementation of the pdaA-mediated phase variability in other sporeformers as well. These results for the first time reveal (1) pdaA as a phase-variable contingency locus in the adaptive evolution of endospore properties and (2) bet-hedging between what appears to be a quantity versus quality trade-off in endospore crops.
Assuntos
Bacillus cereus , Esporos Bacterianos , Esporos Bacterianos/genética , Bacillus cereus/genética , Evolução Biológica , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Evolução Molecular , Raios UltravioletaRESUMO
The Terminal Fusarium Clade (TFC) is a group in the Nectriaceae family with agricultural and clinical relevance. In recent years, various phylogenies have been presented in the literature, showing disagreement in the topologies, but only a few studies have conducted analyses on the divergence time scale of the group. Therefore, the evolutionary history of this group is still being determined. This study aimed to understand the evolutionary history of the TFC from a phylogenomic perspective. To achieve this objective, we performed a phylogenomic analysis using the available genomes in GenBank and ran eight different pipelines. We presented a new robust topology of the TFC that differs at some nodes from previous studies. These new relationships allowed us to formulate new hypotheses about the evolutionary history of the TFC. We also inferred new divergence time estimates, which differ from those of previous studies due to topology discordances and taxon sampling. The results suggested an important diversification process in the Neogene period, likely associated with the diversification and predominance of terrestrial ecosystems by angiosperms. In conclusion, we presented a robust time-scale phylogeny that allowed us to formulate new hypotheses regarding the evolutionary history of the TFC.
RESUMO
BACKGROUND: Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed. FINDINGS: We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform, which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments, including simulation and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read depths. CONCLUSION: The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
Assuntos
Algoritmos , Benchmarking , Teorema de Bayes , Genótipo , Análise por ConglomeradosRESUMO
Chagas is an endemic disease in tropical regions of Latin America, caused by the parasite Trypanosoma cruzi. High intraspecies variability and genome complexity have been challenges to assemble high quality genomes needed for studies in evolution, population genomics, diagnosis and drug development. Here we present a chromosome-level phased assembly of a TcI T. cruzi strain (Dm25). While 29 chromosomes show a large collinearity with the assembly of the Brazil A4 strain, three chromosomes show both large heterozygosity and large divergence, compared to previous assemblies of TcI T. cruzi strains. Nucleotide and protein evolution statistics indicate that T. cruzi Marinkellei separated before the diversification of T. cruzi in the known DTUs. Interchromosomal paralogs of dispersed gene families and histones appeared before but at the same time have a more strict purifying selection, compared to other repeat families. Previously unreported large tandem arrays of protein kinases and histones were identified in this assembly. Over one million variants obtained from Illumina reads aligned to the primary assembly clearly separate the main DTUs. We expect that this new assembly will be a valuable resource for further studies on evolution and functional genomics of Trypanosomatids.
Assuntos
Doença de Chagas , Trypanosoma cruzi , Humanos , Trypanosoma cruzi/genética , Colômbia , Histonas , BrasilRESUMO
The domestication process in lima bean (Phaseolus lunatus L.) involves two independent events, within the Mesoamerican and Andean gene pools. This makes lima bean an excellent model to understand convergent evolution. The mechanisms of adaptation followed by Mesoamerican and Andean landraces are largely unknown. Genes related to these adaptations can be selected by identification of selective sweeps within gene pools. Previous genetic analyses in lima bean have relied on Single Nucleotide Polymorphism (SNP) loci, and have ignored transposable elements (TEs). Here we show the analysis of whole-genome sequencing data from 61 lima bean accessions to characterize a genomic variation database including TEs and SNPs, to associate selective sweeps with variable TEs and to predict candidate domestication genes. A small percentage of genes under selection are shared among gene pools, suggesting that domestication followed different genetic avenues in both gene pools. About 75% of TEs are located close to genes, which shows their potential to affect gene functions. The genetic structure inferred from variable TEs is consistent with that obtained from SNP markers, suggesting that TE dynamics can be related to the demographic history of wild and domesticated lima bean and its adaptive processes, in particular selection processes during domestication.
Assuntos
Phaseolus , Phaseolus/genética , Elementos de DNA Transponíveis/genética , Polimorfismo de Nucleotídeo Único , Dinâmica PopulacionalRESUMO
Premise: Transposable elements (TEs) make up more than half of the genomes of complex plant species and can modulate the expression of neighboring genes, producing significant variability of agronomically relevant traits. The availability of long-read sequencing technologies allows the building of genome assemblies for plant species with large and complex genomes. Unfortunately, TE annotation currently represents a bottleneck in the annotation of genome assemblies. Methods and Results: We present a new functionality of the Next-Generation Sequencing Experience Platform (NGSEP) to perform efficient homology-based TE annotation. Sequences in a reference library are treated as long reads and mapped to an input genome assembly. A hierarchical annotation is then assigned by homology using the annotation of the reference library. We tested the performance of our algorithm on genome assemblies of different plant species, including Arabidopsis thaliana, Oryza sativa, Coffea humblotiana, and Triticum aestivum (bread wheat). Our algorithm outperforms traditional homology-based annotation tools in speed by a factor of three to >20, reducing the annotation time of the T. aestivum genome from months to hours, and recovering up to 80% of TEs annotated with RepeatMasker with a precision of up to 0.95. Conclusions: NGSEP allows rapid analysis of TEs, especially in very large and TE-rich plant genomes.
RESUMO
BACKGROUND: Danger-associated molecular patterns (DAMPs) may be implicated in the pathophysiological pathways associated with an unfavorable outcome after acute brain injury (ABI). METHODS: We collected samples of ventricular cerebrospinal fluid (vCSF) for 5 days in 50 consecutive patients at risk of intracranial hypertension after traumatic and nontraumatic ABI. Differences in vCSF protein expression over time were evaluated using linear models and selected for functional network analysis using the PANTHER and STRING databases. The primary exposure of interest was the type of brain injury (traumatic vs. nontraumatic), and the primary outcome was the vCSF expression of DAMPs. Secondary exposures of interest included the occurrence of intracranial pressure ≥20 or ≥ 30 mm Hg during the 5 days post-ABI, intensive care unit (ICU) mortality, and neurological outcome (assessed using the Glasgow Outcome Score) at 3 months post-ICU discharge. Secondary outcomes included associations of these exposures with the vCSF expression of DAMPs. RESULTS: A network of 6 DAMPs (DAMP_trauma; protein-protein interaction [PPI] P=0.04) was differentially expressed in patients with ABI of traumatic origin compared with those with nontraumatic ABI. ABI patients with intracranial pressure ≥30 mm Hg differentially expressed a set of 38 DAMPS (DAMP_ICP30; PPI P< 0.001). Proteins in DAMP_ICP30 are involved in cellular proteolysis, complement pathway activation, and post-translational modifications. There were no relationships between DAMP expression and ICU mortality or unfavorable versus favorable outcomes. CONCLUSIONS: Specific patterns of vCSF DAMP expression differentiated between traumatic and nontraumatic types of ABI and were associated with increased episodes of severe intracranial hypertension.
RESUMO
The global market of chocolate has increased worldwide during the last decade and is expected to reach a value of USD 200 billion by 2028. Chocolate is obtained from different varieties of Theobroma cacao L, a plant domesticated more than 4000 years ago in the Amazon rainforest. However, chocolate production is a complex process requiring extensive post-harvesting, mainly involving cocoa bean fermentation, drying, and roasting. These steps have a critical impact on chocolate quality. Standardizing and better understanding cocoa processing is, therefore, a current challenge to boost the global production of high-quality cocoa worldwide. This knowledge can also help cocoa producers improve cocoa processing management and obtain a better chocolate. Several recent studies have been conducted to dissect cocoa processing via omics analysis. A vast amount of data has been produced regarding omics studies of cocoa processing performed worldwide. This review systematically analyzes the current data on cocoa omics using data mining techniques and discusses opportunities and gaps for cocoa processing standardization from this data. First, we observed a recurrent report in metagenomics studies of species of the fungi genus Candida and Pichia as well as bacteria from the genus Lactobacillus, Acetobacter, and Bacillus. Second, our analyzes of the available metabolomics data showed clear differences in the identified metabolites in cocoa and chocolate from different geographical origin, cocoa type, and processing stage. Finally, our analysis of peptidomics data revealed characteristic patterns in the gathered data including higher diversity and lower size distribution of peptides in fine-flavor cocoa. In addition, we discuss the current challenges in cocoa omics research. More research is still required to fill gaps in central matter in chocolate production as starter cultures for cocoa fermentation, flavor evolution of cocoa, and the role of peptides in the development of specific flavor notes. We also offer the most comprehensive collection of multi-omics data in cocoa processing gathered from different research articles.
Assuntos
Bacillus , Cacau , Chocolate , Alimentos , CandidaRESUMO
Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species.
Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Genoma , SoftwareRESUMO
The ultimate goal of de novo assembly of reads sequenced from a diploid individual is the separate reconstruction of the sequences corresponding to the two copies of each chromosome. Unfortunately, the allele linkage information needed to perform phased genome assemblies has been difficult to generate. Hence, most current genome assemblies are a haploid mixture of the two underlying chromosome copies present in the sequenced individual. Sequencing technologies providing long (20 kb) and accurate reads are the basis to generate phased genome assemblies. This chapter provides a brief overview of the main milestones in traditional genome assembly, focusing on the bioinformatic techniques developed to generate haplotype information from different specialized protocols. Using these techniques as a knowledge background, the chapter reviews the current algorithms to generate phased assemblies from long reads with low error rates. Current techniques perform haplotype-aware error correction steps to increase the quality of the raw reads. In addition, variations on the traditional overlap-layout-consensus (OLC) graph have been developed in an effort to eliminate edges between reads sequenced from different chromosome copies. This allows for large presence-absence variants between the chromosome copies to be taken into account. The development of these algorithms, along with the improved sequencing technologies has been crucial to finish chromosome-level assemblies of complex genomes.
Assuntos
Algoritmos , Biologia Computacional , Análise de Sequência de DNA/métodos , Haplótipos , Alelos , Sequenciamento de Nucleotídeos em Larga Escala/métodosRESUMO
Whole-genome alignment allows researchers to understand the genomic structure and variation among genomes. Approaches based on direct pairwise comparisons of DNA sequences require large computational capacities. As a consequence, pipelines combining tools for orthologous gene identification and synteny have been developed. In this manuscript, we present the latest functionalities implemented in NGSEP 4, to identify orthogroups and perform whole genome alignments. NGSEP implements functionalities for identification of clusters of homologus genes, synteny analysis and whole genome alignment. Our results showed that the NGSEP algorithm for orthogroups identification has competitive accuracy and efficiency in comparison to commonly used tools. The implementation also includes a visualization of the whole genome alignment based on synteny of the orthogroups that were identified, and a reconstruction of the pangenome based on frequencies of the orthogroups among the genomes. NGSEP 4 also includes a new graphical user interface based on the JavaFX technology. We expect that these new developments will be very useful for several studies in evolutionary biology and population genomics.
Assuntos
Genoma , Software , Genômica/métodos , Algoritmos , MetagenômicaRESUMO
The impact of cocoa lipid content on chocolate quality has been extensively described. Nevertheless, few studies have elucidated the cocoa lipid composition and their bioactive properties, focusing only on specific lipids. In the present study the lipidome of fine-flavor cocoa fermentation was analyzed using LC-MS-QTOF and a Machine Learning model to assess potential bioactivity was developed. Our results revealed that the cocoa lipidome, comprised mainly of fatty acyls and glycerophospholipids, remains stable during fine-flavor cocoa fermentations. Also, several Machine Learning algorithms were trained to explore potential biological activity among the identified lipids. We found that K-Nearest Neighbors had the best performance. This model was used to classify the identified lipids as bioactive or non-bioactive, nominating 28 molecules as potential bioactive lipids. None of these compounds have been previously reported as bioactive. Our work is the first untargeted lipidomic study and systematic effort to investigate potential bioactivity in fine-flavor cocoa lipids.
Assuntos
Cacau , Chocolate , Fermentação , Lipidômica , Lipídeos , PaladarRESUMO
Fruit development has been central in the evolution and domestication of flowering plants. In common bean (Phaseolus vulgaris), the principal global grain legume staple, two main production categories are distinguished by fibre deposition in pods: dry beans, with fibrous, stringy pods; and stringless snap/green beans, with reduced fibre deposition, which frequently revert to the ancestral stringy state. Here, we identify genetic and developmental patterns associated with pod fibre deposition. Transcriptional, anatomical, epigenetic and genetic regulation of pod strings were explored through RNA-seq, RT-qPCR, fluorescence microscopy, bisulfite sequencing and whole-genome sequencing. Overexpression of the INDEHISCENT ('PvIND') orthologue was observed in stringless types compared with isogenic stringy lines, associated with overspecification of weak dehiscence-zone cells throughout the pod vascular sheath. No differences in DNA methylation were correlated with this phenotype. Nonstringy varieties showed a tandemly direct duplicated PvIND and a Ty1-copia retrotransposon inserted between the two repeats. These sequence features are lost during pod reversion and are predictive of pod phenotype in diverse materials, supporting their role in PvIND overexpression and reversible string phenotype. Our results give insight into reversible gain-of-function mutations and possible genetic solutions to the reversion problem, of considerable economic value for green bean production.
Assuntos
Phaseolus , Domesticação , Duplicação Gênica , Phaseolus/genética , Fenótipo , Retroelementos/genéticaRESUMO
BACKGROUND: Quantitative analysis of ventricular cerebrospinal fluid (vCSF) proteins following acute brain injury (ABI) may help identify pathophysiological pathways and potential biomarkers that can predict unfavorable outcome. METHODS: In this prospective proteomic analysis study, consecutive patients with severe ABI expected to require intraventricular catheterization for intracranial pressure (ICP) monitoring for at least 5 days and patients without ABI admitted for elective clipping of an unruptured cerebral aneurysm were included. vCSF samples were collected within the first 24 h after ABI and ventriculostomy insertion and then every 24 h for 5 days. In patients without ABI, a single vCSF sample was collected at the time of elective clipping. Data-independent acquisition and sequential window acquisition of all theoretical spectra (SWATH) mass spectrometry were used to compare differences in protein expression in patients with ABI and patients without ABI and in patients with traumatic and nontraumatic ABI. Differences in protein expression according to different ICP values, intensive care unit outcome, subarachnoid hemorrhage (SAH) versus traumatic brain injury (TBI), and good versus poor 3-month functional status (assessed by using the Glasgow Outcome Scale) were also evaluated. vCSF proteins with significant differences between groups were compared by using linear models and selected for gene ontology analysis using R Language and the Panther database. RESULTS: We included 50 patients with ABI (SAH n = 23, TBI n = 15, intracranial hemorrhage n = 6, ischemic stroke n = 3, others n = 3) and 12 patients without ABI. There were significant differences in the expression of 255 proteins between patients with and without ABI (p < 0.01). There were intraday and interday differences in expression of seven proteins related to increased inflammation, apoptosis, oxidative stress, and cellular response to hypoxia and injury. Among these, glial fibrillary acidic protein expression was higher in patients with ABI with severe intracranial hypertension (ICH) (ICP ≥ 30 mm Hg) or death compared to those without (log 2 fold change: + 2.4; p < 0.001), suggesting extensive primary astroglial injury or death. There were differences in the expression of 96 proteins between patients with traumatic and nontraumatic ABI (p < 0.05); intraday and interday differences were observed for six proteins related to structural damage, complement activation, and cholesterol metabolism. Thirty-nine vCSF proteins were associated with an increased risk of severe ICH (ICP ≥ 30 mm Hg) in patients with traumatic compared with nontraumatic ABI (p < 0.05). No significant differences were found in protein expression between patients with SAH versus TBI or between those with good versus poor 3-month Glasgow Outcome Scale score. CONCLUSIONS: Dysregulated vCSF protein expression after ABI may be associated with an increased risk of severe ICH and death.
Assuntos
Lesões Encefálicas Traumáticas , Lesões Encefálicas , Hipertensão Intracraniana , Hemorragia Subaracnóidea , Biomarcadores , Colesterol , Proteína Glial Fibrilar Ácida , Humanos , Hipertensão Intracraniana/etiologia , Pressão Intracraniana/fisiologia , Estudos Prospectivos , Proteômica , Hemorragia Subaracnóidea/complicaçõesRESUMO
The growing use of next-generation sequencing technologies on genetic diagnosis has produced an exponential increase in the number of variants of uncertain significance (VUS). In this manuscript, we compare three machine learning methods to classify VUS as Pathogenic or No pathogenic, implementing a Random Forest (RF), a Support Vector Machine (SVM), and a Multilayer Perceptron. To train the models, we extracted high-quality variants from ClinVar that were previously classified as VUS. For each variant, we retrieved nine conservation scores, the loss-of-function tool, and allele frequencies. For the RF and SVM models, hyperparameters were tuned using cross-validation with a grid search. The three models were tested on a nonoverlapping set of variants that had been classified as VUS over the last 3 years, but had been reclassified in August 2020. The three models yielded superior accuracy on this set compared to the benchmarked tools. The RF-based model yielded the best performance across different variant types and was used to create VusPrize, an open-source software tool for prioritization of VUS. We believe that our model can improve the process of genetic diagnosis in research and clinical settings.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Aprendizado de Máquina , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Redes Neurais de Computação , Software , Máquina de Vetores de SuporteRESUMO
Genotyping-by-sequencing (GBS) is a widely used and cost-effective technique for obtaining large numbers of genetic markers from populations by sequencing regions adjacent to restriction cut sites. Although a standard reference-based pipeline can be followed to analyse GBS reads, a reference genome is still not available for a large number of species. Hence, reference-free approaches are required to generate the genetic variability information that can be obtained from a GBS experiment. Unfortunately, available tools to perform de novo analysis of GBS reads face issues of usability, accuracy and performance. Furthermore, few available tools are suitable for analysing data sets from polyploid species. In this manuscript, we describe a novel algorithm to perform reference-free variant detection and genotyping from GBS reads. Nonexact searches on a dynamic hash table of consensus sequences allow for efficient read clustering and sorting. This algorithm was integrated in the Next Generation Sequencing Experience Platform (NGSEP) to integrate the state-of-the-art variant detector already implemented in this tool. We performed benchmark experiments with three different empirical data sets of plants and animals with different population structures and ploidies, and sequenced with different GBS protocols at different read depths. These experiments show that NGSEP has comparable and in some cases better accuracy and always better computational efficiency compared to existing solutions. We expect that this new development will be useful for many research groups conducting population genetic studies in a wide variety of species.
Assuntos
Diploide , Poliploidia , Genômica , Genótipo , Humanos , SoftwareRESUMO
Solanum betaceum is a tree from the Andean region bearing edible fruits, considered an exotic export. Although there has been renewed interest in its commercialization, sustainability, and disease management have been limiting factors. Phytophthora betacei is a recently described species that causes late blight in S. betaceum. There is no general study of the response of S. betaceum, particularly, in the changes in expression of pathogenesis-related genes. In this manuscript we present a comprehensive RNA-seq time-series study of the plant response to the infection of P. betacei. Following six time points of infection, the differentially expressed genes (DEGs) involved in the defense by the plant were contextualized in a sequential manner. We documented 5,628 DEGs across all time-points. From 6 to 24 h post-inoculation, we highlighted DEGs involved in the recognition of the pathogen by the likely activation of pattern-triggered immunity (PTI) genes. We also describe the possible effect of the pathogen effectors in the host during the effector-triggered response. Finally, we reveal genes related to the susceptible outcome of the interaction caused by the onset of necrotrophy and the sharp transcriptional changes as a response to the pathogen. This is the first report of the transcriptome of the tree tomato in response to the newly described pathogen P. betacei.
RESUMO
Recent developments in High Throughput Sequencing (HTS) technologies and bioinformatics, including improved read lengths and genome assemblers allow the reconstruction of complex genomes with unprecedented quality and contiguity. Sugarcane has one of the most complicated genomes among grassess with a haploid length of 1Gbp and a ploidies between 8 and 12. In this work, we present a genome assembly of the Colombian sugarcane hybrid CC 01-1940. Three types of sequencing technologies were combined for this assembly: PacBio long reads, Illumina paired short reads, and Hi-C reads. We achieved a median contig length of 34.94 Mbp and a total genome assembly of 903.2 Mbp. We annotated a total of 63,724 protein coding genes and performed a reconstruction and comparative analysis of the sucrose metabolism pathway. Nucleotide evolution measurements between orthologs with close species suggest that divergence between Saccharum officinarum and Saccharum spontaneum occurred <2 million years ago. Synteny analysis between CC 01-1940 and the S. spontaneum genome confirms the presence of translocation events between the species and a random contribution throughout the entire genome in current sugarcane hybrids. Analysis of RNA-Seq data from leaf and root tissue of contrasting sugarcane genotypes subjected to water stress treatments revealed 17,490 differentially expressed genes, from which 3,633 correspond to genes expressed exclusively in tolerant genotypes. We expect the resources presented here to serve as a source of information to improve the selection processes of new varieties of the breeding programs of sugarcane.