RESUMEN
In the present work, we performed a comparative genome-wide analysis of 22 species representative of the main clades and lifestyles of the phylum Platyhelminthes. We selected a set of 700 orthologous genes conserved in all species, measuring changes in GC content, codon, and amino acid usage in orthologous positions. Values of 3rd codon position GC spanned over a wide range, allowing to discriminate two distinctive clusters within freshwater turbellarians, Cestodes and Trematodes respectively. Furthermore, a hierarchical clustering of codon usage data differs remarkably from the phylogenetic tree. Additionally, we detected a synonymous codon usage bias that was more dramatic in extreme GC-poor or GC-rich genomes, i.e., GC-poor Schistosomes preferred to use AT-rich terminated synonymous codons, while GC-rich M. lignano showed the opposite behavior. Interestingly, these biases impacted the amino acidic usage, with preferred amino acids encoded by codons following the GC content trend. These are associated with non-synonymous substitutions at orthologous positions. The detailed analysis of the synonymous and non-synonymous changes provides evidence for a two-hit mechanism where both mutation and selection forces drive the diverse coding strategies of flatworms.
RESUMEN
Although the genome of Trypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (the abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degrees of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated with T. cruzi's genome since they permit direct determination of the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, not only allows accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of two T. cruzi clones: the hybrid TCC (TcVI) and the non-hybrid Dm28c (TcI), determined by PacBio Single Molecular Real-Time (SMRT) technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome of T. cruzi is composed of a 'core compartment' and a 'disruptive compartment' which exhibit opposite GC content and gene composition. Novel tandem and dispersed repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families, mucins and trans-sialidases allows now a better overview of these complex groups of genes.