RESUMEN
The Terminal Fusarium Clade (TFC) is a group in the Nectriaceae family with agricultural and clinical relevance. In recent years, various phylogenies have been presented in the literature, showing disagreement in the topologies, but only a few studies have conducted analyses on the divergence time scale of the group. Therefore, the evolutionary history of this group is still being determined. This study aimed to understand the evolutionary history of the TFC from a phylogenomic perspective. To achieve this objective, we performed a phylogenomic analysis using the available genomes in GenBank and ran eight different pipelines. We presented a new robust topology of the TFC that differs at some nodes from previous studies. These new relationships allowed us to formulate new hypotheses about the evolutionary history of the TFC. We also inferred new divergence time estimates, which differ from those of previous studies due to topology discordances and taxon sampling. The results suggested an important diversification process in the Neogene period, likely associated with the diversification and predominance of terrestrial ecosystems by angiosperms. In conclusion, we presented a robust time-scale phylogeny that allowed us to formulate new hypotheses regarding the evolutionary history of the TFC.
RESUMEN
Chagas is an endemic disease in tropical regions of Latin America, caused by the parasite Trypanosoma cruzi. High intraspecies variability and genome complexity have been challenges to assemble high quality genomes needed for studies in evolution, population genomics, diagnosis and drug development. Here we present a chromosome-level phased assembly of a TcI T. cruzi strain (Dm25). While 29 chromosomes show a large collinearity with the assembly of the Brazil A4 strain, three chromosomes show both large heterozygosity and large divergence, compared to previous assemblies of TcI T. cruzi strains. Nucleotide and protein evolution statistics indicate that T. cruzi Marinkellei separated before the diversification of T. cruzi in the known DTUs. Interchromosomal paralogs of dispersed gene families and histones appeared before but at the same time have a more strict purifying selection, compared to other repeat families. Previously unreported large tandem arrays of protein kinases and histones were identified in this assembly. Over one million variants obtained from Illumina reads aligned to the primary assembly clearly separate the main DTUs. We expect that this new assembly will be a valuable resource for further studies on evolution and functional genomics of Trypanosomatids.
Asunto(s)
Enfermedad de Chagas , Trypanosoma cruzi , Humanos , Trypanosoma cruzi/genética , Colombia , Histonas , BrasilRESUMEN
BACKGROUND: Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed. FINDINGS: We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform, which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments, including simulation and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read depths. CONCLUSION: The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
Asunto(s)
Algoritmos , Benchmarking , Teorema de Bayes , Genotipo , Análisis por ConglomeradosRESUMEN
The domestication process in lima bean (Phaseolus lunatus L.) involves two independent events, within the Mesoamerican and Andean gene pools. This makes lima bean an excellent model to understand convergent evolution. The mechanisms of adaptation followed by Mesoamerican and Andean landraces are largely unknown. Genes related to these adaptations can be selected by identification of selective sweeps within gene pools. Previous genetic analyses in lima bean have relied on Single Nucleotide Polymorphism (SNP) loci, and have ignored transposable elements (TEs). Here we show the analysis of whole-genome sequencing data from 61 lima bean accessions to characterize a genomic variation database including TEs and SNPs, to associate selective sweeps with variable TEs and to predict candidate domestication genes. A small percentage of genes under selection are shared among gene pools, suggesting that domestication followed different genetic avenues in both gene pools. About 75% of TEs are located close to genes, which shows their potential to affect gene functions. The genetic structure inferred from variable TEs is consistent with that obtained from SNP markers, suggesting that TE dynamics can be related to the demographic history of wild and domesticated lima bean and its adaptive processes, in particular selection processes during domestication.
Asunto(s)
Phaseolus , Phaseolus/genética , Elementos Transponibles de ADN/genética , Polimorfismo de Nucleótido Simple , Dinámica PoblacionalRESUMEN
The global market of chocolate has increased worldwide during the last decade and is expected to reach a value of USD 200 billion by 2028. Chocolate is obtained from different varieties of Theobroma cacao L, a plant domesticated more than 4000 years ago in the Amazon rainforest. However, chocolate production is a complex process requiring extensive post-harvesting, mainly involving cocoa bean fermentation, drying, and roasting. These steps have a critical impact on chocolate quality. Standardizing and better understanding cocoa processing is, therefore, a current challenge to boost the global production of high-quality cocoa worldwide. This knowledge can also help cocoa producers improve cocoa processing management and obtain a better chocolate. Several recent studies have been conducted to dissect cocoa processing via omics analysis. A vast amount of data has been produced regarding omics studies of cocoa processing performed worldwide. This review systematically analyzes the current data on cocoa omics using data mining techniques and discusses opportunities and gaps for cocoa processing standardization from this data. First, we observed a recurrent report in metagenomics studies of species of the fungi genus Candida and Pichia as well as bacteria from the genus Lactobacillus, Acetobacter, and Bacillus. Second, our analyzes of the available metabolomics data showed clear differences in the identified metabolites in cocoa and chocolate from different geographical origin, cocoa type, and processing stage. Finally, our analysis of peptidomics data revealed characteristic patterns in the gathered data including higher diversity and lower size distribution of peptides in fine-flavor cocoa. In addition, we discuss the current challenges in cocoa omics research. More research is still required to fill gaps in central matter in chocolate production as starter cultures for cocoa fermentation, flavor evolution of cocoa, and the role of peptides in the development of specific flavor notes. We also offer the most comprehensive collection of multi-omics data in cocoa processing gathered from different research articles.
Asunto(s)
Bacillus , Cacao , Chocolate , Alimentos , CandidaRESUMEN
Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species.
Asunto(s)
Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Genoma , Programas InformáticosRESUMEN
The ultimate goal of de novo assembly of reads sequenced from a diploid individual is the separate reconstruction of the sequences corresponding to the two copies of each chromosome. Unfortunately, the allele linkage information needed to perform phased genome assemblies has been difficult to generate. Hence, most current genome assemblies are a haploid mixture of the two underlying chromosome copies present in the sequenced individual. Sequencing technologies providing long (20 kb) and accurate reads are the basis to generate phased genome assemblies. This chapter provides a brief overview of the main milestones in traditional genome assembly, focusing on the bioinformatic techniques developed to generate haplotype information from different specialized protocols. Using these techniques as a knowledge background, the chapter reviews the current algorithms to generate phased assemblies from long reads with low error rates. Current techniques perform haplotype-aware error correction steps to increase the quality of the raw reads. In addition, variations on the traditional overlap-layout-consensus (OLC) graph have been developed in an effort to eliminate edges between reads sequenced from different chromosome copies. This allows for large presence-absence variants between the chromosome copies to be taken into account. The development of these algorithms, along with the improved sequencing technologies has been crucial to finish chromosome-level assemblies of complex genomes.
Asunto(s)
Algoritmos , Biología Computacional , Análisis de Secuencia de ADN/métodos , Haplotipos , Alelos , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
Whole-genome alignment allows researchers to understand the genomic structure and variation among genomes. Approaches based on direct pairwise comparisons of DNA sequences require large computational capacities. As a consequence, pipelines combining tools for orthologous gene identification and synteny have been developed. In this manuscript, we present the latest functionalities implemented in NGSEP 4, to identify orthogroups and perform whole genome alignments. NGSEP implements functionalities for identification of clusters of homologus genes, synteny analysis and whole genome alignment. Our results showed that the NGSEP algorithm for orthogroups identification has competitive accuracy and efficiency in comparison to commonly used tools. The implementation also includes a visualization of the whole genome alignment based on synteny of the orthogroups that were identified, and a reconstruction of the pangenome based on frequencies of the orthogroups among the genomes. NGSEP 4 also includes a new graphical user interface based on the JavaFX technology. We expect that these new developments will be very useful for several studies in evolutionary biology and population genomics.
Asunto(s)
Genoma , Programas Informáticos , Genómica/métodos , Algoritmos , MetagenómicaRESUMEN
The impact of cocoa lipid content on chocolate quality has been extensively described. Nevertheless, few studies have elucidated the cocoa lipid composition and their bioactive properties, focusing only on specific lipids. In the present study the lipidome of fine-flavor cocoa fermentation was analyzed using LC-MS-QTOF and a Machine Learning model to assess potential bioactivity was developed. Our results revealed that the cocoa lipidome, comprised mainly of fatty acyls and glycerophospholipids, remains stable during fine-flavor cocoa fermentations. Also, several Machine Learning algorithms were trained to explore potential biological activity among the identified lipids. We found that K-Nearest Neighbors had the best performance. This model was used to classify the identified lipids as bioactive or non-bioactive, nominating 28 molecules as potential bioactive lipids. None of these compounds have been previously reported as bioactive. Our work is the first untargeted lipidomic study and systematic effort to investigate potential bioactivity in fine-flavor cocoa lipids.
Asunto(s)
Cacao , Chocolate , Fermentación , Lipidómica , Lípidos , GustoRESUMEN
The growing use of next-generation sequencing technologies on genetic diagnosis has produced an exponential increase in the number of variants of uncertain significance (VUS). In this manuscript, we compare three machine learning methods to classify VUS as Pathogenic or No pathogenic, implementing a Random Forest (RF), a Support Vector Machine (SVM), and a Multilayer Perceptron. To train the models, we extracted high-quality variants from ClinVar that were previously classified as VUS. For each variant, we retrieved nine conservation scores, the loss-of-function tool, and allele frequencies. For the RF and SVM models, hyperparameters were tuned using cross-validation with a grid search. The three models were tested on a nonoverlapping set of variants that had been classified as VUS over the last 3 years, but had been reclassified in August 2020. The three models yielded superior accuracy on this set compared to the benchmarked tools. The RF-based model yielded the best performance across different variant types and was used to create VusPrize, an open-source software tool for prioritization of VUS. We believe that our model can improve the process of genetic diagnosis in research and clinical settings.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Aprendizaje Automático , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Redes Neurales de la Computación , Programas Informáticos , Máquina de Vectores de SoporteRESUMEN
Genotyping-by-sequencing (GBS) is a widely used and cost-effective technique for obtaining large numbers of genetic markers from populations by sequencing regions adjacent to restriction cut sites. Although a standard reference-based pipeline can be followed to analyse GBS reads, a reference genome is still not available for a large number of species. Hence, reference-free approaches are required to generate the genetic variability information that can be obtained from a GBS experiment. Unfortunately, available tools to perform de novo analysis of GBS reads face issues of usability, accuracy and performance. Furthermore, few available tools are suitable for analysing data sets from polyploid species. In this manuscript, we describe a novel algorithm to perform reference-free variant detection and genotyping from GBS reads. Nonexact searches on a dynamic hash table of consensus sequences allow for efficient read clustering and sorting. This algorithm was integrated in the Next Generation Sequencing Experience Platform (NGSEP) to integrate the state-of-the-art variant detector already implemented in this tool. We performed benchmark experiments with three different empirical data sets of plants and animals with different population structures and ploidies, and sequenced with different GBS protocols at different read depths. These experiments show that NGSEP has comparable and in some cases better accuracy and always better computational efficiency compared to existing solutions. We expect that this new development will be useful for many research groups conducting population genetic studies in a wide variety of species.
Asunto(s)
Diploidia , Poliploidía , Genómica , Genotipo , Humanos , Programas InformáticosRESUMEN
Solanum betaceum is a tree from the Andean region bearing edible fruits, considered an exotic export. Although there has been renewed interest in its commercialization, sustainability, and disease management have been limiting factors. Phytophthora betacei is a recently described species that causes late blight in S. betaceum. There is no general study of the response of S. betaceum, particularly, in the changes in expression of pathogenesis-related genes. In this manuscript we present a comprehensive RNA-seq time-series study of the plant response to the infection of P. betacei. Following six time points of infection, the differentially expressed genes (DEGs) involved in the defense by the plant were contextualized in a sequential manner. We documented 5,628 DEGs across all time-points. From 6 to 24 h post-inoculation, we highlighted DEGs involved in the recognition of the pathogen by the likely activation of pattern-triggered immunity (PTI) genes. We also describe the possible effect of the pathogen effectors in the host during the effector-triggered response. Finally, we reveal genes related to the susceptible outcome of the interaction caused by the onset of necrotrophy and the sharp transcriptional changes as a response to the pathogen. This is the first report of the transcriptome of the tree tomato in response to the newly described pathogen P. betacei.
RESUMEN
Recent developments in High Throughput Sequencing (HTS) technologies and bioinformatics, including improved read lengths and genome assemblers allow the reconstruction of complex genomes with unprecedented quality and contiguity. Sugarcane has one of the most complicated genomes among grassess with a haploid length of 1Gbp and a ploidies between 8 and 12. In this work, we present a genome assembly of the Colombian sugarcane hybrid CC 01-1940. Three types of sequencing technologies were combined for this assembly: PacBio long reads, Illumina paired short reads, and Hi-C reads. We achieved a median contig length of 34.94 Mbp and a total genome assembly of 903.2 Mbp. We annotated a total of 63,724 protein coding genes and performed a reconstruction and comparative analysis of the sucrose metabolism pathway. Nucleotide evolution measurements between orthologs with close species suggest that divergence between Saccharum officinarum and Saccharum spontaneum occurred <2 million years ago. Synteny analysis between CC 01-1940 and the S. spontaneum genome confirms the presence of translocation events between the species and a random contribution throughout the entire genome in current sugarcane hybrids. Analysis of RNA-Seq data from leaf and root tissue of contrasting sugarcane genotypes subjected to water stress treatments revealed 17,490 differentially expressed genes, from which 3,633 correspond to genes expressed exclusively in tolerant genotypes. We expect the resources presented here to serve as a source of information to improve the selection processes of new varieties of the breeding programs of sugarcane.
RESUMEN
Fragment-based drug design (FBDD) and pharmacophore modeling have proven to be efficient tools to discover novel drugs. However, these approaches may become limited if the collection of fragments is highly repetitive, poorly diverse, or excessively simple. In this article, combining pharmacophore modeling and a non-classical type of fragmentation (herein called non-extensive) to screen a natural product (NP) library may provide fragments predicted as potent, diverse, and developable. Initially, we applied retrosynthetic combinatorial analysis procedure (RECAP) rules in two versions, extensive and non-extensive, in order to deconstruct a virtual library of NPs formed by the databases Traditional Chinese Medicine (TCM), AfroDb (African Medicinal Plants database), NuBBE (Nuclei of Bioassays, Biosynthesis, and Ecophysiology of Natural Products), and UEFS (Universidade Estadual de Feira de Santana). We then developed a virtual screening (VS) using two groups of natural-product-derived fragments (extensive and non-extensive NPDFs) and two overlapping pharmacophore models for each of 20 different proteins of therapeutic interest. Molecular weight, lipophilicity, and molecular complexity were estimated and compared for both types of NPDFs (and their original NPs) before and after the VS proceedings. As a result, we found that non-extensive NPDFs exhibited a much higher number of chemical entities compared to extensive NPDFs (45,355 vs. 11,525 compounds), accounting for the larger part of the hits recovered and being far less repetitive than extensive NPDFs. The structural diversity of both types of NPDFs and the NPs was shown to diminish slightly after VS procedures. Finally, and most interestingly, the pharmacophore fit score of the non-extensive NPDFs proved to be not only higher, on average, than extensive NPDFs (56% of cases) but also higher than their original NPs (69% of cases) when all of them were also recognized as hits after the VS. The findings obtained in this study indicated that the proposed cascade approach was useful to enhance the probability of identifying innovative chemical scaffolds, which deserve further development to become drug-sized candidate compounds. We consider that the knowledge about the deconstruction degree required to produce NPDFs of interest represents a good starting point for eventual synthesis, characterization, and biological activity studies.
RESUMEN
TILLING (Targeting Induced Local Lesions IN Genomes) is a powerful reverse genetics method in plant functional genomics and breeding to identify mutagenized individuals with improved behavior for a trait of interest. Pooled high throughput sequencing (HTS) of the targeted genes allows efficient identification and sample assignment of variants within genes of interest in hundreds of individuals. Although TILLING has been used successfully in different crops and even applied to natural populations, one of the main issues for a successful TILLING experiment is that most currently available bioinformatics tools for variant detection are not designed to identify mutations with low frequencies in pooled samples or to perform sample identification from variants identified in overlapping pools. Our research group maintains the Next Generation Sequencing Experience Platform (NGSEP), an open source solution for analysis of HTS data. In this manuscript, we present three novel components within NGSEP to facilitate the design and analysis of TILLING experiments: a pooled variants detector, a sample identifier from variants detected in overlapping pools and a simulator of TILLING experiments. A new implementation of the NGSEP calling model for variant detection allows accurate detection of low frequency mutations within pools. The samples identifier implements the process to triangulate the mutations called within overlapping pools in order to assign mutations to single individuals whenever possible. Finally, we developed a complete simulator of TILLING experiments to enable benchmarking of different tools and to facilitate the design of experimental alternatives varying the number of pools and individuals per pool. Simulation experiments based on genes from the common bean genome indicate that NGSEP provides similar accuracy and better efficiency than other tools to perform pooled variants detection. To the best of our knowledge, NGSEP is currently the only tool that generates individual assignments of the mutations discovered from the pooled data. We expect that this development will be of great use for different groups implementing TILLING as an alternative for plant breeding and even to research groups performing pooled sequencing for other applications.
RESUMEN
Lima bean (Phaseolus lunatus L.), one of the five domesticated Phaseolus bean crops, shows a wide range of ecological adaptations along its distribution range from Mexico to Argentina. These adaptations make it a promising crop for improving food security under predicted scenarios of climate change in Latin America and elsewhere. In this work, we combine long and short read sequencing technologies with a dense genetic map from a biparental population to obtain the chromosome-level genome assembly for Lima bean. Annotation of 28,326 gene models show high diversity among 1917 genes with conserved domains related to disease resistance. Structural comparison across 22,180 orthologs with common bean reveals high genome synteny and five large intrachromosomal rearrangements. Population genomic analyses show that wild Lima bean is organized into six clusters with mostly non-overlapping distributions and that Mesomerican landraces can be further subdivided into three subclusters. RNA-seq data reveal 4275 differentially expressed genes, which can be related to pod dehiscence and seed development. We expect the resources presented here to serve as a solid basis to achieve a comprehensive view of the degree of convergent evolution of Phaseolus species under domestication and provide tools and information for breeding for climate change resiliency.
Asunto(s)
Aclimatación/genética , Productos Agrícolas/genética , Phaseolus/genética , Fitomejoramiento , Sitios de Carácter Cuantitativo , Argentina , Mapeo Cromosómico , Cambio Climático , Domesticación , Genes de Plantas/genética , México , Dispersión de las Plantas , RNA-Seq , Semillas , SinteníaRESUMEN
Cyclin-Dependent Kinase 2 (CDK2) and Vascular Endothelial Growth Factor Receptor (VEGFR2) have largely been considered as attractive targets for developing anticancer agents. However, there is no dual inhibitor commercially available in the market that interacts simultaneously with the allosteric back pocket of these enzymes. We applied a combined computational strategy that started with the generation of two overlapping pharmacophore models of both kinases at 'inactive' conformation. Next, several virtual libraries of natural products, including the databases TCM (Traditional Chinese Medicine), UEFS (Universidade Estadual de Feira de Santana), NuBBE (Nuclei of Bioassays, Biosynthesis, and Ecophysiology of Natural Products) and AfroDb (African Medicinal Plants Database) were deconstructed using a non-extensive version of the approach RECAP (retrosynthetic combinatorial analysis procedure). These natural-product-derived fragments (NPDFs) were screened and merged into drug-sized compounds, which were filtered by Lipinski's Rule-of-five (Ro5) and docking. As a result, two pharmacophore models, namely Hypo1 and Hypo2, were developed with an accuracy of 0.94 and 0.84, respectively. Deconstruction of natural products produced a set of 16655 unique non-extensive NPDFs that were screened against both pharmacophore models. Finally, after merging, Ro5-filtering and docking, we obtained a set of 20 hit compounds predicted to be diverse, developable, synthesizable and potent. The computational strategy proved successful to find virtual candidates of kinase inhibitors and therefore contributes to the identification of innovative multi-target compounds with potential anticancer activity. Communicated by Ramaswamy H. Sarma.
Asunto(s)
Antineoplásicos , Productos Biológicos , Quinasa 2 Dependiente de la Ciclina/antagonistas & inhibidores , Receptor 2 de Factores de Crecimiento Endotelial Vascular/antagonistas & inhibidores , Simulación del Acoplamiento MolecularRESUMEN
BACKGROUND: Common bean is an important staple crop in the tropics of Africa, Asia and the Americas. Particularly smallholder farmers rely on bean as a source for calories, protein and micronutrients. Drought is a major production constraint for common bean, a situation that will be aggravated with current climate change scenarios. In this context, new tools designed to understand the genetic basis governing the phenotypic responses to abiotic stress are required to improve transfer of desirable traits into cultivated beans. RESULTS: A multiparent advanced generation intercross (MAGIC) population of common bean was generated from eight Mesoamerican breeding lines representing the phenotypic and genotypic diversity of the CIAT Mesoamerican breeding program. This population was assessed under drought conditions in two field trials for yield, 100 seed weight, iron and zinc accumulation, phenology and pod harvest index. Transgressive segregation was observed for most of these traits. Yield was positively correlated with yield components and pod harvest index (PHI), and negative correlations were found with phenology traits and micromineral contents. Founder haplotypes in the population were identified using Genotyping by Sequencing (GBS). No major population structure was observed in the population. Whole Genome Sequencing (WGS) data from the founder lines was used to impute genotyping data for GWAS. Genetic mapping was carried out with two methods, using association mapping with GWAS, and linkage mapping with haplotype-based interval screening. Thirteen high confidence QTL were identified using both methods and several QTL hotspots were found controlling multiple traits. A major QTL hotspot located on chromosome Pv01 for phenology traits and yield was identified. Further hotspots affecting several traits were observed on chromosomes Pv03 and Pv08. A major QTL for seed Fe content was contributed by MIB778, the founder line with highest micromineral accumulation. Based on imputed WGS data, candidate genes are reported for the identified major QTL, and sequence changes were identified that could cause the phenotypic variation. CONCLUSIONS: This work demonstrates the importance of this common bean MAGIC population for genetic mapping of agronomic traits, to identify trait associations for molecular breeding tool design and as a new genetic resource for the bean research community.
Asunto(s)
Phaseolus , África , Asia , Mapeo Cromosómico , Sequías , Phaseolus/genética , Fenotipo , Fitomejoramiento , Sitios de Carácter CuantitativoRESUMEN
Palm oil is the most consumed vegetable oil globally, and Colombia is the largest palm oil producer in South America and fourth worldwide. However, oil palm plantations in Colombia are affected by bud rot disease caused by the oomycete Phytophthora palmivora, leading to significant economic losses. Infection processes by plant pathogens involve the secretion of effector molecules, which alter the functioning or structure of host cells. Current long-read sequencing technologies provide the information needed to produce high-quality genome assemblies, enabling a comprehensive annotation of effectors. Here, we describe the development of genomic resources for P. palmivora, including a high-quality genome assembly based on long and short-read sequencing data, intraspecies variability for 12 isolates from different oil palm cultivation regions in Colombia, and a catalog of over 1,000 candidate effector proteins. A total of 45,416 genes were annotated from the new genome assembled in 2,322 contigs adding to 165.5 Mbp, which represents an improvement of two times more gene models, 33 times better contiguity, and 11 times less fragmentation compared with currently available genomic resources for the species. Analysis of nucleotide evolution in paralogs suggests a recent whole-genome duplication event. Genetic differences were identified among isolates showing variable virulence levels. We expect that these novel genomic resources contribute to the characterization of the species and the understanding of the interaction of P. palmivora with oil palm and could be further exploited as tools for the development of effective strategies for disease control.