RESUMO
Prokaryotic genomes are often considered to be mosaics of genes that do not necessarily share the same evolutionary history due to widespread horizontal gene transfers (HGTs). Consequently, representing evolutionary relationships of prokaryotes as bifurcating trees has long been controversial. However, studies reporting conflicts among gene trees derived from phylogenomic data sets have shown that these conflicts can be the result of artifacts or evolutionary processes other than HGT, such as incomplete lineage sorting, low phylogenetic signal, and systematic errors due to substitution model misspecification. Here, we present the results of an extensive exploration of phylogenetic conflicts in the cyanobacterial order Nostocales, for which previous studies have inferred strongly supported conflicting relationships when using different concatenated phylogenomic data sets. We found that most of these conflicts are concentrated in deep clusters of short internodes of the Nostocales phylogeny, where the great majority of individual genes have low resolving power. We then inferred phylogenetic networks to detect HGT events while also accounting for incomplete lineage sorting. Our results indicate that most conflicts among gene trees are likely due to incomplete lineage sorting linked to an ancient rapid radiation, rather than to HGTs. Moreover, the short internodes of this radiation fit the expectations of the anomaly zone, i.e., a region of the tree parameter space where a species tree is discordant with its most likely gene tree. We demonstrated that concatenation of different sets of loci can recover up to 17 distinct and well-supported relationships within the putative anomaly zone of Nostocales, corresponding to the observed conflicts among well-supported trees based on concatenated data sets from previous studies. Our findings highlight the important role of rapid radiations as a potential cause of strongly conflicting phylogenetic relationships when using phylogenomic data sets of bacteria. We propose that polytomies may be the most appropriate phylogenetic representation of these rapid radiations that are part of anomaly zones, especially when all possible genomic markers have been considered to infer these phylogenies. [Anomaly zone; bacteria; horizontal gene transfer; incomplete lineage sorting; Nostocales; phylogenomic conflict; rapid radiation; Rhizonema.].
Assuntos
Cianobactérias , Genoma , Filogenia , Evolução Biológica , Células Procarióticas , Cianobactérias/genéticaRESUMO
SUMMARY: To support small and large-scale genome mining projects, we present Post-processing Analysis tooLbox for ANTIsmash Reports (Palantir), a dedicated software suite for handling and refining secondary metabolite biosynthetic gene cluster (BGC) data annotated with the popular antiSMASH pipeline. Palantir provides new functionalities building on NRPS/PKS predictions from antiSMASH, such as improved BGC annotation, module delineation and easy access to sub-sequences at different levels (cluster, gene, module and domain). Moreover, it can parse user-provided antiSMASH reports and reformat them for direct use or storage in a relational database. AVAILABILITY AND IMPLEMENTATION: Palantir is released both as a Perl API available on CPAN (https://metacpan.org/release/Bio-Palantir) and as a web application (http://palantir.uliege.be). As a practical use case, the web interface also features a database built from the mining of 1616 cyanobacterial genomes, of which 1488 were predicted to encode at least one BGC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Vias Biossintéticas , Software , Bactérias/genética , Anotação de Sequência Molecular , Família MultigênicaRESUMO
Understanding the evolutionary history of symbiotic Cyanobacteria at a fine scale is essential to unveil patterns of associations with their hosts and factors driving their spatiotemporal interactions. As for bacteria in general, Horizontal Gene Transfers (HGT) are expected to be rampant throughout their evolution, which justified the use of single-locus phylogenies in macroevolutionary studies of these photoautotrophic bacteria. Genomic approaches have greatly increased the amount of molecular data available, but the selection of orthologous, congruent genes that are more likely to reflect bacterial macroevolutionary histories remains problematic. In this study, we developed a synteny-based approach and searched for Collinear Orthologous Regions (COR), under the assumption that genes that are present in the same order and orientation across a wide monophyletic clade are less likely to have undergone HGT. We searched sixteen reference Nostocales genomes and identified 99 genes, part of 28 COR comprising three to eight genes each. We then developed a bioinformatic pipeline, designed to minimize inter-genome contamination and processed twelve Nostoc-associated lichen metagenomes. This reduced our original dataset to 90 genes representing 25 COR, which were used to infer phylogenetic relationships within Nostocales and among lichenized Cyanobacteria. This dataset was narrowed down further to 71 genes representing 22 COR by selecting only genes part of one (largest) operon per COR. We found a relatively high level of congruence among trees derived from the 90-gene dataset, but congruence was only slightly higher among genes within a COR compared to genes across COR. However, topological congruence was significantly higher among the 71 genes part of one operon per COR. Nostocales phylogenies resulting from concatenation and species tree approaches based on the 90- and 71-gene datasets were highly congruent, but the most highly supported result was obtained when using synteny, collinearity, and operon information (i.e., 71-gene dataset) as gene selection criteria, which outperformed larger datasets with more genes.
Assuntos
Cianobactérias/genética , Transferência Genética Horizontal , Filogenia , Sintenia , Evolução Molecular , GenômicaRESUMO
The decreasing cost of sequencing and concomitant augmentation of publicly available genomes have created an acute need for automated software to assess genomic contamination. During the last 6 years, 18 programs have been published, each with its own strengths and weaknesses. Deciding which tools to use becomes more and more difficult without an understanding of the underlying algorithms. We review these programs, benchmarking six of them, and present their main operating principles. This article is intended to guide researchers in the selection of appropriate tools for specific applications. Finally, we present future challenges in the developing field of contamination detection.
Assuntos
Genômica , Software , Algoritmos , Benchmarking , GenomaRESUMO
Snodgrassella is a genus of Betaproteobacteria that lives in the gut of honeybees (Apis spp.) and bumblebees (Bombus spp). It is part of a conserved microbiome that is composed of a few core phylotypes and is essential for bee health and metabolism. Phylogenomic analyses using whole-genome sequences of 75 Snodgrassella strains from 4 species of honeybees and 14 species of bumblebees showed that these strains formed a monophyletic lineage within the Neisseriaceae family, that Snodgrassella isolates from Asian honeybees diverged early from the other species in their evolution, that isolates from honeybees and bumblebees were well separated, and that this genus consists of at least seven species. We propose to formally name two new Snodgrassella species that were isolated from bumblebees: i.e., Snodgrassella gandavensis sp. nov. and Snodgrassella communis sp. nov. Possible evolutionary scenarios for 107 species- or group-specific genes revealed very limited evidence for horizontal gene transfer. Functional analyses revealed the importance of small proteins, defense mechanisms, amino acid transport and metabolism, inorganic ion transport and metabolism and carbohydrate transport and metabolism among these 107 specific genes. IMPORTANCE The microbiome of honeybees (Apis spp.) and bumblebees (Bombus spp.) is highly conserved and represented by few phylotypes. This simplicity in taxon composition makes the bee's microbiome an emergent model organism for the study of gut microbial communities. Since the description of the Snodgrassella genus, which was isolated from the gut of honeybees and bumblebees in 2013, a single species (i.e., Snodgrassella alvi), has been named. Here, we demonstrate that this genus is actually composed of at least seven species, two of which (Snodgrassella gandavensis sp. nov. and Snodgrassella communis sp. nov.) are formally described and named in the present publication. We also report the presence of 107 genes specific to Snodgrassella species, showing notably the importance of small proteins and defense mechanisms in this genus.
Assuntos
Microbiota , Neisseriaceae , Animais , Abelhas , Filogenia , Neisseriaceae/genéticaRESUMO
BACKGROUND: Microbial culture collections play a key role in taxonomy by studying the diversity of their strains and providing well-characterized biological material to the scientific community for fundamental and applied research. These microbial resource centers thus need to implement new standards in species delineation, including whole-genome sequencing and phylogenomics. In this context, the genomic needs of the Belgian Coordinated Collections of Microorganisms were studied, resulting in the GEN-ERA toolbox. The latter is a unified cluster of bioinformatic workflows dedicated to both bacteria and small eukaryotes (e.g., yeasts). FINDINGS: This public toolbox allows researchers without a specific training in bioinformatics to perform robust phylogenomic analyses. Hence, it facilitates all steps from genome downloading and quality assessment, including genomic contamination estimation, to tree reconstruction. It also offers workflows for average nucleotide identity comparisons and metabolic modeling. TECHNICAL DETAILS: Nextflow workflows are launched by a single command and are available on the GEN-ERA GitHub repository (https://github.com/Lcornet/GENERA). All the workflows are based on Singularity containers to increase reproducibility. TESTING: The toolbox was developed for a diversity of microorganisms, including bacteria and fungi. It was further tested on an empirical dataset of 18 (meta)genomes of early branching Cyanobacteria, providing the most up-to-date phylogenomic analysis of the Gloeobacterales order, the first group to diverge in the evolutionary tree of Cyanobacteria. CONCLUSION: The GEN-ERA toolbox can be used to infer completely reproducible comparative genomic and metabolic analyses on prokaryotes and small eukaryotes. Although designed for routine bioinformatics of culture collections, it can also be used by all researchers interested in microbial taxonomy, as exemplified by our case study on Gloeobacterales.
Assuntos
Biologia Computacional , Genômica , Fluxo de Trabalho , Reprodutibilidade dos Testes , Genômica/métodos , Biologia Computacional/métodos , Genoma Microbiano , FilogeniaRESUMO
The continuous increase in sequenced genomes in public repositories makes the choice of interesting bacterial strains for future sequencing projects ever more complicated, as it is difficult to estimate the redundancy between these strains and the already available genomes. Therefore, we developed the Nextflow workflow "ORPER", for "ORganism PlacER", containerized in Singularity, which allows the determination the phylogenetic position of a collection of organisms in the genomic landscape. ORPER constrains the phylogenetic placement of SSU (16S) rRNA sequences in a multilocus reference tree based on ribosomal protein genes extracted from public genomes. We demonstrate the utility of ORPER on the Cyanobacteria phylum, by placing 152 strains of the BCCM/ULC collection.
Assuntos
Automação/métodos , Cianobactérias/genética , Filogenia , RNA Ribossômico 16S/genética , Proteínas Ribossômicas/genética , Ribotipagem/métodos , Análise de Sequência de DNA/métodos , DNA Bacteriano , Processamento Eletrônico de Dados/métodos , Fluxo de TrabalhoRESUMO
TQMD is a tool for high-performance computing clusters which downloads, stores and produces lists of dereplicated prokaryotic genomes. It has been developed to counter the ever-growing number of prokaryotic genomes and their uneven taxonomic distribution. It is based on word-based alignment-free methods (k-mers), an iterative single-linkage approach and a divide-and-conquer strategy to remain both efficient and scalable. We studied the performance of TQMD by verifying the influence of its parameters and heuristics on the clustering outcome. We further compared TQMD to two other dereplication tools (dRep and Assembly-Dereplicator). Our results showed that TQMD is primarily optimized to dereplicate at higher taxonomic levels (phylum/class), as opposed to the other dereplication tools, but also works at lower taxonomic levels (species/strain) like the other dereplication tools. TQMD is available from source and as a Singularity container at [https://bitbucket.org/phylogeno/tqmd ].
RESUMO
Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.
RESUMO
The medically relevant Trichophyton rubrum species complex has a variety of phenotypic presentations but shows relatively little genetic differences. Conventional barcodes, such as the internal transcribed spacer (ITS) region or the beta-tubulin gene, are not able to completely resolve the relationships between these closely related taxa. T. rubrum, T. soudanense and T. violaceum are currently accepted as separate species. However, the status of certain variants, including the T. rubrum morphotypes megninii and kuryangei and the T. violaceum morphotype yaoundei, remains to be deciphered. We conducted the first phylogenomic analysis of the T. rubrum species complex by studying 3105 core genes of 18 new strains from the BCCM/IHEM culture collection and nine publicly available genomes. Our analyses revealed a highly resolved phylogenomic tree with six separate clades. Trichophyton rubrum, T. violaceum and T. soudanense were confirmed in their status of species. The morphotypes T. megninii, T. kuryangei and T. yaoundei all grouped in their own respective clade with high support, suggesting that these morphotypes should be reinstituted to the species-level. Robinson-Foulds distance analyses showed that a combination of two markers (a ubiquitin-protein transferase and a MYB DNA-binding domain-containing protein) can mirror the phylogeny obtained using genomic data, and thus represent potential new markers to accurately distinguish the species belonging to the T. rubrum complex.
Assuntos
Arthrodermataceae , Trichophyton , Arthrodermataceae/genética , Filogenia , Trichophyton/genéticaRESUMO
Circular DNA is ubiquitous in nature in the form of plasmids, circular DNA viruses, and extrachromosomal circular DNA (eccDNA) in eukaryotes. Sequencing of such molecules is essential to profiling virus distributions, discovering new viruses and understanding the roles of eccDNAs in eukaryotic cells. Circular DNA enrichment sequencing (CIDER-Seq) is a technique to enrich and accurately sequence circular DNA without the need for polymerase chain reaction amplification, cloning, and computational sequence assembly. The approach is based on randomly primed circular DNA amplification, which is followed by several enzymatic DNA repair steps and then by long-read sequencing. CIDER-Seq includes a custom data analysis package (CIDER-Seq Data Analysis Software 2) that implements the DeConcat algorithm to deconcatenate the long sequencing products of random circular DNA amplification into the intact sequences of the input circular DNA. The CIDER-Seq data analysis package can generate full-length annotated virus genomes, as well as circular DNA sequences of novel viruses. Applications of CIDER-Seq also include profiling of eccDNA molecules such as transposable elements (TEs) from biological samples. The method takes ~2 weeks to complete, depending on the computational resources available. Owing to the present constraints of long-read single-molecule sequencing, the accuracy of circular virus and eccDNA sequences generated by the CIDER-Seq method scales with sequence length, and the greatest accuracy is obtained for molecules <10 kb long.
Assuntos
DNA Circular/análise , DNA Viral/análise , Técnicas de Amplificação de Ácido Nucleico/métodos , Análise de Sequência de DNA/métodos , ArabidopsisRESUMO
Cyanobacteria played an important role in the evolution of Early Earth and the biosphere. They are responsible for the oxygenation of the atmosphere and oceans since the Great Oxidation Event around 2.4â¯Ga, debatably earlier. They are also major primary producers in past and present oceans, and the ancestors of the chloroplast. Nevertheless, the identification of cyanobacteria in the early fossil record remains ambiguous because the morphological criteria commonly used are not always reliable for microfossil interpretation. Recently, new biosignatures specific to cyanobacteria were proposed. Here, we review the classic and new cyanobacterial biosignatures. We also assess the reliability of the previously described cyanobacteria fossil record and the challenges of molecular approaches on modern cyanobacteria. Finally, we suggest possible new calibration points for molecular clocks, and strategies to improve our understanding of the timing and pattern of the evolution of cyanobacteria and oxygenic photosynthesis.
Assuntos
Evolução Biológica , Cloroplastos/metabolismo , Cianobactérias/metabolismo , Oxigênio/metabolismo , Cianobactérias/genética , Fósseis , Oxirredução , FotossínteseRESUMO
OBJECTIVE: Cyanobacteria are an ancient phylum of prokaryotes that contain the class Oxyphotobacteria. This group has been extensively studied by phylogenomics notably because it is widely accepted that Cyanobacteria were responsible for the spread of photosynthesis to the eukaryotic domain. The aim of this study was to evaluate the fraction of the oxyphotobacterial diversity for which sequenced genomes are available for genomic studies. For this, we built a phylogenomic-constrained SSU rRNA (16S) tree to pinpoint unexploited clusters of Oxyphotobacteria that should be targeted for future genome sequencing, so as to improve our understanding of Oxyphotobacteria evolution. RESULTS: We show that only a little fraction of the oxyphotobacterial diversity has been sequenced so far. Indeed 31 rRNA clusters of the 60 composing the photosynthetic Cyanobacteria have a fraction of sequenced genomes < 1%. This fraction remains low (min = 1%, median = 11.1%, IQR = 7.3%) within the remaining "sequenced" clusters that already contain some representative genomes. The "unsequenced" clusters are scattered across the whole Oxyphotobacteria tree, at the exception of very basal clades. Yet, these clades still feature some (sub)clusters without any representative genome. This last result is especially important, as these basal clades are prime candidate for plastid emergence.
Assuntos
Cianobactérias/genética , Fotossíntese/genética , Filogenia , RNA Ribossômico/análise , Sequência de BasesRESUMO
Cyanobacteria form one of the most diversified phyla of Bacteria. They are important ecologically as primary producers, for Earth evolution and biotechnological applications. Yet, Cyanobacteria are notably difficult to purify and grow axenically, and most strains in culture collections contain heterotrophic bacteria that were probably associated with Cyanobacteria in the environment. Obtaining cyanobacterial DNA without contaminant sequences is thus a challenging and time-consuming task. Here, we describe a metagenomic pipeline that enables the easy recovery of genomes from non-axenic cultures. We tested this pipeline on 17 cyanobacterial cultures from the BCCM/ULC public collection and generated novel genome sequences for 12 polar or subpolar strains and three temperate ones, including three early-branching organisms that will be useful for phylogenomics. In parallel, we assembled 31 co-cultivated bacteria (12 nearly complete) from the same cultures and showed that they mostly belong to Bacteroidetes and Proteobacteria, some of them being very closely related in spite of geographically distant sampling sites.
Assuntos
Cianobactérias/classificação , Cianobactérias/genética , Metagenoma , Microbiota/genética , Regiões Antárticas , Regiões Árticas , Cianobactérias/isolamento & purificação , Metagenômica , Filogenia , RNA Ribossômico 16S/genéticaRESUMO
Moonmilk are cave carbonate deposits that host a rich microbiome, including antibiotic-producing Actinobacteria, making these speleothems appealing for bioprospecting. Here, we investigated the taxonomic profile of the actinobacterial community of three moonmilk deposits of the cave "Grotte des Collemboles" via high-throughput sequencing of 16S rRNA amplicons. Actinobacteria was the most common phylum after Proteobacteria, ranging from 9% to 23% of the total bacterial population. Next to actinobacterial operational taxonomic units (OTUs) attributed to uncultured organisms at the genus level (~44%), we identified 47 actinobacterial genera with Rhodoccocus (4 OTUs, 17%) and Pseudonocardia (9 OTUs, ~16%) as the most abundant in terms of the absolute number of sequences. Streptomycetes presented the highest diversity (19 OTUs, 3%), with most of the OTUs unlinked to the culturable Streptomyces strains that were previously isolated from the same deposits. Furthermore, 43% of the OTUs were shared between the three studied collection points, while 34% were exclusive to one deposit, indicating that distinct speleothems host their own population, despite their nearby localization. This important spatial diversity suggests that prospecting within different moonmilk deposits should result in the isolation of unique and novel Actinobacteria. These speleothems also host a wide range of non-streptomycetes antibiotic-producing genera, and should therefore be subjected to methodologies for isolating rare Actinobacteria.
RESUMO
Publicly available genomes are crucial for phylogenetic and metagenomic studies, in which contaminating sequences can be the cause of major problems. This issue is expected to be especially important for Cyanobacteria because axenic strains are notoriously difficult to obtain and keep in culture. Yet, despite their great scientific interest, no data are currently available concerning the quality of publicly available cyanobacterial genomes. As reliably detecting contaminants is a complex task, we designed a pipeline combining six methods in a consensus strategy to assess the contamination level of 440 genome assemblies of Cyanobacteria. Two methods are based on published reference databases of ribosomal genes (SSU rRNA 16S and ribosomal proteins), one is indirectly based on a reference database of marker genes (CheckM), and three are based on complete genome analysis. Among those genome-wide methods, Kraken and DIAMOND blastx share the same reference database that we derived from Ensembl Bacteria, whereas CONCOCT does not require any reference database, instead relying on differences in DNA tetramer frequencies. Given that all the six methods appear to have their own strengths and limitations, we used the consensus of their rankings to infer that >5% of cyanobacterial genome assemblies are highly contaminated by foreign DNA (i.e., contaminants were detected by 5 or 6 methods). Our results will help researchers to check the quality of publicly available genomic data before use in their own analyses. Moreover, we argue that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases.
Assuntos
Cianobactérias/genética , Contaminação por DNA , Genoma Bacteriano/genética , Consenso , DNA Bacteriano/genética , Genes de RNAr/genética , Marcadores Genéticos/genéticaRESUMO
Phormidesmis priestleyi ULC007 is an Antarctic freshwater cyanobacterium. Its draft genome is 5,684,389 bp long. It contains a total of 5,604 protein-encoding genes, of which 22.2% have no clear homologues in known genomes. To date, this draft genome is the first one ever determined for an axenic cyanobacterium from Antarctica.