RESUMO
Plasmids are a key vector of antibiotic resistance, but the current bioinformatics toolkit is not well suited to tracking them. The rapid structural changes seen in plasmid genomes present considerable challenges to evolutionary and epidemiological analysis. Typical approaches are either low resolution (replicon typing) or use shared k-mer content to define a genetic distance. However, this distance can both overestimate plasmid relatedness by ignoring rearrangements, and underestimate by over-penalizing gene gain/loss. Therefore a model is needed which captures the key components of how plasmid genomes evolve structurally - through gene/block gain or loss, and rearrangement. A secondary requirement is to prevent promiscuous transposable elements (TEs) leading to over-clustering of unrelated plasmids. We choose the 'Double Cut and Join Indel' (DCJ-Indel) model, in which plasmids are studied at a coarse level, as a sequence of signed integers (representing genes or aligned blocks), and the distance between two plasmids is the minimum number of rearrangement events or indels needed to transform one into the other. We show how this gives much more meaningful distances between plasmids. We introduce a software workflow pling (https://github.com/iqbal-lab-org/pling), which uses the DCJ-Indel model, to calculate distances between plasmids and then cluster them. In our approach, we combine containment distances and DCJ-Indel distances to build a TE-aware plasmid network. We demonstrate superior performance and interpretability to other plasmid clustering tools on the 'Russian Doll' dataset and a hospital transmission dataset.
Assuntos
Plasmídeos , Software , Plasmídeos/genética , Elementos de DNA Transponíveis , Rearranjo Gênico , Biologia Computacional/métodos , Humanos , Evolução Molecular , Mutação INDELRESUMO
Motivation: Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth). Results: We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour. Availability: Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.
RESUMO
Computational pangenomics deals with the joint analysis of all genomic sequences of a species. It has already been successfully applied to various tasks in many research areas. Further advances in DNA sequencing technologies constantly let more and more genomic sequences become available for many species, leading to an increasing attractiveness of pangenomic studies. At the same time, larger datasets also pose new challenges for data structures and algorithms that are needed to handle the data. Efficient methods oftentimes make use of the concept of k-mers.Core detection is a common way of analyzing a pangenome. The pangenome's core is defined as the subset of genomic information shared among all individual members. Classically, it is not only determined on the abstract level of genes but can also be described on the sequence level.In this chapter, we provide an overview of k-mer-based methods in the context of pangenomics studies. We first revisit existing software solutions for k-mer counting and k-mer set representation. Afterward, we describe the usage of two k-mer-based approaches, Pangrowth and Corer, for pangenomic core detection.
Assuntos
Algoritmos , Biologia Computacional , Genômica , Software , Genômica/métodos , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodosRESUMO
The comparison of large-scale genome structures across distinct species offers valuable insights into the species' phylogeny, genome organization, and gene associations. In this chapter, we review the family-free genome comparison tool FFGC that, relying on built-in interfaces with a sequence comparison tool (either BLAST+ or DIAMOND) and with an ILP solver (either CPLEX or Gurobi), provides several methods for analyses that do not require prior classification of genes across the studied genomes. Taking annotated genome sequences as input, FFGC is a complete workflow for genome comparison allowing not only the computation of measures of similarity and dissimilarity but also the inference of gene families, simultaneously based on sequence similarities and large-scale genomic features.
Assuntos
Genômica , Filogenia , Software , Genômica/métodos , Genoma , Biologia Computacional/métodos , HumanosRESUMO
[This corrects the article DOI: 10.1371/journal.pone.0107014.].
RESUMO
BACKGROUND: Two genomes [Formula: see text] and [Formula: see text] over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by [Formula: see text] the number of common families of [Formula: see text] and [Formula: see text]. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let [Formula: see text] and [Formula: see text] be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes [Formula: see text] and [Formula: see text]. Then, the breakpoint distance of [Formula: see text] and [Formula: see text] is equal to [Formula: see text]. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of [Formula: see text] and [Formula: see text] is [Formula: see text], where c is the total number of cycles and [Formula: see text] is the total number of paths of even length. MOTIVATION: The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a [Formula: see text] distance, defined to be [Formula: see text], and increasingly investigate the complexities of median and double distance for the [Formula: see text] distance, then the [Formula: see text] distance, and so on. RESULTS: While for the median much effort was done in our and in other research groups but no progress was obtained even for the [Formula: see text] distance, for solving the double distance under [Formula: see text] and [Formula: see text] distances we could devise linear time algorithms, which we present here.
RESUMO
Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.
Assuntos
Genética Populacional , Metagenômica , Genômica , Genoma , HaplótiposRESUMO
One of the most basic kinds of analysis to be performed on a pangenome is the detection of its core, i.e., the information shared among all members. Pangenomic core detection is classically done on the gene level and many tools focus exclusively on core detection in prokaryotes. Here, we present a new method for sequence-based pangenomic core detection. Our model generalizes from a strict core definition allowing us to flexibly determine suitable core properties depending on the research question and the dataset under consideration. We propose an algorithm based on a colored de Bruijn graph that runs in linear time with respect to the number of k-mers in the graph. An implementation of our method is called Corer. Because of the usage of a colored de Bruijn graph, it works alignment-free, is provided with a small memory footprint, and accepts as input assembled genomes as well as sequencing reads.
RESUMO
MOTIVATION: Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. RESULTS: We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. AVAILABILITY AND IMPLEMENTATION: Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at https://github.com/morrislab/submarine.
Assuntos
Algoritmos , Biologia Computacional/métodos , Mutação/genética , Neoplasias , Simulação por Computador , Evolução Molecular , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/classificação , Neoplasias/genética , Sequenciamento Completo do GenomaRESUMO
The computation of genomic distances has been a very active field of computational comparative genomics over the past 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double cut and join distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao et al. relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an integer linear programming (ILP) solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes that have equal numbers of duplicates of any marker. Therefore, it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this article, we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao et al. to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few 10,000 markers, which we demonstrate on simulated and real data.
Assuntos
Biologia Computacional , Rearranjo Gênico/genética , Genoma/genética , Genômica , Algoritmos , Modelos Genéticos , Programação LinearRESUMO
The inversion distance, that is the distance between two unichromosomal genomes with the same content allowing only inversions of DNA segments, can be exactly computed thanks to a pioneering approach of Hannenhalli and Pevzner from 1995. In 2000, El-Mabrouk extended the inversion model to perform the comparison of unichromosomal genomes with unequal contents, combining inversions with insertions and deletions (indels) of DNA segments, giving rise to the inversion-indel distance. However, only a heuristic was provided for its computation. In 2005, Yancopoulos, Attie and Friedberg started a new branch of research by introducing the generic double cut and join (DCJ) operation, that can represent several genome rearrangements (including inversions). In 2006, Bergeron, Mixtacki and Stoye showed that the DCJ distance can be computed in linear time with a very simple procedure. As a consequence, in 2010 we gave a linear-time algorithm to compute the DCJ-indel distance. This result allowed the inversion-indel model to be revisited from another angle. In 2013, we could show that, when the diagram that represents the relation between the two compared genomes has no bad components, the inversion-indel distance is equal to the DCJ-indel distance. In the present work we complete the study of the inversion-indel distance by giving the first algorithm to compute it exactly even in the presence of bad components.
Assuntos
Genômica/métodos , Mutação INDEL/genética , Algoritmos , Rearranjo Gênico/genéticaRESUMO
Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.
RESUMO
BACKGROUND: Computationally inferred ancestral genomes play an important role in many areas of genome research. We present an improved workflow for the reconstruction from highly diverged genomes such as those of plants. RESULTS: Our work relies on an established workflow in the reconstruction of ancestral plants, but improves several steps of this process. Instead of using gene annotations for inferring the genome content of the ancestral sequence, we identify genomic markers through a process called genome segmentation. This enables us to reconstruct the ancestral genome from hundreds of thousands of markers rather than the tens of thousands of annotated genes. We also introduce the concept of local genome rearrangement, through which we refine syntenic blocks before they are used in the reconstruction of contiguous ancestral regions. With the enhanced workflow at hand, we reconstruct the ancestral genome of eudicots, a major sub-clade of flowering plants, using whole genome sequences of five modern plants. CONCLUSIONS: Our reconstructed genome is highly detailed, yet its layout agrees well with that reported in Badouin et al. (2017). Using local genome rearrangement, not only the marker-based, but also the gene-based reconstruction of the eudicot ancestor exhibited increased genome content, evidencing the power of this novel concept.
Assuntos
Mapeamento Cromossômico/métodos , Genômica/métodos , Magnoliopsida/genética , Simulação por Computador , Evolução Molecular , Ordem dos Genes , Genoma de Planta , Modelos Genéticos , Filogenia , Sintenia/genéticaRESUMO
Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.
RESUMO
The dramatic decrease in time and cost for generating genetic sequence data has opened up vast opportunities in molecular systematics, one of which is the ability to decipher the evolutionary history of strains of a species. Under this fine systematic resolution, the standard markers are too crude to provide a phylogenetic signal. Nevertheless, among prokaryotes, genome dynamics in the form of horizontal gene transfer (HGT) between organisms and gene loss seem to provide far richer information by affecting both gene order and gene content. The "synteny index" (SI) between a pair of genomes combines these latter two factors, allowing comparison of genomes with unequal gene content, together with order considerations of their common genes. Although this approach is useful for classifying close relatives, no rigorous statistical modeling for it has been suggested. Such modeling is valuable, as it allows observed measures to be transformed into estimates of time periods during evolution, yielding the "additivity" of the measure. To the best of our knowledge, there is no other additivity proof for other gene order/content measures under HGT. Here, we provide a first statistical model and analysis for the SI measure. We model the "gene neighborhood" as a "birth-death-immigration" process affected by the HGT activity over the genome, and analytically relate the HGT rate and time to the expected SI. This model is asymptotic and thus provides accurate results, assuming infinite size genomes. Therefore, we also developed a heuristic model following an "exponential decay" function, accounting for biologically realistic values, which performed well in simulations. Applying this model to 1,133 prokaryotes partitioned to 39 clusters by the rank of genus yields that the average number of genome dynamics events per gene in the phylogenetic depth of genus is around half with significant variability between genera. This result extends and confirms similar results obtained for individual genera in different manners.
Assuntos
Transferência Genética Horizontal , Técnicas Genéticas , Modelos Genéticos , Sintenia , Genoma Microbiano , FilogeniaRESUMO
Mesenteric infection by the parasitic blood fluke Schistosoma bovis is a common veterinary problem in Africa and the Middle East and occasionally in the Mediterranean Region. The species also has the ability to form interspecific hybrids with the human parasite S. haematobium with natural hybridisation observed in West Africa, presenting possible zoonotic transmission. Additionally, this exchange of alleles between species may dramatically influence disease dynamics and parasite evolution. We have generated a 374 Mb assembly of the S. bovis genome using Illumina and PacBio-based technologies. Despite infecting different hosts and organs, the genome sequences of S. bovis and S. haematobium appeared strikingly similar with 97% sequence identity. The two species share 98% of protein-coding genes, with an average sequence identity of 97.3% at the amino acid level. Genome comparison identified large continuous parts of the genome (up to several 100 kb) showing almost 100% sequence identity between S. bovis and S. haematobium. It is unlikely that this is a result of genome conservation and provides further evidence of natural interspecific hybridization between S. bovis and S. haematobium. Our results suggest that foreign DNA obtained by interspecific hybridization was maintained in the population through multiple meiosis cycles and that hybrids were sexually reproductive, producing viable offspring. The S. bovis genome assembly forms a highly valuable resource for studying schistosome evolution and exploring genetic regions that are associated with species-specific phenotypic traits.
Assuntos
Hibridização Genética/genética , Schistosoma/genética , África , África Ocidental , Animais , Sequência de Bases/genética , Bovinos , Mapeamento Cromossômico/métodos , DNA/genética , Genoma/genética , Genoma Mitocondrial/genética , Hibridização Genética/fisiologia , Oriente Médio , Filogenia , Proteoma/genética , Especificidade da Espécie , Trematódeos/genética , Sequenciamento Completo do Genoma/métodosRESUMO
The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.
Assuntos
Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/tendências , Software , Algoritmos , Compressão de Dados , Genoma/genéticaRESUMO
Ancestral genome reconstruction is an important task to analyze the evolution of genomes. Recent progress in sequencing ancient DNA led to the publication of so-called paleogenomes and allows the integration of this sequencing data in genome evolution analysis. However, the de novo assembly of ancient genomes is usually fragmented due to DNA degradation over time among others. Integrated phylogenetic assembly addresses the issue of genome fragmentation in the ancient DNA assembly while aiming to improve the reconstruction of all ancient genomes in the phylogeny simultaneously. The fragmented assembly of the ancient genome can be represented as an assembly graph, indicating contradicting ordering information of contigs. In this setting, our approach is to compare the ancient data with extant finished genomes. We generalize a reconstruction approach minimizing the Single-Cut-or-Join rearrangement distance towards multifurcating trees and include edge lengths to improve the reconstruction in practice. This results in a polynomial time algorithm that includes additional ancient DNA data at one node in the tree, resulting in consistent reconstructions of ancestral genomes.