RESUMO
BACKGROUND: Given two genomes that have diverged by a series of rearrangements, we infer minimum Double Cut-and-Join (DCJ) scenarios to explain their organization differences, coupled with indel scenarios to explain their intergene size distribution, where DCJs themselves also alter the sizes of broken intergenes. RESULTS: We give a polynomial-time algorithm that, given two genomes with arbitrary intergene size distributions, outputs a DCJ scenario which optimizes on the number of DCJs, and given this optimal number of DCJs, optimizes on the total sum of the sizes of the indels. CONCLUSIONS: We show that there is a valuable information in the intergene sizes concerning the rearrangement scenario itself. On simulated data we show that statistical properties of the inferred scenarios are closer to the true ones than DCJ only scenarios, i.e. scenarios which do not handle intergene sizes.
Assuntos
Rearranjo Gênico/genética , Genoma , Modelos Genéticos , Algoritmos , Mutação INDELRESUMO
Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes. In contrast, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. For this purpose, we formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the treewidth tw of the fatgraph, and its output represents a [Formula: see text] algorithm (and even possibly [Formula: see text] in simple energy models) for predicting the MFE folding of an RNA of length n. We demonstrate, for the most common pseudoknot classes, that our automatically generated algorithms achieve the same complexities as reported in the literature for hand-crafted schemes. Our framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.
RESUMO
Hard graph problems are ubiquitous in Bioinformatics, inspiring the design of specialized Fixed-Parameter Tractable algorithms, many of which rely on a combination of tree-decomposition and dynamic programming. The time/space complexities of such approaches hinge critically on low values for the treewidth tw of the input graph. In order to extend their scope of applicability, we introduce the TREE-DIET problem, i.e. the removal of a minimal set of edges such that a given tree-decomposition can be slimmed down to a prescribed treewidth [Formula: see text]. Our rationale is that the time gained thanks to a smaller treewidth in a parameterized algorithm compensates the extra post-processing needed to take deleted edges into account. Our core result is an FPT dynamic programming algorithm for TREE-DIET, using [Formula: see text] time and space. We complement this result with parameterized complexity lower-bounds for stronger variants (e.g., NP-hardness when [Formula: see text] or [Formula: see text] is constant). We propose a prototype implementation for our approach which we apply on difficult instances of selected RNA-based problems: RNA design, sequence-structure alignment, and search of pseudoknotted RNAs in genomes, revealing very encouraging results. This work paves the way for a wider adoption of tree-decomposition-based algorithms in Bioinformatics.
RESUMO
Genome rearrangements are mutations affecting large portions of a genome, and a reversal is one of the most studied genome rearrangements in the literature through the Sorting by Reversals (SbR) problem. SbR is solvable in polynomial time on signed permutations (i.e., the gene orientation is known), and it is NP-hard on unsigned permutations. This problem (and many others considering genome rearrangements) models genome as a list of its genes in the order they appear, ignoring all other information present in the genome. Recent works claimed that the incorporation of the size of intergenic regions, i.e., sequences of nucleotides between genes, may result in better estimators for the real distance between genomes. Here we introduce the Sorting Signed Permutations by Intergenic Reversals problem, that sorts a signed permutation using reversals both on gene order and intergenic sizes. We show that this problem is NP-hard by a reduction from the 3-partition problem. Then, we propose a 2-approximation algorithm for it. Finally, we also incorporate intergenic indels (i.e., insertions or deletions of intergenic regions) to overcome a limitation of sorting by conservative events (such as reversals) and propose two approximation algorithms.
Assuntos
DNA Intergênico/genética , Rearranjo Gênico/genética , Genômica/legislação & jurisprudência , Algoritmos , Mutação INDEL/genética , Modelos Genéticos , Mutação/genéticaRESUMO
Synthetic biology has boomed since the early 2000s when it started being shown that it was possible to efficiently synthetize compounds of interest in a much more rapid and effective way by using other organisms than those naturally producing them. However, to thus engineer a single organism, often a microbe, to optimise one or a collection of metabolic tasks may lead to difficulties when attempting to obtain a production system that is efficient, or to avoid toxic effects for the recruited microorganism. The idea of using instead a microbial consortium has thus started being developed in the last decade. This was motivated by the fact that such consortia may perform more complicated functions than could single populations and be more robust to environmental fluctuations. Success is however not always guaranteed. In particular, establishing which consortium is best for the production of a given compound or set thereof remains a great challenge. This is the problem we address in this paper. We thus introduce an initial model and a method that enable to propose a consortium to synthetically produce compounds that are either exogenous to it, or are endogenous but where interaction among the species in the consortium could improve the production line.
Assuntos
Algoritmos , Consórcios Microbianos , Biologia Sintética/métodos , Acetatos/metabolismo , Bactérias/metabolismo , Biotecnologia , Glicerol/metabolismo , Propilenoglicóis/metabolismoRESUMO
Given two genomes possibly with duplicate genes, the exemplar distance problem is that of removing all but one copy of each gene in each genome, so as to minimize the distance between the two reduced genomes according to some measure. Let $((s,t))$-exemplar distance denote the exemplar distance problem on two genomes $(G_1)$ and $(G_2)$, where each gene occurs at most $(s)$ times in $(G_1)$ and at most $(t)$ times in $(G_2)$. We show that the simplest nontrivial variant of the exemplar distance problem, $((1,2))$-Exemplar Distance, is already hard to approximate for a wide variety of distance measures, including both popular genome rearrangement measures such as adjacency disruptions, signed reversals, and signed double-cut-and-joins, and classic string edit distance measures such as Levenshtein and Hamming distances.