Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Genome Res ; 33(7): 1198-1207, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37253540

RESUMO

Compacted de Bruijn graphs are one of the most fundamental data structures in computational genomics. Colored compacted de Bruijn graphs are a variant built on a collection of sequences and associate to each k-mer the sequences in which it appears. We present GGCAT, a tool for constructing both types of graphs, based on a new approach merging the k-mer counting step with the unitig construction step, as well as on numerous practical optimizations. For compacted de Bruijn graph construction, GGCAT achieves speed-ups of 3× to 21× compared with the state-of-the-art tool Cuttlefish 2. When constructing the colored variant, GGCAT achieves speed-ups of 5× to 39× compared with the state-of-the-art tool BiFrost. Additionally, GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Genômica
2.
Bioinformatics ; 40(2)2024 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-38262343

RESUMO

MOTIVATION: Recent advancements in long-read RNA sequencing have enabled the examination of full-length isoforms, previously uncaptured by short-read sequencing methods. An alternative powerful method for studying isoforms is through the use of barcoded short-read RNA reads, for which a barcode indicates whether two short-reads arise from the same molecule or not. Such techniques included the 10x Genomics linked-read based SParse Isoform Sequencing (SPIso-seq), as well as Loop-Seq, or Tell-Seq. Some applications, such as novel-isoform discovery, require very high coverage. Obtaining high coverage using long reads can be difficult, making barcoded RNA-seq data a valuable alternative for this task. However, most annotation pipelines are not able to work with a set of short reads instead of a single transcript, also not able to work with coverage gaps within a molecule if any. In order to overcome this challenge, we present an RNA-seq assembler that allows the determination of the expressed isoform per barcode. RESULTS: In this article, we present cloudrnaSPAdes, a tool for assembling full-length isoforms from barcoded RNA-seq linked-read data in a reference-free fashion. Evaluating it on simulated and real human data, we found that cloudrnaSPAdes accurately assembles isoforms, even for genes with high isoform diversity. AVAILABILITY AND IMPLEMENTATION: cloudrnaSPAdes is a feature release of a SPAdes assembler and version used for this article is available at https://github.com/1dayac/cloudrnaSPAdes-release.


Assuntos
Genômica , RNA , Humanos , RNA/genética , Análise de Sequência de RNA/métodos , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , RNA-Seq , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Transcriptoma
3.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37862229

RESUMO

MOTIVATION: Many important problems in Bioinformatics (e.g. assembly or multiassembly) admit multiple solutions, while the final objective is to report only one. A common approach to deal with this uncertainty is finding "safe" partial solutions (e.g. contigs) which are common to all solutions. Previous research on safety has focused on polynomially time solvable problems, whereas many successful and natural models are NP-hard to solve, leaving a lack of "safety tools" for such problems. We propose the first method for computing all safe solutions for an NP-hard problem, "minimum flow decomposition" (MFD). We obtain our results by developing a "safety test" for paths based on a general integer linear programming (ILP) formulation. Moreover, we provide implementations with practical optimizations aimed to reduce the total ILP time, the most efficient of these being based on a recursive group-testing procedure. RESULTS: Experimental results on transcriptome datasets show that all safe paths for MFDs correctly recover up to 90% of the full RNA transcripts, which is at least 25% more than previously known safe paths. Moreover, despite the NP-hardness of the problem, we can report all safe paths for 99.8% of the over 27 000 non-trivial graphs of this dataset in only 1.5 h. Our results suggest that, on perfect data, there is less ambiguity than thought in the notoriously hard RNA assembly problem. AVAILABILITY AND IMPLEMENTATION: https://github.com/algbio/mfd-safety.


Assuntos
Algoritmos , Programação Linear , Biologia Computacional , RNA
4.
Bioinformatics ; 39(8)2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37494467

RESUMO

MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS: We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION: GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Alinhamento de Sequência , Biologia Computacional , Software
5.
Physiol Plant ; 176(3): e14398, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38894544

RESUMO

RNA-seq data is currently generated in numerous non-model organisms that lack a reference genome. Nevertheless, the confirmation of gene expression levels using RT-qPCR remains necessary, and the existing techniques do not seamlessly interface with the omics pipeline workflow. Developing primers for many targets by utilising orthologous genes can be a laborious, imprecise, and subjective process, particularly for plant species that are not commonly studied and do not have a known genome. We have developed a primer design tool, named PABLOG, that analyses the alignments generated from long or short RNA-seq reads and a reference orthologous gene. PABLOG scans, much like a bee searching several flowers for pollen, and presents a sorted list of potential exon-exon junction locations, ranked according to their reliability. Through computational analysis across the whole genomes of several non-model species, we demonstrate that PABLOG performs more effectively than other methods in identifying exon-exon junctions since it generates significantly fewer false-positive results. Examination of candidate regions at the gene level, in conjunction with laboratory studies, shows that the suggested primers successfully amplified particular targets in non-model plants without any presence of genomic contamination. Our tool includes a consensus sequence feature that enables the complete process of primer design, from aligning with the target gene to determining amplification parameters. The utility can be accessed via the GitHub repository located at: https://github.com/tools4plant-omics/PABLOG.


Assuntos
Primers do DNA , Abelhas/genética , Primers do DNA/genética , Éxons/genética , Software , Animais , Genes de Plantas/genética , Genoma de Planta/genética , Biologia Computacional/métodos
6.
Bioinformatics ; 35(5): 769-777, 2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30101335

RESUMO

MOTIVATION: Discovering the evolution of a tumor may help identify driver mutations and provide a more comprehensive view on the history of the tumor. Recent studies have tackled this problem using multiple samples sequenced from a tumor, and due to clinical implications, this has attracted great interest. However, such samples usually mix several distinct tumor subclones, which confounds the discovery of the tumor phylogeny. RESULTS: We study a natural problem formulation requiring to decompose the tumor samples into several subclones with the objective of forming a minimum perfect phylogeny. We propose an Integer Linear Programming formulation for it, and implement it into a method called MIPUP. We tested the ability of MIPUP and of four popular tools LICHeE, AncesTree, CITUP, Treeomics to reconstruct the tumor phylogeny. On simulated data, MIPUP shows up to a 34% improvement under the ancestor-descendant relations metric. On four real datasets, MIPUP's reconstructions proved to be generally more faithful than those of LICHeE. AVAILABILITY AND IMPLEMENTATION: MIPUP is available at https://github.com/zhero9/MIPUP as open source. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias , Humanos , Mutação , Neoplasias/genética , Filogenia , Programação Linear , Software
7.
Bioinformatics ; 31(7): 1133-5, 2015 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-25398608

RESUMO

MOTIVATION: Recent studies sequenced tumor samples from the same progenitor at different development stages and showed that by taking into account the phylogeny of this development, single-nucleotide variant (SNV) calling can be improved. Accurate SNV calls can better reveal early-stage tumors, identify mechanisms of cancer progression or help in drug targeting. RESULTS: We present SNV-PPILP, a fast and easy to use tool for refining GATK's Unified Genotyper SNV calls, for multiple samples assumed to form a phylogeny. We tested SNV-PPILP on simulated data, with a varying number of samples, SNVs, read coverage and violations of the perfect phylogeny assumption. We always match or improve the accuracy of GATK, with a significant improvement on low read coverage. AVAILABILITY AND IMPLEMENTATION: SNV-PPILP, available at cs.helsinki.fi/gsa/snv-ppilp/, is written in Python and requires the free ILP solver lp_solve. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias/genética , Filogenia , Polimorfismo de Nucleotídeo Único/genética , Programação Linear , Análise de Sequência de DNA/métodos , Software , Biologia Computacional , Simulação por Computador , Humanos
8.
BMC Bioinformatics ; 15 Suppl 9: S5, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25252805

RESUMO

BACKGROUND: Multi-assembly problems have gathered much attention in the last years, as Next-Generation Sequencing technologies have started being applied to mixed settings, such as reads from the transcriptome (RNA-Seq), or from viral quasi-species. One classical model that has resurfaced in many multi-assembly methods (e.g. in Cufflinks, ShoRAH, BRANCH, CLASS) is the Minimum Path Cover (MPC) Problem, which asks for the minimum number of directed paths that cover all the nodes of a directed acyclic graph. The MPC Problem is highly popular because the acyclicity of the graph ensures its polynomial-time solvability. RESULTS: In this paper, we consider two generalizations of it dealing with integrating constraints arising from long reads or paired-end reads; these extensions have also been considered by two recent methods, but not fully solved. More specifically, we study the two problems where also a set of subpaths, or pairs of subpaths, of the graph have to be entirely covered by some path in the MPC. We show that in the case of long reads (subpaths), the generalized problem can be solved in polynomial-time by a reduction to the classical MPC Problem. We also consider the weighted case, and show that it can be solved in polynomial-time by a reduction to a min-cost circulation problem. As a side result, we also improve the time complexity of the classical minimum weight MPC Problem. In the case of paired-end reads (pairs of subpaths), the generalized problem becomes NP-hard, but we show that it is fixed-parameter tractable (FPT) in the total number of constraints. This computational dichotomy between long reads and paired-end reads is also a general insight into multi-assembly problems.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Transcriptoma , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/economia , Análise de Sequência de RNA/economia
9.
Artigo em Inglês | MEDLINE | ID: mdl-39269812

RESUMO

Minimum flow decomposition (MFD) is a common problem across various fields of Computer Science, where a flow is decomposed into a minimum set of weighted paths. However, in Bioinformatics applications, such as RNA transcript or quasi-species assembly, the flow is erroneous since it is obtained from noisy read coverages. Typical generalizations of the MFD problem to handle errors are based on least-squares formulations or modelling the erroneous flow values as ranges. All of these are thus focused on error handling at the level of individual edges. In this paper, we interpret the flow decomposition problem as a robust optimization problem and lift error-handling from individual edges to solution paths. As such, we introduce a new minimum path-error flow decomposition problem, for which we give an Integer Linear Programming formulation. Our experimental results reveal that our formulation can account for errors significantly better, by lowering the inaccuracy rate by 30-50% compared to previous error-handling formulations, with computational requirements that remain practical.

10.
BMC Bioinformatics ; 14 Suppl 5: S15, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23734627

RESUMO

BACKGROUND: Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms), depending on the individual, on the tissue the cell is in, or in response to some stimuli. Recent RNA-Seq technology allows for new high-throughput ways for isoform identification and quantification based on short reads, and various methods have been put forward for this non-trivial problem. RESULTS: In this paper we propose a novel radically different method based on minimum-cost network flows. This has a two-fold advantage: on the one hand, it translates the problem as an established one in the field of network flows, which can be solved in polynomial time, with different existing solvers; on the other hand, it is general enough to encompass many of the previous proposals under the least sum of squares model. Our method works as follows: in order to find the transcripts which best explain, under a given fitness model, a splicing graph resulting from an RNA-Seq experiment, we find a min-cost flow in an offset flow network, under an equivalent cost model. Under very weak assumptions on the fitness model, the optimal flow can be computed in polynomial time. Parsimoniously splitting the flow back into few path transcripts can be done with any of the heuristics and approximations available from the theory of network flows. In the present implementation, we choose the simple strategy of repeatedly removing the heaviest path. CONCLUSIONS: We proposed a new very general method based on network flows for a multiassembly problem arising from isoform identification and quantification with RNA-Seq. Experimental results on prediction accuracy show that our method is very competitive with popular tools such as Cufflinks and IsoLasso. Our tool, called Traph (Transcrips in gRAPHs), is available at: http://www.cs.helsinki.fi/gsa/traph/.


Assuntos
Perfilação da Expressão Gênica/métodos , Isoformas de RNA/metabolismo , Análise de Sequência de RNA/métodos , Algoritmos , Processamento Alternativo , Humanos , Modelos Estatísticos , Software
11.
Bioinformatics ; 28(1): 123-4, 2012 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-22084252

RESUMO

SUMMARY: The advent of high-throughput sequencers (HTS) introduced the need of new tools in order to analyse the large amount of data that those machines are able to produce. The mandatory first step for a wide range of analyses is the alignment of the sequences against a reference genome. We present a major update to our rNA (randomized Numerical Aligner) tool. The main feature of rNA is the fact that it achieves an accuracy greater than the majority of other tools in a feasible amount of time. rNA executables and source codes are freely downloadable at http://iga-rna.sourceforge.net/. CONTACT: vezzi@appliedgenomics.org; delfabbro@appliedgenomics.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Humanos
12.
bioRxiv ; 2023 Feb 02.
Artigo em Inglês | MEDLINE | ID: mdl-36778435

RESUMO

Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity. The optimal algorithm in this framework, called the omnitig algorithm, has not been used in practice due to its complexity and its lack of robustness to real data. Instead, we pursue a simplified notion of omnitigs, giving an efficient algorithm to compute them and demonstrating their safety under certain conditions. We modify two assemblers (wtdbg2 and Flye) by replacing their unitig algorithm with the simple omnitig algorithm. We test our modifications using real HiFi data from the Drosophilia melanogaster and the Caenorhabditis elegans genome. Our modified algorithms lead to a substantial improvement in alignment-based contiguity, with negligible computational costs and either no or a small increase in the number of misassemblies.

13.
Artigo em Inglês | MEDLINE | ID: mdl-35104222

RESUMO

Flow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e., decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPT algorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.


Assuntos
Algoritmos , Heurística , RNA/genética
14.
Genome Biol ; 24(1): 168, 2023 07 17.
Artigo em Inglês | MEDLINE | ID: mdl-37461051

RESUMO

Sequence alignments are the foundations of life science research, but most innovation so far focuses on optimal alignments, while information derived from suboptimal solutions is ignored. We argue that one optimal alignment per pairwise sequence comparison is a reasonable approximation when dealing with very similar sequences but is insufficient when exploring the biodiversity of the protein universe at tree-of-life scale. To overcome this limitation, we introduce pairwise alignment-safety to uncover the amino acid positions robustly shared across all suboptimal solutions. We implement EMERALD, a software library for alignment-safety inference, and apply it to 400k sequences from the SwissProt database.


Assuntos
Algoritmos , Software , Animais , Sequência de Aminoácidos , Alinhamento de Sequência , Proteínas/genética , Proteínas/química , Aves
15.
bioRxiv ; 2023 Jul 27.
Artigo em Inglês | MEDLINE | ID: mdl-37546844

RESUMO

Motivation: Recent advancements in long-read RNA sequencing have enabled the examination of full-length isoforms, previously uncaptured by short-read sequencing methods. An alternative powerful method for studying isoforms is through the use of barcoded short-read RNA reads, for which a barcode indicates whether two short-reads arise from the same molecule or not. Such techniques included the 10x Genomics linked-read based SParse Isoform Sequencing (SPIso-seq), as well as Loop-Seq, or Tell-Seq. Some applications, such as novel-isoform discovery, require very high coverage. Obtaining high coverage using long reads can be difficult, making barcoded RNA-seq data a valuable alternative for this task. However, most annotation pipelines are not able to work with a set of short reads instead of a single transcript, also not able to work with coverage gaps within a molecule if any. In order to overcome this challenge, we present an RNA-seq assembler allowing the determination of the expressed isoform per barcode. Results: In this paper, we present cloudrnaSPAdes, a tool for assembling full-length isoforms from barcoded RNA-seq linked-read data in a reference-free fashion. Evaluating it on simulated and real human data, we found that cloudrnaSPAdes accurately assembles isoforms, even for genes with high isoform diversity. Availability: cloudrnaSPAdes is a feature release of a SPAdes assembler and available at https://cab.spbu.ru/software/cloudrnaspades/.

16.
Genome Biol ; 24(1): 136, 2023 Jun 09.
Artigo em Inglês | MEDLINE | ID: mdl-37296461

RESUMO

We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Bactérias
17.
J Comput Biol ; 29(12): 1270-1287, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36288562

RESUMO

Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage (≈50% more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20%) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4-5×) and space performance (1.2-2.2×), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.


Assuntos
Algoritmos , RNA , RNA/genética , Biologia Computacional/métodos , Heurística , Análise de Sequência de RNA
18.
J Comput Biol ; 29(11): 1252-1267, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36260412

RESUMO

Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.


Assuntos
Biologia Computacional , Programação Linear , Algoritmos , RNA
19.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3673-3684, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34847041

RESUMO

A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.


Assuntos
Algoritmos , Genômica , Genoma , Sequência de Bases , RNA
20.
Artigo em Inglês | MEDLINE | ID: mdl-29994355

RESUMO

Gap filling has emerged as a natural sub-problem of many de novo genome assembly projects. The gap filling problem generally asks for an $s$s-$t$t path in an assembly graph whose length matches the gap length estimate. Several methods have addressed it, but only few have focused on strategies for dealing with multiple gap filling solutions and for guaranteeing reliable results. Such strategies include reporting only unique solutions, or exhaustively enumerating all filling solutions and heuristically creating their consensus. Our main contribution is a new method for reliable gap filling: filling gaps with those sub-paths common to all gap filling solutions. We call these partial solutions safe, following the framework of (Tomescu and Medvedev, RECOMB 2016). We give an efficient safe algorithm running in $O(dm)$O(dm) time and space, where $d$d is the gap length estimate and $m$m is the number of edges of the assembly graph. To show the benefits of this method, we implemented this algorithm for the problem of filling gaps in scaffolds. Our experimental results on bacterial and on conservative human assemblies show that, on average, our method can retrieve over 73 percent more safe and correct bases as compared to previous methods, with a similar precision.


Assuntos
Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Genoma Bacteriano/genética , Genoma Humano/genética , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA