Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
Mais filtros

Base de dados
Tipo de documento
País/Região como assunto
Intervalo de ano de publicação
1.
Bioinformatics ; 39(8)2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37494467

RESUMO

MOTIVATION: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS: We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION: GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Alinhamento de Sequência , Biologia Computacional , Software
2.
BMC Bioinformatics ; 23(1): 167, 2022 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-35525918

RESUMO

BACKGROUND: De novo genome assembly typically produces a set of contigs instead of the complete genome. Thus additional data such as genetic linkage maps, optical maps, or Hi-C data is needed to resolve the complete structure of the genome. Most of the previous work uses the additional data to order and orient contigs. RESULTS: Here we introduce a framework to guide genome assembly with additional data. Our approach is based on clustering the reads, such that each read in each cluster originates from nearby positions in the genome according to the additional data. These sets are then assembled independently and the resulting contigs are further assembled in a hierarchical manner. We implemented our approach for genetic linkage maps in a tool called HGGA. CONCLUSIONS: Our experiments on simulated and real Pacific Biosciences long reads and genetic linkage maps show that HGGA produces a more contiguous assembly with less contigs and from 1.2 to 9.8 times higher NGA50 or N50 than a plain assembly of the reads and 1.03 to 6.5 times higher NGA50 or N50 than a previous approach integrating genetic linkage maps with contig assembly. Furthermore, also the correctness of the assembly remains similar or improves as compared to an assembly using only the read data.


Assuntos
Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos
3.
Bioinformatics ; 37(14): 1946-1952, 2021 08 04.
Artigo em Inglês | MEDLINE | ID: mdl-32462192

RESUMO

MOTIVATION: The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction, e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. RESULTS: In this article, we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods, e.g. FDBG cannot be constructed on large scale datasets, or cannot support both addition and deletion, e.g. BiFrost. AVAILABILITY AND IMPLEMENTATION: DynamicBOSS is publicly available at https://github.com/baharpan/dynboss. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Sequenciamento de Nucleotídeos em Larga Escala , Projetos de Pesquisa , Análise de Sequência de DNA
4.
Bioinformatics ; 37(4): 473-481, 2021 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-32926162

RESUMO

MOTIVATION: RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. RESULTS: We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. AVAILABILITY AND IMPLEMENTATION: viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Quase-Espécies , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
5.
Bioinformatics ; 36(3): 682-689, 2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-31504206

RESUMO

MOTIVATION: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. RESULTS: We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. AVAILABILITY AND IMPLEMENTATION: Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Software , Algoritmos , Genoma Humano , Humanos , Mapeamento por Restrição , Análise de Sequência de DNA
6.
BMC Bioinformatics ; 21(1): 285, 2020 Jul 06.
Artigo em Inglês | MEDLINE | ID: mdl-32631227

RESUMO

BACKGROUND: The long reads produced by third generation sequencing technologies have significantly boosted the results of genome assembly but still, genome-wide assemblies solely based on read data cannot be produced. Thus, for example, optical mapping data has been used to further improve genome assemblies but it has mostly been applied in a post-processing stage after contig assembly. RESULTS: We propose OPTICALKERMIT which directly integrates genome wide optical maps into contig assembly. We show how genome wide optical maps can be used to localize reads on the genome and then we adapt the Kermit method, which originally incorporated genetic linkage maps to the miniasm assembler, to use this information in contig assembly. Our experimental results show that incorporating genome wide optical maps to the contig assembly of miniasm increases NGA50 while the number of misassemblies decreases or stays the same. Furthermore, when compared to the Canu assembler, OPTICALKERMIT produces an assembly with almost three times higher NGA50 with a lower number of misassemblies on real A. thaliana reads. CONCLUSIONS: OPTICALKERMIT successfully incorporates optical mapping data directly to contig assembly of eukaryotic genomes. Our results show that this is a promising approach to improve the contiguity of genome assemblies.


Assuntos
Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Imagens com Corantes Sensíveis à Voltagem/métodos , Humanos
7.
Bioinformatics ; 35(18): 3250-3256, 2019 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-30698651

RESUMO

MOTIVATION: Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. RESULTS: We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. AVAILABILITY AND IMPLEMENTATION: The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Genoma , Mapeamento por Restrição , Análise de Sequência de DNA
8.
Bioinformatics ; 33(6): 799-806, 2017 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-27273673

RESUMO

Motivation: New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. Results: We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k -mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher. Availability and Implementation: LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/ . Contact: leena.salmela@cs.helsinki.fi.


Assuntos
Análise de Sequência de DNA/métodos , Software , Algoritmos , Escherichia coli/genética , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Saccharomyces cerevisiae/genética
10.
Bioinformatics ; 30(24): 3506-14, 2014 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-25165095

RESUMO

MOTIVATION: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space. RESULTS: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Animais , Escherichia coli/genética , Genômica/métodos , Papagaios/genética , Software , Leveduras/genética
11.
Mol Ecol ; 24(19): 4886-900, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26331775

RESUMO

Insect flight is one of the most energetically demanding activities in the animal kingdom, yet for many insects flight is necessary for reproduction and foraging. Moreover, dispersal by flight is essential for the viability of species living in fragmented landscapes. Here, working on the Glanville fritillary butterfly (Melitaea cinxia), we use transcriptome sequencing to investigate gene expression changes caused by 15 min of flight in two contrasting populations and the two sexes. Male butterflies and individuals from a large metapopulation had significantly higher peak flight metabolic rate (FMR) than female butterflies and those from a small inbred population. In the pooled data, FMR was significantly positively correlated with genome-wide heterozygosity, a surrogate of individual inbreeding. The flight experiment changed the expression level of 1513 genes, including genes related to major energy metabolism pathways, ribosome biogenesis and RNA processing, and stress and immune responses. Males and butterflies from the population with high FMR had higher basal expression of genes related to energy metabolism, whereas females and butterflies from the small population with low FMR had higher expression of genes related to ribosome/RNA processing and immune response. Following the flight treatment, genes related to energy metabolism were generally down-regulated, while genes related to ribosome/RNA processing and immune response were up-regulated. These results suggest that common molecular mechanisms respond to flight and can influence differences in flight metabolic capacity between populations and sexes.


Assuntos
Borboletas/genética , Voo Animal , Expressão Gênica , Caracteres Sexuais , Transcriptoma , Animais , Borboletas/fisiologia , Metabolismo Energético/genética , Feminino , Finlândia , Masculino , Dados de Sequência Molecular , Polimorfismo de Nucleotídeo Único , Análise de Sequência de RNA
12.
Algorithms Mol Biol ; 19(1): 14, 2024 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-38581000

RESUMO

Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes O ( N k w ) computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using O ( N + u k w ) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by k - 1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining k - 1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes O ( σ k ) time in the worst case, σ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.

13.
PLoS One ; 18(11): e0294415, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38019768

RESUMO

K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer counting methods. We propose SAKE, a method to extract long k-mers from high error rate reads by utilizing strobemers and consensus k-mer generation through partial order alignment. Our experiments show that on simulated data with up to 6% error rate, SAKE can extract 97-mers with over 90% recall. Conversely, the recall of DSK, an exact k-mer counter, drops to less than 20%. Furthermore, the precision of SAKE remains similar to DSK. On real bacterial data, SAKE retrieves 97-mers with a recall of over 90% and slightly lower precision than DSK, while the recall of DSK already drops to 50%. We show that SAKE can extract more k-mers from uncorrected high error rate reads compared to exact k-mer counting. However, exact k-mer counters run on corrected reads can extract slightly more k-mers than SAKE run on uncorrected reads.


Assuntos
Algoritmos , Genômica , Análise de Sequência de DNA/métodos , Genoma , Biologia Computacional , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software
14.
IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1550-1562, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-35853050

RESUMO

During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.


Assuntos
Quase-Espécies , Software , Quase-Espécies/genética , Sequenciamento de Nucleotídeos em Larga Escala , Haplótipos/genética , Análise de Sequência de DNA/métodos , Algoritmos
15.
BMC Bioinformatics ; 13: 255, 2012 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-23031320

RESUMO

BACKGROUND: For the development of genome assembly tools, some comprehensive and efficiently computable validation measures are required to assess the quality of the assembly. The mostly used N50 measure summarizes the assembly results by the length of the scaffold (or contig) overlapping the midpoint of the length-order concatenation of scaffolds (contigs). Especially for scaffold assemblies it is non-trivial to combine a correctness measure to the N50 values, and the current methods for doing this are rather involved. RESULTS: We propose a simple but rigorous normalized N50 assembly metric that combines N50 with such a correctness measure; assembly is split into as many parts as necessary to align each part to the reference. For scalability, we first compute maximal local approximate matches between scaffolds and reference in distributed manner, and then proceed with co-linear chaining to find a global alignment. Best alignment is removed from the scaffold and the process is iterated with the remaining scaffold content in order to split the scaffold into correctly aligning parts. The proposed normalized N50 metric is then the N50 value computed for the final correctly aligning parts. As a side result of independent interest, we show how to modify co-linear chaining to restrict gaps to produce a more sensible global alignment. CONCLUSIONS: We propose and implement a comprehensive and efficient approach to compute a metric that summarizes scaffold assembly correctness and length. Our implementation can be downloaded from http://www.cs.helsinki.fi/group/scaffold/normalizedN50/.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma/genética , Análise de Sequência de DNA/métodos
16.
Bioinformatics ; 27(11): 1455-61, 2011 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-21471014

RESUMO

MOTIVATION: Current sequencing technologies produce a large number of erroneous reads. The sequencing errors present a major challenge in utilizing the data in de novo sequencing projects as assemblers have difficulties in dealing with errors. RESULTS: We present Coral which corrects sequencing errors by forming multiple alignments. Unlike previous tools for error correction, Coral can utilize also bases distant from the error in the correction process because the whole read is present in the alignment. Coral is easily adjustable to reads produced by different sequencing technologies like Illumina Genome Analyzer and Roche/454 Life Sciences sequencing platforms because the sequencing error model can be defined by the user. We show that our method is able to reduce the error rate of reads more than previous methods. AVAILABILITY: The source code of Coral is freely available at http://www.cs.helsinki.fi/u/lmsalmel/coral/.


Assuntos
Algoritmos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA
17.
Bioinformatics ; 27(23): 3259-65, 2011 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-21998153

RESUMO

MOTIVATION: Assembling genomes from short read data has become increasingly popular, but the problem remains computationally challenging especially for larger genomes. We study the scaffolding phase of sequence assembly where preassembled contigs are ordered based on mate pair data. RESULTS: We present MIP Scaffolder that divides the scaffolding problem into smaller subproblems and solves these with mixed integer programming. The scaffolding problem can be represented as a graph and the biconnected components of this graph can be solved independently. We present a technique for restricting the size of these subproblems so that they can be solved accurately with mixed integer programming. We compare MIP Scaffolder to two state of the art methods, SOPRA and SSPACE. MIP Scaffolder is fast and produces better or as good scaffolds as its competitors on large genomes. AVAILABILITY: The source code of MIP Scaffolder is freely available at http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/. CONTACT: leena.salmela@cs.helsinki.fi.


Assuntos
Genoma , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Caenorhabditis elegans/genética , Escherichia coli/genética , Sequenciamento de Nucleotídeos em Larga Escala , Pseudomonas syringae/genética
18.
Bioinformatics ; 26(10): 1284-90, 2010 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-20378555

RESUMO

MOTIVATION: High-throughput sequencing technologies produce large sets of short reads that may contain errors. These sequencing errors make de novo assembly challenging. Error correction aims to reduce the error rate prior assembly. Many de novo sequencing projects use reads from several sequencing technologies to get the benefits of all used technologies and to alleviate their shortcomings. However, combining such a mixed set of reads is problematic as many tools are specific to one sequencing platform. The SOLiD sequencing platform is especially problematic in this regard because of the two base color coding of the reads. Therefore, new tools for working with mixed read sets are needed. RESULTS: We present an error correction tool for correcting substitutions, insertions and deletions in a mixed set of reads produced by various sequencing platforms. We first develop a method for correcting reads from any sequencing technology producing base space reads such as the SOLEXA/Illumina and Roche/454 Life Sciences sequencing platforms. We then further refine the algorithm to correct the color space reads from the Applied Biosystems SOLiD sequencing platform together with normal base space reads. Our new tool is based on the SHREC program that is aimed at correcting SOLEXA/Illumina reads. Our experiments show that we can detect errors with 99% sensitivity and >98% specificity if the combined sequencing coverage of the sets is at least 12. We also show that the error rate of the reads is greatly reduced. AVAILABILITY: The JAVA source code is freely available at http://www.cs.helsinki.fi/u/lmsalmel/hybrid-shrec/ CONTACT: leena.salmela@cs.helsinki.fi


Assuntos
Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Biologia Computacional , Alinhamento de Sequência
19.
iScience ; 24(1): 101956, 2021 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-33437938

RESUMO

DNA and RNA sequencing is a core technology in biological and medical research. The high throughput of these technologies and the consistent development of new experimental assays and biotechnologies demand the continuous development of methods to analyze the resulting data. The RECOMB Satellite Workshop on Massively Parallel Sequencing brings together leading researchers in computational genomics to discuss emerging frontiers in algorithm development for massively parallel sequencing data. The 10th meeting in this series, RECOMB-Seq 2020, was scheduled to be held in Padua, Italy, but due to the ongoing COVID-19 pandemic, the meeting was carried out virtually instead. The online workshop featured keynote talks by Paola Bonizzoni and Zamin Iqbal, two highlight talks, ten regular talks, and three short talks. Seven of the works presented in the workshop are featured in this edition of iScience, and many of the talks are available online in the RECOMB-Seq 2020 YouTube channel.

20.
Artigo em Inglês | MEDLINE | ID: mdl-34529572

RESUMO

The extraction of k-mers from reads is an important task in many bioinformatics applications, such as all DNA sequence analysis methods based on de Bruijn graphs. These methods tend to be more accurate when the used k-mers are unique in the analyzed DNA, and thus the use of longer k-mers is preferred. When the read lengths of short read sequencing technologies increase, the error rate will become the determining factor for the largest possible value of k. Here we propose LoMeX which uses spaced seeds to extract long k-mers accurately even in the presence of sequencing errors. Our experiments show that LoMeX can extract long k-mers from current Illumina reads with a similar or higher recall than a standard k-mer counting tool. Furthermore, our experiments on simulated data show that when the read length further increases enabling even longer k-mers, the performance of standard k-mer counters declines, whereas LoMeX still extracts long k-mers successfully.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA