Búsqueda | Portal de Búsqueda de la BVS

CoMSA: compression of protein multiple sequence alignment files.

Deorowicz, Sebastian; Walczyszyn, Joanna; Debudaj-Grabysz, Agnieszka.

Bioinformatics ; 35(2): 227-234, 2019 01 15.

Artículo en Inglés | MEDLINE | ID: mdl-30010777

RESUMEN

Motivation: Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e. Pfam, consumes 40-230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge. Results: We propose a novel compression algorithm, CoMSA, designed especially for aligned data. It is based on a generalization of the positional Burrows-Wheeler transform for non-binary alphabets. CoMSA handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e. gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. Availability and implementation: CoMSA is available for free at https://github.com/refresh-bio/comsa and http://sun.aei.polsl.pl/REFRESH/comsa. Supplementary material: Supplementary data are available at Bioinformatics online.

Asunto(s)

Compresión de Datos , Bases de Datos de Proteínas , Genómica , Alineación de Secuencia , Algoritmos , Biología Computacional , Análisis de Secuencia de ADN

Whisper: read sorting allows robust mapping of DNA sequencing data.

Deorowicz, Sebastian; Debudaj-Grabysz, Agnieszka; Gudys, Adam; Grabowski, Szymon.

Bioinformatics ; 35(12): 2043-2050, 2019 06 01.

Artículo en Inglés | MEDLINE | ID: mdl-30407485

RESUMEN

MOTIVATION: Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time. RESULTS: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline. AVAILABILITY AND IMPLEMENTATION: Whisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Programas Informáticos , Secuencia de Bases , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN

KMC 2: fast and resource-frugal k-mer counting.

Deorowicz, Sebastian; Kokot, Marek; Grabowski, Szymon; Debudaj-Grabysz, Agnieszka.

Bioinformatics ; 31(10): 1569-76, 2015 May 15.

Artículo en Inglés | MEDLINE | ID: mdl-25609798

RESUMEN

MOTIVATION: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory. RESULTS: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12 GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k, x)-mers allows to significantly reduce the I/O and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106 GB of compressed size) in about 20 min, on a 6-core Intel i7 PC with an solid-state disk.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Humanos

Disk-based k-mer counting on a PC.

Deorowicz, Sebastian; Debudaj-Grabysz, Agnieszka; Grabowski, Szymon.

BMC Bioinformatics ; 14: 160, 2013 May 16.

Artículo en Inglés | MEDLINE | ID: mdl-23679007

RESUMEN

BACKGROUND: The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. RESULTS: We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data. CONCLUSIONS: By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

Asunto(s)

Algoritmos , Genómica/métodos , Microcomputadores , Animales , Caenorhabditis elegans/genética , Genoma Humano , Humanos , Alineación de Secuencia , Análisis de Secuencia de ADN , Programas Informáticos

FAMSA: Fast and accurate multiple sequence alignment of huge protein families.

Deorowicz, Sebastian; Debudaj-Grabysz, Agnieszka; Gudys, Adam.

Sci Rep ; 6: 33964, 2016 Sep 27.

Artículo en Inglés | MEDLINE | ID: mdl-27670777

RESUMEN

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA