Búsqueda | Portal de Búsqueda de la BVS Colombia

SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores.

Meng, Jintao; Wang, Bingqiang; Wei, Yanjie; Feng, Shengzhong; Balaji, Pavan.

BMC Bioinformatics ; 15 Suppl 9: S2, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-25253533

RESUMEN

BACKGROUND: There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data. RESULTS: This paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets. CONCLUSIONS: In this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler.

Asunto(s)

Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Animales , Genoma , Humanos

Counting Kmers for Biological Sequences at Large Scale.

Ge, Jianqiu; Meng, Jintao; Guo, Ning; Wei, Yanjie; Balaji, Pavan; Feng, Shengzhong.

Interdiscip Sci ; 12(1): 99-108, 2020 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-31734873

RESUMEN

Counting the abundance of all the distinct kmers in biological sequence data is a fundamental step in bioinformatics. These applications include de novo genome assembly, error correction, etc. With the development of sequencing technology, the sequence data in a single project can reach Petabyte-scale or Terabyte-scale nucleotides. Counting demand for the abundance of these sequencing data is beyond the memory and computing capacity of single computing node, and how to process it efficiently is a challenge on a high-performance computing cluster. As such, we propose SWAPCounter, a highly scalable distributed approach for kmer counting. This approach is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency. By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. The experimental results indicate that SWAPCounter has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter. Moreover, SWAPCounter also shows the highest scalability under strong scaling experiments. In our experiment on Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes. The source code of SWAPCounter is publicly available at https://github.com/mengjintao/SWAPCounter.

Asunto(s)

Biología Computacional/métodos , Genómica/métodos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Programas Informáticos

Meeting report: the terabase metagenomics workshop and the vision of an Earth microbiome project.

Gilbert, Jack A; Meyer, Folker; Antonopoulos, Dion; Balaji, Pavan; Brown, C Titus; Brown, Christopher T; Desai, Narayan; Eisen, Jonathan A; Evers, Dirk; Field, Dawn; Feng, Wu; Huson, Daniel; Jansson, Janet; Knight, Rob; Knight, James; Kolker, Eugene; Konstantindis, Kostas; Kostka, Joel; Kyrpides, Nikos; Mackelprang, Rachel; McHardy, Alice; Quince, Christopher; Raes, Jeroen; Sczyrba, Alexander; Shade, Ashley; Stevens, Rick.

Stand Genomic Sci ; 3(3): 243-8, 2010 Dec 25.

Artículo en Inglés | MEDLINE | ID: mdl-21304727

RESUMEN

Between July 18(th) and 24(th) 2010, 26 leading microbial ecology, computation, bioinformatics and statistics researchers came together in Snowbird, Utah (USA) to discuss the challenge of how to best characterize the microbial world using next-generation sequencing technologies. The meeting was entitled "Terabase Metagenomics" and was sponsored by the Institute for Computing in Science (ICiS) summer 2010 workshop program. The aim of the workshop was to explore the fundamental questions relating to microbial ecology that could be addressed using advances in sequencing potential. Technological advances in next-generation sequencing platforms such as the Illumina HiSeq 2000 can generate in excess of 250 billion base pairs of genetic information in 8 days. Thus, the generation of a trillion base pairs of genetic information is becoming a routine matter. The main outcome from this meeting was the birth of a concept and practical approach to exploring microbial life on earth, the Earth Microbiome Project (EMP). Here we briefly describe the highlights of this meeting and provide an overview of the EMP concept and how it can be applied to exploration of the microbiome of each ecosystem on this planet.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA