BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Chen, Jinxiang; Li, Fuyi; Wang, Miao; Li, Junlong; Marquez-Lago, Tatiana T; Leier, André; Revote, Jerico; Li, Shuqin; Liu, Quanzhong; Song, Jiangning

Chen, Jinxiang; Li, Fuyi; Wang, Miao; Li, Junlong; Marquez-Lago, Tatiana T; Leier, André; Revote, Jerico; Li, Shuqin; Liu, Quanzhong; Song, Jiangning.

Afiliación

Chen J; Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China.
Li F; Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia.
Wang M; Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia.
Li J; Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia.
Marquez-Lago TT; Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China.
Leier A; Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China.
Revote J; Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States.
Li S; Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States.
Liu Q; Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States.
Song J; Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States.

Front Big Data ; 4: 727216, 2021.

Article en En | MEDLINE | ID: mdl-35118375

ABSTRACT

ABSTRACT

BACKGROUND:

Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.

RESULTS:

In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data.

CONCLUSIONS:

The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

Palabras clave

Hadoop; Simple Sequence Repeats (SSR); big data; next-generation sequencing; read pairs

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: Front Big Data Año: 2021 Tipo del documento: Article País de afiliación: China

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google