Pesquisa | Biblioteca Virtual em Saúde

RabbitKSSD: accelerating genome distance estimation on modern multi-core architectures.

Xu, Xiaoming; Yin, Zekun; Yan, Lifeng; Yi, Huiguang; Wang, Hua; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 39(11)2023 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-37971961

RESUMO

SUMMARY: We propose RabbitKSSD, a high-speed genome distance estimation tool. Specifically, we leverage load-balanced task partitioning, fast I/O, efficient intermediate result accesses, and high-performance data structures to improve overall efficiency. Our performance evaluation demonstrates that RabbitKSSD achieves speedups ranging from 5.7× to 19.8× over Kssd for the time-consuming sketch generation and distance computation on commonly used workstations. In addition, it significantly outperforms Mash, BinDash, and Dashing2. Moreover, RabbitKSSD can efficiently perform all-vs-all distance computation for all RefSeq complete bacterial genomes (455 GB in FASTA format) in just 2 min on a 64-core workstation. AVAILABILITY AND IMPLEMENTATION: RabbitKSSD is available at https://github.com/RabbitBio/RabbitKSSD.

Assuntos

Genoma Bacteriano , Software , Evolução Biológica

RabbitQCPlus 2.0: More efficient and versatile quality control for sequencing data.

Yan, Lifeng; Yin, Zekun; Zhang, Hao; Zhao, Zhan; Wang, Mingkai; Müller, André; Kallenborn, Felix; Wichmann, Alexander; Wei, Yanjie; Niu, Beifang; Schmidt, Bertil; Liu, Weiguo.

Methods ; 216: 39-50, 2023 08.

Artigo em Inglês | MEDLINE | ID: mdl-37330158

RESUMO

Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.

Assuntos

Compressão de Dados , Software , Sequenciamento de Nucleotídeos em Larga Escala , Controle de Qualidade , Algoritmos , Análise de Sequência de DNA

RabbitV: fast detection of viruses and microorganisms in sequencing data on multi-core architectures.

Zhang, Hao; Chang, Qixin; Yin, Zekun; Xu, Xiaoming; Wei, Yanjie; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 38(10): 2932-2933, 2022 05 13.

Artigo em Inglês | MEDLINE | ID: mdl-35561184

RESUMO

MOTIVATION: Detection and identification of viruses and microorganisms in sequencing data plays an important role in pathogen diagnosis and research. However, existing tools for this problem often suffer from high runtimes and memory consumption. RESULTS: We present RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization and fast data parsing. Experiments show that RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively. Furthermore, RabbitV is able to detect COVID-19 from 40 samples of sequencing data (255 GB in FASTQ format) in only 320 s. AVAILABILITY AND IMPLEMENTATION: RabbitUniq and RabbitV are available at https://github.com/RabbitBio/RabbitUniq and https://github.com/RabbitBio/RabbitV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

COVID-19 , Vírus , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software , Vírus/genética

RabbitMash: accelerating hash-based genome analysis on modern multi-core architectures.

Yin, Zekun; Xu, Xiaoming; Zhang, Jinxiao; Wei, Yanjie; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 37(6): 873-875, 2021 05 05.

Artigo em Inglês | MEDLINE | ID: mdl-32845281

RESUMO

MOTIVATION: Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. RESULTS: We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. AVAILABILITY AND IMPLEMENTATION: RabbitMash is available at https://github.com/ZekunYin/RabbitMash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Computadores , Genoma , Genômica

RabbitQC: high-speed scalable quality control for sequencing data.

Yin, Zekun; Zhang, Hao; Liu, Meiyang; Zhang, Wen; Song, Honglei; Lan, Haidong; Wei, Yanjie; Niu, Beifang; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 37(4): 573-574, 2021 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-32790850

RESUMO

MOTIVATION: Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. RESULTS: We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Nanoporos , Software , Sequenciamento de Nucleotídeos em Larga Escala , Controle de Qualidade , Análise de Sequência de DNA

RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms.

Zhang, Hao; Song, Honglei; Xu, Xiaoming; Chang, Qixin; Wang, Mingkai; Wei, Yanjie; Yin, Zekun; Schmidt, Bertil; Liu, Weiguo.

IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 2341-2348, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36327193

RESUMO

The continuous growth of generated sequencing data leads to the development of a variety of associated bioinformatics tools. However, many of them are not able to fully exploit the resources of modern multi-core systems since they are bottlenecked by parsing files leading to slow execution times. This motivates the design of an efficient method for parsing sequencing data that can exploit the power of modern hardware, especially for modern CPUs with fast storage devices. We have developed RabbitFX, a fast, efficient, and easy-to-use framework for processing biological sequencing data on modern multi-core platforms. It can efficiently read FASTA and FASTQ files by combining a lightweight parsing method by means of an optimized formatting implementation. Furthermore, we provide user-friendly and modularized C++ APIs that can be easily integrated into applications in order to increase their file parsing speed. As proof-of-concept, we have integrated RabbitFX into three I/O-intensive applications: fastp, Ktrim, and Mash. Our evaluation shows that the inclusion of RabbitFX leads to speedups of at least 11.6 (6.6), 2.4 (2.4), and 3.7 (3.2) compared to the original versions on plain (gzip-compressed) files, respectively. These case studies demonstrate that RabbitFX can be easily integrated into a variety of NGS analysis tools to significantly reduce associated runtimes. It is open source software available at https://github.com/RabbitBio/RabbitFX.

Assuntos

Biologia Computacional , Software , Sequenciamento de Nucleotídeos em Larga Escala

RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches.

Xu, Xiaoming; Yin, Zekun; Yan, Lifeng; Zhang, Hao; Xu, Borui; Wei, Yanjie; Niu, Beifang; Schmidt, Bertil; Liu, Weiguo.

Genome Biol ; 24(1): 121, 2023 05 17.

Artigo em Inglês | MEDLINE | ID: mdl-37198663

RESUMO

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.

Assuntos

Genoma , Software , Bases de Dados de Ácidos Nucleicos , Análise por Conglomerados , Bactérias , Algoritmos , Genoma Bacteriano

Facile preparation of MnO₂-TiO₂ nanotube arrays composite electrode for electrochemical detection of hydrogen peroxide.

Yang, Mengyao; Wu, Zhigang; Wang, Xixin; Yin, Zekun; Tan, Xu; Zhao, Jianling.

Talanta ; 244: 123407, 2022 Jul 01.

Artigo em Inglês | MEDLINE | ID: mdl-35366513

RESUMO

The MnO2-TNTA composite electrodes were obtained through depositing MnO2 into TiO2 nanotube arrays (TNTA) by successive ionic layer adsorption reaction (SILAR) and subsequent hydrothermal method. The MnO2-TNTA nanocomposites were used as electrochemical sensors for the detection of hydrogen peroxide (H2O2). The preparation conditions of MnO2-TNTA electrodes and test conditions affect the electrochemical detection performance significantly. The optimal conditions are listed as follows: the number of SILAR cycles, 6 times; KMnO4 solution temperature, 50 °C; supporting electrolyte, 0.5 M NaOH. Under these conditions, the MnO2-TNTA electrode exhibits the best performance for detecting H2O2. The optimized MnO2-TNTA electrode has a minimum detection limit of 0.6 µM (S/N = 3) and a linear range of 5 µM â¼ 13 mM, which is much superior to the previously-reported electrodes. Moreover, the optimized MnO2-TNTA electrode possesses high selectivity, excellent stability and good reproducibility in the detection of H2O2. When used in the determination of H2O2 content in actual samples including disinfectant and milk, it also shows good accuracy, ideal recovery (96.00% â¼ 102.67%) and high precision (RSD < 4.0%).

Assuntos

Compostos de Manganês , Nanotubos , Técnicas Eletroquímicas/métodos , Eletrodos , Peróxido de Hidrogênio/química , Compostos de Manganês/química , Óxidos/química , Reprodutibilidade dos Testes , Titânio

Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.

Yin, Zekun; Lan, Haidong; Tan, Guangming; Lu, Mian; Vasilakos, Athanasios V; Liu, Weiguo.

Comput Struct Biotechnol J ; 15: 403-411, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28883909

RESUMO

The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA