RESUMEN
MOTIVATION: Nanopore sequencing current signal data can be 'basecalled' into sequence information or analysed directly, with the capacity to identify diverse molecular features, such as DNA/RNA base modifications and secondary structures. However, raw signal data is large and complex, and there is a need for improved visualization strategies to facilitate signal analysis, exploration and tool development. RESULTS: Squigualiser (Squiggle visualiser) is a toolkit for intuitive, interactive visualization of sequence-aligned signal data, which currently supports both DNA and RNA sequencing data from Oxford Nanopore Technologies instruments. Squigualiser is compatible with a wide range of alternative signal-alignment software packages and enables visualization of both signal-to-read and signal-to-reference aligned data at single-base resolution. Squigualiser generates an interactive signal browser view (HTML file), in which the user can navigate across a genome/transcriptome region and customize the display. Multiple independent reads are integrated into a 'signal pileup' format and different datasets can be displayed as parallel tracks. Although other methods exist, Squigualiser provides the community with a software package purpose-built for raw signal data visualization, incorporating a range of new and existing features into a unified platform. AVAILABILITY AND IMPLEMENTATION: Squigualiser is an open-source package under an MIT licence: https://github.com/hiruna72/squigualiser. The software was developed using Python 3.8 and can be installed with pip or bioconda or executed directly using prebuilt binaries provided with each release.
Asunto(s)
Secuenciación de Nanoporos , Programas Informáticos , Secuenciación de Nanoporos/métodos , Análisis de Secuencia de ADN/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodosRESUMEN
minimap2 is the gold-standard software for reference-based sequence mapping in third-generation long-read sequencing. While minimap2 is relatively fast, further speedup is desirable, especially when processing a multitude of large datasets. In this work, we present minimap2-fpga, a hardware-accelerated version of minimap2 that speeds up the mapping process by integrating an FPGA kernel optimised for chaining. Integrating the FPGA kernel into minimap2 posed significant challenges that we solved by accurately predicting the processing time on hardware while considering data transfer overheads, mitigating hardware scheduling overheads in a multi-threaded environment, and optimizing memory management for processing large realistic datasets. We demonstrate speed-ups in end-to-end run-time for data from both Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio). minimap2-fpga is up to 79% and 53% faster than minimap2 for [Formula: see text] ONT and [Formula: see text] PacBio datasets respectively, when mapping without base-level alignment. When mapping with base-level alignment, minimap2-fpga is up to 62% and 10% faster than minimap2 for [Formula: see text] ONT and [Formula: see text] PacBio datasets respectively. The accuracy is near-identical to that of original minimap2 for both ONT and PacBio data, when mapping both with and without base-level alignment. minimap2-fpga is supported on Intel FPGA-based systems (evaluations performed on an on-premise system) and Xilinx FPGA-based systems (evaluations performed on a cloud system). We also provide a well-documented library for the FPGA-accelerated chaining kernel to be used by future researchers developing sequence alignment software with limited hardware background.
Asunto(s)
Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento , Alineación de SecuenciaRESUMEN
Nanopore sequencing is being rapidly adopted in genomics. We recently developed SLOW5, a new file format with advantages for storage and analysis of raw signal data from nanopore experiments. Here we introduce slow5tools, an intuitive toolkit for handling nanopore data in SLOW5 format. Slow5tools enables lossless data conversion and a range of tools for interacting with SLOW5 files. Slow5tools uses multi-threading, multi-processing, and other engineering strategies to achieve fast data conversion and manipulation, including live FAST5-to-SLOW5 conversion during sequencing. We provide examples and benchmarking experiments to illustrate slow5tools usage, and describe the engineering principles underpinning its performance.
Asunto(s)
Secuenciación de Nanoporos , Nanoporos , Análisis de Secuencia de ADN , Genómica , Programas Informáticos , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
Nanopore sequencing depends on the FAST5 file format, which does not allow efficient parallel analysis. Here we introduce SLOW5, an alternative format engineered for efficient parallelization and acceleration of nanopore data analysis. Using the example of DNA methylation profiling of a human genome, analysis runtime is reduced from more than two weeks to approximately 10.5 h on a typical high-performance computer. SLOW5 is approximately 25% smaller than FAST5 and delivers consistent improvements on different computer architectures.
Asunto(s)
Secuenciación de Nanoporos , Nanoporos , Análisis de Datos , Genoma Humano/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADNRESUMEN
BACKGROUND: Third-generation nanopore sequencers offer selective sequencing or "Read Until" that allows genomic reads to be analyzed in real time and abandoned halfway if not belonging to a genomic region of "interest." This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analyzing should be as low as possible for selective sequencing to be effective so that unnecessary reads can be rejected as early as possible. However, existing methods that employ a subsequence dynamic time warping (sDTW) algorithm for this problem are too computationally intensive that a massive workstation with dozens of CPU cores still struggles to keep up with the data rate of a mobile phone-sized MinION sequencer. RESULTS: In this article, we present Hardware Accelerated Read Until (HARU), a resource-efficient hardware-software codesign-based method that exploits a low-cost and portable heterogeneous multiprocessor system-on-chip platform with on-chip field-programmable gate arrays (FPGA) to accelerate the sDTW-based Read Until algorithm. Experimental results show that HARU on a Xilinx FPGA embedded with a 4-core ARM processor is around 2.5× faster than a highly optimized multithreaded software version (around 85× faster than the existing unoptimized multithreaded software) running on a sophisticated server with a 36-core Intel Xeon processor for a SARS-CoV-2 dataset. The energy consumption of HARU is 2 orders of magnitudes lower than the same application executing on the 36-core server. CONCLUSIONS: HARU demonstrates that nanopore selective sequencing is possible on resource-constrained devices through rigorous hardware-software optimizations. The source code for the HARU sDTW module is available as open source at https://github.com/beebdev/HARU, and an example application that uses HARU is at https://github.com/beebdev/sigfish-haru.
Asunto(s)
COVID-19 , Humanos , Análisis de Secuencia de ADN/métodos , SARS-CoV-2/genética , Programas Informáticos , Mapeo Cromosómico , AlgoritmosRESUMEN
BACKGROUND: Nanopore sequencing enables portable, real-time sequencing applications, including point-of-care diagnostics and in-the-field genotyping. Achieving these outcomes requires efficient bioinformatic algorithms for the analysis of raw nanopore signal data. However, comparing raw nanopore signals to a biological reference sequence is a computationally complex task. The dynamic programming algorithm called Adaptive Banded Event Alignment (ABEA) is a crucial step in polishing sequencing data and identifying non-standard nucleotides, such as measuring DNA methylation. Here, we parallelise and optimise an implementation of the ABEA algorithm (termed f5c) to efficiently run on heterogeneous CPU-GPU architectures. RESULTS: By optimising memory, computations and load balancing between CPU and GPU, we demonstrate how f5c can perform â¼3-5 × faster than an optimised version of the original CPU-only implementation of ABEA in the Nanopolish software package. We also show that f5c enables DNA methylation detection on-the-fly using an embedded System on Chip (SoC) equipped with GPUs. CONCLUSIONS: Our work not only demonstrates that complex genomics analyses can be performed on lightweight computing systems, but also benefits High-Performance Computing (HPC). The associated source code for f5c along with GPU optimised ABEA is available at https://github.com/hasindu2008/f5c .
Asunto(s)
Gráficos por Computador , Nanoporos , Procesamiento de Señales Asistido por Computador , Algoritmos , Biología Computacional , Bases de Datos como Asunto , Genoma Humano , Humanos , Análisis de SecuenciaRESUMEN
The de-novo genome assembly is a challenging computational problem for which several pipelines have been developed. The advent of long-read sequencing technology has resulted in a new set of algorithmic approaches for the assembly process. In this work, we identify that one of these new and fast long-read assembly techniques (using Minimap2 and Miniasm) can be modified for the short-read assembly process. This possibility motivated us to customize a long-read assembly approach for applications in a short-read assembly scenario. Here, we compare and contrast our proposed de-novo assembly pipeline (MiniSR) with three other recently developed programs for the assembly of bacterial and small eukaryotic genomes. We have documented two trade-offs: one between speed and accuracy and the other between contiguity and base-calling errors. Our proposed assembly pipeline shows a good balance in these trade-offs. The resulting pipeline is 6 and 2.2 times faster than the short-read assemblers Spades and SGA, respectively. MiniSR generates assemblies of superior N50 and NGA50 to SGA, although assemblies are less complete and accurate than those from Spades. A third tool, SOAPdenovo2, is as fast as our proposed pipeline but had poorer assembly quality.
Asunto(s)
Secuencia de Consenso/genética , Genómica/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Genoma Bacteriano/genética , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
A variant caller is used to identify variations in an individual genome (compared to the reference genome) in a genome processing pipeline. For the sake of accuracy, modern variant callers perform many local re-assemblies on small regions of the genome using a graph-based algorithm. However, such graph-based data structures are inefficiently stored in the linear memory of modern computers, which in turn reduces computing efficiency. Therefore, variant calling can take several CPU hours for a typical human genome. We have sped up the local re-assembly algorithm with no impact on its accuracy, by the effective use of the memory hierarchy. The proposed algorithm maximises data locality so that the fast internal processor memory (cache) is efficiently used. By the increased use of caches, accesses to main memory are minimised. The resulting algorithm is up to twice as fast as the original one when executed on a commodity computer and could gain even more speed up on computers with less complex memory subsystems.
Asunto(s)
Algoritmos , Variación Genética/genética , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodosRESUMEN
BACKGROUND: Pairwise alignment of short DNA sequences with affine-gap scoring is a common processing step performed in a range of bioinformatics analyses. Dynamic programming (i.e. Smith-Waterman algorithm) is widely used for this purpose. Despite using data level parallelisation, pairwise alignment consumes much time. There are faster alignment algorithms but they suffer from the lack of accuracy. RESULTS: In this paper, we present MEM-Align, a fast semi-global alignment algorithm for short DNA sequences that allows for affine-gap scoring and exploit sequence similarity. In contrast to traditional alignment method (such as Smith-Waterman) where individual symbols are aligned, MEM-Align extracts Maximal Exact Matches (MEMs) using a bit-level parallel method and then looks for a subset of MEMs that forms the alignment using a novel dynamic programming method. MEM-Align tries to mimic alignment produced by Smith-Waterman. As a result, for 99.9% of input sequence pair, the computed alignment score is identical to the alignment score computed by Smith-Waterman. Yet MEM-Align is up to 14.5 times faster than the Smith-Waterman algorithm. Fast run-time is achieved by: (a) using a bit-level parallel method to extract MEMs; (b) processing MEMs rather than individual symbols; and, (c) applying heuristics. CONCLUSIONS: MEM-Align is a potential candidate to replace other pairwise alignment algorithms used in processes such as DNA read-mapping and Variant-Calling.
Asunto(s)
Algoritmos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Nucleótidos/químicaRESUMEN
The advent of Nanopore sequencing has realised portable genomic research and applications. However, state of the art long read aligners and large reference genomes are not compatible with most mobile computing devices due to their high memory requirements. We show how memory requirements can be reduced through parameter optimisation and reference genome partitioning, but highlight the associated limitations and caveats of these approaches. We then demonstrate how these issues can be overcome through an appropriate merging technique. We incorporated multi-index merging into the Minimap2 aligner and demonstrate that long read alignment to the human genome can be performed on a system with 2 GB RAM with negligible impact on accuracy.
Asunto(s)
Genoma Humano/genética , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Equipos de Almacenamiento de Computador , Humanos , Secuenciación de Nanoporos/métodos , Análisis de Secuencia de ADN/métodos , Programas InformáticosRESUMEN
MOTIVATION: The Variant Call Format (VCF) is widely used to store data about genetic variation. Variant calling workflows detect potential variants in large numbers of short sequence reads generated by DNA sequencing and report them in VCF format. To evaluate the accuracy of variant callers, it is critical to correctly compare their output against a reference VCF file containing a gold standard set of variants. However, comparing VCF files is a complicated task as an individual genomic variant can be represented in several different ways and is therefore not necessarily reported in a unique way by different software. RESULTS: We introduce a VCF normalization method called Best Alignment Normalisation (BAN) that results in more accurate VCF file comparison. BAN applies all the variations in a VCF file to the reference genome to create a sample genome, and then recalls the variants by aligning this sample genome back with the reference genome. Since the purpose of BAN is to get an accurate result at the time of VCF comparison, we define a better normalization method as the one resulting in less disagreement between the outputs of different VCF comparators. AVAILABILITY AND IMPLEMENTATION: The BAN Linux bash script along with required software are publicly available on https://sites.google.com/site/banadf16. CONTACT: A.Bayat@unsw.edu.au. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.