Search | VHL Regional Portal

1.

CAREx: context-aware read extension of paired-end sequencing data.

Kallenborn, Felix; Schmidt, Bertil.

BMC Bioinformatics ; 25(1): 186, 2024 May 10.

Article in English | MEDLINE | ID: mdl-38730374

ABSTRACT

BACKGROUND: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads. RESULTS: We present CAREx-an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to 99 % for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools. CONCLUSION: CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at ( https://github.com/fkallen/CAREx ).

Subject(s)

Algorithms , High-Throughput Nucleotide Sequencing , Software , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Humans , Sequence Alignment/methods

2.

From GPUs to AI and quantum: three waves of acceleration in bioinformatics.

Schmidt, Bertil; Hildebrandt, Andreas.

Drug Discov Today ; 29(6): 103990, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38663581

ABSTRACT

The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in computational techniques and architectures continues to be highly dynamic, involving novel deep neural network models and artificial intelligence (AI) accelerators, and potentially quantum processing units in the future. These are expected to be disruptive for the life sciences as a whole and for drug discovery in particular. Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum computers.

Subject(s)

Artificial Intelligence , Computational Biology , Computational Biology/methods , Computer Graphics , Quantum Theory , Humans , Neural Networks, Computer , Drug Discovery/methods

3.

RabbitKSSD: accelerating genome distance estimation on modern multi-core architectures.

Xu, Xiaoming; Yin, Zekun; Yan, Lifeng; Yi, Huiguang; Wang, Hua; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 39(11)2023 11 01.

Article in English | MEDLINE | ID: mdl-37971961

ABSTRACT

SUMMARY: We propose RabbitKSSD, a high-speed genome distance estimation tool. Specifically, we leverage load-balanced task partitioning, fast I/O, efficient intermediate result accesses, and high-performance data structures to improve overall efficiency. Our performance evaluation demonstrates that RabbitKSSD achieves speedups ranging from 5.7× to 19.8× over Kssd for the time-consuming sketch generation and distance computation on commonly used workstations. In addition, it significantly outperforms Mash, BinDash, and Dashing2. Moreover, RabbitKSSD can efficiently perform all-vs-all distance computation for all RefSeq complete bacterial genomes (455 GB in FASTA format) in just 2 min on a 64-core workstation. AVAILABILITY AND IMPLEMENTATION: RabbitKSSD is available at https://github.com/RabbitBio/RabbitKSSD.

Subject(s)

Genome, Bacterial , Software , Biological Evolution

4.

MetaTransformer: deep metagenomic sequencing read classification using self-attention models.

Wichmann, Alexander; Buschong, Etienne; Müller, André; Jünger, Daniel; Hildebrandt, Andreas; Hankeln, Thomas; Schmidt, Bertil.

NAR Genom Bioinform ; 5(3): lqad082, 2023 Sep.

Article in English | MEDLINE | ID: mdl-37705831

ABSTRACT

Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.

5.

RabbitQCPlus 2.0: More efficient and versatile quality control for sequencing data.

Yan, Lifeng; Yin, Zekun; Zhang, Hao; Zhao, Zhan; Wang, Mingkai; Müller, André; Kallenborn, Felix; Wichmann, Alexander; Wei, Yanjie; Niu, Beifang; Schmidt, Bertil; Liu, Weiguo.

Methods ; 216: 39-50, 2023 08.

Article in English | MEDLINE | ID: mdl-37330158

ABSTRACT

Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.

Subject(s)

Data Compression , Software , High-Throughput Nucleotide Sequencing , Quality Control , Algorithms , Sequence Analysis, DNA

6.

RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches.

Xu, Xiaoming; Yin, Zekun; Yan, Lifeng; Zhang, Hao; Xu, Borui; Wei, Yanjie; Niu, Beifang; Schmidt, Bertil; Liu, Weiguo.

Genome Biol ; 24(1): 121, 2023 05 17.

Article in English | MEDLINE | ID: mdl-37198663

ABSTRACT

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.

Subject(s)

Genome , Software , Databases, Nucleic Acid , Cluster Analysis , Bacteria , Algorithms , Genome, Bacterial

7.

RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms.

Zhang, Hao; Song, Honglei; Xu, Xiaoming; Chang, Qixin; Wang, Mingkai; Wei, Yanjie; Yin, Zekun; Schmidt, Bertil; Liu, Weiguo.

IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 2341-2348, 2023.

Article in English | MEDLINE | ID: mdl-36327193

ABSTRACT

The continuous growth of generated sequencing data leads to the development of a variety of associated bioinformatics tools. However, many of them are not able to fully exploit the resources of modern multi-core systems since they are bottlenecked by parsing files leading to slow execution times. This motivates the design of an efficient method for parsing sequencing data that can exploit the power of modern hardware, especially for modern CPUs with fast storage devices. We have developed RabbitFX, a fast, efficient, and easy-to-use framework for processing biological sequencing data on modern multi-core platforms. It can efficiently read FASTA and FASTQ files by combining a lightweight parsing method by means of an optimized formatting implementation. Furthermore, we provide user-friendly and modularized C++ APIs that can be easily integrated into applications in order to increase their file parsing speed. As proof-of-concept, we have integrated RabbitFX into three I/O-intensive applications: fastp, Ktrim, and Mash. Our evaluation shows that the inclusion of RabbitFX leads to speedups of at least 11.6 (6.6), 2.4 (2.4), and 3.7 (3.2) compared to the original versions on plain (gzip-compressed) files, respectively. These case studies demonstrate that RabbitFX can be easily integrated into a variety of NGS analysis tools to significantly reduce associated runtimes. It is open source software available at https://github.com/RabbitBio/RabbitFX.

Subject(s)

Computational Biology , Software , High-Throughput Nucleotide Sequencing

8.

Locality-sensitive hashing enables efficient and scalable signal classification in high-throughput mass spectrometry raw data.

Bob, Konstantin; Teschner, David; Kemmer, Thomas; Gomez-Zepeda, David; Tenzer, Stefan; Schmidt, Bertil; Hildebrandt, Andreas.

BMC Bioinformatics ; 23(1): 287, 2022 Jul 20.

Article in English | MEDLINE | ID: mdl-35858828

ABSTRACT

BACKGROUND: Mass spectrometry is an important experimental technique in the field of proteomics. However, analysis of certain mass spectrometry data faces a combination of two challenges: first, even a single experiment produces a large amount of multi-dimensional raw data and, second, signals of interest are not single peaks but patterns of peaks that span along the different dimensions. The rapidly growing amount of mass spectrometry data increases the demand for scalable solutions. Furthermore, existing approaches for signal detection usually rely on strong assumptions concerning the signals properties. RESULTS: In this study, it is shown that locality-sensitive hashing enables signal classification in mass spectrometry raw data at scale. Through appropriate choice of algorithm parameters it is possible to balance false-positive and false-negative rates. On synthetic data, a superior performance compared to an intensity thresholding approach was achieved. Real data could be strongly reduced without losing relevant information. Our implementation scaled out up to 32 threads and supports acceleration by GPUs. CONCLUSIONS: Locality-sensitive hashing is a desirable approach for signal classification in mass spectrometry raw data. AVAILABILITY: Generated data and code are available at https://github.com/hildebrandtlab/mzBucket . Raw data is available at https://zenodo.org/record/5036526 .

Subject(s)

Algorithms , Software , Mass Spectrometry , Proteomics/methods

9.

CARE 2.0: reducing false-positive sequencing error corrections using machine learning.

Kallenborn, Felix; Cascitti, Julian; Schmidt, Bertil.

BMC Bioinformatics ; 23(1): 227, 2022 Jun 13.

Article in English | MEDLINE | ID: mdl-35698033

ABSTRACT

BACKGROUND: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .

Subject(s)

Algorithms , Software , High-Throughput Nucleotide Sequencing/methods , Humans , Machine Learning , Sequence Alignment , Sequence Analysis, DNA/methods

10.

RabbitV: fast detection of viruses and microorganisms in sequencing data on multi-core architectures.

Zhang, Hao; Chang, Qixin; Yin, Zekun; Xu, Xiaoming; Wei, Yanjie; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 38(10): 2932-2933, 2022 05 13.

Article in English | MEDLINE | ID: mdl-35561184

ABSTRACT

MOTIVATION: Detection and identification of viruses and microorganisms in sequencing data plays an important role in pathogen diagnosis and research. However, existing tools for this problem often suffer from high runtimes and memory consumption. RESULTS: We present RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization and fast data parsing. Experiments show that RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively. Furthermore, RabbitV is able to detect COVID-19 from 40 samples of sequencing data (255 GB in FASTQ format) in only 320 s. AVAILABILITY AND IMPLEMENTATION: RabbitUniq and RabbitV are available at https://github.com/RabbitBio/RabbitUniq and https://github.com/RabbitBio/RabbitV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

COVID-19 , Viruses , Algorithms , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA , Software , Viruses/genetics

11.

A 3D Deep Neural Network for Liver Volumetry in 3T Contrast-Enhanced MRI.

Winther, Hinrich; Hundt, Christian; Ringe, Kristina Imeen; Wacker, Frank K; Schmidt, Bertil; Jürgens, Julian; Haimerl, Michael; Beyer, Lukas Philipp; Stroszczynski, Christian; Wiggermann, Philipp; Verloh, Niklas.

Rofo ; 193(3): 305-314, 2021 Mar.

Article in English | MEDLINE | ID: mdl-32882724

ABSTRACT

PURPOSE: To create a fully automated, reliable, and fast segmentation tool for Gd-EOB-DTPA-enhanced MRI scans using deep learning. MATERIALS AND METHODS: Datasets of Gd-EOB-DTPA-enhanced liver MR images of 100 patients were assembled. Ground truth segmentation of the hepatobiliary phase images was performed manually. Automatic image segmentation was achieved with a deep convolutional neural network. RESULTS: Our neural network achieves an intraclass correlation coefficient (ICC) of 0.987, a Sørensen-Dice coefficient of 96.7â± 1.9â% (meanâ±âstd), an overlap of 92â±â3.5â%, and a Hausdorff distance of 24.9â±â14.7âmm compared with two expert readers who corresponded to an ICC of 0.973, a Sørensen-Dice coefficient of 95.2â±â2.8â%, and an overlap of 90.9â±â4.9â%. A second human reader achieved a Sørensen-Dice coefficient of 95â% on a subset of the test set. CONCLUSION: Our study introduces a fully automated liver volumetry scheme for Gd-EOB-DTPA-enhanced MR imaging. The neural network achieves competitive concordance with the ground truth regarding ICC, Sørensen-Dice, and overlap compared with manual segmentation. The neural network performs the task in just 60 seconds. KEY POINTS: · The proposed neural network helps to segment the liver accurately, providing detailed information about patient-specific liver anatomy and volume.. · With the help of a deep learning-based neural network, fully automatic segmentation of the liver on MRI scans can be performed in seconds.. · A fully automatic segmentation scheme makes liver segmentation on MRI a valuable tool for treatment planning.. CITATION FORMAT: · Winther H, Hundt C, Ringe KI etâal. A 3D Deep Neural Network for Liver Volumetry in 3T Contrast-Enhanced MRI. Fortschr Röntgenstr 2021; 193: 305â-â314.

Subject(s)

Image Processing, Computer-Assisted , Liver , Magnetic Resonance Imaging , Neural Networks, Computer , Humans , Image Processing, Computer-Assisted/methods , Liver/diagnostic imaging , Magnetic Resonance Imaging/methods

12.

RabbitMash: accelerating hash-based genome analysis on modern multi-core architectures.

Yin, Zekun; Xu, Xiaoming; Zhang, Jinxiao; Wei, Yanjie; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 37(6): 873-875, 2021 05 05.

Article in English | MEDLINE | ID: mdl-32845281

ABSTRACT

MOTIVATION: Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. RESULTS: We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. AVAILABILITY AND IMPLEMENTATION: RabbitMash is available at https://github.com/ZekunYin/RabbitMash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Software , Computers , Genome , Genomics

13.

Deep learning in next-generation sequencing.

Schmidt, Bertil; Hildebrandt, Andreas.

Drug Discov Today ; 26(1): 173-180, 2021 01.

Article in English | MEDLINE | ID: mdl-33059075

ABSTRACT

Next-generation sequencing (NGS) methods lie at the heart of large parts of biological and medical research. Their fundamental importance has created a continuously increasing demand for processing and analysis methods of the data sets produced, addressing questions such as variant calling, metagenomic classification and quantification, genomic feature detection, or downstream analysis in larger biological or medical contexts. In addition to classical algorithmic approaches, machine-learning (ML) techniques are often used for such tasks. In particular, deep learning (DL) methods that use multilayered artificial neural networks (ANNs) for supervised, semisupervised, and unsupervised learning have gained significant traction for such applications. Here, we highlight important network architectures, application areas, and DL frameworks in a NGS context.

Subject(s)

Deep Learning , High-Throughput Nucleotide Sequencing/methods , Metagenomics , Neural Networks, Computer , Biomedical Research/trends , Humans , Metagenomics/methods , Metagenomics/trends

14.

Fully automated detection of primary sclerosing cholangitis (PSC)-compatible bile duct changes based on 3D magnetic resonance cholangiopancreatography using machine learning.

Ringe, Kristina I; Vo Chieu, Van Dai; Wacker, Frank; Lenzen, Henrike; Manns, Michael P; Hundt, Christian; Schmidt, Bertil; Winther, Hinrich B.

Eur Radiol ; 31(4): 2482-2489, 2021 Apr.

Article in English | MEDLINE | ID: mdl-32974688

ABSTRACT

OBJECTIVES: To develop and evaluate a deep learning algorithm for fully automated detection of primary sclerosing cholangitis (PSC)-compatible cholangiographic changes on three-dimensional magnetic resonance cholangiopancreatography (3D-MRCP) images. METHODS: The datasets of 428 patients (n = 205 with confirmed diagnosis of PSC; n = 223 non-PSC patients) referred for MRI including MRCP were included in this retrospective IRB-approved study. Datasets were randomly assigned to a training (n = 386) and a validation group (n = 42). For each case, 20 uniformly distributed axial MRCP rotations and a subsequent maximum intensity projection (MIP) were calculated, resulting in a training database of 7720 images and a validation database of 840 images. Then, a pre-trained Inception ResNet was implemented which was conclusively fine-tuned (learning rate 10-3). RESULTS: Applying an ensemble strategy (by binning of the 20 axial projections), the mean absolute error (MAE) of the developed deep learning algorithm for detection of PSC-compatible cholangiographic changes was lowered from 21 to 7.1%. Sensitivity, specificity, positive predictive (PPV), and negative predictive value (NPV) for detection of these changes were 95.0%, 90.9%, 90.5%, and 95.2% respectively. CONCLUSIONS: The results of this study demonstrate the feasibility of transfer learning in combination with extensive image augmentation to detect PSC-compatible cholangiographic changes on 3D-MRCP images with a high sensitivity and a low MAE. Further validation with more and multicentric data is now desirable, as it is known that neural networks tend to overfit the characteristics of the dataset. KEY POINTS: â¢ The described machine learning algorithm is able to detect PSC-compatible cholangiographic changes on 3D-MRCP images with high accuracy. â¢ The generation of 2D projections from 3D datasets enabled the implementation of an ensemble strategy to boost inference performance.

Subject(s)

Cholangiopancreatography, Magnetic Resonance , Cholangitis, Sclerosing , Bile Ducts/diagnostic imaging , Cholangiopancreatography, Endoscopic Retrograde , Cholangitis, Sclerosing/diagnostic imaging , Humans , Machine Learning , Retrospective Studies

15.

RabbitQC: high-speed scalable quality control for sequencing data.

Yin, Zekun; Zhang, Hao; Liu, Meiyang; Zhang, Wen; Song, Honglei; Lan, Haidong; Wei, Yanjie; Niu, Beifang; Schmidt, Bertil; Liu, Weiguo.

Bioinformatics ; 37(4): 573-574, 2021 05 01.

Article in English | MEDLINE | ID: mdl-32790850

ABSTRACT

MOTIVATION: Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. RESULTS: We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Nanopores , Software , High-Throughput Nucleotide Sequencing , Quality Control , Sequence Analysis, DNA

16.

CARE: context-aware sequencing read error correction.

Kallenborn, Felix; Hildebrandt, Andreas; Schmidt, Bertil.

Bioinformatics ; 37(7): 889-895, 2021 05 17.

Article in English | MEDLINE | ID: mdl-32818262

ABSTRACT

MOTIVATION: Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS: We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION: CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Algorithms , Humans , Sequence Alignment , Sequence Analysis, DNA

17.

Big Data in metagenomics: Apache Spark vs MPI.

Abuín, José M; Lopes, Nuno; Ferreira, Luís; Pena, Tomás F; Schmidt, Bertil.

PLoS One ; 15(10): e0239741, 2020.

Article in English | MEDLINE | ID: mdl-33022000

ABSTRACT

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

Subject(s)

Big Data , Genome, Bacterial/genetics , Metagenome/genetics , Metagenomics , Algorithms , Computing Methodologies , DNA/genetics , Software

18.

RainDrop: Rapid activation matrix computation for droplet-based single-cell RNA-seq reads.

Niebler, Stefan; Müller, André; Hankeln, Thomas; Schmidt, Bertil.

BMC Bioinformatics ; 21(1): 274, 2020 Jul 01.

Article in English | MEDLINE | ID: mdl-32611394

ABSTRACT

BACKGROUND: Obtaining data from single-cell transcriptomic sequencing allows for the investigation of cell-specific gene expression patterns, which could not be addressed a few years ago. With the advancement of droplet-based protocols the number of studied cells continues to increase rapidly. This establishes the need for software tools for efficient processing of the produced large-scale datasets. We address this need by presenting RainDrop for fast gene-cell count matrix computation from single-cell RNA-seq data produced by 10x Genomics Chromium technology. RESULTS: RainDrop can process single-cell transcriptomic datasets consisting of 784 million reads sequenced from around 8.000 cells in less than 40 minutes on a standard workstation. It significantly outperforms the established Cell Ranger pipeline and the recently introduced Alevin tool in terms of runtime by a maximal (average) speedup of 30.4 (22.6) and 3.5 (2.4), respectively, while keeping high agreements of the generated results. CONCLUSIONS: RainDrop is a software tool for highly efficient processing of large-scale droplet-based single-cell RNA-seq datasets on standard workstations written in C++. It is available at https://gitlab.rlp.net/stnieble/raindrop .

Subject(s)

Sequence Analysis, RNA/methods , User-Computer Interface , Databases, Genetic , Humans , Information Storage and Retrieval , Single-Cell Analysis

19.

A big data approach to metagenomics for all-food-sequencing.

Kobus, Robin; Abuín, José M; Müller, André; Hellmann, Sören Lukas; Pichel, Juan C; Pena, Tomás F; Hildebrandt, Andreas; Hankeln, Thomas; Schmidt, Bertil.

BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.

Article in English | MEDLINE | ID: mdl-32164527

ABSTRACT

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).

Subject(s)

Big Data , Food Analysis/methods , High-Throughput Nucleotide Sequencing/methods , Metagenomics/methods , Whole Genome Sequencing/methods , Biosurveillance , Genome, Bacterial , Metagenome , Microbiota/genetics , Software

20.

Deep semantic lung segmentation for tracking potential pulmonary perfusion biomarkers in chronic obstructive pulmonary disease (COPD): The multi-ethnic study of atherosclerosis COPD study.

Winther, Hinrich B; Gutberlet, Marcel; Hundt, Christian; Kaireit, Till F; Alsady, Tawfik Moher; Schmidt, Bertil; Wacker, Frank; Sun, Yanping; Dettmer, Sabine; Maschke, Sabine K; Hinrichs, Jan B; Jambawalikar, Sachin; Prince, Martin R; Barr, R Graham; Vogel-Claussen, Jens.

J Magn Reson Imaging ; 51(2): 571-579, 2020 02.

Article in English | MEDLINE | ID: mdl-31276264

ABSTRACT

BACKGROUND: Chronic obstructive pulmonary disease (COPD) is associated with high morbidity and mortality. Identification of imaging biomarkers for phenotyping is necessary for future treatment and therapy monitoring. However, translation of visual analytic pipelines into clinics or their use in large-scale studies is significantly slowed by time-consuming postprocessing steps. PURPOSE: To implement an automated tool chain for regional quantification of pulmonary microvascular blood flow in order to reduce analysis time and user variability. STUDY TYPE: Prospective. POPULATION: In all, 90 MRI scans of 63 patients, of which 31 had a COPD with a mean Global Initiative for Chronic Obstructive Lung Disease status of 1.9 ± 0.64 (µ ± σ). FIELD STRENGTH/SEQUENCE: 1.5T dynamic gadolinium-enhanced MRI measurement using 4D dynamic contrast material-enhanced (DCE) time-resolved angiography acquired in a single breath-hold in inspiration. [Correction added on August 20, 2019, after first online publication: The field strength in the preceding sentence was corrected.] ASSESSMENT: We built a 3D convolutional neural network for semantic segmentation using 29 manually segmented perfusion maps. All five lobes of the lung are denoted, including the middle lobe. Evaluation was performed on 61 independent cases from two sites of the Multi-Ethnic Study of Arteriosclerosis (MESA)-COPD study. We publish our implementation of a model-free deconvolution filter according to Sourbron et al for 4D DCE MRI scans as open source. STATISTICAL TEST: Cross-validation 29/61 (# training / # testing), intraclass correlation coefficient (ICC), Spearman ρ, Pearson r, Sørensen-Dice coefficient, and overlap. RESULTS: Segmentations and derived clinical parameters were processed in ~90 seconds per case on a Xeon E5-2637v4 workstation with Tesla P40 GPUs. Clinical parameters and predicted segmentations exhibit high concordance with the ground truth regarding median perfusion for all lobes with an ICC of 0.99 and a Sørensen-Dice coefficient of 93.4 ± 2.8 (µ ± σ). DATA CONCLUSION: We present a robust end-to-end pipeline that allows for the extraction of perfusion-based biomarkers for all lung lobes in 4D DCE MRI scans by combining model-free deconvolution with deep learning. LEVEL OF EVIDENCE: 3 Technical Efficacy: Stage 2 J. Magn. Reson. Imaging 2020;51:571-579.

Subject(s)

Atherosclerosis , Pulmonary Disease, Chronic Obstructive , Biomarkers , Humans , Lung/diagnostic imaging , Magnetic Resonance Imaging , Perfusion , Prospective Studies , Pulmonary Disease, Chronic Obstructive/diagnostic imaging , Semantics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL