Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 73
Filtrar
1.
Stud Health Technol Inform ; 317: 261-269, 2024 Aug 30.
Artigo em Inglês | MEDLINE | ID: mdl-39234730

RESUMO

INTRODUCTION: Retrieving comprehensible rule-based knowledge from medical data by machine learning is a beneficial task, e.g., for automating the process of creating a decision support system. While this has recently been studied by means of exception-tolerant hierarchical knowledge bases (i.e., knowledge bases, where rule-based knowledge is represented on several levels of abstraction), privacy concerns have not been addressed extensively in this context yet. However, privacy plays an important role, especially for medical applications. METHODS: When parts of the original dataset can be restored from a learned knowledge base, there may be a practically and legally relevant risk of re-identification for individuals. In this paper, we study privacy issues of exception-tolerant hierarchical knowledge bases which are learned from data. We propose approaches for determining and eliminating privacy issues of the learned knowledge bases. RESULTS: We present results for synthetic as well as for real world datasets. CONCLUSION: The results show that our approach effectively prevents privacy breaches while only moderately decreasing the inference quality.


Assuntos
Confidencialidade , Bases de Conhecimento , Aprendizado de Máquina , Humanos , Segurança Computacional , Privacidade , Registros Eletrônicos de Saúde
2.
BMC Bioinformatics ; 25(1): 186, 2024 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-38730374

RESUMO

BACKGROUND: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads. RESULTS: We present CAREx-an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to 99 % for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools. CONCLUSION: CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at ( https://github.com/fkallen/CAREx ).


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Humanos , Alinhamento de Sequência/métodos
3.
Drug Discov Today ; 29(6): 103990, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38663581

RESUMO

The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in computational techniques and architectures continues to be highly dynamic, involving novel deep neural network models and artificial intelligence (AI) accelerators, and potentially quantum processing units in the future. These are expected to be disruptive for the life sciences as a whole and for drug discovery in particular. Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum computers.


Assuntos
Inteligência Artificial , Biologia Computacional , Biologia Computacional/métodos , Gráficos por Computador , Teoria Quântica , Humanos , Redes Neurais de Computação , Descoberta de Drogas/métodos
4.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37971961

RESUMO

SUMMARY: We propose RabbitKSSD, a high-speed genome distance estimation tool. Specifically, we leverage load-balanced task partitioning, fast I/O, efficient intermediate result accesses, and high-performance data structures to improve overall efficiency. Our performance evaluation demonstrates that RabbitKSSD achieves speedups ranging from 5.7× to 19.8× over Kssd for the time-consuming sketch generation and distance computation on commonly used workstations. In addition, it significantly outperforms Mash, BinDash, and Dashing2. Moreover, RabbitKSSD can efficiently perform all-vs-all distance computation for all RefSeq complete bacterial genomes (455 GB in FASTA format) in just 2 min on a 64-core workstation. AVAILABILITY AND IMPLEMENTATION: RabbitKSSD is available at https://github.com/RabbitBio/RabbitKSSD.


Assuntos
Genoma Bacteriano , Software , Evolução Biológica
5.
NAR Genom Bioinform ; 5(3): lqad082, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37705831

RESUMO

Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.

6.
Methods ; 216: 39-50, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37330158

RESUMO

Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.


Assuntos
Compressão de Dados , Software , Sequenciamento de Nucleotídeos em Larga Escala , Controle de Qualidade , Algoritmos , Análise de Sequência de DNA
7.
Genome Biol ; 24(1): 121, 2023 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-37198663

RESUMO

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.


Assuntos
Genoma , Software , Bases de Dados de Ácidos Nucleicos , Análise por Conglomerados , Bactérias , Algoritmos , Genoma Bacteriano
8.
IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 2341-2348, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36327193

RESUMO

The continuous growth of generated sequencing data leads to the development of a variety of associated bioinformatics tools. However, many of them are not able to fully exploit the resources of modern multi-core systems since they are bottlenecked by parsing files leading to slow execution times. This motivates the design of an efficient method for parsing sequencing data that can exploit the power of modern hardware, especially for modern CPUs with fast storage devices. We have developed RabbitFX, a fast, efficient, and easy-to-use framework for processing biological sequencing data on modern multi-core platforms. It can efficiently read FASTA and FASTQ files by combining a lightweight parsing method by means of an optimized formatting implementation. Furthermore, we provide user-friendly and modularized C++ APIs that can be easily integrated into applications in order to increase their file parsing speed. As proof-of-concept, we have integrated RabbitFX into three I/O-intensive applications: fastp, Ktrim, and Mash. Our evaluation shows that the inclusion of RabbitFX leads to speedups of at least 11.6 (6.6), 2.4 (2.4), and 3.7 (3.2) compared to the original versions on plain (gzip-compressed) files, respectively. These case studies demonstrate that RabbitFX can be easily integrated into a variety of NGS analysis tools to significantly reduce associated runtimes. It is open source software available at https://github.com/RabbitBio/RabbitFX.


Assuntos
Biologia Computacional , Software , Sequenciamento de Nucleotídeos em Larga Escala
9.
BMC Bioinformatics ; 23(1): 287, 2022 Jul 20.
Artigo em Inglês | MEDLINE | ID: mdl-35858828

RESUMO

BACKGROUND: Mass spectrometry is an important experimental technique in the field of proteomics. However, analysis of certain mass spectrometry data faces a combination of two challenges: first, even a single experiment produces a large amount of multi-dimensional raw data and, second, signals of interest are not single peaks but patterns of peaks that span along the different dimensions. The rapidly growing amount of mass spectrometry data increases the demand for scalable solutions. Furthermore, existing approaches for signal detection usually rely on strong assumptions concerning the signals properties. RESULTS: In this study, it is shown that locality-sensitive hashing enables signal classification in mass spectrometry raw data at scale. Through appropriate choice of algorithm parameters it is possible to balance false-positive and false-negative rates. On synthetic data, a superior performance compared to an intensity thresholding approach was achieved. Real data could be strongly reduced without losing relevant information. Our implementation scaled out up to 32 threads and supports acceleration by GPUs. CONCLUSIONS: Locality-sensitive hashing is a desirable approach for signal classification in mass spectrometry raw data. AVAILABILITY: Generated data and code are available at https://github.com/hildebrandtlab/mzBucket . Raw data is available at https://zenodo.org/record/5036526 .


Assuntos
Algoritmos , Software , Espectrometria de Massas , Proteômica/métodos
10.
BMC Bioinformatics ; 23(1): 227, 2022 Jun 13.
Artigo em Inglês | MEDLINE | ID: mdl-35698033

RESUMO

BACKGROUND: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .


Assuntos
Algoritmos , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Aprendizado de Máquina , Alinhamento de Sequência , Análise de Sequência de DNA/métodos
11.
Bioinformatics ; 38(10): 2932-2933, 2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35561184

RESUMO

MOTIVATION: Detection and identification of viruses and microorganisms in sequencing data plays an important role in pathogen diagnosis and research. However, existing tools for this problem often suffer from high runtimes and memory consumption. RESULTS: We present RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization and fast data parsing. Experiments show that RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively. Furthermore, RabbitV is able to detect COVID-19 from 40 samples of sequencing data (255 GB in FASTQ format) in only 320 s. AVAILABILITY AND IMPLEMENTATION: RabbitUniq and RabbitV are available at https://github.com/RabbitBio/RabbitUniq and https://github.com/RabbitBio/RabbitV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
COVID-19 , Vírus , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software , Vírus/genética
12.
Bioinformatics ; 37(4): 573-574, 2021 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-32790850

RESUMO

MOTIVATION: Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. RESULTS: We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Nanoporos , Software , Sequenciamento de Nucleotídeos em Larga Escala , Controle de Qualidade , Análise de Sequência de DNA
13.
Bioinformatics ; 37(6): 873-875, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-32845281

RESUMO

MOTIVATION: Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. RESULTS: We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. AVAILABILITY AND IMPLEMENTATION: RabbitMash is available at https://github.com/ZekunYin/RabbitMash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Computadores , Genoma , Genômica
14.
Rofo ; 193(3): 305-314, 2021 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32882724

RESUMO

PURPOSE: To create a fully automated, reliable, and fast segmentation tool for Gd-EOB-DTPA-enhanced MRI scans using deep learning. MATERIALS AND METHODS: Datasets of Gd-EOB-DTPA-enhanced liver MR images of 100 patients were assembled. Ground truth segmentation of the hepatobiliary phase images was performed manually. Automatic image segmentation was achieved with a deep convolutional neural network. RESULTS: Our neural network achieves an intraclass correlation coefficient (ICC) of 0.987, a Sørensen-Dice coefficient of 96.7 ± 1.9 % (mean ±â€Šstd), an overlap of 92 ±â€Š3.5 %, and a Hausdorff distance of 24.9 ±â€Š14.7 mm compared with two expert readers who corresponded to an ICC of 0.973, a Sørensen-Dice coefficient of 95.2 ±â€Š2.8 %, and an overlap of 90.9 ±â€Š4.9 %. A second human reader achieved a Sørensen-Dice coefficient of 95 % on a subset of the test set. CONCLUSION: Our study introduces a fully automated liver volumetry scheme for Gd-EOB-DTPA-enhanced MR imaging. The neural network achieves competitive concordance with the ground truth regarding ICC, Sørensen-Dice, and overlap compared with manual segmentation. The neural network performs the task in just 60 seconds. KEY POINTS: · The proposed neural network helps to segment the liver accurately, providing detailed information about patient-specific liver anatomy and volume.. · With the help of a deep learning-based neural network, fully automatic segmentation of the liver on MRI scans can be performed in seconds.. · A fully automatic segmentation scheme makes liver segmentation on MRI a valuable tool for treatment planning.. CITATION FORMAT: · Winther H, Hundt C, Ringe KI et al. A 3D Deep Neural Network for Liver Volumetry in 3T Contrast-Enhanced MRI. Fortschr Röntgenstr 2021; 193: 305 - 314.


Assuntos
Processamento de Imagem Assistida por Computador , Fígado , Imageamento por Ressonância Magnética , Redes Neurais de Computação , Humanos , Processamento de Imagem Assistida por Computador/métodos , Fígado/diagnóstico por imagem , Imageamento por Ressonância Magnética/métodos
15.
Bioinformatics ; 37(7): 889-895, 2021 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-32818262

RESUMO

MOTIVATION: Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS: We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION: CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA
16.
Eur Radiol ; 31(4): 2482-2489, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-32974688

RESUMO

OBJECTIVES: To develop and evaluate a deep learning algorithm for fully automated detection of primary sclerosing cholangitis (PSC)-compatible cholangiographic changes on three-dimensional magnetic resonance cholangiopancreatography (3D-MRCP) images. METHODS: The datasets of 428 patients (n = 205 with confirmed diagnosis of PSC; n = 223 non-PSC patients) referred for MRI including MRCP were included in this retrospective IRB-approved study. Datasets were randomly assigned to a training (n = 386) and a validation group (n = 42). For each case, 20 uniformly distributed axial MRCP rotations and a subsequent maximum intensity projection (MIP) were calculated, resulting in a training database of 7720 images and a validation database of 840 images. Then, a pre-trained Inception ResNet was implemented which was conclusively fine-tuned (learning rate 10-3). RESULTS: Applying an ensemble strategy (by binning of the 20 axial projections), the mean absolute error (MAE) of the developed deep learning algorithm for detection of PSC-compatible cholangiographic changes was lowered from 21 to 7.1%. Sensitivity, specificity, positive predictive (PPV), and negative predictive value (NPV) for detection of these changes were 95.0%, 90.9%, 90.5%, and 95.2% respectively. CONCLUSIONS: The results of this study demonstrate the feasibility of transfer learning in combination with extensive image augmentation to detect PSC-compatible cholangiographic changes on 3D-MRCP images with a high sensitivity and a low MAE. Further validation with more and multicentric data is now desirable, as it is known that neural networks tend to overfit the characteristics of the dataset. KEY POINTS: • The described machine learning algorithm is able to detect PSC-compatible cholangiographic changes on 3D-MRCP images with high accuracy. • The generation of 2D projections from 3D datasets enabled the implementation of an ensemble strategy to boost inference performance.


Assuntos
Colangiopancreatografia por Ressonância Magnética , Colangite Esclerosante , Ductos Biliares/diagnóstico por imagem , Colangiopancreatografia Retrógrada Endoscópica , Colangite Esclerosante/diagnóstico por imagem , Humanos , Aprendizado de Máquina , Estudos Retrospectivos
17.
Drug Discov Today ; 26(1): 173-180, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33059075

RESUMO

Next-generation sequencing (NGS) methods lie at the heart of large parts of biological and medical research. Their fundamental importance has created a continuously increasing demand for processing and analysis methods of the data sets produced, addressing questions such as variant calling, metagenomic classification and quantification, genomic feature detection, or downstream analysis in larger biological or medical contexts. In addition to classical algorithmic approaches, machine-learning (ML) techniques are often used for such tasks. In particular, deep learning (DL) methods that use multilayered artificial neural networks (ANNs) for supervised, semisupervised, and unsupervised learning have gained significant traction for such applications. Here, we highlight important network architectures, application areas, and DL frameworks in a NGS context.


Assuntos
Aprendizado Profundo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica , Redes Neurais de Computação , Pesquisa Biomédica/tendências , Humanos , Metagenômica/métodos , Metagenômica/tendências
18.
PLoS One ; 15(10): e0239741, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33022000

RESUMO

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.


Assuntos
Big Data , Genoma Bacteriano/genética , Metagenoma/genética , Metagenômica , Algoritmos , Metodologias Computacionais , DNA/genética , Software
19.
BMC Bioinformatics ; 21(1): 274, 2020 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-32611394

RESUMO

BACKGROUND: Obtaining data from single-cell transcriptomic sequencing allows for the investigation of cell-specific gene expression patterns, which could not be addressed a few years ago. With the advancement of droplet-based protocols the number of studied cells continues to increase rapidly. This establishes the need for software tools for efficient processing of the produced large-scale datasets. We address this need by presenting RainDrop for fast gene-cell count matrix computation from single-cell RNA-seq data produced by 10x Genomics Chromium technology. RESULTS: RainDrop can process single-cell transcriptomic datasets consisting of 784 million reads sequenced from around 8.000 cells in less than 40 minutes on a standard workstation. It significantly outperforms the established Cell Ranger pipeline and the recently introduced Alevin tool in terms of runtime by a maximal (average) speedup of 30.4 (22.6) and 3.5 (2.4), respectively, while keeping high agreements of the generated results. CONCLUSIONS: RainDrop is a software tool for highly efficient processing of large-scale droplet-based single-cell RNA-seq datasets on standard workstations written in C++. It is available at https://gitlab.rlp.net/stnieble/raindrop .


Assuntos
Análise de Sequência de RNA/métodos , Interface Usuário-Computador , Bases de Dados Genéticas , Humanos , Armazenamento e Recuperação da Informação , Análise de Célula Única
20.
BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-32164527

RESUMO

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).


Assuntos
Big Data , Análise de Alimentos/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Sequenciamento Completo do Genoma/métodos , Biovigilância , Genoma Bacteriano , Metagenoma , Microbiota/genética , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA