Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 72
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 25(1): 186, 2024 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-38730374

RESUMO

BACKGROUND: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads. RESULTS: We present CAREx-an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to 99 % for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools. CONCLUSION: CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at ( https://github.com/fkallen/CAREx ).


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Humanos , Alinhamento de Sequência/métodos
2.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37971961

RESUMO

SUMMARY: We propose RabbitKSSD, a high-speed genome distance estimation tool. Specifically, we leverage load-balanced task partitioning, fast I/O, efficient intermediate result accesses, and high-performance data structures to improve overall efficiency. Our performance evaluation demonstrates that RabbitKSSD achieves speedups ranging from 5.7× to 19.8× over Kssd for the time-consuming sketch generation and distance computation on commonly used workstations. In addition, it significantly outperforms Mash, BinDash, and Dashing2. Moreover, RabbitKSSD can efficiently perform all-vs-all distance computation for all RefSeq complete bacterial genomes (455 GB in FASTA format) in just 2 min on a 64-core workstation. AVAILABILITY AND IMPLEMENTATION: RabbitKSSD is available at https://github.com/RabbitBio/RabbitKSSD.


Assuntos
Genoma Bacteriano , Software , Evolução Biológica
3.
Methods ; 216: 39-50, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37330158

RESUMO

Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.


Assuntos
Compressão de Dados , Software , Sequenciamento de Nucleotídeos em Larga Escala , Controle de Qualidade , Algoritmos , Análise de Sequência de DNA
4.
Bioinformatics ; 38(10): 2932-2933, 2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35561184

RESUMO

MOTIVATION: Detection and identification of viruses and microorganisms in sequencing data plays an important role in pathogen diagnosis and research. However, existing tools for this problem often suffer from high runtimes and memory consumption. RESULTS: We present RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization and fast data parsing. Experiments show that RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively. Furthermore, RabbitV is able to detect COVID-19 from 40 samples of sequencing data (255 GB in FASTQ format) in only 320 s. AVAILABILITY AND IMPLEMENTATION: RabbitUniq and RabbitV are available at https://github.com/RabbitBio/RabbitUniq and https://github.com/RabbitBio/RabbitV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
COVID-19 , Vírus , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software , Vírus/genética
5.
BMC Bioinformatics ; 23(1): 227, 2022 Jun 13.
Artigo em Inglês | MEDLINE | ID: mdl-35698033

RESUMO

BACKGROUND: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .


Assuntos
Algoritmos , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Aprendizado de Máquina , Alinhamento de Sequência , Análise de Sequência de DNA/métodos
6.
BMC Bioinformatics ; 23(1): 287, 2022 Jul 20.
Artigo em Inglês | MEDLINE | ID: mdl-35858828

RESUMO

BACKGROUND: Mass spectrometry is an important experimental technique in the field of proteomics. However, analysis of certain mass spectrometry data faces a combination of two challenges: first, even a single experiment produces a large amount of multi-dimensional raw data and, second, signals of interest are not single peaks but patterns of peaks that span along the different dimensions. The rapidly growing amount of mass spectrometry data increases the demand for scalable solutions. Furthermore, existing approaches for signal detection usually rely on strong assumptions concerning the signals properties. RESULTS: In this study, it is shown that locality-sensitive hashing enables signal classification in mass spectrometry raw data at scale. Through appropriate choice of algorithm parameters it is possible to balance false-positive and false-negative rates. On synthetic data, a superior performance compared to an intensity thresholding approach was achieved. Real data could be strongly reduced without losing relevant information. Our implementation scaled out up to 32 threads and supports acceleration by GPUs. CONCLUSIONS: Locality-sensitive hashing is a desirable approach for signal classification in mass spectrometry raw data. AVAILABILITY: Generated data and code are available at https://github.com/hildebrandtlab/mzBucket . Raw data is available at https://zenodo.org/record/5036526 .


Assuntos
Algoritmos , Software , Espectrometria de Massas , Proteômica/métodos
7.
Bioinformatics ; 37(7): 889-895, 2021 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-32818262

RESUMO

MOTIVATION: Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS: We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION: CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA
8.
Bioinformatics ; 37(6): 873-875, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-32845281

RESUMO

MOTIVATION: Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. RESULTS: We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. AVAILABILITY AND IMPLEMENTATION: RabbitMash is available at https://github.com/ZekunYin/RabbitMash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Computadores , Genoma , Genômica
9.
Bioinformatics ; 37(4): 573-574, 2021 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-32790850

RESUMO

MOTIVATION: Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. RESULTS: We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. AVAILABILITY AND IMPLEMENTATION: C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Nanoporos , Software , Sequenciamento de Nucleotídeos em Larga Escala , Controle de Qualidade , Análise de Sequência de DNA
10.
Eur Radiol ; 31(4): 2482-2489, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-32974688

RESUMO

OBJECTIVES: To develop and evaluate a deep learning algorithm for fully automated detection of primary sclerosing cholangitis (PSC)-compatible cholangiographic changes on three-dimensional magnetic resonance cholangiopancreatography (3D-MRCP) images. METHODS: The datasets of 428 patients (n = 205 with confirmed diagnosis of PSC; n = 223 non-PSC patients) referred for MRI including MRCP were included in this retrospective IRB-approved study. Datasets were randomly assigned to a training (n = 386) and a validation group (n = 42). For each case, 20 uniformly distributed axial MRCP rotations and a subsequent maximum intensity projection (MIP) were calculated, resulting in a training database of 7720 images and a validation database of 840 images. Then, a pre-trained Inception ResNet was implemented which was conclusively fine-tuned (learning rate 10-3). RESULTS: Applying an ensemble strategy (by binning of the 20 axial projections), the mean absolute error (MAE) of the developed deep learning algorithm for detection of PSC-compatible cholangiographic changes was lowered from 21 to 7.1%. Sensitivity, specificity, positive predictive (PPV), and negative predictive value (NPV) for detection of these changes were 95.0%, 90.9%, 90.5%, and 95.2% respectively. CONCLUSIONS: The results of this study demonstrate the feasibility of transfer learning in combination with extensive image augmentation to detect PSC-compatible cholangiographic changes on 3D-MRCP images with a high sensitivity and a low MAE. Further validation with more and multicentric data is now desirable, as it is known that neural networks tend to overfit the characteristics of the dataset. KEY POINTS: • The described machine learning algorithm is able to detect PSC-compatible cholangiographic changes on 3D-MRCP images with high accuracy. • The generation of 2D projections from 3D datasets enabled the implementation of an ensemble strategy to boost inference performance.


Assuntos
Colangiopancreatografia por Ressonância Magnética , Colangite Esclerosante , Ductos Biliares/diagnóstico por imagem , Colangiopancreatografia Retrógrada Endoscópica , Colangite Esclerosante/diagnóstico por imagem , Humanos , Aprendizado de Máquina , Estudos Retrospectivos
11.
BMC Bioinformatics ; 21(1): 274, 2020 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-32611394

RESUMO

BACKGROUND: Obtaining data from single-cell transcriptomic sequencing allows for the investigation of cell-specific gene expression patterns, which could not be addressed a few years ago. With the advancement of droplet-based protocols the number of studied cells continues to increase rapidly. This establishes the need for software tools for efficient processing of the produced large-scale datasets. We address this need by presenting RainDrop for fast gene-cell count matrix computation from single-cell RNA-seq data produced by 10x Genomics Chromium technology. RESULTS: RainDrop can process single-cell transcriptomic datasets consisting of 784 million reads sequenced from around 8.000 cells in less than 40 minutes on a standard workstation. It significantly outperforms the established Cell Ranger pipeline and the recently introduced Alevin tool in terms of runtime by a maximal (average) speedup of 30.4 (22.6) and 3.5 (2.4), respectively, while keeping high agreements of the generated results. CONCLUSIONS: RainDrop is a software tool for highly efficient processing of large-scale droplet-based single-cell RNA-seq datasets on standard workstations written in C++. It is available at https://gitlab.rlp.net/stnieble/raindrop .


Assuntos
Análise de Sequência de RNA/métodos , Interface Usuário-Computador , Bases de Dados Genéticas , Humanos , Armazenamento e Recuperação da Informação , Análise de Célula Única
12.
BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-32164527

RESUMO

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).


Assuntos
Big Data , Análise de Alimentos/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Sequenciamento Completo do Genoma/métodos , Biovigilância , Genoma Bacteriano , Metagenoma , Microbiota/genética , Software
13.
Bioinformatics ; 35(13): 2306-2308, 2019 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-30445566

RESUMO

MOTIVATION: Modern bioinformatics tools for analyzing large-scale NGS datasets often need to include fast implementations of core sequence alignment algorithms in order to achieve reasonable execution times. We address this need by presenting the BGSA toolkit for optimized implementations of popular bit-parallel global pairwise alignment algorithms on modern microprocessors. RESULTS: BGSA outperforms Edlib, SeqAn and BitPAl for pairwise edit distance computations and Parasail, SeqAn and BitPAl when using more general scoring schemes for pairwise alignments of a batch of sequence reads on both standard multi-core CPUs and Xeon Phi many-core CPUs. Furthermore, banded edit distance performance of BGSA on a Xeon Phi-7210 outperforms the highly optimized NVBio implementation on a Titan X GPU for the seed verification stage of a read mapper by a factor of 4.4. AVAILABILITY AND IMPLEMENTATION: BGSA is open-source and available at https://github.com/sdu-hpcl/BGSA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Alinhamento de Sequência , Análise de Sequência de DNA
14.
Bioinformatics ; 35(23): 4871-4878, 2019 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-31038666

RESUMO

MOTIVATION: K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. RESULTS: We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. AVAILABILITY AND IMPLEMENTATION: The source codes of our algorithm are available at github.com/lzhLab/kmcEx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Alinhamento de Sequência , Análise de Sequência de DNA
15.
J Magn Reson Imaging ; 51(2): 571-579, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-31276264

RESUMO

BACKGROUND: Chronic obstructive pulmonary disease (COPD) is associated with high morbidity and mortality. Identification of imaging biomarkers for phenotyping is necessary for future treatment and therapy monitoring. However, translation of visual analytic pipelines into clinics or their use in large-scale studies is significantly slowed by time-consuming postprocessing steps. PURPOSE: To implement an automated tool chain for regional quantification of pulmonary microvascular blood flow in order to reduce analysis time and user variability. STUDY TYPE: Prospective. POPULATION: In all, 90 MRI scans of 63 patients, of which 31 had a COPD with a mean Global Initiative for Chronic Obstructive Lung Disease status of 1.9 ± 0.64 (µ ± σ). FIELD STRENGTH/SEQUENCE: 1.5T dynamic gadolinium-enhanced MRI measurement using 4D dynamic contrast material-enhanced (DCE) time-resolved angiography acquired in a single breath-hold in inspiration. [Correction added on August 20, 2019, after first online publication: The field strength in the preceding sentence was corrected.] ASSESSMENT: We built a 3D convolutional neural network for semantic segmentation using 29 manually segmented perfusion maps. All five lobes of the lung are denoted, including the middle lobe. Evaluation was performed on 61 independent cases from two sites of the Multi-Ethnic Study of Arteriosclerosis (MESA)-COPD study. We publish our implementation of a model-free deconvolution filter according to Sourbron et al for 4D DCE MRI scans as open source. STATISTICAL TEST: Cross-validation 29/61 (# training / # testing), intraclass correlation coefficient (ICC), Spearman ρ, Pearson r, Sørensen-Dice coefficient, and overlap. RESULTS: Segmentations and derived clinical parameters were processed in ~90 seconds per case on a Xeon E5-2637v4 workstation with Tesla P40 GPUs. Clinical parameters and predicted segmentations exhibit high concordance with the ground truth regarding median perfusion for all lobes with an ICC of 0.99 and a Sørensen-Dice coefficient of 93.4 ± 2.8 (µ ± σ). DATA CONCLUSION: We present a robust end-to-end pipeline that allows for the extraction of perfusion-based biomarkers for all lung lobes in 4D DCE MRI scans by combining model-free deconvolution with deep learning. LEVEL OF EVIDENCE: 3 Technical Efficacy: Stage 2 J. Magn. Reson. Imaging 2020;51:571-579.


Assuntos
Aterosclerose , Doença Pulmonar Obstrutiva Crônica , Biomarcadores , Humanos , Pulmão/diagnóstico por imagem , Imageamento por Ressonância Magnética , Perfusão , Estudos Prospectivos , Doença Pulmonar Obstrutiva Crônica/diagnóstico por imagem , Semântica
16.
BMC Bioinformatics ; 19(1): 92, 2018 03 09.
Artigo em Inglês | MEDLINE | ID: mdl-29523083

RESUMO

BACKGROUND: Various indexing techniques have been applied by next generation sequencing read mapping tools. The choice of a particular data structure is a trade-off between memory consumption, mapping throughput, and construction time. RESULTS: We present the succinct hash index - a novel data structure for read mapping which is a variant of the classical q-gram index with a particularly small memory footprint occupying between 3.5 and 5.3 GB for a human reference genome for typical parameter settings. The succinct hash index features two novel seed selection algorithms (group seeding and variable-length seeding) and an efficient parallel construction algorithm, which we have implemented to design the FEM (Fast(F) and Efficient(E) read Mapper(M)) mapper. FEM can return all read mappings within a given edit distance. Our experimental results show that FEM is scalable and outperforms other state-of-the-art all-mappers in terms of both speed and memory footprint. Compared to Masai, FEM is an order-of-magnitude faster using a single thread and two orders-of-magnitude faster when using multiple threads. Furthermore, we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3. CONCLUSIONS: The presented succinct index is the first feasible implementation of the q-gram index functionality that occupies around 3.5 GB of memory for a whole human reference genome. FEM is freely available at https://github.com/haowenz/FEM .


Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Pareamento de Bases/genética , Sequência de Bases , Simulação por Computador , Bases de Dados Genéticas , Genoma Humano , Humanos , Software
17.
Bioinformatics ; 33(23): 3740-3748, 2017 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-28961782

RESUMO

MOTIVATION: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. RESULTS: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data. AVAILABILITY AND IMPLEMENTATION: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache. CONTACT: bertil.schmidt@uni-mainz.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenômica/métodos , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
18.
Bioinformatics ; 33(9): 1396-1398, 2017 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-28453677

RESUMO

Summary: DNA-based methods to detect and quantify taxon composition in biological materials are often based on species-specific polymerase chain reaction, limited to detecting species targeted by the assay. Next-generation sequencing overcomes this drawback by untargeted shotgun sequencing of whole metagenomes at affordable cost. Here we present AFS, a software pipeline for quantification of species composition in food. AFS uses metagenomic shotgun sequencing and sequence read counting to infer species proportions. Using Illumina data from a reference sausage comprising four species, we reveal that AFS is independent of the sequencing assay and library preparation protocol. Cost-saving short (50-bp) single-end reads and Nextera ® library preparation yield reliable results. Availability and Implementation: Datasets, binaries and usage instructions are available under http://all-food-seq.sourceforge.net. Raw data is available at NCBI's SRA with accession number PRJNA271645. Contact: hankeln@uni-mainz.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Microbiologia de Alimentos/métodos , Metagenômica/métodos , Análise de Sequência de DNA/métodos , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos
19.
Nucleic Acids Res ; 44(2): e19, 2016 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-26365234

RESUMO

Alternative splicing is an important mechanism in eukaryotes that expands the transcriptome and proteome significantly. It plays an important role in a number of biological processes. Understanding its regulation is hence an important challenge. Recently, increasing evidence has been collected that supports an involvement of intragenic DNA methylation in the regulation of alternative splicing. The exact mechanisms of regulation, however, are largely unknown, and speculated to be complex: different methylation profiles might exist, each of which could be associated with a different regulation mechanism. We present a computational technique that is able to determine such stable methylation patterns and allows to correlate these patterns with inclusion propensity of exons. Pattern detection is based on dynamic time warping (DTW) of methylation profiles, a sophisticated similarity measure for signals that can be non-trivially transformed. We design a flexible self-organizing map approach to pattern grouping. Exemplary application on available data sets indicates that stable patterns which correlate non-trivially with exon inclusion do indeed exist. To improve the reliability of these predictions, further studies on larger data sets will be required. We have thus taken great care that our software runs efficiently on modern hardware, so that it can support future studies on large-scale data sets.


Assuntos
Processamento Alternativo , Metilação de DNA , Epigênese Genética , Software , Éxons , Humanos , Íntrons , RNA/genética , RNA/metabolismo , Transcriptoma
20.
BMC Bioinformatics ; 18(1): 346, 2017 Jul 20.
Artigo em Inglês | MEDLINE | ID: mdl-28728542

RESUMO

BACKGROUND: A precise understanding of structural variants (SVs) in DNA is important in the study of cancer and population diversity. Many methods have been designed to identify SVs from DNA sequencing data. However, the problem remains challenging because existing approaches suffer from low sensitivity, precision, and positional accuracy. Furthermore, many existing tools only identify breakpoints, and so not collect related breakpoints and classify them as a particular type of SV. Due to the rapidly increasing usage of high throughput sequencing technologies in this area, there is an urgent need for algorithms that can accurately classify complex genomic rearrangements (involving more than one breakpoint or fusion). RESULTS: We present CLOVE, an algorithm for integrating the results of multiple breakpoint or SV callers and classifying the results as a particular SV. CLOVE is based on a graph data structure that is created from the breakpoint information. The algorithm looks for patterns in the graph that are characteristic of more complex rearrangement types. CLOVE is able to integrate the results of multiple callers, producing a consensus call. CONCLUSIONS: We demonstrate using simulated and real data that re-classified SV calls produced by CLOVE improve on the raw call set of existing SV algorithms, particularly in terms of accuracy. CLOVE is freely available from http://www.github.com/PapenfussLab .


Assuntos
Genômica/métodos , Interface Usuário-Computador , Algoritmos , Cromossomos/química , Cromossomos/metabolismo , DNA/química , DNA/metabolismo , Escherichia coli/genética , Humanos , Internet , Conformação de Ácido Nucleico
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa