Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 97
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36869850

RESUMEN

Alignment is the cornerstone of many long-read pipelines and plays an essential role in resolving structural variants (SVs). However, forced alignments of SVs embedded in long reads, inflexibility of integrating novel SVs models and computational inefficiency remain problems. Here, we investigate the feasibility of resolving long-read SVs with alignment-free algorithms. We ask: (1) Is it possible to resolve long-read SVs with alignment-free approaches? and (2) Does it provide an advantage over existing approaches? To this end, we implemented the framework named Linear, which can flexibly integrate alignment-free algorithms such as the generative model for long-read SV detection. Furthermore, Linear addresses the problem of compatibility of alignment-free approaches with existing software. It takes as input long reads and outputs standardized results existing software can directly process. We conducted large-scale assessments in this work and the results show that the sensitivity, and flexibility of Linear outperform alignment-based pipelines. Moreover, the computational efficiency is orders of magnitude faster.


Asunto(s)
Genoma Humano , Programas Informáticos , Humanos , Algoritmos , Análisis de Secuencia , Modelos Estadísticos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento
2.
Bioinformatics ; 40(2)2024 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-38269626

RESUMEN

MOTIVATION: The minimizer concept is a data structure for sequence sketching. The standard canonical minimizer selects a subset of k-mers from the given DNA sequence by comparing the forward and reverse k-mers in a window simultaneously according to a predefined selection scheme. It is widely employed by sequence analysis such as read mapping and assembly. k-mer density, k-mer repetitiveness (e.g. k-mer bias), and computational efficiency are three critical measurements for minimizer selection schemes. However, there exist trade-offs between kinds of minimizer variants. Generic, effective, and efficient are always the requirements for high-performance minimizer algorithms. RESULTS: We propose a simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. However, it can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders). Moreover, it is computationally efficient and the density is close to that of the standard minimizer. The refined minimizer may benefit high-performance applications like binning and read mapping. AVAILABILITY AND IMPLEMENTATION: The source code of the benchmark in this work is available at the github repository https://github.com/xp3i4/mini_benchmark.


Asunto(s)
Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento
3.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38485699

RESUMEN

MOTIVATION: Local alignments of query sequences in large databases represent a core part of metagenomic studies and facilitate homology search. Following the development of NCBI Blast, many applications aimed to provide faster and equally sensitive local alignment frameworks. Most applications focus on protein alignments, while only few also facilitate DNA-based searches. None of the established programs allow searching DNA sequences from bisulfite sequencing experiments commonly used for DNA methylation profiling, for which specific alignment strategies need to be implemented. RESULTS: Here, we introduce Lambda3, a new version of the local alignment application Lambda. Lambda3 is the first solution that enables the search of protein, nucleotide as well as bisulfite-converted nucleotide query sequences. Its protein mode achieves comparable performance to that of the highly optimized protein alignment application Diamond, while the nucleotide mode consistently outperforms established local nucleotide aligners. Combined, Lambda3 presents a universal local alignment framework that enables fast and sensitive homology searches for a wide range of use-cases. AVAILABILITY AND IMPLEMENTATION: Lambda3 is free and open-source software publicly available at https://github.com/seqan/lambda/.


Asunto(s)
Algoritmos , Programas Informáticos , Sulfitos , Alineación de Secuencia , Proteínas
4.
Bioinformatics ; 39(6)2023 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-37294786

RESUMEN

MOTIVATION: Deep learning has moved to the forefront of tandem mass spectrometry-driven proteomics and authentic prediction for peptide fragmentation is more feasible than ever. Still, at this point spectral prediction is mainly used to validate database search results or for confined search spaces. Fully predicted spectral libraries have not yet been efficiently adapted to large search space problems that often occur in metaproteomics or proteogenomics. RESULTS: In this study, we showcase a workflow that uses Prosit for spectral library predictions on two common metaproteomes and implement an indexing and search algorithm, Mistle, to efficiently identify experimental mass spectra within the library. Hence, the workflow emulates a classic protein sequence database search with protein digestion but builds a searchable index from spectral predictions as an in-between step. We compare Mistle to popular search engines, both on a spectral and database search level, and provide evidence that this approach is more accurate than a database search using MSFragger. Mistle outperforms other spectral library search engines in terms of run time and proves to be extremely memory efficient with a 4- to 22-fold decrease in RAM usage. This makes Mistle universally applicable to large search spaces, e.g. covering comprehensive sequence databases of diverse microbiomes. AVAILABILITY AND IMPLEMENTATION: Mistle is freely available on GitHub at https://github.com/BAMeScience/Mistle.


Asunto(s)
Péptidos , Programas Informáticos , Péptidos/metabolismo , Motor de Búsqueda/métodos , Proteómica/métodos , Algoritmos , Espectrometría de Masas en Tándem/métodos , Bases de Datos de Proteínas , Biblioteca de Péptidos
5.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36622018

RESUMEN

MOTIVATION: Single-cell multimodal assays allow us to simultaneously measure two different molecular features of the same cell, enabling new insights into cellular heterogeneity, cell development and diseases. However, most existing methods suffer from inaccurate dimensionality reduction for the joint-modality data, hindering their discovery of novel or rare cell subpopulations. RESULTS: Here, we present VIMCCA, a computational framework based on variational-assisted multi-view canonical correlation analysis to integrate paired multimodal single-cell data. Our statistical model uses a common latent variable to interpret the common source of variances in two different data modalities. Our approach jointly learns an inference model and two modality-specific non-linear models by leveraging variational inference and deep learning. We perform VIMCCA and compare it with 10 existing state-of-the-art algorithms on four paired multi-modal datasets sequenced by different protocols. Results demonstrate that VIMCCA facilitates integrating various types of joint-modality data, thus leading to more reliable and accurate downstream analysis. VIMCCA improves our ability to identify novel or rare cell subtypes compared to existing widely used methods. Besides, it can also facilitate inferring cell lineage based on joint-modality profiles. AVAILABILITY AND IMPLEMENTATION: The VIMCCA algorithm has been implemented in our toolkit package scbean (≥0.5.0), and its code has been archived at https://github.com/jhu99/scbean under MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Modelos Estadísticos , Diferenciación Celular , Linaje de la Célula
6.
Brief Bioinform ; 22(2): 642-663, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-33147627

RESUMEN

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.


Asunto(s)
COVID-19/prevención & control , Biología Computacional , SARS-CoV-2/aislamiento & purificación , Investigación Biomédica , COVID-19/epidemiología , COVID-19/virología , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética
7.
Bioinformatics ; 38(17): 4100-4108, 2022 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-35801930

RESUMEN

MOTIVATION: The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. RESULTS: As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query. AVAILABILITY AND IMPLEMENTATION: https://github.com/seqan/needle. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN
8.
BMC Bioinformatics ; 23(1): 18, 2022 Jan 06.
Artículo en Inglés | MEDLINE | ID: mdl-34991448

RESUMEN

BACKGROUND: The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson-Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. RESULTS: We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. CONCLUSIONS: With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases.


Asunto(s)
ARN , Programas Informáticos , Algoritmos , Secuencia de Bases , Humanos , Conformación de Ácido Nucleico , ARN/genética , Alineación de Secuencia , Análisis de Secuencia de ARN
9.
Bioinformatics ; 37(21): 3934-3935, 2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34601556

RESUMEN

SUMMARY: Bisulfite sequencing data provide value beyond the straightforward methylation assessment by analyzing single-read patterns. Over the past years, various metrics have been established to explore this layer of information. However, limited compatibility with alignment tools, reference genomes or the measurements they provide present a bottleneck for most groups to routinely perform read-level analysis. To address this, we developed RLM, a fast and scalable tool for the computation of several frequently used read-level methylation statistics. RLM supports standard alignment tools, works independently of the reference genome and handles most sequencing experiment designs. RLM can process large input files with a billion reads in just a few hours on common workstations. AVAILABILITY AND IMPLEMENTATION: https://github.com/sarahet/RLM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Benchmarking , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Metilación de ADN
10.
Bioinformatics ; 37(3): 426-428, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32717040

RESUMEN

SUMMARY: RNA-sequencing (RNA-Seq) is the current method of choice for studying bacterial transcriptomes. To date, many computational pipelines have been developed to predict differentially expressed genes from RNA-Seq data, but no gold-standard has been widely accepted. We present the Snakemake-based tool Smart Consensus Of RNA Expression (SCORE) which uses a consensus approach founded on a selection of well-established tools for differential gene expression analysis. This allows SCORE to increase the overall prediction accuracy and to merge varying results into a single, human-readable output. SCORE performs all steps for the analysis of bacterial RNA-Seq data, from read preprocessing to the overrepresentation analysis of significantly associated ontologies. Development of consensus approaches like SCORE will help to streamline future RNA-Seq workflows and will fundamentally contribute to the creation of new gold-standards for the analysis of these types of data. AVAILABILITY AND IMPLEMENTATION: https://github.com/SiWolf/SCORE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Bacterias/genética , Programas Informáticos , Transcriptoma , Consenso , Regulación Bacteriana de la Expresión Génica , Análisis de Secuencia de ARN
11.
BMC Genomics ; 22(1): 822, 2021 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-34773979

RESUMEN

BACKGROUND: We benchmarked sequencing technology and assembly strategies for short-read, long-read, and hybrid assemblers in respect to correctness, contiguity, and completeness of assemblies in genomes of Francisella tularensis. Benchmarking allowed in-depth analyses of genomic structures of the Francisella pathogenicity islands and insertion sequences. Five major high-throughput sequencing technologies were applied, including next-generation "short-read" and third-generation "long-read" sequencing methods. RESULTS: We focused on short-read assemblers, hybrid assemblers, and analysis of the genomic structure with particular emphasis on insertion sequences and the Francisella pathogenicity island. The A5-miseq pipeline performed best for MiSeq data, Mira for Ion Torrent data, and ABySS for HiSeq data from eight short-read assembly methods. Two approaches were applied to benchmark long-read and hybrid assembly strategies: long-read-first assembly followed by correction with short reads (Canu/Pilon, Flye/Pilon) and short-read-first assembly along with scaffolding based on long reads (Unicyler, SPAdes). Hybrid assembly can resolve large repetitive regions best with a "long-read first" approach. CONCLUSIONS: Genomic structures of the Francisella pathogenicity islands frequently showed misassembly. Insertion sequences (IS) could be used to perform an evolutionary conservation analysis. A phylogenetic structure of insertion sequences and the evolution within the clades elucidated the clade structure of the highly conservative F. tularensis.


Asunto(s)
Francisella tularensis , Genoma Bacteriano , Elementos Transponibles de ADN , Francisella tularensis/genética , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Filogenia , Análisis de Secuencia de ADN
12.
Bioinformatics ; 36(12): 3687-3692, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32246826

RESUMEN

MOTIVATION: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. RESULTS: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position. AVAILABILITY AND IMPLEMENTATION: GenMap can be installed via bioconda. Binaries and C++ source code are available on https://github.com/cpockrandt/genmap.


Asunto(s)
Genoma , Programas Informáticos , Algoritmos , Genómica , Análisis de Secuencia de ADN
13.
Bioinformatics ; 36(Suppl_1): i12-i20, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32657362

RESUMEN

MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. RESULTS: Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. AVAILABILITY AND IMPLEMENTATION: The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Metagenómica , Archaea , Análisis de Secuencia de ADN , Programas Informáticos
14.
PLoS Comput Biol ; 16(5): e1007843, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32469863

RESUMEN

Reconstructing haplotypes from sequencing data is one of the major challenges in genetics. Haplotypes play a crucial role in many analyses, including genome-wide association studies and population genetics. Haplotype reconstruction becomes more difficult for higher numbers of homologous chromosomes, as it is often the case for polyploid plants. This complexity is compounded further by higher heterozygosity, which denotes the frequent presence of variants between haplotypes. We have designed Ranbow, a new tool for haplotype reconstruction of polyploid genome from short read sequencing data. Ranbow integrates all types of small variants in bi- and multi-allelic sites to reconstruct haplotypes. To evaluate Ranbow and currently available competing methods on real data, we have created and released a real gold standard dataset from sweet potato sequencing data. Our evaluations on real and simulated data clearly show Ranbow's superior performance in terms of accuracy, haplotype length, memory usage, and running time. Specifically, Ranbow is one order of magnitude faster than the next best method. The efficiency and accuracy of Ranbow makes whole genome haplotype reconstruction of complex genome with higher ploidy feasible.


Asunto(s)
Haplotipos , Poliploidía , Algoritmos , Conjuntos de Datos como Asunto , Heterocigoto , Humanos
15.
J Proteome Res ; 19(3): 1060-1072, 2020 03 06.
Artículo en Inglés | MEDLINE | ID: mdl-31975601

RESUMEN

Accurate protein inference in the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here, we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data, EPIFANY is the only tested method that finds all true-positive proteins at a 5% protein false discovery rate (FDR) without strict prefiltering on the peptide-spectrum match (PSM) level, yielding an increase in identification performance (+10% in the number of true positives and +14% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany.


Asunto(s)
Algoritmos , Proteómica , Teorema de Bayes , Bases de Datos de Proteínas , Proteínas , Programas Informáticos
16.
BMC Biotechnol ; 19(1): 40, 2019 06 27.
Artículo en Inglés | MEDLINE | ID: mdl-31248401

RESUMEN

BACKGROUND: Natural variations in a genome can drastically alter the CRISPR-Cas9 off-target landscape by creating or removing sites. Despite the resulting potential side-effects from such unaccounted for sites, current off-target detection pipelines are not equipped to include variant information. To address this, we developed VARiant-aware detection and SCoring of Off-Targets (VARSCOT). RESULTS: VARSCOT identifies only 0.6% of off-targets to be common between 4 individual genomes and the reference, with an average of 82% of off-targets unique to an individual. VARSCOT is the most sensitive detection method for off-targets, finding 40 to 70% more experimentally verified off-targets compared to other popular software tools and its machine learning model allows for CRISPR-Cas9 concentration aware off-target activity scoring. CONCLUSIONS: VARSCOT allows researchers to take genomic variation into account when designing individual or population-wide targeting strategies. VARSCOT is available from https://github.com/BauerLab/VARSCOT .


Asunto(s)
Sistemas CRISPR-Cas , Biología Computacional/métodos , Edición Génica/métodos , Marcación de Gen/métodos , Genómica/métodos , Programas Informáticos , Edición Génica/normas , Marcación de Gen/normas , Genómica/normas , Internet , Reproducibilidad de los Resultados
17.
Nat Methods ; 13(9): 741-8, 2016 08 30.
Artículo en Inglés | MEDLINE | ID: mdl-27575624

RESUMEN

High-resolution mass spectrometry (MS) has become an important tool in the life sciences, contributing to the diagnosis and understanding of human diseases, elucidating biomolecular structural information and characterizing cellular signaling networks. However, the rapid growth in the volume and complexity of MS data makes transparent, accurate and reproducible analysis difficult. We present OpenMS 2.0 (http://www.openms.de), a robust, open-source, cross-platform software specifically designed for the flexible and reproducible analysis of high-throughput MS data. The extensible OpenMS software implements common mass spectrometric data processing tasks through a well-defined application programming interface in C++ and Python and through standardized open data formats. OpenMS additionally provides a set of 185 tools and ready-made workflows for common mass spectrometric data processing tasks, which enable users to perform complex quantitative mass spectrometric analyses with ease.


Asunto(s)
Biología Computacional/métodos , Procesamiento Automatizado de Datos , Espectrometría de Masas/métodos , Proteómica/métodos , Programas Informáticos , Envejecimiento/sangre , Proteínas Sanguíneas/química , Humanos , Anotación de Secuencia Molecular , Proteogenómica/métodos , Flujo de Trabajo
18.
Bioinformatics ; 34(20): 3437-3445, 2018 10 15.
Artículo en Inglés | MEDLINE | ID: mdl-29726911

RESUMEN

Motivation: Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (single instruction multiple data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we (a) distribute many independent alignments on multiple threads and (b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. Results: We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon PhiTM (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon PhiTM and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. Availability and implementation: The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4 under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME: SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Alineación de Secuencia , Programas Informáticos , Algoritmos
19.
Bioinformatics ; 34(17): i766-i772, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30423080

RESUMEN

Motivation: Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. >10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about 1 day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times. Results: To solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM framework. Availability and implementation: https://gitlab.com/pirovc/dream_yara/.


Asunto(s)
Bases de Datos Factuales , Programas Informáticos , Humanos , Factores de Tiempo
20.
J Infect Dis ; 217(9): 1442-1452, 2018 04 11.
Artículo en Inglés | MEDLINE | ID: mdl-29099941

RESUMEN

Spontaneous outbreaks of Clostridium difficile infection (CDI) occur in neonatal piglets, but the predisposing factors are largely not known. To study the conditions for C. difficile colonization and CDI development, 48 neonatal piglets were moved into isolators, fed bovine milk-based formula, and infected with C. difficile 078. Analyses included clinical scoring; measurement of the fecal C. difficile burden, toxin B level, and calprotectin level; and postmortem histopathological analysis of colon specimens. Controls were noninfected suckling piglets. Fecal specimens from suckling piglets, formula-fed piglets, and formula-fed, C. difficile-infected piglets were used for metagenomics analysis. High background levels of C. difficile and toxin were detected in formula-fed piglets prior to infection, while suckling piglets carried about 3-fold less C. difficile, and toxin was not detected. Toxin level in C. difficile-challenged animals correlated positively with C. difficile and calprotectin levels. Postmortem signs of CDI were absent in suckling piglets, whereas mesocolonic edema and gas-filled distal small intestines and ceca, cellular damage, and reduced expression of claudins were associated with animals from the challenge trials. Microbiota in formula-fed piglets was enriched with Escherichia, Shigella, Streptococcus, Enterococcus, and Ruminococcus species. Formula-fed piglets were predisposed to C. difficile colonization earlier as compared to suckling piglets. Infection with a hypervirulent C. difficile ribotype did not aggravate the symptoms of infection. Sow-offspring association and consumption of porcine milk during early life may be crucial for the control of C. difficile expansion in piglets.


Asunto(s)
Animales Recién Nacidos , Clostridioides difficile/patogenicidad , Infecciones por Clostridium/veterinaria , Sustitutos de la Leche , Enfermedades de los Porcinos/microbiología , Alimentación Animal , Animales , Animales Lactantes , Enfermedades Intestinales/microbiología , Enfermedades Intestinales/patología , Enfermedades Intestinales/veterinaria , Intestinos/patología , Porcinos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA