Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 96
Filtrar
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38485699

RESUMEN

MOTIVATION: Local alignments of query sequences in large databases represent a core part of metagenomic studies and facilitate homology search. Following the development of NCBI Blast, many applications aimed to provide faster and equally sensitive local alignment frameworks. Most applications focus on protein alignments, while only few also facilitate DNA-based searches. None of the established programs allow searching DNA sequences from bisulfite sequencing experiments commonly used for DNA methylation profiling, for which specific alignment strategies need to be implemented. RESULTS: Here, we introduce Lambda3, a new version of the local alignment application Lambda. Lambda3 is the first solution that enables the search of protein, nucleotide as well as bisulfite-converted nucleotide query sequences. Its protein mode achieves comparable performance to that of the highly optimized protein alignment application Diamond, while the nucleotide mode consistently outperforms established local nucleotide aligners. Combined, Lambda3 presents a universal local alignment framework that enables fast and sensitive homology searches for a wide range of use-cases. AVAILABILITY AND IMPLEMENTATION: Lambda3 is free and open-source software publicly available at https://github.com/seqan/lambda/.


Asunto(s)
Algoritmos , Programas Informáticos , Sulfitos , Alineación de Secuencia , Proteínas
2.
Bioinformatics ; 40(2)2024 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-38269626

RESUMEN

MOTIVATION: The minimizer concept is a data structure for sequence sketching. The standard canonical minimizer selects a subset of k-mers from the given DNA sequence by comparing the forward and reverse k-mers in a window simultaneously according to a predefined selection scheme. It is widely employed by sequence analysis such as read mapping and assembly. k-mer density, k-mer repetitiveness (e.g. k-mer bias), and computational efficiency are three critical measurements for minimizer selection schemes. However, there exist trade-offs between kinds of minimizer variants. Generic, effective, and efficient are always the requirements for high-performance minimizer algorithms. RESULTS: We propose a simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. However, it can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders). Moreover, it is computationally efficient and the density is close to that of the standard minimizer. The refined minimizer may benefit high-performance applications like binning and read mapping. AVAILABILITY AND IMPLEMENTATION: The source code of the benchmark in this work is available at the github repository https://github.com/xp3i4/mini_benchmark.


Asunto(s)
Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento
3.
Bioinformatics ; 39(6)2023 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-37294786

RESUMEN

MOTIVATION: Deep learning has moved to the forefront of tandem mass spectrometry-driven proteomics and authentic prediction for peptide fragmentation is more feasible than ever. Still, at this point spectral prediction is mainly used to validate database search results or for confined search spaces. Fully predicted spectral libraries have not yet been efficiently adapted to large search space problems that often occur in metaproteomics or proteogenomics. RESULTS: In this study, we showcase a workflow that uses Prosit for spectral library predictions on two common metaproteomes and implement an indexing and search algorithm, Mistle, to efficiently identify experimental mass spectra within the library. Hence, the workflow emulates a classic protein sequence database search with protein digestion but builds a searchable index from spectral predictions as an in-between step. We compare Mistle to popular search engines, both on a spectral and database search level, and provide evidence that this approach is more accurate than a database search using MSFragger. Mistle outperforms other spectral library search engines in terms of run time and proves to be extremely memory efficient with a 4- to 22-fold decrease in RAM usage. This makes Mistle universally applicable to large search spaces, e.g. covering comprehensive sequence databases of diverse microbiomes. AVAILABILITY AND IMPLEMENTATION: Mistle is freely available on GitHub at https://github.com/BAMeScience/Mistle.


Asunto(s)
Péptidos , Programas Informáticos , Péptidos/metabolismo , Motor de Búsqueda/métodos , Proteómica/métodos , Algoritmos , Espectrometría de Masas en Tándem/métodos , Bases de Datos de Proteínas , Biblioteca de Péptidos
4.
Genome Biol ; 24(1): 131, 2023 05 31.
Artículo en Inglés | MEDLINE | ID: mdl-37259161

RESUMEN

We present a novel data structure for searching sequences in large databases: the Hierarchical Interleaved Bloom Filter (HIBF). It is extremely fast and space efficient, yet so general that it could serve as the underlying engine for many applications. We show that the HIBF is superior in build time, index size, and search time while achieving a comparable or better accuracy compared to other state-of-the-art tools. The HIBF builds an index up to 211 times faster, using up to 14 times less space, and can answer approximate membership queries faster by a factor of up to 129.


Asunto(s)
Algoritmos , Programas Informáticos
5.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36869850

RESUMEN

Alignment is the cornerstone of many long-read pipelines and plays an essential role in resolving structural variants (SVs). However, forced alignments of SVs embedded in long reads, inflexibility of integrating novel SVs models and computational inefficiency remain problems. Here, we investigate the feasibility of resolving long-read SVs with alignment-free algorithms. We ask: (1) Is it possible to resolve long-read SVs with alignment-free approaches? and (2) Does it provide an advantage over existing approaches? To this end, we implemented the framework named Linear, which can flexibly integrate alignment-free algorithms such as the generative model for long-read SV detection. Furthermore, Linear addresses the problem of compatibility of alignment-free approaches with existing software. It takes as input long reads and outputs standardized results existing software can directly process. We conducted large-scale assessments in this work and the results show that the sensitivity, and flexibility of Linear outperform alignment-based pipelines. Moreover, the computational efficiency is orders of magnitude faster.


Asunto(s)
Genoma Humano , Programas Informáticos , Humanos , Algoritmos , Análisis de Secuencia , Modelos Estadísticos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento
6.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36622018

RESUMEN

MOTIVATION: Single-cell multimodal assays allow us to simultaneously measure two different molecular features of the same cell, enabling new insights into cellular heterogeneity, cell development and diseases. However, most existing methods suffer from inaccurate dimensionality reduction for the joint-modality data, hindering their discovery of novel or rare cell subpopulations. RESULTS: Here, we present VIMCCA, a computational framework based on variational-assisted multi-view canonical correlation analysis to integrate paired multimodal single-cell data. Our statistical model uses a common latent variable to interpret the common source of variances in two different data modalities. Our approach jointly learns an inference model and two modality-specific non-linear models by leveraging variational inference and deep learning. We perform VIMCCA and compare it with 10 existing state-of-the-art algorithms on four paired multi-modal datasets sequenced by different protocols. Results demonstrate that VIMCCA facilitates integrating various types of joint-modality data, thus leading to more reliable and accurate downstream analysis. VIMCCA improves our ability to identify novel or rare cell subtypes compared to existing widely used methods. Besides, it can also facilitate inferring cell lineage based on joint-modality profiles. AVAILABILITY AND IMPLEMENTATION: The VIMCCA algorithm has been implemented in our toolkit package scbean (≥0.5.0), and its code has been archived at https://github.com/jhu99/scbean under MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Modelos Estadísticos , Diferenciación Celular , Linaje de la Célula
7.
Bioinformatics ; 38(17): 4100-4108, 2022 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-35801930

RESUMEN

MOTIVATION: The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. RESULTS: As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query. AVAILABILITY AND IMPLEMENTATION: https://github.com/seqan/needle. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN
8.
BMC Bioinformatics ; 23(1): 18, 2022 Jan 06.
Artículo en Inglés | MEDLINE | ID: mdl-34991448

RESUMEN

BACKGROUND: The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson-Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. RESULTS: We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. CONCLUSIONS: With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases.


Asunto(s)
ARN , Programas Informáticos , Algoritmos , Secuencia de Bases , Humanos , Conformación de Ácido Nucleico , ARN/genética , Alineación de Secuencia , Análisis de Secuencia de ARN
10.
BMC Genomics ; 22(1): 822, 2021 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-34773979

RESUMEN

BACKGROUND: We benchmarked sequencing technology and assembly strategies for short-read, long-read, and hybrid assemblers in respect to correctness, contiguity, and completeness of assemblies in genomes of Francisella tularensis. Benchmarking allowed in-depth analyses of genomic structures of the Francisella pathogenicity islands and insertion sequences. Five major high-throughput sequencing technologies were applied, including next-generation "short-read" and third-generation "long-read" sequencing methods. RESULTS: We focused on short-read assemblers, hybrid assemblers, and analysis of the genomic structure with particular emphasis on insertion sequences and the Francisella pathogenicity island. The A5-miseq pipeline performed best for MiSeq data, Mira for Ion Torrent data, and ABySS for HiSeq data from eight short-read assembly methods. Two approaches were applied to benchmark long-read and hybrid assembly strategies: long-read-first assembly followed by correction with short reads (Canu/Pilon, Flye/Pilon) and short-read-first assembly along with scaffolding based on long reads (Unicyler, SPAdes). Hybrid assembly can resolve large repetitive regions best with a "long-read first" approach. CONCLUSIONS: Genomic structures of the Francisella pathogenicity islands frequently showed misassembly. Insertion sequences (IS) could be used to perform an evolutionary conservation analysis. A phylogenetic structure of insertion sequences and the evolution within the clades elucidated the clade structure of the highly conservative F. tularensis.


Asunto(s)
Francisella tularensis , Genoma Bacteriano , Elementos Transponibles de ADN , Francisella tularensis/genética , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Filogenia , Análisis de Secuencia de ADN
11.
Datenbank Spektrum ; 21(3): 255-260, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34786019

RESUMEN

Today's scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 "FONDA -- Foundations of Workflows for Large-Scale Scientific Data Analysis", in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA's internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the "making of" a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.

12.
Bioinformatics ; 37(21): 3934-3935, 2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34601556

RESUMEN

SUMMARY: Bisulfite sequencing data provide value beyond the straightforward methylation assessment by analyzing single-read patterns. Over the past years, various metrics have been established to explore this layer of information. However, limited compatibility with alignment tools, reference genomes or the measurements they provide present a bottleneck for most groups to routinely perform read-level analysis. To address this, we developed RLM, a fast and scalable tool for the computation of several frequently used read-level methylation statistics. RLM supports standard alignment tools, works independently of the reference genome and handles most sequencing experiment designs. RLM can process large input files with a billion reads in just a few hours on common workstations. AVAILABILITY AND IMPLEMENTATION: https://github.com/sarahet/RLM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Benchmarking , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento , Metilación de ADN
13.
iScience ; 24(7): 102782, 2021 Jul 23.
Artículo en Inglés | MEDLINE | ID: mdl-34337360

RESUMEN

We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.

14.
Nanomaterials (Basel) ; 11(4)2021 Mar 30.
Artículo en Inglés | MEDLINE | ID: mdl-33808372

RESUMEN

Engineered nanomaterials are potentially very useful for a variety of applications, but studies are needed to ascertain whether these materials pose a risk to human health. Here, we studied three benchmark nanomaterials (Ag nanoparticles, TiO2 nanoparticles, and multi-walled carbon nanotubes, MWCNTs) procured from the nanomaterial repository at the Joint Research Centre of the European Commission. Having established a sub-lethal concentration of these materials using two human cell lines representative of the immune system and the lungs, respectively, we performed RNA sequencing of the macrophage-like cell line after exposure for 6, 12, and 24 h. Downstream analysis of the transcriptomics data revealed significant effects on chemokine signaling pathways. CCR2 was identified as the most significantly upregulated gene in MWCNT-exposed cells. Using multiplex assays to evaluate cytokine and chemokine secretion, we could show significant effects of MWCNTs on several chemokines, including CCL2, a ligand of CCR2. The results demonstrate the importance of evaluating sub-lethal concentrations of nanomaterials in relevant target cells.

15.
Bioinformatics ; 37(3): 426-428, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32717040

RESUMEN

SUMMARY: RNA-sequencing (RNA-Seq) is the current method of choice for studying bacterial transcriptomes. To date, many computational pipelines have been developed to predict differentially expressed genes from RNA-Seq data, but no gold-standard has been widely accepted. We present the Snakemake-based tool Smart Consensus Of RNA Expression (SCORE) which uses a consensus approach founded on a selection of well-established tools for differential gene expression analysis. This allows SCORE to increase the overall prediction accuracy and to merge varying results into a single, human-readable output. SCORE performs all steps for the analysis of bacterial RNA-Seq data, from read preprocessing to the overrepresentation analysis of significantly associated ontologies. Development of consensus approaches like SCORE will help to streamline future RNA-Seq workflows and will fundamentally contribute to the creation of new gold-standards for the analysis of these types of data. AVAILABILITY AND IMPLEMENTATION: https://github.com/SiWolf/SCORE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Bacterias/genética , Programas Informáticos , Transcriptoma , Consenso , Regulación Bacteriana de la Expresión Génica , Análisis de Secuencia de ARN
16.
Brief Bioinform ; 22(2): 642-663, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-33147627

RESUMEN

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.


Asunto(s)
COVID-19/prevención & control , Biología Computacional , SARS-CoV-2/aislamiento & purificación , Investigación Biomédica , COVID-19/epidemiología , COVID-19/virología , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética
17.
ISME J ; 14(11): 2783-2793, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-32747713

RESUMEN

Despite a well-documented effect of high dietary zinc oxide on the pig intestinal microbiota composition less is it yet known about changes in microbial functional properties or the effect of organic zinc sources. Forty weaning piglets in four groups were fed diets supplemented with 40 or 110 ppm zinc as zinc oxide, 110 ppm as Zn-Lysinate, or 2500 ppm as zinc oxide. Host zinc homeostasis, intestinal zinc fractions, and ileal nutrient digestibility were determined as main nutritional and physiological factors putatively driving colon microbial ecology. Metagenomic sequencing of colon microbiota revealed only clear differences at genus level for the group receiving 2500 ppm zinc oxide. However, a clear group differentiation according to dietary zinc concentration and source was observed at species level. Functional analysis revealed significant differences in genes related to stress response, mineral, and carbohydrate metabolism. Taxonomic and functional gene differences were accompanied with clear effects in microbial metabolite concentration. Finally, a selection of certain antibiotic resistance genes by dietary zinc was observed. This study sheds further light onto the consequences of concentration and chemical form of dietary zinc on microbial ecology measures and the resistome in the porcine colon.


Asunto(s)
Antibacterianos , Zinc , Alimentación Animal/análisis , Animales , Colon , Dieta , Farmacorresistencia Microbiana , Porcinos , Destete
18.
Bioinformatics ; 36(Suppl_1): i12-i20, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32657362

RESUMEN

MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. RESULTS: Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. AVAILABILITY AND IMPLEMENTATION: The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Metagenómica , Archaea , Análisis de Secuencia de ADN , Programas Informáticos
19.
Front Microbiol ; 11: 636, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32457701

RESUMEN

Zoonotic pathogens that can be transmitted via food to humans have a high potential for large-scale emergencies, comprising severe effects on public health, critical infrastructures, and the economy. In this context, the development of laboratory methods to rapidly detect zoonotic bacteria in the food supply chain, including high-resolution mass spectrometry proteotyping are needed. In this work, an optimized sample preparation method for liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based proteome profiling was established for Francisella isolates, and a cluster analysis, as well as a phylogenetic tree, was generated to shed light on evolutionary relationships. Furthermore, this method was applied to tissues of infected hare carcasses from Germany. Even though the non-informative data outnumbered by a manifold the information of the zoonotic pathogen in the resulting proteome profiles, the standardized evaluation of MS data within an established automated analysis pipeline identified Francisella (F.) tularensis and, thus, could be, in principle, an applicable method to monitor food supply chains.

20.
PLoS Comput Biol ; 16(5): e1007843, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-32469863

RESUMEN

Reconstructing haplotypes from sequencing data is one of the major challenges in genetics. Haplotypes play a crucial role in many analyses, including genome-wide association studies and population genetics. Haplotype reconstruction becomes more difficult for higher numbers of homologous chromosomes, as it is often the case for polyploid plants. This complexity is compounded further by higher heterozygosity, which denotes the frequent presence of variants between haplotypes. We have designed Ranbow, a new tool for haplotype reconstruction of polyploid genome from short read sequencing data. Ranbow integrates all types of small variants in bi- and multi-allelic sites to reconstruct haplotypes. To evaluate Ranbow and currently available competing methods on real data, we have created and released a real gold standard dataset from sweet potato sequencing data. Our evaluations on real and simulated data clearly show Ranbow's superior performance in terms of accuracy, haplotype length, memory usage, and running time. Specifically, Ranbow is one order of magnitude faster than the next best method. The efficiency and accuracy of Ranbow makes whole genome haplotype reconstruction of complex genome with higher ploidy feasible.


Asunto(s)
Haplotipos , Poliploidía , Algoritmos , Conjuntos de Datos como Asunto , Heterocigoto , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...