Pesquisa | Portal Regional da BVS

Schulz, Tizian; Parmigiani, Luca; Rempel, Andreas; Stoye, Jens.

Methods Mol Biol ; 2802: 73-106, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38819557

RESUMO

Computational pangenomics deals with the joint analysis of all genomic sequences of a species. It has already been successfully applied to various tasks in many research areas. Further advances in DNA sequencing technologies constantly let more and more genomic sequences become available for many species, leading to an increasing attractiveness of pangenomic studies. At the same time, larger datasets also pose new challenges for data structures and algorithms that are needed to handle the data. Efficient methods oftentimes make use of the concept of k-mers.Core detection is a common way of analyzing a pangenome. The pangenome's core is defined as the subset of genomic information shared among all individual members. Classically, it is not only determined on the abstract level of genes but can also be described on the sequence level.In this chapter, we provide an overview of k-mer-based methods in the context of pangenomics studies. We first revisit existing software solutions for k-mer counting and k-mer set representation. Afterward, we describe the usage of two k-mer-based approaches, Pangrowth and Corer, for pangenomic core detection.

Assuntos

Algoritmos , Biologia Computacional , Genômica , Software , Genômica/métodos , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos

ESKEMAP: exact sketch-based read mapping.

Schulz, Tizian; Medvedev, Paul.

Algorithms Mol Biol ; 19(1): 19, 2024 May 04.

Artigo em Inglês | MEDLINE | ID: mdl-38704605

RESUMO

BACKGROUND: Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. RESULTS: In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in O ( | t | + | p | + â 2 ) time and O ( â log â ) space, where |t| is the number of k -mers inside the sketch of the reference, |p| is the number of k -mers inside the read's sketch and â is the number of times that k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

Exact Sketch-Based Read Mapping.

Schulz, Tizian; Medvedev, Paul.

Lebniz Int Proc Inform ; 2732023 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-38831964

RESUMO

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in ðª|t|+|p|+â2 time and Θâ2 space, where |t| is the number of k-mers inside the sketch of the reference, |p| is the number of k-mers inside the read's sketch and â is the number of times that k-mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

Sequence-based pangenomic core detection.

Schulz, Tizian; Wittler, Roland; Stoye, Jens.

iScience ; 25(6): 104413, 2022 Jun 17.

Artigo em Inglês | MEDLINE | ID: mdl-35663029

RESUMO

One of the most basic kinds of analysis to be performed on a pangenome is the detection of its core, i.e., the information shared among all members. Pangenomic core detection is classically done on the gene level and many tools focus exclusively on core detection in prokaryotes. Here, we present a new method for sequence-based pangenomic core detection. Our model generalizes from a strict core definition allowing us to flexibly determine suitable core properties depending on the research question and the dataset under consideration. We propose an algorithm based on a colored de Bruijn graph that runs in linear time with respect to the number of k-mers in the graph. An implementation of our method is called Corer. Because of the usage of a colored de Bruijn graph, it works alignment-free, is provided with a small memory footprint, and accepts as input assembled genomes as well as sequencing reads.

Multiple Occurrences of a 168-Nucleotide Deletion in SARS-CoV-2 ORF8, Unnoticed by Standard Amplicon Sequencing and Variant Calling Pipelines.

Brandt, David; Simunovic, Marina; Busche, Tobias; Haak, Markus; Belmann, Peter; Jünemann, Sebastian; Schulz, Tizian; Klages, Levin Joe; Vinke, Svenja; Beckstette, Michael; Pohl, Ehmke; Scherer, Christiane; Sczyrba, Alexander; Kalinowski, Jörn.

Viruses ; 13(9)2021 09 18.

Artigo em Inglês | MEDLINE | ID: mdl-34578452

RESUMO

Genomic surveillance of the SARS-CoV-2 pandemic is crucial and mainly achieved by amplicon sequencing protocols. Overlapping tiled-amplicons are generated to establish contiguous SARS-CoV-2 genome sequences, which enable the precise resolution of infection chains and outbreaks. We investigated a SARS-CoV-2 outbreak in a local hospital and used nanopore sequencing with a modified ARTIC protocol employing 1200 bp long amplicons. We detected a long deletion of 168 nucleotides in the ORF8 gene in 76 samples from the hospital outbreak. This deletion is difficult to identify with the classical amplicon sequencing procedures since it removes two amplicon primer-binding sites. We analyzed public SARS-CoV-2 sequences and sequencing read data from ENA and identified the same deletion in over 100 genomes belonging to different lineages of SARS-CoV-2, pointing to a mutation hotspot or to positive selection. In almost all cases, the deletion was not represented in the virus genome sequence after consensus building. Additionally, further database searches point to other deletions in the ORF8 coding region that have never been reported by the standard data analysis pipelines. These findings and the fact that ORF8 is especially prone to deletions, make a clear case for the urgent necessity of public availability of the raw data for this and other large deletions that might change the physiology of the virus towards endemism.

Assuntos

COVID-19/virologia , Genes Virais , SARS-CoV-2/genética , Deleção de Sequência , Variação Genética , Humanos , Sequenciamento por Nanoporos , Fases de Leitura Aberta , Análise de Sequência de RNA , Sequenciamento Completo do Genoma

Detecting high-scoring local alignments in pangenome graphs.

Schulz, Tizian; Wittler, Roland; Rahmann, Sven; Hach, Faraz; Stoye, Jens.

Bioinformatics ; 37(16): 2266-2274, 2021 Aug 25.

Artigo em Inglês | MEDLINE | ID: mdl-33532821

RESUMO

MOTIVATION: Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. RESULTS: We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. AVAILABILITY AND IMPLEMENTATION: Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

GraphTeams: a method for discovering spatial gene clusters in Hi-C sequencing data.

Schulz, Tizian; Stoye, Jens; Doerr, Daniel.

BMC Genomics ; 19(Suppl 5): 308, 2018 May 08.

Artigo em Inglês | MEDLINE | ID: mdl-29745835

RESUMO

BACKGROUND: Hi-C sequencing offers novel, cost-effective means to study the spatial conformation of chromosomes. We use data obtained from Hi-C experiments to provide new evidence for the existence of spatial gene clusters. These are sets of genes with associated functionality that exhibit close proximity to each other in the spatial conformation of chromosomes across several related species. RESULTS: We present the first gene cluster model capable of handling spatial data. Our model generalizes a popular computational model for gene cluster prediction, called Î´-teams, from sequences to graphs. Following previous lines of research, we subsequently extend our model to allow for several vertices being associated with the same label. The model, called Î´-teams with families, is particular suitable for our application as it enables handling of gene duplicates. We develop algorithmic solutions for both models. We implemented the algorithm for discovering Î´-teams with families and integrated it into a fully automated workflow for discovering gene clusters in Hi-C data, called GraphTeams. We applied it to human and mouse data to find intra- and interchromosomal gene cluster candidates. The results include intrachromosomal clusters that seem to exhibit a closer proximity in space than on their chromosomal DNA sequence. We further discovered interchromosomal gene clusters that contain genes from different chromosomes within the human genome, but are located on a single chromosome in mouse. CONCLUSIONS: By identifying Î´-teams with families, we provide a flexible model to discover gene cluster candidates in Hi-C data. Our analysis of Hi-C data from human and mouse reveals several known gene clusters (thus validating our approach), but also few sparsely studied or possibly unknown gene cluster candidates that could be the source of further experimental investigations.

Assuntos

Algoritmos , Cromossomos/química , Gráficos por Computador , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Família Multigênica , Análise de Sequência de DNA/métodos , Animais , Análise por Conglomerados , Genômica , Humanos , Camundongos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA