Pesquisa | BVS Aleitamento Materno

1.

HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets.

Kirsch-Gerweck, Benedikt; Bohnenkämper, Leonard; Henrichs, Michel T; Alanko, Jarno N; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Burger, Joachim; Stoye, Jens; Diekmann, Yoan.

Mol Biol Evol ; 40(3)2023 03 04.

Artigo em Inglês | MEDLINE | ID: mdl-36790822

RESUMO

Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.

Assuntos

Genética Populacional , Metagenômica , Genômica , Genoma , Haplótipos

2.

Detecting high-scoring local alignments in pangenome graphs.

Schulz, Tizian; Wittler, Roland; Rahmann, Sven; Hach, Faraz; Stoye, Jens.

Bioinformatics ; 37(16): 2266-2274, 2021 Aug 25.

Artigo em Inglês | MEDLINE | ID: mdl-33532821

RESUMO

MOTIVATION: Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. RESULTS: We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. AVAILABILITY AND IMPLEMENTATION: Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.

Reconstructing tumor evolutionary histories and clone trees in polynomial-time with SubMARine.

Sundermann, Linda K; Wintersinger, Jeff; Rätsch, Gunnar; Stoye, Jens; Morris, Quaid.

PLoS Comput Biol ; 17(1): e1008400, 2021 01.

Artigo em Inglês | MEDLINE | ID: mdl-33465079

RESUMO

Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at https://github.com/morrislab/submarine.

Assuntos

Algoritmos , Biologia Computacional/métodos , Mutação/genética , Neoplasias , Simulação por Computador , Evolução Molecular , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/classificação , Neoplasias/genética , Sequenciamento Completo do Genoma

4.

Horizontal Gene Transfer Phylogenetics: A Random Walk Approach.

Sevillya, Gur; Doerr, Daniel; Lerner, Yael; Stoye, Jens; Steel, Mike; Snir, Sagi.

Mol Biol Evol ; 37(5): 1470-1479, 2020 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-31845962

RESUMO

The dramatic decrease in time and cost for generating genetic sequence data has opened up vast opportunities in molecular systematics, one of which is the ability to decipher the evolutionary history of strains of a species. Under this fine systematic resolution, the standard markers are too crude to provide a phylogenetic signal. Nevertheless, among prokaryotes, genome dynamics in the form of horizontal gene transfer (HGT) between organisms and gene loss seem to provide far richer information by affecting both gene order and gene content. The "synteny index" (SI) between a pair of genomes combines these latter two factors, allowing comparison of genomes with unequal gene content, together with order considerations of their common genes. Although this approach is useful for classifying close relatives, no rigorous statistical modeling for it has been suggested. Such modeling is valuable, as it allows observed measures to be transformed into estimates of time periods during evolution, yielding the "additivity" of the measure. To the best of our knowledge, there is no other additivity proof for other gene order/content measures under HGT. Here, we provide a first statistical model and analysis for the SI measure. We model the "gene neighborhood" as a "birth-death-immigration" process affected by the HGT activity over the genome, and analytically relate the HGT rate and time to the expected SI. This model is asymptotic and thus provides accurate results, assuming infinite size genomes. Therefore, we also developed a heuristic model following an "exponential decay" function, accounting for biologically realistic values, which performed well in simulations. Applying this model to 1,133 prokaryotes partitioned to 39 clusters by the rank of genus yields that the average number of genome dynamics events per gene in the phylogenetic depth of genus is around half with significant variability between genera. This result extends and confirms similar results obtained for individual genera in different manners.

Assuntos

Transferência Genética Horizontal , Técnicas Genéticas , Modelos Genéticos , Sintenia , Genoma Microbiano , Filogenia

5.

Whole-genome sequence of the bovine blood fluke Schistosoma bovis supports interspecific hybridization with S. haematobium.

Oey, Harald; Zakrzewski, Martha; Gravermann, Kerstin; Young, Neil D; Korhonen, Pasi K; Gobert, Geoffrey N; Nawaratna, Sujeevi; Hasan, Shihab; Martínez, David M; You, Hong; Lavin, Martin; Jones, Malcolm K; Ragan, Mark A; Stoye, Jens; Oleaga, Ana; Emery, Aidan M; Webster, Bonnie L; Rollinson, David; Gasser, Robin B; McManus, Donald P; Krause, Lutz.

PLoS Pathog ; 15(1): e1007513, 2019 01.

Artigo em Inglês | MEDLINE | ID: mdl-30673782

RESUMO

Mesenteric infection by the parasitic blood fluke Schistosoma bovis is a common veterinary problem in Africa and the Middle East and occasionally in the Mediterranean Region. The species also has the ability to form interspecific hybrids with the human parasite S. haematobium with natural hybridisation observed in West Africa, presenting possible zoonotic transmission. Additionally, this exchange of alleles between species may dramatically influence disease dynamics and parasite evolution. We have generated a 374 Mb assembly of the S. bovis genome using Illumina and PacBio-based technologies. Despite infecting different hosts and organs, the genome sequences of S. bovis and S. haematobium appeared strikingly similar with 97% sequence identity. The two species share 98% of protein-coding genes, with an average sequence identity of 97.3% at the amino acid level. Genome comparison identified large continuous parts of the genome (up to several 100 kb) showing almost 100% sequence identity between S. bovis and S. haematobium. It is unlikely that this is a result of genome conservation and provides further evidence of natural interspecific hybridization between S. bovis and S. haematobium. Our results suggest that foreign DNA obtained by interspecific hybridization was maintained in the population through multiple meiosis cycles and that hybrids were sexually reproductive, producing viable offspring. The S. bovis genome assembly forms a highly valuable resource for studying schistosome evolution and exploring genetic regions that are associated with species-specific phenotypic traits.

Assuntos

Hibridização Genética/genética , Schistosoma/genética , África , África Ocidental , Animais , Sequência de Bases/genética , Bovinos , Mapeamento Cromossômico/métodos , DNA/genética , Genoma/genética , Genoma Mitocondrial/genética , Hibridização Genética/fisiologia , Oriente Médio , Filogenia , Proteoma/genética , Especificidade da Espécie , Trematódeos/genética , Sequenciamento Completo do Genoma/métodos

6.

Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants.

Rubert, Diego P; Martinez, Fábio V; Stoye, Jens; Doerr, Daniel.

BMC Genomics ; 21(Suppl 2): 273, 2020 Apr 16.

Artigo em Inglês | MEDLINE | ID: mdl-32299356

RESUMO

BACKGROUND: Computationally inferred ancestral genomes play an important role in many areas of genome research. We present an improved workflow for the reconstruction from highly diverged genomes such as those of plants. RESULTS: Our work relies on an established workflow in the reconstruction of ancestral plants, but improves several steps of this process. Instead of using gene annotations for inferring the genome content of the ancestral sequence, we identify genomic markers through a process called genome segmentation. This enables us to reconstruct the ancestral genome from hundreds of thousands of markers rather than the tens of thousands of annotated genes. We also introduce the concept of local genome rearrangement, through which we refine syntenic blocks before they are used in the reconstruction of contiguous ancestral regions. With the enhanced workflow at hand, we reconstruct the ancestral genome of eudicots, a major sub-clade of flowering plants, using whole genome sequences of five modern plants. CONCLUSIONS: Our reconstructed genome is highly detailed, yet its layout agrees well with that reported in Badouin et al. (2017). Using local genome rearrangement, not only the marker-based, but also the gene-based reconstruction of the eudicot ancestor exhibited increased genome content, evidencing the power of this novel concept.

Assuntos

Mapeamento Cromossômico/métodos , Genômica/métodos , Magnoliopsida/genética , Simulação por Computador , Evolução Molecular , Ordem dos Genes , Genoma de Planta , Modelos Genéticos , Filogenia , Sintenia/genética

7.

Computing the family-free DCJ similarity.

Rubert, Diego P; Hoshino, Edna A; Braga, Marília D V; Stoye, Jens; Martinez, Fábio V.

BMC Bioinformatics ; 19(Suppl 6): 152, 2018 05 08.

Artigo em Inglês | MEDLINE | ID: mdl-29745861

RESUMO

BACKGROUND: The genomic similarity is a large-scale measure for comparing two given genomes. In this work we study the (NP-hard) problem of computing the genomic similarity under the DCJ model in a setting that does not assume that the genes of the compared genomes are grouped into gene families. This problem is called family-free DCJ similarity. RESULTS: We propose an exact ILP algorithm to solve the family-free DCJ similarity problem, then we show its APX-hardness and present four combinatorial heuristics with computational experiments comparing their results to the ILP. CONCLUSIONS: We show that the family-free DCJ similarity can be computed in reasonable time, although for larger genomes it is necessary to resort to heuristics. This provides a basis for further studies on the applicability and model refinement of family-free whole genome similarity measures.

Assuntos

Modelos Genéticos , Filogenia , Algoritmos , Animais , Simulação por Computador , Bases de Dados Genéticas , Genoma , Genômica , Heurística , Humanos , Camundongos , Ratos

8.

GraphTeams: a method for discovering spatial gene clusters in Hi-C sequencing data.

Schulz, Tizian; Stoye, Jens; Doerr, Daniel.

BMC Genomics ; 19(Suppl 5): 308, 2018 May 08.

Artigo em Inglês | MEDLINE | ID: mdl-29745835

RESUMO

BACKGROUND: Hi-C sequencing offers novel, cost-effective means to study the spatial conformation of chromosomes. We use data obtained from Hi-C experiments to provide new evidence for the existence of spatial gene clusters. These are sets of genes with associated functionality that exhibit close proximity to each other in the spatial conformation of chromosomes across several related species. RESULTS: We present the first gene cluster model capable of handling spatial data. Our model generalizes a popular computational model for gene cluster prediction, called Î´-teams, from sequences to graphs. Following previous lines of research, we subsequently extend our model to allow for several vertices being associated with the same label. The model, called Î´-teams with families, is particular suitable for our application as it enables handling of gene duplicates. We develop algorithmic solutions for both models. We implemented the algorithm for discovering Î´-teams with families and integrated it into a fully automated workflow for discovering gene clusters in Hi-C data, called GraphTeams. We applied it to human and mouse data to find intra- and interchromosomal gene cluster candidates. The results include intrachromosomal clusters that seem to exhibit a closer proximity in space than on their chromosomal DNA sequence. We further discovered interchromosomal gene clusters that contain genes from different chromosomes within the human genome, but are located on a single chromosome in mouse. CONCLUSIONS: By identifying Î´-teams with families, we provide a flexible model to discover gene cluster candidates in Hi-C data. Our analysis of Hi-C data from human and mouse reveals several known gene clusters (thus validating our approach), but also few sparsely studied or possibly unknown gene cluster candidates that could be the source of further experimental investigations.

Assuntos

Algoritmos , Cromossomos/química , Gráficos por Computador , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Família Multigênica , Análise de Sequência de DNA/métodos , Animais , Análise por Conglomerados , Genômica , Humanos , Camundongos

9.

Finding approximate gene clusters with Gecko 3.

Winter, Sascha; Jahn, Katharina; Wehner, Stefanie; Kuchenbecker, Leon; Marz, Manja; Stoye, Jens; Böcker, Sebastian.

Nucleic Acids Res ; 44(20): 9600-9610, 2016 Nov 16.

Artigo em Inglês | MEDLINE | ID: mdl-27679480

RESUMO

Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Família Multigênica , Software , Algoritmos , Conjuntos de Dados como Assunto , Genes Bacterianos , Genoma Bacteriano , Modelos Estatísticos , Navegador , Fluxo de Trabalho

10.

Identification of the CIMP-like subtype and aberrant methylation of members of the chromosomal segregation and spindle assembly pathways in esophageal adenocarcinoma.

Krause, Lutz; Nones, Katia; Loffler, Kelly A; Nancarrow, Derek; Oey, Harald; Tang, Yue Hang; Wayte, Nicola J; Patch, Ann Marie; Patel, Kalpana; Brosda, Sandra; Manning, Suzanne; Lampe, Guy; Clouston, Andrew; Thomas, Janine; Stoye, Jens; Hussey, Damian J; Watson, David I; Lord, Reginald V; Phillips, Wayne A; Gotley, David; Smithers, B Mark; Whiteman, David C; Hayward, Nicholas K; Grimmond, Sean M; Waddell, Nicola; Barbour, Andrew P.

Carcinogenesis ; 37(4): 356-65, 2016 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-26905591

RESUMO

The incidence of esophageal adenocarcinoma (EAC) has risen significantly over recent decades. Although survival has improved, cure rates remain poor, with <20% of patients surviving 5 years. This is the first study to explore methylome, transcriptome and ENCODE data to characterize the role of methylation in EAC. We investigate the genome-wide methylation profile of 250 samples including 125 EAC, 19 Barrett's esophagus (BE), 85 squamous esophagus and 21 normal stomach. Transcriptome data of 70 samples (48 EAC, 4 BE and 18 squamous esophagus) were used to identify changes in methylation associated with gene expression. BE and EAC showed similar methylation profiles, which differed from squamous tissue. Hypermethylated sites in EAC and BE were mainly located in CpG-rich promoters. A total of 18575 CpG sites associated with 5538 genes were differentially methylated, 63% of these genes showed significant correlation between methylation and mRNA expression levels. Pathways involved in tumorigenesis including cell adhesion, TGF and WNT signaling showed enrichment for genes aberrantly methylated. Genes involved in chromosomal segregation and spindle formation were aberrantly methylated. Given the recent evidence that chromothripsis may be a driver mechanism in EAC, the role of epigenetic perturbation of these pathways should be further investigated. The methylation profiles revealed two EAC subtypes, one associated with widespread CpG island hypermethylation overlapping H3K27me3 marks and binding sites of the Polycomb proteins. These subtypes were supported by an independent set of 89 esophageal cancer samples. The most hypermethylated tumors showed worse patient survival.

Assuntos

Adenocarcinoma/genética , Segregação de Cromossomos , Metilação de DNA , Neoplasias Esofágicas/genética , Fuso Acromático , Adenocarcinoma/patologia , Neoplasias Esofágicas/patologia , Humanos

11.

BiPACE 2D--graph-based multiple alignment for comprehensive 2D gas chromatography-mass spectrometry.

Hoffmann, Nils; Wilhelm, Mathias; Doebbe, Anja; Niehaus, Karsten; Stoye, Jens.

Bioinformatics ; 30(7): 988-95, 2014 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-24363380

RESUMO

MOTIVATION: Comprehensive 2D gas chromatography-mass spectrometry is an established method for the analysis of complex mixtures in analytical chemistry and metabolomics. It produces large amounts of data that require semiautomatic, but preferably automatic handling. This involves the location of significant signals (peaks) and their matching and alignment across different measurements. To date, there exist only a few openly available algorithms for the retention time alignment of peaks originating from such experiments that scale well with increasing sample and peak numbers, while providing reliable alignment results. RESULTS: We describe BiPACE 2D, an automated algorithm for retention time alignment of peaks from 2D gas chromatography-mass spectrometry experiments and evaluate it on three previously published datasets against the mSPA, SWPA and Guineu algorithms. We also provide a fourth dataset from an experiment studying the H2 production of two different strains of Chlamydomonas reinhardtii that is available from the MetaboLights database together with the experimental protocol, peak-detection results and manually curated multiple peak alignment for future comparability with newly developed algorithms. AVAILABILITY AND IMPLEMENTATION: BiPACE 2D is contained in the freely available Maltcms framework, version 1.3, hosted at http://maltcms.sf.net, under the terms of the L-GPL v3 or Eclipse Open Source licenses. The software used for the evaluation along with the underlying datasets is available at the same location. The C.reinhardtii dataset is freely available at http://www.ebi.ac.uk/metabolights/MTBLS37.

Assuntos

Cromatografia Gasosa-Espectrometria de Massas/métodos , Metabolômica/métodos , Algoritmos , Animais , Automação Laboratorial , Chlamydomonas reinhardtii/química , Software

12.

ReadXplorer--visualization and analysis of mapped sequences.

Hilker, Rolf; Stadermann, Kai Bernd; Doppmeier, Daniel; Kalinowski, Jörn; Stoye, Jens; Straube, Jasmin; Winnebald, Jörn; Goesmann, Alexander.

Bioinformatics ; 30(16): 2247-54, 2014 Aug 15.

Artigo em Inglês | MEDLINE | ID: mdl-24790157

RESUMO

MOTIVATION: Fast algorithms and well-arranged visualizations are required for the comprehensive analysis of the ever-growing size of genomic and transcriptomic next-generation sequencing data. RESULTS: ReadXplorer is a software offering straightforward visualization and extensive analysis functions for genomic and transcriptomic DNA sequences mapped on a reference. A unique specialty of ReadXplorer is the quality classification of the read mappings. It is incorporated in all analysis functions and displayed in ReadXplorer's various synchronized data viewers for (i) the reference sequence, its base coverage as (ii) normalizable plot and (iii) histogram, (iv) read alignments and (v) read pairs. ReadXplorer's analysis capability covers RNA secondary structure prediction, single nucleotide polymorphism and deletion-insertion polymorphism detection, genomic feature and general coverage analysis. Especially for RNA-Seq data, it offers differential gene expression analysis, transcription start site and operon detection as well as RPKM value and read count calculations. Furthermore, ReadXplorer can combine or superimpose coverage of different datasets. AVAILABILITY AND IMPLEMENTATION: ReadXplorer is available as open-source software at http://www.readxplorer.org along with a detailed manual.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Gráficos por Computador , Perfilação da Expressão Gênica , Genômica/métodos , Óperon , Polimorfismo Genético , Polimorfismo de Nucleotídeo Único , RNA/química , Sítio de Iniciação de Transcrição

13.

Identifying gene clusters by discovering common intervals in indeterminate strings.

Doerr, Daniel; Stoye, Jens; Böcker, Sebastian; Jahn, Katharina.

BMC Genomics ; 15 Suppl 6: S2, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-25571793

RESUMO

BACKGROUND: Comparative analyses of chromosomal gene orders are successfully used to predict gene clusters in bacterial and fungal genomes. Present models for detecting sets of co-localized genes in chromosomal sequences require prior knowledge of gene family assignments of genes in the dataset of interest. These families are often computationally predicted on the basis of sequence similarity or higher order features of gene products. Errors introduced in this process amplify in subsequent gene order analyses and thus may deteriorate gene cluster prediction. RESULTS: In this work, we present a new dynamic model and efficient computational approaches for gene cluster prediction suitable in scenarios ranging from traditional gene family-based gene cluster prediction, via multiple conflicting gene family annotations, to gene family-free analysis, in which gene clusters are predicted solely on the basis of a pairwise similarity measure of the genes of different genomes. We evaluate our gene family-free model against a gene family-based model on a dataset of 93 bacterial genomes. CONCLUSIONS: Our model is able to detect gene clusters that would be also detected with well-established gene family-based approaches. Moreover, we show that it is able to detect conserved regions which are missed by gene family-based methods due to wrong or deficient gene family assignments.

Assuntos

Modelos Genéticos , Família Multigênica , Algoritmos , Conjuntos de Dados como Assunto , Genoma Bacteriano

14.

Methods for Pangenomic Core Detection.

Schulz, Tizian; Parmigiani, Luca; Rempel, Andreas; Stoye, Jens.

Methods Mol Biol ; 2802: 73-106, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38819557

RESUMO

Computational pangenomics deals with the joint analysis of all genomic sequences of a species. It has already been successfully applied to various tasks in many research areas. Further advances in DNA sequencing technologies constantly let more and more genomic sequences become available for many species, leading to an increasing attractiveness of pangenomic studies. At the same time, larger datasets also pose new challenges for data structures and algorithms that are needed to handle the data. Efficient methods oftentimes make use of the concept of k-mers.Core detection is a common way of analyzing a pangenome. The pangenome's core is defined as the subset of genomic information shared among all individual members. Classically, it is not only determined on the abstract level of genes but can also be described on the sequence level.In this chapter, we provide an overview of k-mer-based methods in the context of pangenomics studies. We first revisit existing software solutions for k-mer counting and k-mer set representation. Afterward, we describe the usage of two k-mer-based approaches, Pangrowth and Corer, for pangenomic core detection.

Assuntos

Algoritmos , Biologia Computacional , Genômica , Software , Genômica/métodos , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos

15.

Investigating the complexity of the double distance problems.

Braga, Marília D V; Brockmann, Leonie R; Klerx, Katharina; Stoye, Jens.

Algorithms Mol Biol ; 19(1): 1, 2024 Jan 04.

Artigo em Inglês | MEDLINE | ID: mdl-38178195

RESUMO

BACKGROUND: Two genomes [Formula: see text] and [Formula: see text] over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by [Formula: see text] the number of common families of [Formula: see text] and [Formula: see text]. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let [Formula: see text] and [Formula: see text] be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes [Formula: see text] and [Formula: see text]. Then, the breakpoint distance of [Formula: see text] and [Formula: see text] is equal to [Formula: see text]. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of [Formula: see text] and [Formula: see text] is [Formula: see text], where c is the total number of cycles and [Formula: see text] is the total number of paths of even length. MOTIVATION: The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a [Formula: see text] distance, defined to be [Formula: see text], and increasingly investigate the complexities of median and double distance for the [Formula: see text] distance, then the [Formula: see text] distance, and so on. RESULTS: While for the median much effort was done in our and in other research groups but no progress was obtained even for the [Formula: see text] distance, for solving the double distance under [Formula: see text] and [Formula: see text] distances we could devise linear time algorithms, which we present here.

16.

Panacus: fast and exact pangenome growth and core size estimation.

Parmigiani, Luca; Garrison, Erik; Stoye, Jens; Marschall, Tobias; Doerr, Daniel.

bioRxiv ; 2024 Jun 12.

Artigo em Inglês | MEDLINE | ID: mdl-38915671

RESUMO

Motivation: Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth). Results: We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour. Availability: Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.

17.

Family-Free Genome Comparison.

Braga, Marilia D V; Doerr, Daniel; Rubert, Diego P; Stoye, Jens.

Methods Mol Biol ; 2802: 57-72, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38819556

RESUMO

The comparison of large-scale genome structures across distinct species offers valuable insights into the species' phylogeny, genome organization, and gene associations. In this chapter, we review the family-free genome comparison tool FFGC that, relying on built-in interfaces with a sequence comparison tool (either BLAST+ or DIAMOND) and with an ILP solver (either CPLEX or Gurobi), provides several methods for analyses that do not require prior classification of genes across the studied genomes. Taking annotated genome sequences as input, FFGC is a complete workflow for genome comparison allowing not only the computation of measures of similarity and dissimilarity but also the inference of gene families, simultaneously based on sequence similarities and large-scale genomic features.

Assuntos

Genômica , Filogenia , Software , Genômica/métodos , Genoma , Biologia Computacional/métodos , Humanos

18.

Correction: GABenchToB: A Genome Assembly Benchmark Tuned on Bacteria and Benchtop Sequencers.

Jünemann, Sebastian; Prior, Karola; Albersmeier, Andreas; Albaum, Stefan; Kalinowski, Jörn; Goesmann, Alexander; Stoye, Jens; Harmsen, Dag.

PLoS One ; 19(2): e0299269, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38359070

RESUMO

[This corrects the article DOI: 10.1371/journal.pone.0107014.].

19.

Statistics for approximate gene clusters.

Jahn, Katharina; Winter, Sascha; Stoye, Jens; Böcker, Sebastian.

BMC Bioinformatics ; 14 Suppl 15: S14, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-24564620

RESUMO

BACKGROUND: Genes occurring co-localized in multiple genomes can be strong indicators for either functional constraints on the genome organization or remnant ancestral gene order. The computational detection of these patterns, which are usually referred to as gene clusters, has become increasingly sensitive over the past decade. The most powerful approaches allow for various types of imperfect cluster conservation: Cluster locations may be internally rearranged. The individual cluster locations may contain only a subset of the cluster genes and may be disrupted by uninvolved genes. Moreover cluster locations may not at all occur in some or even most of the studied genomes. The detection of such low quality clusters increases the risk of mistaking faint patterns that occur merely by chance for genuine findings. Therefore, it is crucial to estimate the significance of computational gene cluster predictions and discriminate between true conservation and coincidental clustering. RESULTS: In this paper, we present an efficient and accurate approach to estimate the significance of gene cluster predictions under the approximate common intervals model. Given a single gene cluster prediction, we calculate the probability to observe it with the same or a higher degree of conservation under the null hypothesis of random gene order, and add a correction factor to account for multiple testing. Our approach considers all parameters that define the quality of gene cluster conservation: the number of genomes in which the cluster occurs, the number of involved genes, the degree of conservation in the different genomes, as well as the frequency of the clustered genes within each genome. We apply our approach to evaluate gene cluster predictions in a large set of well annotated genomes.

Assuntos

Biometria/métodos , Família Multigênica , Ordem dos Genes , Genoma Bacteriano , Probabilidade

20.

metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences.

Ander, Christina; Schulz-Trieglaff, Ole B; Stoye, Jens; Cox, Anthony J.

BMC Bioinformatics ; 14 Suppl 5: S2, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-23734710

RESUMO

Environmental shotgun sequencing (ESS) has potential to give greater insight into microbial communities than targeted sequencing of 16S regions, but requires much higher sequence coverage. The advent of next-generation sequencing has made it feasible for the Human Microbiome Project and other initiatives to generate ESS data on a large scale, but computationally efficient methods for analysing such data sets are needed.Here we present metaBEETL, a fast taxonomic classifier for environmental shotgun sequences. It uses a Burrows-Wheeler Transform (BWT) index of the sequencing reads and an indexed database of microbial reference sequences. Unlike other BWT-based tools, our method has no upper limit on the number or the total size of the reference sequences in its database. By capturing sequence relationships between strains, our reference index also allows us to classify reads which are not unique to an individual strain but are nevertheless specific to some higher phylogenetic order.Tested on datasets with known taxonomic composition, metaBEETL gave results that are competitive with existing similarity-based tools: due to normalization steps which other classifiers lack, the taxonomic profile computed by metaBEETL closely matched the true environmental profile. At the same time, its moderate running time and low memory footprint allow metaBEETL to scale well to large data sets.Code to construct the BWT indexed database and for the taxonomic classification is part of the BEETL library, available as a github repository at git@github.com:BEETL/BEETL.git.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Microbiota , Análise de Sequência de DNA/métodos , Algoritmos , Microbiologia Ambiental , Humanos , Filogenia

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA