Pesquisa | Portal Regional da BVS

1.

Correction: GABenchToB: A Genome Assembly Benchmark Tuned on Bacteria and Benchtop Sequencers.

Jünemann, Sebastian; Prior, Karola; Albersmeier, Andreas; Albaum, Stefan; Kalinowski, Jörn; Goesmann, Alexander; Stoye, Jens; Harmsen, Dag.

PLoS One ; 19(2): e0299269, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38359070

RESUMO

[This corrects the article DOI: 10.1371/journal.pone.0107014.].

2.

Investigating the complexity of the double distance problems.

Braga, Marília D V; Brockmann, Leonie R; Klerx, Katharina; Stoye, Jens.

Algorithms Mol Biol ; 19(1): 1, 2024 Jan 04.

Artigo em Inglês | MEDLINE | ID: mdl-38178195

RESUMO

BACKGROUND: Two genomes [Formula: see text] and [Formula: see text] over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by [Formula: see text] the number of common families of [Formula: see text] and [Formula: see text]. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let [Formula: see text] and [Formula: see text] be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes [Formula: see text] and [Formula: see text]. Then, the breakpoint distance of [Formula: see text] and [Formula: see text] is equal to [Formula: see text]. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of [Formula: see text] and [Formula: see text] is [Formula: see text], where c is the total number of cycles and [Formula: see text] is the total number of paths of even length. MOTIVATION: The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a [Formula: see text] distance, defined to be [Formula: see text], and increasingly investigate the complexities of median and double distance for the [Formula: see text] distance, then the [Formula: see text] distance, and so on. RESULTS: While for the median much effort was done in our and in other research groups but no progress was obtained even for the [Formula: see text] distance, for solving the double distance under [Formula: see text] and [Formula: see text] distances we could devise linear time algorithms, which we present here.

3.

HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets.

Kirsch-Gerweck, Benedikt; Bohnenkämper, Leonard; Henrichs, Michel T; Alanko, Jarno N; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Burger, Joachim; Stoye, Jens; Diekmann, Yoan.

Mol Biol Evol ; 40(3)2023 03 04.

Artigo em Inglês | MEDLINE | ID: mdl-36790822

RESUMO

Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.

Assuntos

Genética Populacional , Metagenômica , Genômica , Genoma , Haplótipos

4.

Sequence-based pangenomic core detection.

Schulz, Tizian; Wittler, Roland; Stoye, Jens.

iScience ; 25(6): 104413, 2022 Jun 17.

Artigo em Inglês | MEDLINE | ID: mdl-35663029

RESUMO

One of the most basic kinds of analysis to be performed on a pangenome is the detection of its core, i.e., the information shared among all members. Pangenomic core detection is classically done on the gene level and many tools focus exclusively on core detection in prokaryotes. Here, we present a new method for sequence-based pangenomic core detection. Our model generalizes from a strict core definition allowing us to flexibly determine suitable core properties depending on the research question and the dataset under consideration. We propose an algorithm based on a colored de Bruijn graph that runs in linear time with respect to the number of k-mers in the graph. An implementation of our method is called Corer. Because of the usage of a colored de Bruijn graph, it works alignment-free, is provided with a small memory footprint, and accepts as input assembled genomes as well as sequencing reads.

5.

Detecting high-scoring local alignments in pangenome graphs.

Schulz, Tizian; Wittler, Roland; Rahmann, Sven; Hach, Faraz; Stoye, Jens.

Bioinformatics ; 37(16): 2266-2274, 2021 Aug 25.

Artigo em Inglês | MEDLINE | ID: mdl-33532821

RESUMO

MOTIVATION: Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. RESULTS: We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. AVAILABILITY AND IMPLEMENTATION: Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

6.

Computing the Rearrangement Distance of Natural Genomes.

Bohnenkämper, Leonard; Braga, Marília D V; Doerr, Daniel; Stoye, Jens.

J Comput Biol ; 28(4): 410-431, 2021 04.

Artigo em Inglês | MEDLINE | ID: mdl-33393848

RESUMO

The computation of genomic distances has been a very active field of computational comparative genomics over the past 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double cut and join distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao et al. relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an integer linear programming (ILP) solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes that have equal numbers of duplicates of any marker. Therefore, it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this article, we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao et al. to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few 10,000 markers, which we demonstrate on simulated and real data.

Assuntos

Biologia Computacional , Rearranjo Gênico/genética , Genoma/genética , Genômica , Algoritmos , Modelos Genéticos , Programação Linear

7.

Reconstructing tumor evolutionary histories and clone trees in polynomial-time with SubMARine.

Sundermann, Linda K; Wintersinger, Jeff; Rätsch, Gunnar; Stoye, Jens; Morris, Quaid.

PLoS Comput Biol ; 17(1): e1008400, 2021 01.

Artigo em Inglês | MEDLINE | ID: mdl-33465079

RESUMO

Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at https://github.com/morrislab/submarine.

Assuntos

Algoritmos , Biologia Computacional/métodos , Mutação/genética , Neoplasias , Simulação por Computador , Evolução Molecular , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/classificação , Neoplasias/genética , Sequenciamento Completo do Genoma

8.

Computing the Inversion-Indel Distance.

Willing, Eyla; Stoye, Jens; Braga, Marilia D V.

IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2314-2326, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-32324562

RESUMO

The inversion distance, that is the distance between two unichromosomal genomes with the same content allowing only inversions of DNA segments, can be exactly computed thanks to a pioneering approach of Hannenhalli and Pevzner from 1995. In 2000, El-Mabrouk extended the inversion model to perform the comparison of unichromosomal genomes with unequal contents, combining inversions with insertions and deletions (indels) of DNA segments, giving rise to the inversion-indel distance. However, only a heuristic was provided for its computation. In 2005, Yancopoulos, Attie and Friedberg started a new branch of research by introducing the generic double cut and join (DCJ) operation, that can represent several genome rearrangements (including inversions). In 2006, Bergeron, Mixtacki and Stoye showed that the DCJ distance can be computed in linear time with a very simple procedure. As a consequence, in 2010 we gave a linear-time algorithm to compute the DCJ-indel distance. This result allowed the inversion-indel model to be revisited from another angle. In 2013, we could show that, when the diagram that represents the relation between the two compared genomes has no bad components, the inversion-indel distance is equal to the DCJ-indel distance. In the present work we complete the study of the inversion-indel distance by giving the first algorithm to compute it exactly even in the presence of bad components.

Assuntos

Genômica/métodos , Mutação INDEL/genética , Algoritmos , Rearranjo Gênico/genética

9.

Editorial: Computational Methods for Microbiome Analysis.

Setubal, João C; Stoye, Jens; Dutilh, Bas E.

Front Genet ; 11: 623897, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-33362871

10.

HASLR: Fast Hybrid Assembly of Long Reads.

Haghshenas, Ehsan; Asghari, Hossein; Stoye, Jens; Chauve, Cedric; Hach, Faraz.

iScience ; 23(8): 101389, 2020 Aug 21.

Artigo em Inglês | MEDLINE | ID: mdl-32781410

RESUMO

Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.

11.

Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants.

Rubert, Diego P; Martinez, Fábio V; Stoye, Jens; Doerr, Daniel.

BMC Genomics ; 21(Suppl 2): 273, 2020 Apr 16.

Artigo em Inglês | MEDLINE | ID: mdl-32299356

RESUMO

BACKGROUND: Computationally inferred ancestral genomes play an important role in many areas of genome research. We present an improved workflow for the reconstruction from highly diverged genomes such as those of plants. RESULTS: Our work relies on an established workflow in the reconstruction of ancestral plants, but improves several steps of this process. Instead of using gene annotations for inferring the genome content of the ancestral sequence, we identify genomic markers through a process called genome segmentation. This enables us to reconstruct the ancestral genome from hundreds of thousands of markers rather than the tens of thousands of annotated genes. We also introduce the concept of local genome rearrangement, through which we refine syntenic blocks before they are used in the reconstruction of contiguous ancestral regions. With the enhanced workflow at hand, we reconstruct the ancestral genome of eudicots, a major sub-clade of flowering plants, using whole genome sequences of five modern plants. CONCLUSIONS: Our reconstructed genome is highly detailed, yet its layout agrees well with that reported in Badouin et al. (2017). Using local genome rearrangement, not only the marker-based, but also the gene-based reconstruction of the eudicot ancestor exhibited increased genome content, evidencing the power of this novel concept.

Assuntos

Mapeamento Cromossômico/métodos , Genômica/métodos , Magnoliopsida/genética , Simulação por Computador , Evolução Molecular , Ordem dos Genes , Genoma de Planta , Modelos Genéticos , Filogenia , Sintenia/genética

12.

Finding all maximal perfect haplotype blocks in linear time.

Alanko, Jarno; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Stoye, Jens.

Algorithms Mol Biol ; 15: 2, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32055252

RESUMO

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.

13.

Horizontal Gene Transfer Phylogenetics: A Random Walk Approach.

Sevillya, Gur; Doerr, Daniel; Lerner, Yael; Stoye, Jens; Steel, Mike; Snir, Sagi.

Mol Biol Evol ; 37(5): 1470-1479, 2020 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-31845962

RESUMO

The dramatic decrease in time and cost for generating genetic sequence data has opened up vast opportunities in molecular systematics, one of which is the ability to decipher the evolutionary history of strains of a species. Under this fine systematic resolution, the standard markers are too crude to provide a phylogenetic signal. Nevertheless, among prokaryotes, genome dynamics in the form of horizontal gene transfer (HGT) between organisms and gene loss seem to provide far richer information by affecting both gene order and gene content. The "synteny index" (SI) between a pair of genomes combines these latter two factors, allowing comparison of genomes with unequal gene content, together with order considerations of their common genes. Although this approach is useful for classifying close relatives, no rigorous statistical modeling for it has been suggested. Such modeling is valuable, as it allows observed measures to be transformed into estimates of time periods during evolution, yielding the "additivity" of the measure. To the best of our knowledge, there is no other additivity proof for other gene order/content measures under HGT. Here, we provide a first statistical model and analysis for the SI measure. We model the "gene neighborhood" as a "birth-death-immigration" process affected by the HGT activity over the genome, and analytically relate the HGT rate and time to the expected SI. This model is asymptotic and thus provides accurate results, assuming infinite size genomes. Therefore, we also developed a heuristic model following an "exponential decay" function, accounting for biologically realistic values, which performed well in simulations. Applying this model to 1,133 prokaryotes partitioned to 39 clusters by the rank of genus yields that the average number of genome dynamics events per gene in the phylogenetic depth of genus is around half with significant variability between genera. This result extends and confirms similar results obtained for individual genera in different manners.

Assuntos

Transferência Genética Horizontal , Técnicas Genéticas , Modelos Genéticos , Sintenia , Genoma Microbiano , Filogenia

14.

Whole-genome sequence of the bovine blood fluke Schistosoma bovis supports interspecific hybridization with S. haematobium.

Oey, Harald; Zakrzewski, Martha; Gravermann, Kerstin; Young, Neil D; Korhonen, Pasi K; Gobert, Geoffrey N; Nawaratna, Sujeevi; Hasan, Shihab; Martínez, David M; You, Hong; Lavin, Martin; Jones, Malcolm K; Ragan, Mark A; Stoye, Jens; Oleaga, Ana; Emery, Aidan M; Webster, Bonnie L; Rollinson, David; Gasser, Robin B; McManus, Donald P; Krause, Lutz.

PLoS Pathog ; 15(1): e1007513, 2019 01.

Artigo em Inglês | MEDLINE | ID: mdl-30673782

RESUMO

Mesenteric infection by the parasitic blood fluke Schistosoma bovis is a common veterinary problem in Africa and the Middle East and occasionally in the Mediterranean Region. The species also has the ability to form interspecific hybrids with the human parasite S. haematobium with natural hybridisation observed in West Africa, presenting possible zoonotic transmission. Additionally, this exchange of alleles between species may dramatically influence disease dynamics and parasite evolution. We have generated a 374 Mb assembly of the S. bovis genome using Illumina and PacBio-based technologies. Despite infecting different hosts and organs, the genome sequences of S. bovis and S. haematobium appeared strikingly similar with 97% sequence identity. The two species share 98% of protein-coding genes, with an average sequence identity of 97.3% at the amino acid level. Genome comparison identified large continuous parts of the genome (up to several 100 kb) showing almost 100% sequence identity between S. bovis and S. haematobium. It is unlikely that this is a result of genome conservation and provides further evidence of natural interspecific hybridization between S. bovis and S. haematobium. Our results suggest that foreign DNA obtained by interspecific hybridization was maintained in the population through multiple meiosis cycles and that hybrids were sexually reproductive, producing viable offspring. The S. bovis genome assembly forms a highly valuable resource for studying schistosome evolution and exploring genetic regions that are associated with species-specific phenotypic traits.

Assuntos

Hibridização Genética/genética , Schistosoma/genética , África , África Ocidental , Animais , Sequência de Bases/genética , Bovinos , Mapeamento Cromossômico/métodos , DNA/genética , Genoma/genética , Genoma Mitocondrial/genética , Hibridização Genética/fisiologia , Oriente Médio , Filogenia , Proteoma/genética , Especificidade da Espécie , Trematódeos/genética , Sequenciamento Completo do Genoma/métodos

15.

Dynamic Alignment-Free and Reference-Free Read Compression.

Holley, Guillaume; Wittler, Roland; Stoye, Jens; Hach, Faraz.

J Comput Biol ; 25(7): 825-836, 2018 07.

Artigo em Inglês | MEDLINE | ID: mdl-30011247

RESUMO

The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/tendências , Software , Algoritmos , Compressão de Dados , Genoma/genética

16.

Scaffolding of Ancient Contigs and Ancestral Reconstruction in a Phylogenetic Framework.

Luhmann, Nina; Chauve, Cedric; Stoye, Jens; Wittler, Roland.

IEEE/ACM Trans Comput Biol Bioinform ; 15(6): 2094-2100, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29993816

RESUMO

Ancestral genome reconstruction is an important task to analyze the evolution of genomes. Recent progress in sequencing ancient DNA led to the publication of so-called paleogenomes and allows the integration of this sequencing data in genome evolution analysis. However, the de novo assembly of ancient genomes is usually fragmented due to DNA degradation over time among others. Integrated phylogenetic assembly addresses the issue of genome fragmentation in the ancient DNA assembly while aiming to improve the reconstruction of all ancient genomes in the phylogeny simultaneously. The fragmented assembly of the ancient genome can be represented as an assembly graph, indicating contradicting ordering information of contigs. In this setting, our approach is to compare the ancient data with extant finished genomes. We generalize a reconstruction approach minimizing the Single-Cut-or-Join rearrangement distance towards multifurcating trees and include edge lengths to improve the reconstruction in practice. This results in a polynomial time algorithm that includes additional ancient DNA data at one node in the tree, resulting in consistent reconstructions of ancestral genomes.

Assuntos

DNA Antigo/análise , DNA , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , DNA/análise , DNA/classificação , DNA/genética , Evolução Molecular , História Antiga , História Medieval , Humanos , Modelos Genéticos , Paleontologia , Filogenia , Peste/história , Peste/microbiologia , Ratos , Alinhamento de Sequência/métodos , Yersinia pestis/classificação , Yersinia pestis/genética

17.

GraphTeams: a method for discovering spatial gene clusters in Hi-C sequencing data.

Schulz, Tizian; Stoye, Jens; Doerr, Daniel.

BMC Genomics ; 19(Suppl 5): 308, 2018 May 08.

Artigo em Inglês | MEDLINE | ID: mdl-29745835

RESUMO

BACKGROUND: Hi-C sequencing offers novel, cost-effective means to study the spatial conformation of chromosomes. We use data obtained from Hi-C experiments to provide new evidence for the existence of spatial gene clusters. These are sets of genes with associated functionality that exhibit close proximity to each other in the spatial conformation of chromosomes across several related species. RESULTS: We present the first gene cluster model capable of handling spatial data. Our model generalizes a popular computational model for gene cluster prediction, called Î´-teams, from sequences to graphs. Following previous lines of research, we subsequently extend our model to allow for several vertices being associated with the same label. The model, called Î´-teams with families, is particular suitable for our application as it enables handling of gene duplicates. We develop algorithmic solutions for both models. We implemented the algorithm for discovering Î´-teams with families and integrated it into a fully automated workflow for discovering gene clusters in Hi-C data, called GraphTeams. We applied it to human and mouse data to find intra- and interchromosomal gene cluster candidates. The results include intrachromosomal clusters that seem to exhibit a closer proximity in space than on their chromosomal DNA sequence. We further discovered interchromosomal gene clusters that contain genes from different chromosomes within the human genome, but are located on a single chromosome in mouse. CONCLUSIONS: By identifying Î´-teams with families, we provide a flexible model to discover gene cluster candidates in Hi-C data. Our analysis of Hi-C data from human and mouse reveals several known gene clusters (thus validating our approach), but also few sparsely studied or possibly unknown gene cluster candidates that could be the source of further experimental investigations.

Assuntos

Algoritmos , Cromossomos/química , Gráficos por Computador , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Família Multigênica , Análise de Sequência de DNA/métodos , Animais , Análise por Conglomerados , Genômica , Humanos , Camundongos

18.

Computing the family-free DCJ similarity.

Rubert, Diego P; Hoshino, Edna A; Braga, Marília D V; Stoye, Jens; Martinez, Fábio V.

BMC Bioinformatics ; 19(Suppl 6): 152, 2018 05 08.

Artigo em Inglês | MEDLINE | ID: mdl-29745861

RESUMO

BACKGROUND: The genomic similarity is a large-scale measure for comparing two given genomes. In this work we study the (NP-hard) problem of computing the genomic similarity under the DCJ model in a setting that does not assume that the genes of the compared genomes are grouped into gene families. This problem is called family-free DCJ similarity. RESULTS: We propose an exact ILP algorithm to solve the family-free DCJ similarity problem, then we show its APX-hardness and present four combinatorial heuristics with computational experiments comparing their results to the ILP. CONCLUSIONS: We show that the family-free DCJ similarity can be computed in reasonable time, although for larger genomes it is necessary to resort to heuristics. This provides a basis for further studies on the applicability and model refinement of family-free whole genome similarity measures.

Assuntos

Modelos Genéticos , Filogenia , Algoritmos , Animais , Simulação por Computador , Bases de Dados Genéticas , Genoma , Genômica , Heurística , Humanos , Camundongos , Ratos

19.

Flexible metagenome analysis using the MGX framework.

Jaenicke, Sebastian; Albaum, Stefan P; Blumenkamp, Patrick; Linke, Burkhard; Stoye, Jens; Goesmann, Alexander.

Microbiome ; 6(1): 76, 2018 04 24.

Artigo em Inglês | MEDLINE | ID: mdl-29690922

RESUMO

BACKGROUND: The characterization of microbial communities based on sequencing and analysis of their genetic information has become a popular approach also referred to as metagenomics; in particular, the recent advances in sequencing technologies have enabled researchers to study even the most complex communities. Metagenome analysis, the assignment of sequences to taxonomic and functional entities, however, remains a tedious task: large amounts of data need to be processed. There are a number of approaches addressing particular aspects, but scientific questions are often too specific to be answered by a general-purpose method. RESULTS: We present MGX, a flexible and extensible client/server-framework for the management and analysis of metagenomic datasets; MGX features a comprehensive set of adaptable workflows required for taxonomic and functional metagenome analysis, combined with an intuitive and easy-to-use graphical user interface offering customizable result visualizations. At the same time, MGX allows to include own data sources and devise custom analysis pipelines, thus enabling researchers to perform basic as well as highly specific analyses within a single application. CONCLUSIONS: With MGX, we provide a novel metagenome analysis platform giving researchers access to the most recent analysis tools. MGX covers taxonomic and functional metagenome analysis, statistical evaluation, and a wide range of visualizations easing data interpretation. Its default taxonomic classification pipeline provides equivalent or superior results in comparison to existing tools.

Assuntos

Sistemas de Gerenciamento de Base de Dados , Metagenoma , Metagenômica/métodos , Microbiota , Reprodutibilidade dos Testes , Interface Usuário-Computador , Fluxo de Trabalho

20.

Pan-Genome Storage and Analysis Techniques.

Zekic, Tina; Holley, Guillaume; Stoye, Jens.

Methods Mol Biol ; 1704: 29-53, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29277862

RESUMO

Computational pan-genome analysis has emerged from the rapid increase of available genome sequencing data. Starting from a microbial pan-genome, the concept has spread to a variety of species, such as plants or viruses. Characterizing a pan-genome provides insights into intra-species evolution, functions, and diversity. However, researchers face challenges such as processing and maintaining large datasets while providing accurate and efficient analysis approaches. Comparative genomics methods are required for detecting conserved and unique regions between a set of genomes. This chapter gives an overview of tools available for indexing pan-genomes, identifying the sub-regions of a pan-genome and offering a variety of downstream analysis methods. These tools are categorized into two groups, gene-based and sequence-based, according to the pan-genome identification method. We highlight the differences, advantages, and disadvantages between the tools, and provide information about the general workflow, methodology of pan-genome identification, covered functionalities, usability and availability of the tools.

Assuntos

Algoritmos , Genoma Microbiano , Genômica/métodos , Análise de Sequência de DNA/métodos , Análise por Conglomerados , Biologia Computacional/métodos , Bases de Dados Genéticas , Filogenia , Software

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA