Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
BMC Bioinformatics ; 24(1): 400, 2023 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-37884897

RESUMO

BACKGROUND: Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. RESULTS: We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. CONCLUSIONS: We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.


Assuntos
Algoritmos , Genoma , Análise de Sequência de DNA , Software , Biologia Computacional
2.
Bioinformatics ; 36(7): 2275-2277, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31830243

RESUMO

MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY AND IMPLEMENTATION: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Genoma , Genômica , Software
3.
Theor Appl Genet ; 134(12): 3845-3861, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34387711

RESUMO

KEY MESSAGE: The deep scoping method incorporates the use of a gene bank together with different population layers to reintroduce genetic variation into the breeding population, thus maximizing the long-term genetic gain without reducing the short-term genetic gain or increasing the total financial cost. Genomic prediction is often combined with truncation selection to identify superior parental individuals that can pass on favorable quantitative trait locus (QTL) alleles to their offspring. However, truncation selection reduces genetic variation within the breeding population, causing a premature convergence to a sub-optimal genetic value. In order to also increase genetic gain in the long term, different methods have been proposed that better preserve genetic variation. However, when the genetic variation of the breeding population has already been reduced as a result of prior intensive selection, even those methods will not be able to avert such premature convergence. Pre-breeding provides a solution for this problem by reintroducing genetic variation into the breeding population. Unfortunately, as pre-breeding often relies on a separate breeding population to increase the genetic value of wild specimens before introducing them in the elite population, it comes with an increased financial cost. In this paper, on the basis of a simulation study, we propose a new method that reintroduces genetic variation in the breeding population on a continuous basis without the need for a separate pre-breeding program or a larger population size. This way, we are able to introduce favorable QTL alleles into an elite population and maximize the genetic gain in the short as well as in the long term without increasing the financial cost.


Assuntos
Variação Genética , Melhoramento Vegetal , Locos de Características Quantitativas , Alelos , Haploidia , Hordeum/genética , Modelos Genéticos , Melhoramento Vegetal/métodos
4.
BMC Bioinformatics ; 21(Suppl 2): 81, 2020 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-32164557

RESUMO

BACKGROUND: The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. RESULTS: We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. CONCLUSIONS: BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.


Assuntos
Algoritmos , Interface Usuário-Computador , Metodologias Computacionais , Humanos , Matrizes de Pontuação de Posição Específica
5.
BMC Bioinformatics ; 21(1): 402, 2020 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-32928110

RESUMO

BACKGROUND: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. RESULTS: To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. CONCLUSIONS: We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detox under the GNU AGPL v3.0 license.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Humanos
6.
BMC Bioinformatics ; 20(1): 298, 2019 Jun 03.
Artigo em Inglês | MEDLINE | ID: mdl-31159722

RESUMO

BACKGROUND: Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. RESULTS: We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. CONCLUSIONS: BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector .


Assuntos
Algoritmos , DNA/genética , Genoma , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Animais , Bases de Dados de Ácidos Nucleicos , Humanos , Alinhamento de Sequência , Fatores de Tempo
7.
PLoS Comput Biol ; 14(12): e1006637, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30571689

RESUMO

Rotors are functional reentry sources identified in clinically relevant cardiac arrhythmias, such as ventricular and atrial fibrillation. Ablation targeting rotor sites has resulted in arrhythmia termination. Recent clinical, experimental and modelling studies demonstrate that rotors are often anchored around fibrotic scars or regions with increased fibrosis. However, the mechanisms leading to abundance of rotors at these locations are not clear. The current study explores the hypothesis whether fibrotic scars just serve as anchoring sites for the rotors or whether there are other active processes which drive the rotors to these fibrotic regions. Rotors were induced at different distances from fibrotic scars of various sizes and degree of fibrosis. Simulations were performed in a 2D model of human ventricular tissue and in a patient-specific model of the left ventricle of a patient with remote myocardial infarction. In both the 2D and the patient-specific model we found that without fibrotic scars, the rotors were stable at the site of their initiation. However, in the presence of a scar, rotors were eventually dynamically anchored from large distances by the fibrotic scar via a process of dynamical reorganization of the excitation pattern. This process coalesces with a change from polymorphic to monomorphic ventricular tachycardia.


Assuntos
Arritmias Cardíacas/patologia , Arritmias Cardíacas/fisiopatologia , Modelos Cardiovasculares , Potenciais de Ação , Arritmias Cardíacas/cirurgia , Ablação por Cateter , Biologia Computacional , Simulação por Computador , Eletrocardiografia , Fenômenos Eletrofisiológicos , Fibrose , Sistema de Condução Cardíaco/patologia , Sistema de Condução Cardíaco/fisiopatologia , Sistema de Condução Cardíaco/cirurgia , Ventrículos do Coração/patologia , Ventrículos do Coração/fisiopatologia , Humanos , Imageamento por Ressonância Magnética , Infarto do Miocárdio/patologia , Infarto do Miocárdio/fisiopatologia
8.
BMC Bioinformatics ; 19(1): 311, 2018 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-30180801

RESUMO

BACKGROUND: Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string. RESULTS: We present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure. CONCLUSIONS: BrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at: https://github.com/biointec/browniealigner.


Assuntos
Algoritmos , Biologia Computacional/métodos , Gráficos por Computador , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Humanos , Linguagens de Programação
9.
Bioinformatics ; 33(17): 2740-2742, 2017 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-28472230

RESUMO

MOTIVATION: The Bionano Genomics platform allows for the optical detection of short sequence patterns in very long DNA molecules (up to 2.5 Mbp). Molecules with overlapping patterns can be assembled to generate a consensus optical map of the entire genome. In turn, these optical maps can be used to validate or improve de novo genome assembly projects or to detect large-scale structural variation in genomes. Simulated optical map data can assist in the development and benchmarking of tools that operate on those data, such as alignment and assembly software. Additionally, it can help to optimize the experimental setup for a genome of interest. Such a simulator is currently not available. RESULTS: We have developed a simulator, OMSim, that produces synthetic optical map data that mimics real Bionano Genomics data. These simulated data have been tested for compatibility with the Bionano Genomics Irys software system and the Irys-scaffolding scripts. OMSim is capable of handling very large genomes (over 30 Gbp) with high throughput and low memory requirements. AVAILABILITY AND IMPLEMENTATION: The Python simulation tool and a cross-platform graphical user interface are available as open source software under the GNU GPL v2 license ( http://www.bioinformatics.intec.ugent.be/omsim ). CONTACT: jan.fostier@ugent.be. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Análise de Sequência de DNA/métodos , Software , Genômica/métodos , Humanos
10.
BMC Bioinformatics ; 18(1): 374, 2017 Aug 18.
Artigo em Inglês | MEDLINE | ID: mdl-28821237

RESUMO

BACKGROUND: Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS: For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS: We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.


Assuntos
Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Animais , Bactérias/genética , Caenorhabditis elegans/genética , DNA/química , DNA/metabolismo , Drosophila/genética , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA
11.
Nucleic Acids Res ; 43(16): e105, 2015 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-25990729

RESUMO

Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems adds ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information.Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples.


Assuntos
Genoma Bacteriano , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Infecções Bacterianas/microbiologia , Coinfecção/microbiologia , Escherichia coli/genética , Evolução Molecular , Humanos , Polimorfismo Genético
12.
Bioinformatics ; 31(15): 2482-8, 2015 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-25819078

RESUMO

MOTIVATION: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. RESULTS: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.


Assuntos
Análise de Sequência de DNA/métodos , Software , Genoma Humano , Humanos
13.
Bioinformatics ; 31(23): 3758-66, 2015 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-26254488

RESUMO

MOTIVATION: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. RESULTS: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. AVAILABILITY AND IMPLEMENTATION: BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller CONTACT: Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma de Planta , Regiões Promotoras Genéticas , Análise de Sequência de DNA/métodos , Sequência de Bases , Sítios de Ligação , Sequência Conservada , DNA de Plantas/química , Motivos de Nucleotídeos , Alinhamento de Sequência , Software , Fatores de Transcrição/metabolismo
14.
Bioinformatics ; 29(10): 1308-16, 2013 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-23595663

RESUMO

MOTIVATION: When genomic data are associated with gene expression data, the resulting expression quantitative trait loci (eQTL) will likely span multiple genes. eQTL prioritization techniques can be used to select the most likely causal gene affecting the expression of a target gene from a list of candidates. As an input, these techniques use physical interaction networks that often contain highly connected genes and unreliable or irrelevant interactions that can interfere with the prioritization process. We present EPSILON, an extendable framework for eQTL prioritization, which mitigates the effect of highly connected genes and unreliable interactions by constructing a local network before a network-based similarity measure is applied to select the true causal gene. RESULTS: We tested the new method on three eQTL datasets derived from yeast data using three different association techniques. A physical interaction network was constructed, and each eQTL in each dataset was prioritized using the EPSILON approach: first, a local network was constructed using a k-trials shortest path algorithm, followed by the calculation of a network-based similarity measure. Three similarity measures were evaluated: random walks, the Laplacian Exponential Diffusion kernel and the Regularized Commute-Time kernel. The aim was to predict knockout interactions from a yeast knockout compendium. EPSILON outperformed two reference prioritization methods, random assignment and shortest path prioritization. Next, we found that using a local network significantly increased prioritization performance in terms of predicted knockout pairs when compared with using exactly the same network similarity measures on the global network, with an average increase in prioritization performance of 8 percentage points (P < 10(-5)). AVAILABILITY: The physical interaction network and the source code (Matlab/C++) of our implementation can be downloaded from http://bioinformatics.intec.ugent.be/epsilon. CONTACT: lieven.verbeke@intec.ugent.be, kamar@psb.ugent.be, jan.fostier@intec.ugent.be SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Locos de Características Quantitativas , Saccharomyces cerevisiae/genética , Software , Algoritmos , Expressão Gênica , Técnicas de Inativação de Genes , Mutação
15.
Nucleic Acids Res ; 40(2): e11, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22102584

RESUMO

Comparative genomics is a powerful means to gain insight into the evolutionary processes that shape the genomes of related species. As the number of sequenced genomes increases, the development of software to perform accurate cross-species analyses becomes indispensable. However, many implementations that have the ability to compare multiple genomes exhibit unfavorable computational and memory requirements, limiting the number of genomes that can be analyzed in one run. Here, we present a software package to unveil genomic homology based on the identification of conservation of gene content and gene order (collinearity), i-ADHoRe 3.0, and its application to eukaryotic genomes. The use of efficient algorithms and support for parallel computing enable the analysis of large-scale data sets. Unlike other tools, i-ADHoRe can process the Ensembl data set, containing 49 species, in 1 h. Furthermore, the profile search is more sensitive to detect degenerate genomic homology than chaining pairwise collinearity information based on transitive homology. From ultra-conserved collinear regions between mammals and birds, by integrating coexpression information and protein-protein interactions, we identified more than 400 regions in the human genome showing significant functional coherence. The different algorithmical improvements ensure that i-ADHoRe 3.0 will remain a powerful tool to study genome evolution.


Assuntos
Genômica/métodos , Software , Sintenia , Algoritmos , Animais , Ordem dos Genes , Genes , Genoma Humano , Humanos , Alinhamento de Sequência/métodos
16.
Proc Data Compress Conf ; 2024: 123-132, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-39157794

RESUMO

MONI (Rossi et al., JCB 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the operations are constant-time LF-steps but most of the time is spent evaluating LCE queries. In this paper we show how (a variant of) the latter can be evaluated lazily, so as to bound the total time MONI needs to process the pattern in terms of the number of MEMs between the pattern and the text, while maintaining logarithmic latency.

17.
bioRxiv ; 2024 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-38854079

RESUMO

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

18.
IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 1995-2006, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37015543

RESUMO

In de novo genome assembly using short Illumina reads, the accurate determination of node and arc multiplicities in a de Bruijn graph has a large impact on the quality and contiguity of the assembly. The multiplicity estimates of nodes and arcs guide the cleaning of the de Bruijn graph by identifying spurious nodes and arcs that correspond to sequencing errors. Additionally, they can be used to guide repeat resolution. Here, we model the entire de Bruijn graph and the accompanying read coverage information with a single Conditional Random Field (CRF) model. We show that approximate inference using Loopy Belief Propagation (LBP) on our model improves multiplicity assignment accuracy within feasible runtimes. The order in which messages are passed has a large influence on the speed of LBP convergence. Little theoretical guarantees exist and the conditions for convergence are not easily checked as our CRF model contains higher-order interactions. Therefore, we also present an empirical evaluation of several message passing schemes that may guide future users of LBP on CRFs with higher-order interactions in their choice of message passing scheme.


Assuntos
Algoritmos , Fadiga , Humanos , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Software
19.
Front Plant Sci ; 14: 1218665, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37546253

RESUMO

Since the introduction of genomic selection in plant breeding, high genetic gains have been realized in different plant breeding programs. Various methods based on genomic estimated breeding values (GEBVs) for selecting parental lines that maximize the genetic gain as well as methods for improving the predictive performance of genomic selection have been proposed. Unfortunately, it remains difficult to measure to what extent these methods really maximize long-term genetic values. In this study, we propose oracle selection, a hypothetical frame of mind that uses the ground truth to optimally select parents or optimize the training population in order to maximize the genetic gain in each breeding cycle. Clearly, oracle selection cannot be applied in a true breeding program, but allows for the assessment of existing parental selection and training population update methods and the evaluation of how far these methods are from the optimal utopian solution.

20.
Bioinformatics ; 27(6): 749-56, 2011 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-21216775

RESUMO

MOTIVATION: Many comparative genomics studies rely on the correct identification of homologous genomic regions using accurate alignment tools. In such case, the alphabet of the input sequences consists of complete genes, rather than nucleotides or amino acids. As optimal multiple sequence alignment is computationally impractical, a progressive alignment strategy is often employed. However, such an approach is susceptible to the propagation of alignment errors in early pairwise alignment steps, especially when dealing with strongly diverged genomic regions. In this article, we present a novel accurate and efficient greedy, graph-based algorithm for the alignment of multiple homologous genomic segments, represented as ordered gene lists. RESULTS: Based on provable properties of the graph structure, several heuristics are developed to resolve local alignment conflicts that occur due to gene duplication and/or rearrangement events on the different genomic segments. The performance of the algorithm is assessed by comparing the alignment results of homologous genomic segments in Arabidopsis thaliana to those obtained by using both a progressive alignment method and an earlier graph-based implementation. Especially for datasets that contain strongly diverged segments, the proposed method achieves a substantially higher alignment accuracy, and proves to be sufficiently fast for large datasets including a few dozens of eukaryotic genomes. AVAILABILITY: http://bioinformatics.psb.ugent.be/software. The algorithm is implemented as a part of the i-ADHoRe 3.0 package.


Assuntos
Algoritmos , Genômica/métodos , Alinhamento de Sequência/métodos , Arabidopsis/genética , Biologia Computacional/métodos , Genoma , Software
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa