Búsqueda | BVS Bolivia

1.

b-move: faster bidirectional character extensions in a run-length compressed index.

Depuydt, Lore; Renders, Luca; de Vyver, Simon Van; Veys, Lennart; Gagie, Travis; Fostier, Jan.

bioRxiv ; 2024 Jun 02.

Artículo en Inglés | MEDLINE | ID: mdl-38854079

RESUMEN

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

2.

Pan-genome de Bruijn graph using the bidirectional FM-index.

Depuydt, Lore; Renders, Luca; Abeel, Thomas; Fostier, Jan.

BMC Bioinformatics ; 24(1): 400, 2023 Oct 26.

Artículo en Inglés | MEDLINE | ID: mdl-37884897

RESUMEN

BACKGROUND: Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. RESULTS: We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. CONCLUSIONS: We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.

Asunto(s)

Algoritmos , Genoma , Análisis de Secuencia de ADN , Programas Informáticos , Biología Computacional

3.

Oracle selection provides insight into how far off practice is from Utopia in plant breeding.

Vanavermaete, David; Maenhout, Steven; Fostier, Jan; De Baets, Bernard.

Front Plant Sci ; 14: 1218665, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37546253

RESUMEN

Since the introduction of genomic selection in plant breeding, high genetic gains have been realized in different plant breeding programs. Various methods based on genomic estimated breeding values (GEBVs) for selecting parental lines that maximize the genetic gain as well as methods for improving the predictive performance of genomic selection have been proposed. Unfortunately, it remains difficult to measure to what extent these methods really maximize long-term genetic values. In this study, we propose oracle selection, a hypothetical frame of mind that uses the ground truth to optimally select parents or optimize the training population in order to maximize the genetic gain in each breeding cycle. Clearly, oracle selection cannot be applied in a true breeding program, but allows for the assessment of existing parental selection and training population update methods and the evaluation of how far these methods are from the optimal utopian solution.

4.

Improved Node and Arc Multiplicity Estimation in De Bruijn Graphs Using Approximate Inference in Conditional Random Fields.

Steyaert, Aranka; Audenaert, Pieter; Fostier, Jan.

IEEE/ACM Trans Comput Biol Bioinform ; 20(3): 1995-2006, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37015543

RESUMEN

In de novo genome assembly using short Illumina reads, the accurate determination of node and arc multiplicities in a de Bruijn graph has a large impact on the quality and contiguity of the assembly. The multiplicity estimates of nodes and arcs guide the cleaning of the de Bruijn graph by identifying spurious nodes and arcs that correspond to sequencing errors. Additionally, they can be used to guide repeat resolution. Here, we model the entire de Bruijn graph and the accompanying read coverage information with a single Conditional Random Field (CRF) model. We show that approximate inference using Loopy Belief Propagation (LBP) on our model improves multiplicity assignment accuracy within feasible runtimes. The order in which messages are passed has a large influence on the speed of LBP convergence. Little theoretical guarantees exist and the conditions for convergence are not easily checked as our CRF model contains higher-order interactions. Therefore, we also present an empirical evaluation of several message passing schemes that may guide future users of LBP on CRFs with higher-order interactions in their choice of message passing scheme.

Asunto(s)

Algoritmos , Fatiga , Humanos , Análisis de Secuencia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos

5.

BLSSpeller to discover novel regulatory motifs in maize.

Rahmani, Razgar Seyed; Decap, Dries; Fostier, Jan; Marchal, Kathleen.

DNA Res ; 29(4)2022 Jun 25.

Artículo en Inglés | MEDLINE | ID: mdl-35904558

RESUMEN

With the decreasing cost of sequencing and availability of larger numbers of sequenced genomes, comparative genomics is becoming increasingly attractive to complement experimental techniques for the task of transcription factor (TF) binding site identification. In this study, we redesigned BLSSpeller, a motif discovery algorithm, to cope with larger sequence datasets. BLSSpeller was used to identify novel motifs in Zea mays in a comparative genomics setting with 16 monocot lineages. We discovered 61 motifs of which 20 matched previously described motif models in Arabidopsis. In addition, novel, yet uncharacterized motifs were detected, several of which are supported by available sequence-based and/or functional data. Instances of the predicted motifs were enriched around transcription start sites and contained signatures of selection. Moreover, the enrichment of the predicted motif instances in open chromatin and TF binding sites indicates their functionality, supported by the fact that genes carrying instances of these motifs were often found to be co-expressed and/or enriched in similar GO functions. Overall, our study unveiled several novel candidate motifs that might help our understanding of the genotype to phenotype association in crops.

Asunto(s)

Arabidopsis , Zea mays , Algoritmos , Arabidopsis/genética , Sitios de Unión , Genómica/métodos , Motivos de Nucleótidos , Unión Proteica , Zea mays/genética

6.

Halvade somatic: Somatic variant calling with Apache Spark.

Decap, Dries; de Schaetzen van Brienen, Louise; Larmuseau, Maarten; Costanza, Pascal; Herzeel, Charlotte; Wuyts, Roel; Marchal, Kathleen; Fostier, Jan.

Gigascience ; 11(1)2022 01 12.

Artículo en Inglés | MEDLINE | ID: mdl-35022699

RESUMEN

BACKGROUND: The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. FINDINGS: We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. CONCLUSIONS: To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Secuenciación del Exoma , Secuenciación Completa del Genoma

7.

Deep scoping: a breeding strategy to preserve, reintroduce and exploit genetic variation.

Vanavermaete, David; Fostier, Jan; Maenhout, Steven; De Baets, Bernard.

Theor Appl Genet ; 134(12): 3845-3861, 2021 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-34387711

RESUMEN

KEY MESSAGE: The deep scoping method incorporates the use of a gene bank together with different population layers to reintroduce genetic variation into the breeding population, thus maximizing the long-term genetic gain without reducing the short-term genetic gain or increasing the total financial cost. Genomic prediction is often combined with truncation selection to identify superior parental individuals that can pass on favorable quantitative trait locus (QTL) alleles to their offspring. However, truncation selection reduces genetic variation within the breeding population, causing a premature convergence to a sub-optimal genetic value. In order to also increase genetic gain in the long term, different methods have been proposed that better preserve genetic variation. However, when the genetic variation of the breeding population has already been reduced as a result of prior intensive selection, even those methods will not be able to avert such premature convergence. Pre-breeding provides a solution for this problem by reintroducing genetic variation into the breeding population. Unfortunately, as pre-breeding often relies on a separate breeding population to increase the genetic value of wild specimens before introducing them in the elite population, it comes with an increased financial cost. In this paper, on the basis of a simulation study, we propose a new method that reintroduces genetic variation in the breeding population on a continuous basis without the need for a separate pre-breeding program or a larger population size. This way, we are able to introduce favorable QTL alleles into an elite population and maximize the genetic gain in the short as well as in the long term without increasing the financial cost.

Asunto(s)

Variación Genética , Fitomejoramiento , Sitios de Carácter Cuantitativo , Alelos , Haploidia , Hordeum/genética , Modelos Genéticos , Fitomejoramiento/métodos

8.

Dynamic partitioning of search patterns for approximate pattern matching using search schemes.

Renders, Luca; Marchal, Kathleen; Fostier, Jan.

iScience ; 24(7): 102687, 2021 Jul 23.

Artículo en Inglés | MEDLINE | ID: mdl-34235407

RESUMEN

Search schemes constitute a flexible and generic framework to describe how all approximate occurrences of a search pattern in a text can be found efficiently. We propose an algorithm for the dynamic partitioning of search patterns which can be universally applied to any kind of search scheme and demonstrate that this technique significantly reduces the search space. We present Columba, a software tool written in C++, in which a multitude of search schemes are implemented. We discuss implementation aspects such as memory interleaving of Burrows-Wheeler transform representations and the reduction of redundancy that is inherently associated with the edit distance metric. Ultimately, we demonstrate that Columba has superior performance to the state of the art. Using a single CPU core, Columba is able to retrieve all occurrences of 100,000 Illumina reads and their reverse complements within a maximum edit distance of four in the human genome in less than 3 min.

9.

Multithreaded variant calling in elPrep 5.

Herzeel, Charlotte; Costanza, Pascal; Decap, Dries; Fostier, Jan; Wuyts, Roel; Verachtert, Wilfried.

PLoS One ; 16(2): e0244471, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-33539352

RESUMEN

We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.

Asunto(s)

Exoma , Genoma Humano , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Humanos , Secuenciación del Exoma

10.

Computational assessment of the feasibility of protonation-based protein sequencing.

Miclotte, Giles; Martens, Koen; Fostier, Jan.

PLoS One ; 15(9): e0238625, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-32915813

RESUMEN

Recent advances in DNA sequencing methods revolutionized biology by providing highly accurate reads, with high throughput or high read length. These read data are being used in many biological and medical applications. Modern DNA sequencing methods have no equivalent in protein sequencing, severely limiting the widespread application of protein data. Recently, several optical protein sequencing methods have been proposed that rely on the fluorescent labeling of amino acids. Here, we introduce the reprotonation-deprotonation protein sequencing method. Unlike other methods, this proposed technique relies on the measurement of an electrical signal and requires no fluorescent labeling. In reprotonation-deprotonation protein sequencing, the terminal amino acid is identified through its unique protonation signal, and by repeatedly cleaving the terminal amino acids one-by-one, each amino acid in the peptide is measured. By means of simulations, we show that, given a reference database of known proteins, reprotonation-deprotonation sequencing has the potential to correctly identify proteins in a sample. Our simulations provide target values for the signal-to-noise ratios that sensor devices need to attain in order to detect reprotonation-deprotonation events, as well as suitable pH values and required measurement times per amino acid. For instance, an SNR of 10 is required for a 61.71% proteome recovery rate with 100 ms measurement time per amino acid.

Asunto(s)

Aminoácidos/química , Proteínas/química , Proteoma/genética , Análisis de Secuencia de Proteína/métodos , Aminoácidos/genética , Colorantes Fluorescentes/química , Péptidos/química , Péptidos/genética , Proteínas/genética , Proteoma/química , Protones , Análisis de Secuencia de ADN/métodos , Relación Señal-Ruido

11.

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields.

Steyaert, Aranka; Audenaert, Pieter; Fostier, Jan.

BMC Bioinformatics ; 21(1): 402, 2020 Sep 14.

Artículo en Inglés | MEDLINE | ID: mdl-32928110

RESUMEN

BACKGROUND: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. RESULTS: To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. CONCLUSIONS: We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detox under the GNU AGPL v3.0 license.

Asunto(s)

Biología Computacional/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Humanos

12.

Comparative analysis of somatic variant calling on matched FF and FFPE WGS samples.

de Schaetzen van Brienen, Louise; Larmuseau, Maarten; Van der Eecken, Kim; De Ryck, Frederic; Robbe, Pauline; Schuh, Anna; Fostier, Jan; Ost, Piet; Marchal, Kathleen.

BMC Med Genomics ; 13(1): 94, 2020 07 06.

Artículo en Inglés | MEDLINE | ID: mdl-32631411

RESUMEN

BACKGROUND: Research grade Fresh Frozen (FF) DNA material is not yet routinely collected in clinical practice. Many hospitals, however, collect and store Formalin Fixed Paraffin Embedded (FFPE) tumor samples. Consequently, the sample size of whole genome cancer cohort studies could be increased tremendously by including FFPE samples, although the presence of artefacts might obfuscate the variant calling. To assess whether FFPE material can be used for cohort studies, we performed an in-depth comparison of somatic SNVs called on matching FF and FFPE Whole Genome Sequence (WGS) samples extracted from the same tumor. METHODS: Four variant callers (i.e. Strelka2, Mutect2, VarScan2 and Shimmer) were used to call somatic variants on matching FF and FFPE WGS samples from a metastatic prostate tumor. Using the variants identified by these callers, we developed a heuristic to maximize the overlap between the FF and its FFPE counterpart in terms of sensitivity and precision. The proposed variant calling approach was then validated on nine matched primary samples. Finally, we assessed what fraction of the discrepancy could be attributed to intra-tumor heterogeneity (ITH), by comparing the overlap in clonal and subclonal somatic variants. RESULTS: We first compared variants between an FF and an FFPE sample from a metastatic prostate tumor, showing that on average 50% of the calls in the FF are recovered in the FFPE sample, with notable differences between callers. Combining the variants of the different callers using a simple heuristic, increases both the precision and the sensitivity of the variant calling. Validating the heuristic on nine additional matched FF-FFPE samples, resulted in an average F1-score of 0.58 and an outperformance of any of the individual callers. In addition, we could show that part of the discrepancy between the FF and the FFPE samples can be attributed to ITH. CONCLUSION: This study illustrates that when using the correct variant calling strategy, the majority of clonal SNVs can be recovered in an FFPE sample with high precision and sensitivity. These results suggest that somatic variants derived from WGS of FFPE material can be used in cohort studies.

Asunto(s)

Biomarcadores de Tumor/genética , ADN de Neoplasias/genética , Neoplasias Pulmonares/genética , Mutación , Recurrencia Local de Neoplasia/genética , Neoplasias de la Próstata/genética , Fijación del Tejido/métodos , ADN de Neoplasias/análisis , Estudios de Factibilidad , Formaldehído/química , Perfilación de la Expresión Génica , Regulación Neoplásica de la Expresión Génica , Humanos , Neoplasias Pulmonares/secundario , Masculino , Recurrencia Local de Neoplasia/patología , Adhesión en Parafina/métodos , Pronóstico , Neoplasias de la Próstata/patología , Manejo de Especímenes , Secuenciación Completa del Genoma

13.

Preservation of Genetic Variation in a Breeding Population for Long-Term Genetic Gain.

Vanavermaete, David; Fostier, Jan; Maenhout, Steven; De Baets, Bernard.

G3 (Bethesda) ; 10(8): 2753-2762, 2020 08 05.

Artículo en Inglés | MEDLINE | ID: mdl-32513654

RESUMEN

Genomic selection has been successfully implemented in plant and animal breeding. The transition of parental selection based on phenotypic characteristics to genomic selection (GS) has reduced breeding time and cost while accelerating the rate of genetic progression. Although breeding methods have been adapted to include genomic selection, parental selection often involves truncation selection, selecting the individuals with the highest genomic estimated breeding values (GEBVs) in the hope that favorable properties will be passed to their offspring. This ensures genetic progression and delivers offspring with high genetic values. However, several favorable quantitative trait loci (QTL) alleles risk being eliminated from the breeding population during breeding. We show that this could reduce the mean genetic value that the breeding population could reach in the long term with up to 40%. In this paper, by means of a simulation study, we propose a new method for parental mating that is able to preserve the genetic variation in the breeding population, preventing premature convergence of the genetic values to a local optimum, thus maximizing the genetic values in the long term. We do not only prevent the fixation of several unfavorable QTL alleles, but also demonstrate that the genetic values can be increased by up to 15 percentage points compared with truncation selection.

Asunto(s)

Modelos Genéticos , Selección Genética , Animales , Variación Genética , Fitomejoramiento , Sitios de Carácter Cuantitativo

14.

BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs.

Fostier, Jan.

BMC Bioinformatics ; 21(Suppl 2): 81, 2020 Mar 11.

Artículo en Inglés | MEDLINE | ID: mdl-32164557

RESUMEN

BACKGROUND: The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. RESULTS: We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. CONCLUSIONS: BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.

Asunto(s)

Algoritmos , Interfaz Usuario-Computador , Metodologías Computacionales , Humanos , Posición Específica de Matrices de Puntuación

15.

GABAC: an arithmetic coding solution for genomic data.

Voges, Jan; Paridaens, Tom; Müntefering, Fabian; Mainzer, Liudmila S; Bliss, Brian; Yang, Mingyu; Ochoa, Idoia; Fostier, Jan; Ostermann, Jörn; Hernaez, Mikel.

Bioinformatics ; 36(7): 2275-2277, 2020 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-31830243

RESUMEN

MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY AND IMPLEMENTATION: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Genoma , Genómica , Programas Informáticos

16.

Illumina error correction near highly repetitive DNA regions improves de novo genome assembly.

Heydari, Mahdi; Miclotte, Giles; Van de Peer, Yves; Fostier, Jan.

BMC Bioinformatics ; 20(1): 298, 2019 Jun 03.

Artículo en Inglés | MEDLINE | ID: mdl-31159722

RESUMEN

BACKGROUND: Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. RESULTS: We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. CONCLUSIONS: BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector .

Asunto(s)

Algoritmos , ADN/genética , Genoma , Secuencias Repetitivas de Ácidos Nucleicos/genética , Análisis de Secuencia de ADN/métodos , Animales , Bases de Datos de Ácidos Nucleicos , Humanos , Alineación de Secuencia , Factores de Tiempo

17.

elPrep 4: A multithreaded framework for sequence analysis.

Herzeel, Charlotte; Costanza, Pascal; Decap, Dries; Fostier, Jan; Verachtert, Wilfried.

PLoS One ; 14(2): e0209523, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-30759172

RESUMEN

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

Asunto(s)

Análisis de Secuencia/métodos , Algoritmos , Biología Computacional/economía , Biología Computacional/métodos , Costos y Análisis de Costo , Exoma , Análisis de Secuencia/economía , Programas Informáticos

18.

Dynamical anchoring of distant arrhythmia sources by fibrotic regions via restructuring of the activation pattern.

Vandersickel, Nele; Watanabe, Masaya; Tao, Qian; Fostier, Jan; Zeppenfeld, Katja; Panfilov, Alexander V.

PLoS Comput Biol ; 14(12): e1006637, 2018 12.

Artículo en Inglés | MEDLINE | ID: mdl-30571689

RESUMEN

Rotors are functional reentry sources identified in clinically relevant cardiac arrhythmias, such as ventricular and atrial fibrillation. Ablation targeting rotor sites has resulted in arrhythmia termination. Recent clinical, experimental and modelling studies demonstrate that rotors are often anchored around fibrotic scars or regions with increased fibrosis. However, the mechanisms leading to abundance of rotors at these locations are not clear. The current study explores the hypothesis whether fibrotic scars just serve as anchoring sites for the rotors or whether there are other active processes which drive the rotors to these fibrotic regions. Rotors were induced at different distances from fibrotic scars of various sizes and degree of fibrosis. Simulations were performed in a 2D model of human ventricular tissue and in a patient-specific model of the left ventricle of a patient with remote myocardial infarction. In both the 2D and the patient-specific model we found that without fibrotic scars, the rotors were stable at the site of their initiation. However, in the presence of a scar, rotors were eventually dynamically anchored from large distances by the fibrotic scar via a process of dynamical reorganization of the excitation pattern. This process coalesces with a change from polymorphic to monomorphic ventricular tachycardia.

Asunto(s)

Arritmias Cardíacas/patología , Arritmias Cardíacas/fisiopatología , Modelos Cardiovasculares , Potenciales de Acción , Arritmias Cardíacas/cirugía , Ablación por Catéter , Biología Computacional , Simulación por Computador , Electrocardiografía , Fenómenos Electrofisiológicos , Fibrosis , Sistema de Conducción Cardíaco/patología , Sistema de Conducción Cardíaco/fisiopatología , Sistema de Conducción Cardíaco/cirugía , Ventrículos Cardíacos/patología , Ventrículos Cardíacos/fisiopatología , Humanos , Imagen por Resonancia Magnética , Infarto del Miocardio/patología , Infarto del Miocardio/fisiopatología

19.

BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs.

Heydari, Mahdi; Miclotte, Giles; Van de Peer, Yves; Fostier, Jan.

BMC Bioinformatics ; 19(1): 311, 2018 Sep 04.

Artículo en Inglés | MEDLINE | ID: mdl-30180801

RESUMEN

BACKGROUND: Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string. RESULTS: We present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure. CONCLUSIONS: BrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at: https://github.com/biointec/browniealigner.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Gráficos por Computador , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Humanos , Lenguajes de Programación

20.

Evaluation of the impact of Illumina error correction tools on de novo genome assembly.

Heydari, Mahdi; Miclotte, Giles; Demeester, Piet; Van de Peer, Yves; Fostier, Jan.

BMC Bioinformatics ; 18(1): 374, 2017 Aug 18.

Artículo en Inglés | MEDLINE | ID: mdl-28821237

RESUMEN

BACKGROUND: Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS: For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS: We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.

Asunto(s)

Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Algoritmos , Animales , Bacterias/genética , Caenorhabditis elegans/genética , ADN/química , ADN/metabolismo , Drosophila/genética , Humanos , Alineación de Secuencia , Análisis de Secuencia de ADN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA