Pesquisa | Secretaria de Estado da Saúde

Variable-order reference-free variant discovery with the Burrows-Wheeler Transform.

Prezza, Nicola; Pisanti, Nadia; Sciortino, Marinella; Rosone, Giovanna.

BMC Bioinformatics ; 21(Suppl 8): 260, 2020 Sep 16.

Artigo em Inglês | MEDLINE | ID: mdl-32938358

RESUMO

BACKGROUND: In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. RESULTS: In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel . CONCLUSIONS: Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.

Assuntos

Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Humanos , Polimorfismo de Nucleotídeo Único

Transcriptomics and methylomics of CD4-positive T cells in arsenic-exposed women.

Engström, Karin; Wojdacz, Tomasz K; Marabita, Francesco; Ewels, Philip; Käller, Max; Vezzi, Francesco; Prezza, Nicola; Gruselius, Joel; Vahter, Marie; Broberg, Karin.

Arch Toxicol ; 91(5): 2067-2078, 2017 May.

Artigo em Inglês | MEDLINE | ID: mdl-27838757

RESUMO

Arsenic, a carcinogen with immunotoxic effects, is a common contaminant of drinking water and certain food worldwide. We hypothesized that chronic arsenic exposure alters gene expression, potentially by altering DNA methylation of genes encoding central components of the immune system. We therefore analyzed the transcriptomes (by RNA sequencing) and methylomes (by target-enrichment next-generation sequencing) of primary CD4-positive T cells from matched groups of four women each in the Argentinean Andes, with fivefold differences in urinary arsenic concentrations (median concentrations of urinary arsenic in the lower- and high-arsenic groups: 65 and 276 µg/l, respectively). Arsenic exposure was associated with genome-wide alterations of gene expression; principal component analysis indicated that the exposure explained 53% of the variance in gene expression among the top variable genes and 19% of 28,351 genes were differentially expressed (false discovery rate <0.05) between the exposure groups. Key genes regulating the immune system, such as tumor necrosis factor alpha and interferon gamma, as well as genes related to the NF-kappa-beta complex, were significantly downregulated in the high-arsenic group. Arsenic exposure was associated with genome-wide DNA methylation; the high-arsenic group had 3% points higher genome-wide full methylation (>80% methylation) than the lower-arsenic group. Differentially methylated regions that were hyper-methylated in the high-arsenic group showed enrichment for immune-related gene ontologies that constitute the basic functions of CD4-positive T cells, such as isotype switching and lymphocyte activation and differentiation. In conclusion, chronic arsenic exposure from drinking water was related to changes in the transcriptome and methylome of CD4-positive T cells, both genome wide and in specific genes, supporting the hypothesis that arsenic causes immunotoxicity by interfering with gene expression and regulation.

Assuntos

Arsênio/toxicidade , Linfócitos T CD4-Positivos/efeitos dos fármacos , Metilação de DNA/efeitos dos fármacos , Exposição Ambiental/efeitos adversos , Regulação da Expressão Gênica/efeitos dos fármacos , Adulto , Argentina , Linfócitos T CD4-Positivos/fisiologia , Ilhas de CpG , Feminino , Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Pessoa de Meia-Idade , Regiões Promotoras Genéticas

Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2.

Prezza, Nicola; Vezzi, Francesco; Käller, Max; Policriti, Alberto.

BMC Bioinformatics ; 17 Suppl 4: 69, 2016 Mar 02.

Artigo em Inglês | MEDLINE | ID: mdl-26961371

RESUMO

BACKGROUND: Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size. METHODS: As far as data structures employed to align bisulfite-treated reads are concerned, solutions proposed in the literature can be roughly grouped into two main categories: those storing pointers at each text position (e.g. hash tables, suffix trees/arrays), and those using the information-theoretic minimum number of bits (e.g. FM indexes and compressed suffix arrays). The former are fast and memory consuming. The latter are much slower and light. In this paper, we try to close this gap proposing a data structure for aligning bisulfite-treated reads which is at the same time fast, light, and very accurate. We reach this objective by combining a recent theoretical result on succinct hashing with a bisulfite-aware hash function. Furthermore, the new versions of the tools implementing our ideas|the aligner ERNE-BS5 2 and the caller ERNE-METH 2|have been extended with increased downstream compatibility (EPP/Bismark cov output formats), output compression, and support for target enrichment protocols. RESULTS: Experimental results on public and simulated WGBS libraries show that our algorithmic solution is a competitive tradeoff between hash-based and BWT-based indexes, being as fast and accurate as the former, and as memory-efficient as the latter. CONCLUSIONS: The new functionalities of our bisulfite aligner and caller make it a fast and memory efficient tool, useful to analyze big datasets with little computational resources, to easily process target enrichment data, and produce statistics such as protocol efficiency and coverage as a function of the distance from target regions.

Assuntos

Metilação de DNA , DNA/química , Epigenômica , Análise de Sequência de DNA/métodos , Software , Sulfitos/química , Ilhas de CpG , Compressão de Dados , Genoma Humano , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos

Fast randomized approximate string matching with succinct hash data structures.

Policriti, Alberto; Prezza, Nicola.

BMC Bioinformatics ; 16 Suppl 9: S4, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26051265

RESUMO

BACKGROUND: The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. METHODS: In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. RESULTS: The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). CONCLUSIONS: In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.

Assuntos

Algoritmos , Genoma Humano , Genoma de Planta , Análise de Sequência de DNA/métodos , Vitis/genética , Simulação por Computador , Humanos

Space-time Trade-offs for the LCP Array of Wheeler DFAs.

Cotumaccio, Nicola; Gagie, Travis; Köppl, Dominik; Prezza, Nicola.

Int Symp String Process Inf Retr ; 14240: 143-156, 2023 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-39108943

RESUMO

Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires O n log n bits, n being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant. In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time tradeoff to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a k -th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in k , thus improving a previous bound by Boucher et al. which was linear in k [DCC 2015].

Computing matching statistics on Wheeler DFAs.

Conte, Alessio; Cotumaccio, Nicola; Gagie, Travis; Manzini, Giovanni; Prezza, Nicola; Sciortino, Marinella.

Proc Data Compress Conf ; 2023: 150-159, 2023 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38832320

RESUMO

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.

Louza, Felipe A; Telles, Guilherme P; Gog, Simon; Prezza, Nicola; Rosone, Giovanna.

Algorithms Mol Biol ; 15: 18, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32973918

RESUMO

BACKGROUND: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. RESULT: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. CONCLUSIONS: gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.

SNPs detection by eBWT positional clustering.

Prezza, Nicola; Pisanti, Nadia; Sciortino, Marinella; Rosone, Giovanna.

Algorithms Mol Biol ; 14: 3, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-30839919

RESUMO

BACKGROUND: Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. RESULTS: We develop the positional clustering theory that (i) describes how the extended Burrows-Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP. CONCLUSIONS: Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. AVAILABILITY: The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa