Pesquisa | Portal Regional da BVS

1.

Computing matching statistics on Wheeler DFAs.

Conte, Alessio; Cotumaccio, Nicola; Gagie, Travis; Manzini, Giovanni; Prezza, Nicola; Sciortino, Marinella.

Proc Data Compress Conf ; 2023: 150-159, 2023 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38832320

RESUMO

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.

2.

PHONI: Streamed Matching Statistics with Multi-Genome References.

Boucher, Christina; Gagie, Travis; Tomohiro, I; Köppl, Dominik; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo; Pacheco, Alejandro; Rossi, Massimiliano.

Proc Data Compress Conf ; 2021: 193-202, 2021 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-34778549

RESUMO

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. Our code is available at https://github.com/koeppl/phoni.

3.

EZcount: An all-in-one software for microRNA expression quantification from NGS sequencing data.

Geraci, Filippo; Manzini, Giovanni.

Comput Biol Med ; 133: 104352, 2021 06.

Artigo em Inglês | MEDLINE | ID: mdl-33852974

RESUMO

MicroRNAs (miRNAs) are short endogenous molecules of RNA that influence cell regulation by suppressing genes. Their ubiquity throughout all branches of the tree of life has suggested their central role in many cellular functions. Nowadays, several personalized medicine applications rely on miRNAs as biomarkers for diagnoses, prognoses, and prediction of drug response. The increasing ease of sequencing miRNAs contrasts with the difficulty of accurately quantifying their concentration. The use of general purpose aligners is only a partial solution as they have limited possibilities to accurately solve ambiguous mapping due to the short length of these sequences. We developed EZcount, an all-in-one software that, with a single command, performs the entire quantification process: from raw fastq files to read counts. Experiments show that EZcount is more sensitive and accurate than methods based on sequence alignment, independently of the library preparation protocol and sequencing machine. The parallel architecture of EZcount makes it fast enough to process a sample in minutes using a standard workstation. EZcount runs on all of the most common operating systems (Linux, Windows and MacOS) and is freely available for download at https://gitlab.com/BioAlgo/miR-pipe. A detailed description of the datasets, the raw experimental results, and all the scripts used for testing are available as supplementary material.

Assuntos

MicroRNAs , Software , Sequenciamento de Nucleotídeos em Larga Escala , MicroRNAs/genética , Alinhamento de Sequência

4.

PFP Compressed Suffix Trees.

Boucher, Christina; Cvacho, Ondrej; Gagie, Travis; Holub, Jan; Manzini, Giovanni; Navarro, Gonzalo; Rossi, Massimiliano.

Proc Worksh Algorithm Eng Exp ; 2021: 60-72, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-35355938

RESUMO

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to support full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

5.

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.

Kuhnle, Alan; Mun, Taher; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni.

J Comput Biol ; 27(4): 500-513, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32181684

RESUMO

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that-when used with the rank data structure-allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.

Assuntos

Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Genoma Humano/genética , Humanos , Análise de Sequência de DNA/métodos

6.

Matching Reads to Many Genomes with the r-Index.

Mun, Taher; Kuhnle, Alan; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni.

J Comput Biol ; 27(4): 514-518, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32181686

RESUMO

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This article shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on an FASTA file to build an r-index for that file; and how to query that index with ri-align.

Assuntos

Genoma/genética , Genômica , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas , Humanos , Alinhamento de Sequência/métodos , Software

7.

Prefix-free parsing for building big BWTs.

Boucher, Christina; Gagie, Travis; Kuhnle, Alan; Langmead, Ben; Manzini, Giovanni; Mun, Taher.

Algorithms Mol Biol ; 14: 13, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31149025

RESUMO

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

8.

External memory BWT and LCP computation for sequence collections with applications.

Egidi, Lavinia; Louza, Felipe A; Manzini, Giovanni; Telles, Guilherme P.

Algorithms Mol Biol ; 14: 6, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-30899322

RESUMO

BACKGROUND: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. RESULTS: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs. CONCLUSIONS: We prove that our algorithm performs O ( n maxlcp ) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

9.

STAble: a novel approach to de novo assembly of RNA-seq data and its application in a metabolic model network based metatranscriptomic workflow.

Saggese, Igor; Bona, Elisa; Conway, Max; Favero, Francesco; Ladetto, Marco; Liò, Pietro; Manzini, Giovanni; Mignone, Flavio.

BMC Bioinformatics ; 19(Suppl 7): 184, 2018 07 09.

Artigo em Inglês | MEDLINE | ID: mdl-30066630

RESUMO

BACKGROUND: De novo assembly of RNA-seq data allows the study of transcriptome in absence of a reference genome either if data is obtained from a single organism or from a mixed sample as in metatranscriptomics studies. Given the high number of sequences obtained from NGS approaches, a critical step in any analysis workflow is the assembly of reads to reconstruct transcripts thus reducing the complexity of the analysis. Despite many available tools show a good sensitivity, there is a high percentage of false positives due to the high number of assemblies considered and it is likely that the high frequency of false positive is underestimated by currently used benchmarks. The reconstruction of not existing transcripts may false the biological interpretation of results as - for example - may overestimate the identification of "novel" transcripts. Moreover, benchmarks performed are usually based on RNA-seq data from annotated genomes and assembled transcripts are compared to annotations and genomes to identify putative good and wrong reconstructions, but these tests alone may lead to accept a particular type of false positive as true, as better described below. RESULTS: Here we present a novel methodology of de novo assembly, implemented in a software named STAble (Short-reads Transcriptome Assembler). The novel concept of this assembler is that the whole reads are used to determine possible alignments instead of using smaller k-mers, with the aim of reducing the number of chimeras produced. Furthermore, we applied a new set of benchmarks based on simulated data to better define the performance of assembly method and carefully identifying true reconstructions. STAble was also used to build a prototype workflow to analyse metatranscriptomics data in connection to a steady state metabolic modelling algorithm. This algorithm was used to produce high quality metabolic interpretations of small gene expression sets obtained from already published RNA-seq data that we assembled with STAble. CONCLUSIONS: The presented results, albeit preliminary, clearly suggest that with this approach is possible to identify informative reactions not directly revealed by raw transcriptomic data.

Assuntos

Redes e Vias Metabólicas/genética , Modelos Genéticos , Análise de Sequência de RNA/métodos , Software , Transcriptoma/genética , Fluxo de Trabalho , Algoritmos , Animais , Humanos , Metano/metabolismo , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Ruminantes

10.

A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies.

Genovese, Loredana M; Geraci, Filippo; Corrado, Lucia; Mangano, Eleonora; D'Aurizio, Romina; Bordoni, Roberta; Severgnini, Marco; Manzini, Giovanni; De Bellis, Gianluca; D'Alfonso, Sandra; Pellegrini, Marco.

Front Genet ; 9: 155, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29770143

RESUMO

Polymorphic Tandem Repeat (PTR) is a common form of polymorphism in the human genome. A PTR consists in a variation found in an individual (or in a population) of the number of repeating units of a Tandem Repeat (TR) locus of the genome with respect to the reference genome. Several phenotypic traits and diseases have been discovered to be strongly associated with or caused by specific PTR loci. PTR are further distinguished in two main classes: Short Tandem Repeats (STR) when the repeating unit has size up to 6 base pairs, and Variable Number Tandem Repeats (VNTR) for repeating units of size above 6 base pairs. As larger and larger populations are screened via high throughput sequencing projects, it becomes technically feasible and desirable to explore the association between PTR and a panoply of such traits and conditions. In order to facilitate these studies, we have devised a method for compiling catalogs of PTR from assembled genomes, and we have produced a catalog of PTR for genic regions (exons, introns, UTR and adjacent regions) of the human genome (GRCh38). We applied four different TR discovery software tools to uncover in the first phase 55,223,485 TR (after duplicate removal) in GRCh38, of which 373,173 were determined to be PTR in the second phase by comparison with five assembled human genomes. Of these, 263,266 are not included by state-of-the-art PTR catalogs. The new methodology is mainly based on a hierarchical and systematic application of alignment-based sequence comparisons to identify and measure the polymorphism of TR. While previous catalogs focus on the class of STR of small total size, we remove any size restrictions, aiming at the more general class of PTR, and we also target fuzzy TR by using specific detection tools. Similarly to other previous catalogs of human polymorphic loci, we focus our catalog toward applications in the discovery of disease-associated loci. Validation by cross-referencing with existing catalogs on common clinically-relevant loci shows good concordance. Overall, this proposed census of human PTR in genic regions is a shared resource (web accessible), complementary to existing catalogs, facilitating future genome-wide studies involving PTR.

11.

Wheeler graphs: A framework for BWT-based data structures.

Gagie, Travis; Manzini, Giovanni; Sirén, Jouni.

Theor Comput Sci ; 698: 67-78, 2017 Oct 25.

Artigo em Inglês | MEDLINE | ID: mdl-29276331

RESUMO

The famous Burrows-Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more. We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly. We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs.

12.

Multiple seeds sensitivity using a single seed with threshold.

Egidi, Lavinia; Manzini, Giovanni.

J Bioinform Comput Biol ; 13(4): 1550011, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-25747382

RESUMO

Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage.

Assuntos

Biologia Computacional/métodos , Homologia de Sequência , Algoritmos , Software

13.

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.

Ferragina, Paolo; Giancarlo, Raffaele; Greco, Valentina; Manzini, Giovanni; Valiente, Gabriel.

BMC Bioinformatics ; 8: 252, 2007 Jul 13.

Artigo em Inglês | MEDLINE | ID: mdl-17629909

RESUMO

BACKGROUND: Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available. RESULTS: We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next. CONCLUSION: UCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.

Assuntos

Algoritmos , Proteínas/química , Curva ROC , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Área Sob a Curva , Benchmarking , Bases de Dados de Proteínas , Humanos , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Software

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA