Your browser doesn't support javascript.
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
J Comput Biol ; 2019 Dec 04.
Artigo em Inglês | MEDLINE | ID: mdl-31800307

RESUMO

Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this article we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup in range of [3.5 × -7 × ], outperforming previous methods. Software and data sets are available at Iterative Spaced Seed Hashing.

2.
J Bioinform Comput Biol ; 17(5): 1940011, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-31856669

RESUMO

Many bioinformatics tools heavily rely on k-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive k-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each k-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input k-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant k-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff.

3.
BMC Bioinformatics ; 20(Suppl 9): 367, 2019 Nov 22.
Artigo em Inglês | MEDLINE | ID: mdl-31757198

RESUMO

MOTIVATION: Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Because assembly typically produces only genome fragments, also known as contigs, it is crucial to group them into putative species for further taxonomic profiling and down-streaming functional analysis. Taxonomic analysis of microbial communities requires contig clustering, a process referred to as binning, that is still one of the most challenging tasks when analyzing metagenomic data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, sequencing errors, and the limitations due to binning contig of different lengths. RESULTS: In this context we present MetaCon a novel tool for unsupervised metagenomic contig binning based on probabilistic k-mers statistics and coverage. MetaCon uses a signature based on k-mers statistics that accounts for the different probability of appearance of a k-mer in different species, also contigs of different length are clustered in two separate phases. The effectiveness of MetaCon is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, MaxBin and MetaBAT.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas , Metagenoma , Metagenômica , Probabilidade , Estatística como Assunto , Análise por Conglomerados , Bases de Dados Genéticas , Microbiota/genética
4.
BMC Bioinformatics ; 20(Suppl 9): 302, 2019 Nov 22.
Artigo em Inglês | MEDLINE | ID: mdl-31757199

RESUMO

MOTIVATION: Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS: We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY: https://github.com/yhhshb/yalff.


Assuntos
Compressão de Dados/normas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Sequência de Bases , Humanos , Polimorfismo de Nucleotídeo Único/genética , Controle de Qualidade , Curva ROC , Software
5.
Genome Biol ; 20(1): 144, 2019 07 25.
Artigo em Inglês | MEDLINE | ID: mdl-31345254

RESUMO

BACKGROUND: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.


Assuntos
Análise de Sequência , Benchmarking , Transferência Genética Horizontal , Internet , Filogenia , Sequências Reguladoras de Ácido Nucleico , Alinhamento de Sequência , Análise de Sequência de Proteína , Software
6.
BMC Bioinformatics ; 19(Suppl 15): 441, 2018 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-30497364

RESUMO

BACKGROUND: Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation. RESULTS: In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run. CONCLUSIONS: We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds.


Assuntos
Algoritmos , Biologia Computacional/métodos , Sequência de Bases , Metagenômica , Fatores de Tempo
7.
Algorithms Mol Biol ; 13: 8, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29588651

RESUMO

Background: Patterns with wildcards in specified positions, namely spaced seeds, are increasingly used instead of k-mers in many bioinformatics applications that require indexing, querying and rapid similarity search, as they can provide better sensitivity. Many of these applications require to compute the hashing of each position in the input sequences with respect to the given spaced seed, or to multiple spaced seeds. While the hashing of k-mers can be rapidly computed by exploiting the large overlap between consecutive k-mers, spaced seeds hashing is usually computed from scratch for each position in the input sequence, thus resulting in slower processing. Results: The method proposed in this paper, fast spaced-seed hashing (FSH), exploits the similarity of the hash values of spaced seeds computed at adjacent positions in the input sequence. In our experiments we compute the hash for each positions of metagenomics reads from several datasets, with respect to different spaced seeds. We also propose a generalized version of the algorithm for the simultaneous computation of multiple spaced seeds hashing. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6[Formula: see text] to 5.3[Formula: see text], depending on the structure of the spaced seed. Conclusions: Spaced seed hashing is a routine task for several bioinformatics application. FSH allows to perform this task efficiently and raise the question of whether other hashing can be exploited to further improve the speed up. This has the potential of major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. Availability: The software FSH is freely available for academic use at: https://bitbucket.org/samu661/fsh/overview.

8.
BMC Genomics ; 18(Suppl 10): 917, 2017 Dec 06.
Artigo em Inglês | MEDLINE | ID: mdl-29244002

RESUMO

BACKGROUND: In recent years several different fields, such as ecology, medicine and microbiology, have experienced an unprecedented development due to the possibility of direct sequencing of microbioimic samples. Among problems that researchers in the field have to deal with, taxonomic classification of metagenomic reads is one of the most challenging. State of the art methods classify single reads with almost 100% precision. However, very often, the performance in terms of recall falls at about 50%. As a consequence, state-of-the-art methods are indeed capable of correctly classify only half of the reads in the sample. How to achieve better performances in terms of overall quality of classification remains a largely unsolved problem. RESULTS: In this paper we propose a method for metagenomics CLassification Improvement with Overlapping Reads (CLIOR), that exploits the information carried by the overlapping reads graph of the input read dataset to improve recall, f-measure, and the estimated abundance of species. In this work, we applied CLIOR on top of the classification produced by the classifier Clark-l. Experiments on simulated and synthetic metagenomes show that CLIOR can lead to substantial improvement of the recall rate, sometimes doubling it. On average, on simulated datasets, the increase of recall is paired with an higher precision too, while on synthetic datasets it comes at expenses of a small loss of precision. On experiments on real metagenomes CLIOR is able to assign many more reads while keeping the abundance ratios in line with previous studies. CONCLUSIONS: Our results showed that with CLIOR is possible to boost the recall of a state-of-the-art metagenomic classifier by inferring and/or correcting the assignment of reads with missing or erroneous labeling. CLIOR is not restricted to the reads classification algorithm used in our experiments, but it may be applied to other methods too. Finally, CLIOR does not need large computational resources, and it can be run on a laptop.


Assuntos
Metagenômica , Estatística como Assunto/métodos , Humanos
9.
Bioinformatics ; 32(17): i567-i575, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27587676

RESUMO

MOTIVATION: Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Taxonomic analysis of microbial communities, a process referred to as binning, is one of the most challenging tasks when analyzing metagenomic reads data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species and the limitations due to short read lengths and sequencing errors. RESULTS: MetaProb is a novel assembly-assisted tool for unsupervised metagenomic binning. The novelty of MetaProb derives from solving a few important problems: how to divide reads into groups of independent reads, so that k-mer frequencies are not overestimated; how to convert k-mer counts into probabilistic sequence signatures, that will correct for variable distribution of k-mers, and for unbalanced groups of reads, in order to produce better estimates of the underlying genome statistic; how to estimate the number of species in a dataset. We show that MetaProb is more accurate and efficient than other state-of-the-art tools in binning both short reads datasets (F-measure 0.87) and long reads datasets (F-measure 0.97) for various abundance ratios. Also, the estimation of the number of species is more accurate than MetaCluster. On a real human stool dataset MetaProb identifies the most predominant species, in line with previous human gut studies. AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/samu661/metaprob CONTACTS: cinzia.pizzi@dei.unipd.it or comin@dei.unipd.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenômica , Modelos Estatísticos , Algoritmos , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software
10.
BMC Med Genomics ; 9 Suppl 1: 36, 2016 08 12.
Artigo em Inglês | MEDLINE | ID: mdl-27535823

RESUMO

BACKGROUND: Sequencing technologies are generating enormous amounts of read data, however assembly of genomes and metagenomes remain among the most challenging tasks. In this paper we study the comparison of genomes and metagenomes only based on read data, using word counts statistics called alignment-free thus not requiring reference genomes or assemblies. Quality scores produced by sequencing platforms are fundamental for various analyses, moreover future-generation sequencing platforms, will produce longer reads but with error rate around 15 %. In this context it will be fundamental to exploit quality values information within the framework of alignment-free measures. RESULTS: In this paper we present a family of alignment-free measures, called d (q) -type, that are based on k-mer counts and quality values. These statistics can be used to compare genomes and metagenomes based on their read sets. Results show that the evolutionary relationship of genomes can be reconstructed based on the direct comparison of theirs reads sets. CONCLUSION: The use of quality values on average improves the classification accuracy, and its contribution increases when the reads are more noisy. Also the comparison of metagenomic microbial communities can be performed efficiently. Similar metagenomes are quickly detected, just by processing their read data, without the need of costly alignments.


Assuntos
Metagenômica/métodos , Análise de Sequência/métodos , Evolução Molecular , Software
11.
BMC Bioinformatics ; 17: 130, 2016 Mar 18.
Artigo em Inglês | MEDLINE | ID: mdl-26987840

RESUMO

BACKGROUND: Enhancers are stretches of DNA (100-1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Although the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult and it cannot be carried out with traditional alignment-based techniques. RESULTS: The use of fast similarity measures, like alignment-free measures, to detect related regulatory sequences is crucial to understand functional correlation between two enhancers. In this paper we study the use of alignment-free measures for the classification of CRMs. However, alignment-free measures are generally tied to a fixed resolution k. Here we propose an alignment-free statistic, called [Formula: see text], that is based on multiple resolution patterns derived from the Entropic Profiles (EPs). The Entropic Profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. As a byproduct we provide a formula to compute the exact variance of variable length word counts, a result that can be of general interest also in other applications. CONCLUSIONS: We evaluate several alignment-free statistics on simulated data and real mouse ChIP-seq sequences. The new statistic, [Formula: see text], is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods. We implemented the new alignment-free measures, as well as traditional ones, in a software called EP-sim that is freely available: http://www.dei.unipd.it/~ciompin/main/EP-sim.html .


Assuntos
Elementos Facilitadores Genéticos , Entropia , Análise de Sequência de DNA/métodos , Software , Animais , Drosophila/genética , Genômica/métodos , Camundongos
12.
Algorithms Mol Biol ; 10: 4, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25691913

RESUMO

BACKGROUND: The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads. Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15 %). RESULTS: In this scenario it will be fundamental to exploit quality value information within the alignment-free framework. To the best of our knowledge this is the first study that incorporates quality value information and k-mers counts, in the context of alignment-free measures, for the comparison of reads data. Based on this principles, in this paper we present a family of alignment-free measures called D (q) -type. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. Also results on de novo assembly and metagenomic reads classification show that the introduction of quality values improves over standard alignment-free measures. These statistics are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html).

13.
BMC Bioinformatics ; 15 Suppl 9: S1, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25252700

RESUMO

BACKGROUND: With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cannot be mapped onto a reference genome alignment-based methods are not applicable. However it is still possible to study the evolutionary relationship of unassembled genomes based on NGS data. RESULTS: We present a parameter-free alignment-free method, called Under2, based on variable-length patterns, for the direct comparison of sets of NGS reads. We define a similarity measure using variable-length patterns, as well as reverses and reverse-complements, along with their statistical and syntactical properties. We evaluate several alignment-free statistics on the comparison of NGS reads coming from simulated and real genomes. In almost all simulations our method Under2 outperforms all other statistics. The performance gain becomes more evident when real genomes are used. CONCLUSION: The new alignment-free statistic is highly successful in discriminating related genomes based on NGS reads data. In almost all experiments, it outperforms traditional alignment-free statistics that are based on fixed length patterns.


Assuntos
Genoma , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Archaea/genética , Bactérias/genética , Drosophila/genética , Modelos Estatísticos , Reconhecimento Automatizado de Padrão , Filogenia
14.
J Comput Biol ; 21(4): 330-44, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24597675

RESUMO

The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.


Assuntos
Genoma Humano , Software , Algoritmos , Simulação por Computador , Humanos , Modelos Genéticos , Análise de Sequência de DNA/métodos
15.
Artigo em Inglês | MEDLINE | ID: mdl-26356018

RESUMO

Information theory has been used for quite some time in the area of computational biology. In this paper we present a pattern discovery method, named Fast Entropic Profiler, that is based on a local entropy function that captures the importance of a region with respect to the whole genome. The local entropy function has been introduced by Vinga and Almeida in , here we discuss and improve the original formulation. We provide a linear time and linear space algorithm called Fast Entropic Profiler ( FastEP), as opposed to the original quadratic implementation. Moreover we propose an alternative normalization that can be also efficiently implemented. We show that FastEP is suitable for large genomes and for the discovery of patterns with unbounded length. FastEP is available at http://www.dei.unipd.it/~ciompin/main/FastEP.html.


Assuntos
Biologia Computacional/métodos , Genoma/genética , Teoria da Informação , Reconhecimento Automatizado de Padrão/métodos , Entropia , Modelos Genéticos
16.
Artigo em Inglês | MEDLINE | ID: mdl-26356333

RESUMO

UNLABELLED: The cell-type diversity is to a large degree driven by transcription regulation, i.e., enhancers. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Even if the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult. A similarity measure to detect related regulatory sequences is crucial to understand functional correlation between two enhancers. This will allow large-scale analyses, clustering and genome-wide classifications. In this paper we present Under2, a parameter-free alignment-free statistic based on variable-length words. As opposed to traditional alignment-free methods, which are based on fixed-length patterns or, in other words, tied to a fixed resolution, our statistic is built upon variable-length words, and thus multiple resolutions are allowed. This will capture the great variability of lengths of CRMs. We evaluate several alignment-free statistics on simulated data and real ChIP-seq sequences. The new statistic is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods. Finally, experiments on mouse enhancers show that Under2 can separate enhancers active in different tissues. AVAILABILITY: http://www.dei.unipd.it/~ciompin/main/UnderIICRMS.html.


Assuntos
Elementos Facilitadores Genéticos/genética , Genômica/métodos , Mamíferos/genética , Análise de Sequência de DNA/métodos , Animais , Humanos , Camundongos
17.
Algorithms Mol Biol ; 8(1): 7, 2013 Mar 05.
Artigo em Inglês | MEDLINE | ID: mdl-23497437

RESUMO

BACKGROUND: The large majority of optimization problems related to the inference of distance-based trees used in phylogenetic analysis and classification is known to be intractable. One noted exception is found within the realm of ultrametric distances. The introduction of ultrametric trees in phylogeny was inspired by a model of evolution driven by the postulate of a molecular clock, now dismissed, whereby phylogeny could be represented by a weighted tree in which the sum of the weights of the edges separating any given leaf from the root is the same for all leaves. Both, molecular clocks and rooted ultrametric trees, fell out of fashion as credible representations of evolutionary change. At the same time, ultrametric dendrograms have shown good potential for purposes of classification in so far as they have proven to provide good approximations for additive trees. Most of these approximations are still intractable, but the problem of finding the nearest ultrametric distance matrix to a given distance matrix with respect to the L∞ distance has been long known to be solvable in polynomial time, the solution being incarnated in any minimum spanning tree for the weighted graph subtending to the matrix. RESULTS: This paper expands this subdominant ultrametric perspective by studying ultrametric networks, consisting of the collection of all edges involved in some minimum spanning tree. It is shown that, for a graph with n vertices, the construction of such a network can be carried out by a simple algorithm in optimal time O(n2) which is faster by a factor of n than the direct adaptation of the classical O(n3) paradigm by Warshall for computing the transitive closure of a graph. This algorithm, called UltraNet, will be shown to be easily adapted to compute relaxed networks and to support the introduction of artificial points to reduce the maximum distance between vertices in a pair. Finally, a few experiments will be discussed to demonstrate the applicability of subdominant ultrametric networks. AVAILABILITY: http://www.dei.unipd.it/~ciompin/main/Ultranet/Ultranet.html.

18.
Algorithms Mol Biol ; 7(1): 34, 2012 Dec 06.
Artigo em Inglês | MEDLINE | ID: mdl-23216990

RESUMO

BACKGROUND: With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods. METHODS: In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of "independent" subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords. RESULTS: The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information. AVAILABILITY: http://www.dei.unipd.it/∼ciompin/main/underlying.html.

19.
J Comput Biol ; 18(12): 1819-29, 2011 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-21548811

RESUMO

The automatic classification of protein sequences into families is of great help for the functional prediction and annotation of new proteins. In this article, we present a method called Irredundant Class that address the remote homology detection problem. The best performing methods that solve this problem are string kernels, that compute a similarity function between pairs of proteins based on their subsequence composition. We provide evidence that almost all string kernels are based on patterns that are not independent, and therefore the associated similarity scores are obtained using a set of redundant features, overestimating the similarity between the proteins. To specifically address this issue, we introduce the class of irredundant common patterns. Loosely speaking, the set of irredundant common patterns is the smallest class of independent patterns that can describe all common patterns in a pair of sequences. We present a classification method based on the statistics of these patterns, named Irredundant Class. Results on benchmark data show that the Irredundant Class outperforms most of the string kernels previously proposed, and it achieves results as good as the current state-of-the-art method Local Alignment, but using the same pairwise information only once.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Homologia de Sequência de Aminoácidos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão , Proteínas/classificação , Curva ROC , Alinhamento de Sequência
20.
Artigo em Inglês | MEDLINE | ID: mdl-21030741

RESUMO

The discovery of motifs in biosequences is frequently torn between the rigidity of the model on one hand and the abundance of candidates on the other hand. In particular, motifs that include wild cards or "don't cares" escalate exponentially with their number, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.


Assuntos
Motivos de Aminoácidos , Mineração de Dados/métodos , Análise de Sequência de Proteína/métodos , Software , Proteínas/química
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA