Búsqueda | Portal de Búsqueda de la BVS

1.

Finding 3D motifs in ribosomal RNA structures.

Apostolico, Alberto; Ciriello, Giovanni; Guerra, Concettina; Heitsch, Christine E; Hsiao, Chiaolong; Williams, Loren Dean.

Nucleic Acids Res ; 37(4): e29, 2009 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-19158187

RESUMEN

The identification of small structural motifs and their organization into larger subassemblies is of fundamental interest in the analysis, prediction and design of 3D structures of large RNAs. This problem has been studied only sparsely, as most of the existing work is limited to the characterization and discovery of motifs in RNA secondary structures. We present a novel geometric method for the characterization and identification of structural motifs in 3D rRNA molecules. This method enables the efficient recognition of known 3D motifs, such as tetraloops, E-loops, kink-turns and others. Furthermore, it provides a new way of characterizing complex 3D motifs, notably junctions, that have been defined and identified in the secondary structure but have not been analyzed and classified in three dimensions. We demonstrate the relevance and utility of our approach by applying it to the Haloarcula marismortui large ribosomal unit. Pending the implementation of a dedicated web server, the code accompanying this article, written in JAVA, is available upon request from the contact author.

Asunto(s)

ARN Ribosómico/química , Biología Computacional/métodos , Haloarcula marismortui/genética , Modelos Moleculares , Conformación de Ácido Nucleico , ARN Ribosómico/clasificación , Análisis de Secuencia de ARN

2.

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.

Thankachan, Sharma V; Apostolico, Alberto; Aluru, Srinivas.

J Comput Biol ; 23(6): 472-82, 2016 06.

Artículo en Inglés | MEDLINE | ID: mdl-27058840

RESUMEN

Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.

Asunto(s)

Alineación de Secuencia/métodos , Algoritmos , Biología Computacional/métodos , Filogenia

3.

ALFRED: A Practical Method for Alignment-Free Distance Computation.

Thankachan, Sharma V; Chockalingam, Sriram P; Liu, Yongchao; Apostolico, Alberto; Aluru, Srinivas.

J Comput Biol ; 23(6): 452-60, 2016 06.

Artículo en Inglés | MEDLINE | ID: mdl-27138275

RESUMEN

Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.

Asunto(s)

Biología Computacional/métodos , Primates/genética , Alineación de Secuencia/métodos , Algoritmos , Animales , Genoma Mitocondrial , Metagenómica , Filogenia

4.

Incremental paradigms of motif discovery.

Apostolico, Alberto; Parida, Laxmi.

J Comput Biol ; 11(1): 15-25, 2004.

Artículo en Inglés | MEDLINE | ID: mdl-15072686

RESUMEN

We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of iterated updates of the set of irredundant motifs in a string under consecutive unit symbol extensions of the string itself. This approach exposes novel characterizations for the base set of motifs in a string, hinged on notions of partial order. Such properties support the design of ad hoc data structures and constructs, and lead to develop an O(n(3)) time incremental discovery algorithm.

Asunto(s)

Algoritmos , Reconocimiento de Normas Patrones Automatizadas , Biología Computacional/métodos , Interpretación Estadística de Datos

5.

Monotony of surprise and large-scale quest for unusual words.

Apostolico, Alberto; Bock, Mary Ellen; Lonardi, Stefano.

J Comput Biol ; 10(3-4): 283-311, 2003.

Artículo en Inglés | MEDLINE | ID: mdl-12935329

RESUMEN

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

Asunto(s)

Biología Computacional/métodos , Interpretación Estadística de Datos , Análisis de Secuencia de ADN/métodos , Algoritmos , Distribución Binomial , Cadenas de Markov

6.

Ultrametric networks: a new tool for phylogenetic analysis.

Apostolico, Alberto; Comin, Matteo; Dress, Andres; Parida, Laxmi.

Algorithms Mol Biol ; 8(1): 7, 2013 Mar 05.

Artículo en Inglés | MEDLINE | ID: mdl-23497437

RESUMEN

BACKGROUND: The large majority of optimization problems related to the inference of distance-based trees used in phylogenetic analysis and classification is known to be intractable. One noted exception is found within the realm of ultrametric distances. The introduction of ultrametric trees in phylogeny was inspired by a model of evolution driven by the postulate of a molecular clock, now dismissed, whereby phylogeny could be represented by a weighted tree in which the sum of the weights of the edges separating any given leaf from the root is the same for all leaves. Both, molecular clocks and rooted ultrametric trees, fell out of fashion as credible representations of evolutionary change. At the same time, ultrametric dendrograms have shown good potential for purposes of classification in so far as they have proven to provide good approximations for additive trees. Most of these approximations are still intractable, but the problem of finding the nearest ultrametric distance matrix to a given distance matrix with respect to the L∞ distance has been long known to be solvable in polynomial time, the solution being incarnated in any minimum spanning tree for the weighted graph subtending to the matrix. RESULTS: This paper expands this subdominant ultrametric perspective by studying ultrametric networks, consisting of the collection of all edges involved in some minimum spanning tree. It is shown that, for a graph with n vertices, the construction of such a network can be carried out by a simple algorithm in optimal time O(n2) which is faster by a factor of n than the direct adaptation of the classical O(n3) paradigm by Warshall for computing the transitive closure of a graph. This algorithm, called UltraNet, will be shown to be easily adapted to compute relaxed networks and to support the introduction of artificial points to reduce the maximum distance between vertices in a pair. Finally, a few experiments will be discussed to demonstrate the applicability of subdominant ultrametric networks. AVAILABILITY: http://www.dei.unipd.it/~ciompin/main/Ultranet/Ultranet.html.

7.

Phylogeny construction with rigid gapped motifs.

Cunial, Fabio; Apostolico, Alberto.

J Comput Biol ; 19(7): 911-27, 2012 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-22731623

RESUMEN

Patterns with gaps have traditionally been used as signatures of protein families or as features in binary classification. Current alignment-free algorithms construct phylogenies by comparing the repertoire and frequency of ungapped blocks in genomes and proteomes. In this article, we measure the quality of phylogenies reconstructed by comparing suitably defined sets of gapped motifs that occur in mitochondrial proteomes. We study the dependence between the quality of reconstructed phylogenies and the density, number of solid characters, and statistical significance of gapped motifs. We consider maximal motifs, as well as some of their compact generators. The average performance of suitably defined sets of gapped motifs is comparable to that of popular string-based alignment-free methods. Extremely long and sparse motifs produce phylogenies of the same or better quality than those produced by short and dense motifs. The best phylogenies are produced by motifs with 3 or 4 solid characters, while increasing the number of solid characters degrades phylogenies. Discarding motifs with low statistical significance degrades performance as well. In maximal motifs, moving from the smallest basis to bases with higher redundancy leads to better phylogenies.

Asunto(s)

Algoritmos , Proteínas Mitocondriales , Filogenia , Secuencias de Aminoácidos/genética , Bases de Datos de Proteínas , Proteínas Mitocondriales/clasificación , Proteínas Mitocondriales/genética , Alineación de Secuencia

8.

Efficient algorithms for the discovery of gapped factors.

Apostolico, Alberto; Pizzi, Cinzia; Ukkonen, Esko.

Algorithms Mol Biol ; 6: 5, 2011 Mar 23.

Artículo en Inglés | MEDLINE | ID: mdl-21429203

RESUMEN

BACKGROUND: The discovery of surprisingly frequent patterns is of paramount interest in bioinformatics and computational biology. Among the patterns considered, those consisting of pairs of solid words that co-occur within a prescribed maximum distance -or gapped factors- emerge in a variety of contexts of DNA and protein sequence analysis. A few algorithms and tools have been developed in connection with specific formulations of the problem, however, none can handle comprehensively each of the multiple ways in which the distance between the two terms in a pair may be defined. RESULTS: This paper presents efficient algorithms and tools for the extraction of all pairs of words up to an arbitrarily large length that co-occur surprisingly often in close proximity within a sequence. Whereas the number of such pairs in a sequence of n characters can be Θ(n4), it is shown that an exhaustive discovery process can be carried out in O(n2) or O(n3), depending on the way distance is measured. This is made possible by a prudent combination of properties of pattern maximality and monotonicity of scores, which lead to reduce the number of word pairs to be weighed explicitly, while still producing also the scores attained by any of the pairs not explicitly considered. We applied our approach to the discovery of spaced dyads in DNA sequences. CONCLUSIONS: Experiments on biological datasets prove that the method is effective and much faster than exhaustive enumeration of candidate patterns. Software is available freely by academic users via the web interface at http://bcb.dei.unipd.it:8080/dyweb.

9.

The subsequence composition of polypeptides.

Apostolico, Alberto; Cunial, Fabio.

J Comput Biol ; 17(8): 1011-49, 2010 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-20666621

RESUMEN

The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.

Asunto(s)

Aminoácidos/química , Péptidos/química , Secuencia de Aminoácidos , Biología Computacional , Datos de Secuencia Molecular , Conformación Proteica

10.

Efficient tools for comparative substring analysis.

Apostolico, Alberto; Denas, Olgert; Dress, Andreas.

J Biotechnol ; 149(3): 120-6, 2010 Sep 01.

Artículo en Inglés | MEDLINE | ID: mdl-20682467

RESUMEN

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K=infinity). Remarkably, although there may be Theta(L(2)) distinct strings that occur in a given sequence of length L (and Theta(KL) of length k< or =K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K). A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper.

Asunto(s)

Genómica , Algoritmos , Modelos Teóricos

11.

VARUN: discovering extensible motifs under saturation constraints.

Apostolico, Alberto; Comin, Matteo; Parida, Laxmi.

IEEE/ACM Trans Comput Biol Bioinform ; 7(4): 752-26, 2010.

Artículo en Inglés | MEDLINE | ID: mdl-21030741

RESUMEN

The discovery of motifs in biosequences is frequently torn between the rigidity of the model on one hand and the abundance of candidates on the other hand. In particular, motifs that include wild cards or "don't cares" escalate exponentially with their number, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.

Asunto(s)

Secuencias de Aminoácidos , Minería de Datos/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Proteínas/química

12.

Alignment of biological sequences with quality scores.

Na, Joong Chae; Roh, Kangho; Apostolico, Alberto; Park, Kunsoo.

Int J Bioinform Res Appl ; 5(1): 97-113, 2009.

Artículo en Inglés | MEDLINE | ID: mdl-19136367

RESUMEN

In this paper we consider the problem of sequence alignment with quality scores. DNA sequences produced by a base-calling program (as part of sequencing) have quality scores which represent the confidence level for individual bases. However, previous sequence alignment algorithms do not consider such quality scores. To solve sequence alignment with quality scores, we first consider a more general problem where the input is weighted sequences which are sequences with probabilities that characters occur in each position. We propose a meaningful measure of an alignment of two weighted sequences and show that an optimal alignment in this measure can be found by dynamic programming. Sequence alignment with quality scores can be solved as a special case of the weighted sequence alignment problem.

Asunto(s)

Biología Computacional/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos

13.

Fast algorithms for computing sequence distances by exhaustive substring composition.

Apostolico, Alberto; Denas, Olgert.

Algorithms Mol Biol ; 3: 13, 2008 Oct 28.

Artículo en Inglés | MEDLINE | ID: mdl-18957094

RESUMEN

The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background.

14.

Mining, compressing and classifying with extensible motifs.

Apostolico, Alberto; Comin, Matteo; Parida, Laxmi.

Algorithms Mol Biol ; 1(1): 4, 2006 Mar 23.

Artículo en Inglés | MEDLINE | ID: mdl-16722593

RESUMEN

BACKGROUND: Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Motif discovery techniques and tools tend to be computationally imposing, however, special classes of "rigid" motifs have been identified of which the discovery is affordable in low polynomial time. RESULTS: In the present work, "extensible" motifs are considered such that each sequence of gaps comes endowed with some elasticity, whereby the same pattern may be stretched to fit segments of the source that match all the solid characters but are otherwise of different lengths. A few applications of this notion are then described. In applications of data compression by textual substitution, extensible motifs are seen to bring savings on the size of the codebook, and hence to improve compression. In germane contexts, in which compressibility is used in its dual role as a basis for structural inference and classification, extensible motifs are seen to support unsupervised classification and phylogeny reconstruction. CONCLUSION: Off-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences.

15.

Conservative extraction of over-represented extensible motifs.

Apostolico, Alberto; Comin, Matteo; Parida, Laxmi.

Bioinformatics ; 21 Suppl 1: i9-18, 2005 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-15961503

RESUMEN

MOTIVATION: The discovery of motifs in biosequences is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. In particular, the variety of motifs described by strings that include 'don't care' (dot) patterns escalates exponentially with the length of the motif, and this gets only worse if a dot is allowed to stretch up to some prescribed maximum length. This circumstance tends to generate daunting computational burdens, and often gives rise to tables that are impossible to visualize and digest. This is unfortunate, as it seems to preclude precisely those massive analyses that have become conceivable with the increasing availability of massive genomic and protein data. Although a part of the problem is endemic, another part of it seems rooted in the various characterizations offered for the notion of a motif, that are typically based either on syntax or on statistics alone. It seems worthwhile to consider alternatives that result from a prudent combination of these two aspects in the model. RESULTS: We introduce and study a notion of extensible motif in a sequence which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. We show that a combination of appropriate saturation conditions (expressed in terms of minimum number of dots compatible with a given list of occurrences) and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate over-represented motifs. The merits of the method are documented by the results obtained in implementation, which specifically targeted protein sequence families. In all cases tested, the motif reported in PROSITE as the most important in terms of functional/structural relevance emerges among the top 30 extensible motifs returned by our algorithm, often right at the top. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constrains. AVAILABILITY: This software will be available for use with the suite of tools at www.research.ibm.com/bioinformatics.

Asunto(s)

Biología Computacional/métodos , Algoritmos , Secuencias de Aminoácidos , Proteínas Bacterianas/química , Internet , Modelos Estadísticos , Probabilidad , Lenguajes de Programación , Programas Informáticos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA