Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
BMC Genomics ; 22(1): 644, 2021 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-34488632

RESUMEN

BACKGROUND: Inversion Symmetry is a generalization of the second Chargaff rule, stating that the count of a string of k nucleotides on a single chromosomal strand equals the count of its inverse (reverse-complement) k-mer. It holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. Building on this formalism we introduce the concept of k-mer distances between chromosomes. We formulate two k-mer distance measures, D1 and D2, which depend on k. D1 takes into account all k-mers (for a single k) appearing on single strands of the two compared chromosomes, whereas D2 takes into account both strands of each chromosome. Both measures reflect dissimilarities in global chromosomal structures. RESULTS: After defining the various distance measures and summarizing their properties, we also define proximities that rely on the existence of synteny blocks between chromosomes of different bacterial strains. Comparing pairs of strains of bacteria, we find negative correlations between synteny proximities and k-mer distances, thus establishing the meaning of the latter as measures of evolutionary distances among bacterial strains. The synteny measures we use are appropriate for closely related bacterial strains, where considerable sections of chromosomes demonstrate high direct or reversed equality. These measures are not appropriate for comparing different bacteria or eukaryotes. K-mer structural distances can be defined for all species. Because of the arbitrariness of strand choices, we employ only the D2 measure when comparing chromosomes of different species. The results for comparisons of various eukaryotes display interesting behavior which is partially consistent with conventional understanding of evolutionary genomics. In particular, we define ratios of minimal k-mer distances (KDR) between unmasked and masked chromosomes of two species, which correlate with both short and long evolutionary scales. CONCLUSIONS: k-mer distances reflect dissimilarities among global chromosomal structures. They carry information which aggregates all mutations. As such they can complement traditional evolution studies , which mainly concentrate on coding regions.


Asunto(s)
Cromosomas , Genómica , Inversión Cromosómica , Cromosomas/genética , Eucariontes , Evolución Molecular , Humanos , Sintenía
2.
Bioinformatics ; 34(17): i638-i646, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30423078

RESUMEN

Motivation: The complexes formed by binding of proteins to RNAs play key roles in many biological processes, such as splicing, gene expression regulation, translation and viral replication. Understanding protein-RNA binding may thus provide important insights to the functionality and dynamics of many cellular processes. This has sparked substantial interest in exploring protein-RNA binding experimentally, and predicting it computationally. The key computational challenge is to efficiently and accurately infer protein-RNA binding models that will enable prediction of novel protein-RNA interactions to additional transcripts of interest. Results: We developed DLPRB (Deep Learning for Protein-RNA Binding), a new deep neural network (DNN) approach for learning intrinsic protein-RNA binding preferences and predicting novel interactions. We present two different network architectures: a convolutional neural network (CNN), and a recurrent neural network (RNN). The novelty of our network hinges upon two key aspects: (i) the joint analysis of both RNA sequence and structure, which is represented as a probability vector of different RNA structural contexts; (ii) novel features in the architecture of the networks, such as the application of RNNs to RNA-binding prediction, and the combination of hundreds of variable-length filters in the CNN. Our results in inferring accurate RNA-binding models from high-throughput in vitro data exhibit substantial improvements, compared to all previous approaches for protein-RNA binding prediction (both DNN and non-DNN based). A more modest, yet statistically significant, improvement is achieved for in vivo binding prediction. When incorporating experimentally-measured RNA structure, compared to predicted one, the improvement on in vivo data increases. By visualizing the binding specificities, we can gain biological insights underlying the mechanism of protein RNA-binding. Availability and implementation: The source code is publicly available at https://github.com/ilanbb/dlprb. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Profundo , Redes Neurales de la Computación , Proteínas de Unión al ARN/metabolismo , ARN/metabolismo , Programas Informáticos
3.
Bioinformatics ; 32(17): i559-i566, 2016 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-27587675

RESUMEN

MOTIVATION: Complex interactions among alleles often drive differences in inherited properties including disease predisposition. Isolating the effects of these interactions requires phasing information that is difficult to measure or infer. Furthermore, prevalent sequencing technologies used in the essential first step of determining a haplotype limit the range of that step to the span of reads, namely hundreds of bases. With the advent of pseudo-long read technologies, observable partial haplotypes can span several orders of magnitude more. Yet, measuring whole-genome-single-individual haplotypes remains a challenge. A different view of whole genome measurement addresses the 3D structure of the genome-with great development of Hi-C techniques in recent years. A shortcoming of current Hi-C, however, is the difficulty in inferring information that is specific to each of a pair of homologous chromosomes. RESULTS: In this work, we develop a robust algorithmic framework that takes two measurement derived datasets: raw Hi-C and partial short-range haplotypes, and constructs the full-genome haplotype as well as phased diploid Hi-C maps. By analyzing both data sets together we thus bridge important gaps in both technologies-from short to long haplotypes and from un-phased to phased Hi-C. We demonstrate that our method can recover ground truth haplotypes with high accuracy, using measured biological data as well as simulated data. We analyze the impact of noise, Hi-C sequencing depth and measured haplotype lengths on performance. Finally, we use the inferred 3D structure of a human genome to point at transcription factor targets nuclear co-localization. AVAILABILITY AND IMPLEMENTATION: The implementation available at https://github.com/YakhiniGroup/SpectraPh CONTACT: zohar.yakhini@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Cromosomas , Genoma Humano , Haplotipos , Conformación Molecular , Algoritmos , Variación Genética , Estudio de Asociación del Genoma Completo , Humanos
4.
J Theor Biol ; 420: 318-323, 2017 05 07.
Artículo en Inglés | MEDLINE | ID: mdl-28263816

RESUMEN

Ancestral maximum likelihood (AML) is a phylogenetic tree reconstruction criteria that "lies between" maximum parsimony (MP) and maximum likelihood (ML). ML has long been known to be statistically consistent. On the other hand, Felsenstein (1978) showed that MP is statistically inconsistent, and even positively misleading: There are cases where the parsimony criteria, applied to data generated according to one tree topology, will be optimized on a different tree topology. The question of weather AML is statistically consistent or not has been open for a long time. Mossel et al. (2009) have shown that AML can "shrink" short tree edges, resulting in a star tree with no internal resolution, which yields a better AML score than the original (resolved) model. This result implies that AML is statistically inconsistent, but not that it is positively misleading, because the star tree is compatible with any other topology. We show that AML is confusingly misleading: For some simple, four taxa (resolved) tree, the ancestral likelihood optimization criteria is maximized on an incorrect (resolved) tree topology, as well as on a star tree (both with specific edge lengths), while the tree with the original, correct topology, has strictly lower ancestral likelihood. Interestingly, the two short edges in the incorrect, resolved tree topology are of length zero, and are not adjacent, so this resolved tree is in fact a simple path. While for MP, the underlying phenomenon can be described as long edge attraction, it turns out that here we have long edge repulsion.


Asunto(s)
Evolución Biológica , Biometría/métodos , Modelos Genéticos , Filogenia , Simulación por Computador , Funciones de Verosimilitud
5.
J Proteome Res ; 15(8): 2871-80, 2016 08 05.
Artículo en Inglés | MEDLINE | ID: mdl-27354160

RESUMEN

Modeling and simulation of biological networks is an effective and widely used research methodology. The Biological Network Simulator (BioNSi) is a tool for modeling biological networks and simulating their discrete-time dynamics, implemented as a Cytoscape App. BioNSi includes a visual representation of the network that enables researchers to construct, set the parameters, and observe network behavior under various conditions. To construct a network instance in BioNSi, only partial, qualitative biological data suffices. The tool is aimed for use by experimental biologists and requires no prior computational or mathematical expertise. BioNSi is freely available at http://bionsi.wix.com/bionsi , where a complete user guide and a step-by-step manual can also be found.


Asunto(s)
Modelos Biológicos , Programas Informáticos , Simulación por Computador , Internet
6.
BMC Genomics ; 17: 696, 2016 08 31.
Artículo en Inglés | MEDLINE | ID: mdl-27580854

RESUMEN

BACKGROUND: The generalization of the second Chargaff rule states that counts of any string of nucleotides of length k on a single chromosomal strand equal the counts of its inverse (reverse-complement) k-mer. This Inversion Symmetry (IS) holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. The existence of IS has been demonstrated in the literature, and other pair-wise candidate symmetries (e.g. reverse or complement) have been ruled out. RESULTS: Studying IS in the human genome, we find that IS holds up to k = 10. It holds for complete chromosomes, also after applying the low complexity mask. We introduce a numerical IS criterion, and define the k-limit, KL, as the highest k for which this criterion is valid. We demonstrate that chromosomes of different species, as well as different human chromosomal sections, follow a universal logarithmic dependence of KL ~ 0.7 ln(L), where L is the length of the chromosome. We introduce a statistical IS-Poisson model that allows us to apply confidence measures to our numerical findings. We find good agreement for large k, where the variance of the Poisson distribution determines the outcome of the analysis. This model predicts the observed logarithmic increase of KL with length. The model allows us to conclude that for low k, e.g. k = 1 where IS becomes the 2(nd) Chargaff rule, IS violation, although extremely small, is significant. Studying this violation we come up with an unexpected observation for human chromosomes, finding a meaningful correlation with the excess of genes on particular strands. CONCLUSIONS: Our IS-Poisson model agrees well with genomic data, and accounts for the universal behavior of k-limits. For low k we point out minute, yet significant, deviations from the model, including excess of counts of nucleotides T vs A and G vs C on positive strands of human chromosomes. Interestingly, this correlates with a significant (but small) excess of genes on the same positive strands.


Asunto(s)
Cromosomas Humanos/genética , ADN/genética , Humanos , Modelos Genéticos , Distribución de Poisson
7.
Bioinformatics ; 30(24): 3515-23, 2014 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-25183486

RESUMEN

MOTIVATION: New sequencing technologies generate larger amount of short reads data at decreasing cost. De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome. This presents algorithmic and computational challenges, especially for long and repetitive genome sequences. Most existing approaches to the assembly problem operate in the framework of de Bruijn graphs. Yet, a number of recent works use the paradigm of string graph, using a variety of methods for storing and processing suffixes and prefixes, like suffix arrays, the Burrows-Wheeler transform or the FM index. Our work is motivated by a search for new approaches to constructing the string graph, using alternative yet simple data structures and algorithmic concepts. RESULTS: We introduce a novel hash-based method for constructing the string graph. We use incremental hashing, and specifically a modification of the Karp-Rabin fingerprint, and Bloom filters. Using these probabilistic methods might create false-positive and false-negative edges during the algorithm's execution, but these are all detected and corrected. The advantages of the proposed approach over existing methods are its simplicity and the incorporation of established probabilistic techniques in the context of de novo genome sequencing. Our preliminary implementation is favorably comparable with the first string graph construction of Simpson and Durbin (2010) (but not with subsequent improvements). Further research and optimizations will hopefully enable the algorithm to be incorporated, with noticeable performance improvement, in state-of-the-art string graph-based assemblers.


Asunto(s)
Algoritmos , Análisis de Secuencia de ADN/métodos , Genómica/métodos
8.
J Theor Biol ; 374: 54-9, 2015 Jun 07.
Artículo en Inglés | MEDLINE | ID: mdl-25843219

RESUMEN

The evolution of aligned DNA sequence sites is generally modeled by a Markov process operating along the edges of a phylogenetic tree. It is well known that the probability distribution on the site patterns at the tips of the tree determines the tree topology, and its branch lengths. However, the number of patterns is typically much larger than the number of edges, suggesting considerable redundancy in the branch length estimation. In this paper we ask whether the probabilities of just the 'edge-specific' patterns (the ones that correspond to a change of state on a single edge) suffice to recover the branch lengths of the tree, under a symmetric 2-state Markov process. We first show that this holds provided the branch lengths are sufficiently short, by applying the inverse function theorem. We then consider whether this restriction to short branch lengths is necessary. We show that for trees with up to four leaves it can be lifted. This leaves open the interesting question of whether this holds in general. Our results also extend to certain Markov processes on more than 2-states, such as the Jukes-Cantor model.


Asunto(s)
Evolución Molecular , Modelos Biológicos , Filogenia , Algoritmos , Cadenas de Markov , Probabilidad , Programas Informáticos
9.
PLoS Comput Biol ; 10(11): e1003897, 2014 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-25411839

RESUMEN

We join the increasing call to take computational education of life science students a step further, beyond teaching mere programming and employing existing software tools. We describe a new course, focusing on enriching the curriculum of life science students with abstract, algorithmic, and logical thinking, and exposing them to the computational "culture." The design, structure, and content of our course are influenced by recent efforts in this area, collaborations with life scientists, and our own instructional experience. Specifically, we suggest that an effective course of this nature should: (1) devote time to explicitly reflect upon computational thinking processes, resisting the temptation to drift to purely practical instruction, (2) focus on discrete notions, rather than on continuous ones, and (3) have basic programming as a prerequisite, so students need not be preoccupied with elementary programming issues. We strongly recommend that the mere use of existing bioinformatics tools and packages should not replace hands-on programming. Yet, we suggest that programming will mostly serve as a means to practice computational thinking processes. This paper deals with the challenges and considerations of such computational education for life science students. It also describes a concrete implementation of the course and encourages its use by others.


Asunto(s)
Disciplinas de las Ciencias Biológicas/educación , Biología Computacional/educación , Ciencia de la Información/educación , Algoritmos , Humanos , Programas Informáticos
10.
Sci Rep ; 9(1): 12734, 2019 09 04.
Artículo en Inglés | MEDLINE | ID: mdl-31484964

RESUMEN

Genome conformation capture techniques permit a systematic investigation into the functional spatial organization of genomes, including functional aspects like assessing the co-localization of sets of genomic elements. For example, the co-localization of genes targeted by a transcription factor (TF) within a transcription factory. We quantify spatial co-localization using a rigorous statistical model that measures the enrichment of a subset of elements in neighbourhoods inferred from Hi-C data. We also control for co-localization that can be attributed to genomic order. We systematically apply our open-sourced framework, spatial-mHG, to search for spatial co-localization phenomena in multiple unicellular Hi-C datasets with corresponding genomic annotations. Our biological findings shed new light on the functional spatial organization of genomes, including: In C. crescentus, DNA replication genes reside in two genomic clusters that are spatially co-localized. Furthermore, these clusters contain similar gene copies and lay in genomic vicinity to the ori and ter sequences. In S. cerevisae, Ty5 retrotransposon family element spatially co-localize at a spatially adjacent subset of telomeres. In N. crassa, both Proteasome lid subcomplex genes and protein refolding genes jointly spatially co-localize at a shared location. An implementation of our algorithms is available online.


Asunto(s)
Bacillus subtilis/genética , Caulobacter crescentus/genética , Genoma Bacteriano , Genoma Fúngico , Neurospora crassa/genética , Saccharomyces cerevisiae/genética , Schizosaccharomyces/genética , Genómica , Modelos Genéticos
11.
Bioinformatics ; 23(2): e57-63, 2007 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-17237106

RESUMEN

MOTIVATION: Cloning of long DNA sequences (40-60 bases) into phage display libraries using polymerase chain reaction (PCR) is a low efficiency process, in which PCR is used to incorporate a DNA insert, coding for a certain peptide, into the amplified sequence. The PCR efficiency in this process is strongly affected by the distribution of G-C bases in the amplified sequence. As any DNA insert coding for the target peptide may be attempted, there is a flexibility in choosing part of the amplified sequence. Since the number of inserts coding for the same peptide is exponential in the peptide length, a computational problem naturally arises--that of efficiently finding an insert, whose parameters are optimal for PCR cloning. RESULTS: The GC distribution requirements are formulated as a search problem. We developed an efficient, linear time 'one pass' algorithm for this problem. Interestingly, our algorithm strongly relies on an interesting symmetry, which we observed in the standard genetic code. Most non-standard genetic codes examined possess this symmetry as well, yet some do not. We generalize the search problem and consider the case of a non-standard, or arbitrary, genetic code where this symmetry does not necessary hold. We solve the generalized problem in polynomial, but nonlinear, time. AVAILABILITY: An implementation of the proposed algorithm is available upon request from the authors.


Asunto(s)
Mapeo Cromosómico/métodos , Clonación Molecular/métodos , Sondas de ADN/genética , Secuencia Rica en GC/genética , Código Genético/genética , Reacción en Cadena de la Polimerasa/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Secuencia de Bases , Ingeniería Genética/métodos , Datos de Secuencia Molecular
12.
Mol Syst Biol ; 3: 108, 2007.
Artículo en Inglés | MEDLINE | ID: mdl-17486136

RESUMEN

The COP9 signalosome (CSN), an eight-subunit protein complex, is conserved in all higher eukaryotes. CSN intersects the ubiquitin-proteasome pathway, modulating signaling pathways controlling various aspects of development. We are using Drosophila as a model system to elucidate the function of this important complex. Transcriptome data were generated for four csn mutants, sampled at three developmental time points. Our results are highly reproducible, being confirmed using two different experimental setups that entail different microarrays and different controls. Our results indicate that the CSN acts as a transcriptional repressor during development of Drosophila, resulting in achronic gene expression in the csn mutants. 'Time shift' analysis with the publicly available Drosophila transcriptome data indicates that genes repressed by the CSN are normally induced primarily during late embryogenesis or during metamorphosis. These temporal shifts are likely due to the roles of the CSN in regulating transcription factors. A null mutation in CSN subunit 4 and hypomorphic mutations in csn5 lead to more severe defects than seen in the csn5-null mutants strain, suggesting that CSN5 carries only some of the CSN function.


Asunto(s)
Proteínas de Drosophila/fisiología , Drosophila melanogaster/genética , Regulación de la Expresión Génica , Péptidos y Proteínas de Señalización Intracelular/fisiología , Complejos Multiproteicos/fisiología , Péptido Hidrolasas/fisiología , Proteínas Adaptadoras Transductoras de Señales , Animales , Complejo del Señalosoma COP9 , Proteínas de Drosophila/biosíntesis , Proteínas de Drosophila/genética , Drosophila melanogaster/crecimiento & desarrollo , Drosophila melanogaster/metabolismo , Ecdisona/farmacología , Femenino , Perfilación de la Expresión Génica , Regulación de la Expresión Génica/efectos de los fármacos , Genes de Insecto , Péptidos y Proteínas de Señalización Intracelular/genética , Larva , Masculino , Complejos Multiproteicos/genética , Mutación , Análisis de Secuencia por Matrices de Oligonucleótidos , Péptido Hidrolasas/genética , Subunidades de Proteína , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa , Técnica de Sustracción , Factores de Tiempo , Transcripción Genética
13.
Protein Sci ; 16(10): 2251-9, 2007 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-17893362

RESUMEN

There are 3,200,000 amino acid sequences of length 5 (penta-peptides). Statistically, we expect to see a distribution of penta-peptides that is determined by the frequency of the participating amino acids. We show, however, that not only are there thousands of such penta-peptides that are absent from all known proteomes, but many of them are coded for multiple times in the non-coding genomic regions. This suggests a strong selection process that prevents these peptides from being expressed. We also show that the characteristics of these forbidden penta-peptides vary among different phylogenetic groups (e.g., eukaryotes, prokaryotes, and archaea). Our analysis provides the first steps toward understanding the "grammar" of the forbidden penta-peptides.


Asunto(s)
Oligopéptidos/química , Animales , Archaea/genética , Bacterias/genética , ADN Intergénico/química , Oligopéptidos/genética , Filogenia , Proteómica , Análisis de Secuencia de Proteína
14.
J Comput Biol ; 14(6): 817-38, 2007.
Artículo en Inglés | MEDLINE | ID: mdl-17691896

RESUMEN

We describe a new approach for comparing cellular-biological networks and finding conserved regions in two or more such networks. Our distance measure is based on the description length of one network, given the description of the other one, and it is efficiently computable. We employ these distances as inputs for generating phylogenetic trees. Using KEGG's metabolic networks as our starting point, we obtained trees that are not perfect, but are very good (considering the characteristics of the inputs). Our approach enables us to identify conserved regions among more than a dozen metabolic networks, and among two protein interaction networks. These conserved regions appear to be biologically relevant, proving the viability of our approach.


Asunto(s)
Biología Computacional/métodos , Secuencia Conservada , Evolución Molecular , Filogenia , Proteínas/química , Proteínas/metabolismo , Análisis por Conglomerados , Modelos Biológicos , Modelos Genéticos , Proteómica/métodos , Transducción de Señal
15.
Math Biosci ; 208(2): 347-58, 2007 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-17664091

RESUMEN

This work deals with symbolic mathematical solutions to maximum likelihood on small phylogenetic trees. Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees, but finding the global optimum is a hard computational task. In this work, we give general analytic solutions for a family of trees with four taxa, two state characters, under a molecular clock. Previously, analytical solutions were known only for three taxa trees. The change from three to four taxa incurs a major increase in the complexity of the underlying algebraic system, and requires novel techniques and approaches. Despite the simplicity of our model, solving ML analytically in it is close to the limit of today's tractability. Four taxa rooted trees have two topologies--the fork (two subtrees with two leaves each) and the comb (one subtree with three leaves, the other with a single leaf). Combining the properties of molecular clock fork trees with the Hadamard conjugation, and employing the symbolic algebra software Maple, we derive a number of topology dependent identities. Using these identities, we substantially simplify the system of polynomial equations for the fork. We finally employ the symbolic algebra software to obtain closed form analytic solutions (expressed parametrically in the input data).


Asunto(s)
Funciones de Verosimilitud , Filogenia , Evolución Biológica , Matemática , Modelos Genéticos , Programas Informáticos
16.
J Comput Biol ; 13(3): 819-37, 2006 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-16706728

RESUMEN

Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees, but finding the global optimum is a hard computational task. Because no general analytic solution is known, numeric techniques such as hill climbing or expectation maximization (EM), are used in order to find optimal parameters for a given tree. So far, analytic solutions were derived only for the simplest model--three taxa, two state characters, under a molecular clock. Four taxa rooted trees have two topologies--the fork (two subtrees with two leaves each) and the comb (one subtree with three leaves, the other with a single leaf). In a previous work, we devised a closed form analytic solution for the ML molecular clock fork. In this work, we extend the state of the art in the area of analytic solutions ML trees to the family of all four taxa trees under the molecular clock assumption. The change from the fork topology to the comb incurs a major increase in the complexity of the underlying algebraic system and requires novel techniques and approaches. We combine the ultrametric properties of molecular clock trees with the Hadamard conjugation to derive a number of topology dependent identities. Employing these identities, we substantially simplify the system of polynomial equations. We finally use tools from algebraic geometry (e.g., Gröbner bases, ideal saturation, resultants) and employ symbolic algebra software to obtain analytic solutions for the comb. We show that in contrast to the fork, the comb has no closed form solutions (expressed by radicals in the input data). In general, four taxa trees can have multiple ML points. In contrast, we can now prove that under the molecular clock assumption, the comb has a unique (local and global) ML point. (Such uniqueness was previously shown for the fork.).


Asunto(s)
Evolución Molecular , Modelos Genéticos , Filogenia , Programas Informáticos
17.
J Comput Biol ; 13(2): 336-50, 2006 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-16597244

RESUMEN

We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes, whose lengths may greatly vary. The core of our method is a new measure of pairwise distances between sequences. This measure is based on computing the average lengths of maximum common substrings, which is intrinsically related to information theoretic tools (Kullback-Leibler relative entropy). We present an algorithm for efficiently computing these distances. In principle, the distance of two l long sequences can be calculated in O(l) time. We implemented the algorithm using suffix arrays our implementation is fast enough to enable the construction of the proteome phylogenomic tree for hundreds of species and the genome phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with "acceptable phylogenetic and taxonomic truth." To assess our approach, our results were compared to the traditional (single-gene or protein-based) maximum likelihood method. The obtained trees were compared to implementations of a number of alternative approaches, including two that were previously published in the literature, and to the published results of a third approach. Comparing their outcome and running time to ours, using a "traditional" trees and a standard tree comparison method, our algorithm improved upon the "competition" by a substantial margin. The simplicity and speed of our method allows for a whole genome analysis with the greatest scope attempted so far. We describe here five different applications of the method, which not only show the validity of the method, but also suggest a number of novel phylogenetic insights.


Asunto(s)
Algoritmos , ADN Mitocondrial/genética , Evolución Molecular , Filogenia , Proteínas/genética , Alineación de Secuencia/métodos , Virus/genética , ADN Mitocondrial/química , ADN Mitocondrial/clasificación , Proteínas/química , Proteínas/clasificación , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína , Virus/química , Virus/clasificación
18.
J Comput Biol ; 23(6): 461-71, 2016 06.
Artículo en Inglés | MEDLINE | ID: mdl-27058690

RESUMEN

Clustered regularly interspaced short palindromic repeats (CRISPR) are structured regions in bacterial and archaeal genomes, which are part of an adaptive immune system against phages. CRISPRs are important for many microbial studies and are playing an essential role in current gene editing techniques. As such, they attract substantial research interest. The exponential growth in the amount of bacterial sequence data in recent years enables the exploration of CRISPR loci in more and more species. Most of the automated tools that detect CRISPR loci rely on fully assembled genomes. However, many assemblers do not handle repetitive regions successfully. The first tool to work directly on raw sequence data is Crass, which requires reads that are long enough to contain two copies of the same repeat. We present a method to identify CRISPR repeats from raw sequence data of short reads. The algorithm is based on an observation differentiating CRISPR repeats from other types of repeats, and it involves a series of partial constructions of the overlap graph. This enables us to avoid many of the difficulties that assemblers face, as we merely aim to identify the repeats that belong to CRISPR loci. A preliminary implementation of the algorithm shows good results and detects CRISPR repeats in cases where other existing tools fail to do so.


Asunto(s)
Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas , Biología Computacional/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Archaea/genética , Bacterias/genética
19.
J Comput Biol ; 10(3-4): 373-84, 2003.
Artículo en Inglés | MEDLINE | ID: mdl-12935334

RESUMEN

This paper concerns the discovery of patterns in gene expression matrices, in which each element gives the expression level of a given gene in a given experiment. Most existing methods for pattern discovery in such matrices are based on clustering genes by comparing their expression levels in all experiments, or clustering experiments by comparing their expression levels for all genes. Our work goes beyond such global approaches by looking for local patterns that manifest themselves when we focus simultaneously on a subset G of the genes and a subset T of the experiments. Specifically, we look for order-preserving submatrices (OPSMs), in which the expression levels of all genes induce the same linear ordering of the experiments (we show that the OPSM search problem is NP-hard in the worst case). Such a pattern might arise, for example, if the experiments in T represent distinct stages in the progress of a disease or in a cellular process and the expression levels of all genes in G vary across the stages in the same way. We define a probabilistic model in which an OPSM is hidden within an otherwise random matrix. Guided by this model, we develop an efficient algorithm for finding the hidden OPSM in the random matrix. In data generated according to the model, the algorithm recovers the hidden OPSM with a very high success rate. Application of the methods to breast cancer data seem to reveal significant local patterns.


Asunto(s)
Biología Computacional/métodos , Interpretación Estadística de Datos , Perfilación de la Expresión Génica , Algoritmos , Análisis de Secuencia por Matrices de Oligonucleótidos , Procesos Estocásticos
20.
J Bioinform Comput Biol ; 2(2): 257-71, 2004 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-15297981

RESUMEN

Maximum likelihood (ML) (Neyman, 1971) is an increasingly popular optimality criterion for selecting evolutionary trees. Finding optimal ML trees appears to be a very hard computational task--in particular, algorithms and heuristics for ML take longer to run than algorithms and heuristics for maximum parsimony (MP). However, while MP has been known to be NP-complete for over 20 years, no such hardness result has been obtained so far for ML. In this work we make a first step in this direction by proving that ancestral maximum likelihood (AML) is NP-complete. The input to this problem is a set of aligned sequences of equal length and the goal is to find a tree and an assignment of ancestral sequences for all of that tree's internal vertices such that the likelihood of generating both the ancestral and contemporary sequences is maximized. Our NP-hardness proof follows that for MP given in (Day, Johnson and Sankoff, 1986) in that we use the same reduction from Vertex Cover; however, the proof of correctness for this reduction relative to AML is different and substantially more involved.


Asunto(s)
Algoritmos , Evolución Molecular , Perfilación de la Expresión Génica/métodos , Filogenia , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Funciones de Verosimilitud , Datos de Secuencia Molecular
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA