Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 34
Filtrar
1.
BMC Bioinformatics ; 23(1): 253, 2022 Jun 24.
Artigo em Inglês | MEDLINE | ID: mdl-35751023

RESUMO

BACKGROUND: The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved. RESULTS: We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.


Assuntos
Genoma Bacteriano , Genômica , Bactérias/genética , Genômica/métodos , Humanos , Aprendizado de Máquina , Filogenia , Virulência/genética
2.
Bioinformatics ; 36(Suppl_1): i21-i29, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657415

RESUMO

MOTIVATION: An important task in comparative genomics is to detect functional units by analyzing gene-context patterns. Colinear syntenic blocks (CSBs) are groups of genes that are consistently encoded in the same neighborhood and in the same order across a wide range of taxa. Such CSBs are likely essential for the regulation of gene expression in prokaryotes. Recent results indicate that colinearity can be conserved across multiple operons, thus motivating the discovery of multi-operon CSBs. This computational task raises scalability challenges in large datasets. RESULTS: We propose an efficient algorithm for the discovery of cross-strand multi-operon CSBs in large genomic datasets. The proposed algorithm uses match-point arithmetic, which is scalable for large datasets of microbial genomes in terms of running time and space requirements. The algorithm is implemented and incorporated into a tool with a graphical user interface, called CSBFinder-S. We applied CSBFinder-S to data mine 1485 prokaryotic genomes and analyzed the identified cross-strand CSBs. Our results indicate that most of the syntenic blocks are exclusively colinear. Additional results indicate that transcriptional regulation by overlapping transcriptional genes is abundant in bacteria. We demonstrate the utility of CSBFinder-S to identify common function of the gene-pair PulEF in multiple contexts, including Type 2 Secretion System, Type 4 Pilus System and DNA uptake machinery. AVAILABILITY AND IMPLEMENTATION: CSBFinder-S software and code are publicly available at https://github.com/dinasv/CSBFinder. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Microbiano , Genômica , Algoritmos , Óperon , Software , Sintenia
3.
Bioinformatics ; 35(10): 1634-1643, 2019 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-30321308

RESUMO

MOTIVATION: Identification of conserved syntenic blocks across microbial genomes is important for several problems in comparative genomics such as gene annotation, study of genome organization and evolution and prediction of gene interactions. Current tools for syntenic block discovery do not scale up to the large quantity of prokaryotic genomes available today. RESULTS: We present a novel methodology for the discovery, ranking and taxonomic distribution analysis of colinear syntenic blocks (CSBs)-groups of genes that are consistently located close to each other, in the same order, across a wide range of taxa. We present an efficient algorithm that identifies CSBs in large genomic datasets. The algorithm is implemented and incorporated in a novel tool with a graphical user interface, denoted CSBFinder, that ranks the discovered CSBs according to a probabilistic score and clusters them to families according to their gene content similarity. We apply CSBFinder to data mine 1487 prokaryotic genomes including chromosomes and plasmids. For post-processing analysis, we generate heatmaps for visualizing the distribution of CSB family members across various taxa. We exemplify the utility of CSBFinder in operon prediction, in deciphering unknown gene function and in taxonomic analysis of colinear syntenic blocks. AVAILABILITY AND IMPLEMENTATION: CSBFinder software and code are publicly available at https://github.com/dinasv/CSBFinder. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Software , Algoritmos , Genoma Microbiano , Sintenia
4.
Bioinformatics ; 35(12): 2001-2008, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-30407484

RESUMO

MOTIVATION: Bacterial infections are a major cause of illness worldwide. However, most bacterial strains pose no threat to human health and may even be beneficial. Thus, developing powerful diagnostic bioinformatic tools that differentiate pathogenic from commensal bacteria are critical for effective treatment of bacterial infections. RESULTS: We propose a machine-learning approach for classifying human-hosted bacteria as pathogenic or non-pathogenic based on their genome-derived proteomes. Our approach is based on sparse Support Vector Machines (SVM), which autonomously selects a small set of genes that are related to bacterial pathogenicity. We implement our approach as a tool-'Bacterial Pathogenicity Classification via sparse-SVM' (BacPaCS)-which is fully automated and handles datasets significantly larger than those previously used. BacPaCS shows high accuracy in distinguishing pathogenic from non-pathogenic bacteria, in a clinically relevant dataset, comprising only human-hosted bacteria. Among the genes that received the highest positive weight in the resulting classifier, we found genes that are known to be related to bacterial pathogenicity, in addition to novel candidates, whose involvement in bacterial virulence was never reported. AVAILABILITY AND IMPLEMENTATION: The code and the resulting model are available at: https://github.com/barashe/bacpacs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado de Máquina , Máquina de Vetores de Suporte , Bactérias , Humanos , Proteoma , Virulência
5.
Bioinformatics ; 33(12): 1907-1909, 2017 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-28165111

RESUMO

SUMMARY: Network motifs are small topological patterns that recur in a network significantly more often than expected by chance. Their identification emerged as a powerful approach for uncovering the design principles underlying complex networks. However, available tools for network motif analysis typically require download and execution of computationally intensive software on a local computer. We present MotifNet, the first open-access web-server for network motif analysis. MotifNet allows researchers to analyze integrated networks, where nodes and edges may be labeled, and to search for motifs of up to eight nodes. The output motifs are presented graphically and the user can interactively filter them by their significance, number of instances, node and edge labels, and node identities, and view their instances. MotifNet also allows the user to distinguish between motifs that are centered on specific nodes and motifs that recur in distinct parts of the network. AVAILABILITY AND IMPLEMENTATION: MotifNet is freely available at http://netbio.bgu.ac.il/motifnet . The website was implemented using ReactJs and supports all major browsers. The server interface was implemented in Python with data stored on a MySQL database. CONTACT: estiyl@bgu.ac.il or michaluz@cs.bgu.ac.il. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Software , Bases de Dados Factuais , Internet
6.
PLoS Comput Biol ; 13(1): e1005221, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-28135269

RESUMO

Protein phosphorylation underlies cellular response pathways across eukaryotes and is governed by the opposing actions of phosphorylating kinases and de-phosphorylating phosphatases. While kinases and phosphatases have been extensively studied, their organization and the mechanisms by which they balance each other are not well understood. To address these questions we performed quantitative analyses of large-scale 'omics' datasets from yeast, fly, plant, mouse and human. We uncovered an asymmetric balance of a previously-hidden scale: Each organism contained many different kinase genes, and these were balanced by a small set of highly abundant phosphatase proteins. Kinases were much more responsive to perturbations at the gene and protein levels. In addition, kinases had diverse scales of phenotypic impact when manipulated. Phosphatases, in contrast, were stable, highly robust and flatly organized, with rather uniform impact downstream. We validated aspects of this organization experimentally in nematode, and supported additional aspects by theoretic analysis of the dynamics of protein phosphorylation. Our analyses explain the empirical bias in the protein phosphorylation field toward characterization and therapeutic targeting of kinases at the expense of phosphatases. We show quantitatively and broadly that this is not only a historical bias, but stems from wide-ranging differences in their organization and impact. The asymmetric balance between these opposing regulators of protein phosphorylation is also common to opposing regulators of two other post-translational modification systems, suggesting its fundamental value.


Assuntos
Evolução Molecular , Regulação Enzimológica da Expressão Gênica/fisiologia , Monoéster Fosfórico Hidrolases/genética , Monoéster Fosfórico Hidrolases/metabolismo , Fosfotransferases/genética , Fosfotransferases/metabolismo , Animais , Arabidopsis/genética , Arabidopsis/metabolismo , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Ativação Enzimática/genética , Variação Genética/genética , Camundongos , Monoéster Fosfórico Hidrolases/classificação , Fosforilação , Fosfotransferases/classificação , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Especificidade da Espécie , Leveduras
7.
Methods ; 69(3): 326-34, 2014 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-25009129

RESUMO

The discovery and functional analysis of noncoding RNA (ncRNA) systems in different organisms motivates the development of tools for aiding ncRNA research. Several tools exist that search for occurrences of a given RNA structural profile in genomic sequences. Yet, there is a need for an "RNA BLAST" tool, i.e., a tool that takes a putative functional RNA sequence as input, and efficiently searches for similar sequences in genomic databases, taking into consideration potential secondary structure features of the input query sequence. This work aims at providing such a tool. Our tool, denoted StemSearch, is based on a structural representation of an RNA sequence by its potential stems. Potential stems in genomic sequences are identified in a preprocessing stage, and indexed. A user-provided query sequence is likewise processed, and stems from the target genomes that are similar to the query stems are retrieved from the index. Then, relevant genomic regions are identified and ranked according to their similarity to the query stem-set while enforcing conservation of cross-stem topology. Experiments using RFAM families show significantly improved recall for StemSearch over BLAST, with small loss of precision. We further demonstrate our system's capability to handle eukaryotic genomes by successfully searching for members of the 7SK family in chromosome 2 of the human genome. StemSearch is freely available on the web at: http://www.cs.bgu.ac.il/∼negevcb/StemSearch.


Assuntos
RNA não Traduzido/genética , Ferramenta de Busca/métodos , Análise de Sequência de RNA/métodos , Software , Algoritmos , Humanos , Conformação de Ácido Nucleico , Alinhamento de Sequência/métodos
8.
Bioinformatics ; 29(13): i210-6, 2013 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-23812986

RESUMO

MOTIVATION: A major challenge in systems biology is to reveal the cellular pathways that give rise to specific phenotypes and behaviours. Current techniques often rely on a network representation of molecular interactions, where each node represents a protein or a gene and each interaction is assigned a single static score. However, the use of single interaction scores fails to capture the tendency of proteins to favour different partners under distinct cellular conditions. RESULTS: Here, we propose a novel context-sensitive network model, in which genes and protein nodes are assigned multiple contexts based on their gene ontology annotations, and their interactions are associated with multiple context-sensitive scores. Using this model, we developed a new approach and a corresponding tool, ContextNet, based on a dynamic programming algorithm for identifying signalling paths linking proteins to their downstream target genes. ContextNet finds high-ranking context-sensitive paths in the interactome, thereby revealing the intermediate proteins in the path and their path-specific contexts. We validated the model using 18 348 manually curated cellular paths derived from the SPIKE database. We next applied our framework to elucidate the responses of human primary lung cells to influenza infection. Top-ranking paths were much more likely to contain infection-related proteins, and this likelihood was highly correlated with path score. Moreover, the contexts assigned by the algorithm pointed to putative, as well as previously known responses to viral infection. Thus, context sensitivity is an important extension to current network biology models and can be efficiently used to elucidate cellular response mechanisms. AVAILABILITY: ContextNet is publicly available at http://netbio.bgu.ac.il/ContextNet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Reguladoras de Genes , Mapeamento de Interação de Proteínas/métodos , Transdução de Sinais , Algoritmos , Ontologia Genética , Humanos , Pulmão/metabolismo , Pulmão/virologia , Modelos Biológicos , Orthomyxoviridae/metabolismo , Proteínas Virais/metabolismo
9.
Algorithms Mol Biol ; 18(1): 17, 2023 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-38037088

RESUMO

We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.

10.
BMC Bioinformatics ; 13: 322, 2012 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-23206407

RESUMO

BACKGROUND: MicroRNAs (miRNAs) are important regulators of gene expression encoded by a variety of organisms, including viruses. Although the function of most of the viral miRNAs is currently unknown, there is evidence that both viral and host miRNAs contribute to the interactions between viruses and their hosts. miRNAs constitute a complex combinatorial network, where one miRNA may target many genes and one gene may be targeted by multiple miRNAs. In particular, viral and host miRNAs may also have mutual target genes. Based on published evidence linking viral and host miRNAs there are three modes of mutual regulation: competing, cooperating, and compensating modes. RESULTS: In this paper we explore the compensating mode of mutual regulation upon Human Cytomegalovirus (HCMV) infection, when host miRNAs are down regulated and viral miRNAs compensate by mimicking their function. To achieve this, we develop a new algorithm which finds groups, called quasi-modules, of viral and host miRNAs and their mutual target genes, and use a new host miRNA expression data for HCMV-infected and uninfected cells. For two of the reported quasi-modules, supporting evidence from biological and medical literature is provided. CONCLUSIONS: The modules found by our method may advance the understanding of the role of miRNAs in host-viral interactions, and the genes in these modules may serve as candidates for further experimental validation.


Assuntos
Citomegalovirus/genética , Regulação da Expressão Gênica/fisiologia , MicroRNAs/fisiologia , RNA Viral/fisiologia , Algoritmos , Citomegalovirus/fisiologia , Regulação para Baixo , Humanos
11.
Algorithms Mol Biol ; 16(1): 16, 2021 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-34243815

RESUMO

Gene clusters are groups of genes that are co-locally conserved across various genomes, not necessarily in the same order. Their discovery and analysis is valuable in tasks such as gene annotation and prediction of gene interactions, and in the study of genome organization and evolution. The discovery of conserved gene clusters in a given set of genomes is a well studied problem, but with the rapid sequencing of prokaryotic genomes a new problem is inspired. Namely, given an already known gene cluster that was discovered and studied in one genomic dataset, to identify all the instances of the gene cluster in a given new genomic sequence. Thus, we define a new problem in comparative genomics, denoted PQ-TREE SEARCH that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer arguments [Formula: see text] and [Formula: see text], and a new sequence of genes S. The objective is to identify in S approximate new instances of the gene cluster; These instances could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by [Formula: see text] and [Formula: see text], respectively. We prove that PQ-TREE SEARCH is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-TREE SEARCH in [Formula: see text] time, where [Formula: see text] is the maximum degree of a node in T and [Formula: see text] is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-trees. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters.

12.
BMC Bioinformatics ; 11: 249, 2010 May 13.
Artigo em Inglês | MEDLINE | ID: mdl-20465802

RESUMO

BACKGROUND: MicroRNAs (miRNAs) are an abundant class of small noncoding RNAs (20-24 nts) that can affect gene expression by post-transcriptional regulation of mRNAs. They play important roles in several biological processes (e.g., development and cell cycle regulation). Numerous bioinformatics methods have been developed to identify the function of miRNAs by predicting their target mRNAs. Some viral organisms also encode miRNAs, a fact that contributes to the complex interactions between viruses and their hosts. A need arises to understand the functional relationship between viral and host miRNAs and their effect on viral and host genes. Our approach to meet this challenge is to identify modules where viral and host miRNAs cooperatively regulate host gene expression. RESULTS: We present a method to identify groups of viral and host miRNAs that cooperate in post-transcriptional gene regulation, and their target genes that are involved in similar biological processes. We call these groups (genes and miRNAs of human and viral origin) - modules. The modules are found in a new two-stage procedure, which we call bi-targeting, and is presented in this paper. The stages are (i) a new and efficient target prediction, and (ii) a new method for clustering objects of three different data types. In this work we integrate multiple information sources, including miRNA-target binding information, miRNA expression profiles, and GO annotations. Our hypotheses and the methods have been tested on human and Epstein Barr virus (EBV) miRNAs and human genes, for which we found 34 modules. We provide supporting evidence from biological and medical literature for two of our modules. Our code and data are available at http://www.cs.bgu.ac.il/~vaksler/BiTargeting.htm CONCLUSIONS: The presented algorithm, which makes use of diverse biological data, is demonstrated to be an efficient approach for finding bi-targeting modules of viral and human miRNAs. These modules can contribute to a better understanding of viral-host interactions and the role that miRNAs play in them.


Assuntos
Algoritmos , MicroRNAs/genética , RNA Mensageiro/genética , RNA Viral/genética , Redes Reguladoras de Genes , Herpesvirus Humano 4/genética , Humanos , Processamento Pós-Transcricional do RNA
13.
RNA ; 14(7): 1352-65, 2008 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-18492794

RESUMO

Cotranslational synthesis of proteins into the endoplasmic reticulum is preceded by targeting of the translating mRNA once a signal peptide emerges from the ribosome exit tunnel. Many mRNAs, however, are unlikely to be targeted by this process because they encode proteins that do not contain a signal peptide or because they are too short to be recognized by the signal recognition particle. Herein we tested the possible involvement of the 3'-UTR in the localization of an mRNA that encodes a very short Saccharomyces cerevisiae protein (Pmp1). We found by ribosome density mapping, sedimentation analysis, differential centrifugation, and fluorescent in situ hybridization that the 3'-UTR is essential for the association of the transcript with membrane compartments. Fusion of the 3'-UTR to heterologous open reading frames conferred on them a sedimentation and cellular localization pattern resembling that of PMP1. Mutation analysis revealed that a repeating UG-rich sequence within the 3'-UTR is important for membrane association. Taken together, our results reveal an essential role for elements within the 3'-UTR in the localization of an mRNA that is likely to be ignored by the standard signal-dependant mechanism.


Assuntos
Regiões 3' não Traduzidas/metabolismo , Proteínas de Membrana/genética , Proteínas de Membrana/metabolismo , Proteínas do Tecido Nervoso/genética , Proteínas do Tecido Nervoso/metabolismo , Proteolipídeos/genética , Proteolipídeos/metabolismo , RNA Fúngico/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Regiões 3' não Traduzidas/análise , Regiões 3' não Traduzidas/genética , Códon de Terminação , Retículo Endoplasmático/metabolismo , Proteínas de Membrana/análise , Mutagênese , Proteínas do Tecido Nervoso/análise , Polirribossomos , Transporte Proteico , Proteolipídeos/análise , ATPases Translocadoras de Prótons , RNA Fúngico/análise , RNA Fúngico/genética , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/análise
14.
J Comput Biol ; 27(11): 1561-1580, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-32250165

RESUMO

An important goal in microbial computational genomics is to identify crucial events in the evolution of a gene that severely alter the duplication, loss, and mobilization patterns of the gene within the genomes in which it disseminates. In this article, we formalize this microbiological goal as a new pattern-matching problem in the domain of gene tree and species tree reconciliation, denoted "Reconciliation-Scenario Altering Mutation (RSAM) Discovery." We propose an [Formula: see text] time algorithm to solve this new problem, wheremandnare the number of vertices of the input gene tree and species tree, respectively, andkis a user-specified parameter that bounds from above the number of optimal solutions of interest. The algorithm first constructs a hypergraph representing thekhighest scoring reconciliation scenarios between the given gene tree and species tree, and then interrogates this hypergraph for subtrees matching a prespecified RSAM pattern. Our algorithm is optimal in the sense that the number of hypernodes in the hypergraph can be lower bounded by [Formula: see text]. We implement the new algorithm as a tool, called RSAM-finder, and demonstrate its application to the identification of RSAMs in toxins and drug resistance elements across a data set spanning hundreds of species.


Assuntos
Bactérias/genética , Genômica , Modelos Genéticos , Mutação , Algoritmos , Evolução Molecular , Filogenia
15.
BMC Bioinformatics ; 10: 76, 2009 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-19257906

RESUMO

BACKGROUND: Scanning large genomes with a sliding window in search of locally stable RNA structures is a well motivated problem in bioinformatics. Given a predefined window size L and an RNA sequence S of size N (L < N), the consecutive windows folding problem is to compute the minimal free energy (MFE) for the folding of each of the L-sized substrings of S. The consecutive windows folding problem can be naively solved in O(NL3) by applying any of the classical cubic-time RNA folding algorithms to each of the N-L windows of size L. Recently an O(NL2) solution for this problem has been described. RESULTS: Here, we describe and implement an O(NLpsi(L)) engine for the consecutive windows folding problem, where psi(L) is shown to converge to O(1) under the assumption of a standard probabilistic polymer folding model, yielding an O(L) speedup which is experimentally confirmed. Using this tool, we note an intriguing directionality (5'-3' vs. 3'-5') folding bias, i.e. that the minimal free energy (MFE) of folding is higher in the native direction of the DNA than in the reverse direction of various genomic regions in several organisms including regions of the genomes that do not encode proteins or ncRNA. This bias largely emerges from the genomic dinucleotide bias which affects the MFE, however we see some variations in the folding bias in the different genomic regions when normalized to the dinucleotide bias. We also present results from calculating the MFE landscape of a mouse chromosome 1, characterizing the MFE of the long ncRNA molecules that reside in this chromosome. CONCLUSION: The efficient consecutive windows folding engine described in this paper allows for genome wide scans for ncRNA molecules as well as large-scale statistics. This is implemented here as a software tool, called RNAslider, and applied to the scanning of long chromosomes, leading to the observation of features that are visible only on a large scale.


Assuntos
Algoritmos , Genoma , RNA/química , Software , Conformação de Ácido Nucleico , RNA não Traduzido/química
16.
J Comput Biol ; 26(7): 745-766, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31140838

RESUMO

Recent advances in Next Generation Sequencing techniques, combined with global efforts to study infectious diseases, yield huge and rapidly-growing databases of microbial genomes. These big new data statistically empower genomic-context based approaches to functional analysis: the idea is that groups of genes that are clustered locally together across many genomes usually express protein products that interact in the same biological pathway (e.g., operons). The problem of finding such conserved "gene blocks" in a given genomic data has been studied extensively. In this work, we propose a new gene block discovery problem variant: find conserved gene blocks abiding by a user specification of biological functional constraints. We take advantage of the biological constraints to efficiently prune the search space. This is achieved by modeling the new problem as a special constrained variant of the well-studied "Closed Frequent Itemset Mining" problem, generalized here to handle item duplications. We exemplify the application of the tool we developed for this problem with two different case studies related to microbial ATP (adenosine triphosphate)-binding cassette (ABC) transporters.


Assuntos
Estudos de Associação Genética , Genoma , Família Multigênica , Células Procarióticas/metabolismo , Transportadores de Cassetes de Ligação de ATP/genética , Fatores de Tempo
17.
J Comput Biol ; 25(2): 214-235, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-29028176

RESUMO

We formalize a new problem variant in gene-block discovery, denoted Reference-Anchored Gene Blocks (RAGB), given a query sequence Q of length n, representing the gene array of a DNA element, a window size bound d on the length of a substring of interest in Q, and a set of target gene sequences [Formula: see text]. Our objective is to identify gene blocks in [Formula: see text] that are centered in a subset q of co-localized genes from Q, and contain genomes from [Formula: see text] in which the corresponding orthologs of the genes from q are also co-localized. We cast RAGB as a variant of a (colored) biclique problem in bipartite graphs, and analyze its parameterized complexity, as well as the parameterized complexity of other related problems. We give an [Formula: see text] time algorithm for the uncolored variant of our biclique problem, where m is the number of areas of interest that are parsed from the target sequences, and n and d are as defined earlier. Our algorithm can be adapted to compute all maximal bicliques in the graph within the same time complexity, and to handle edge weights with a slight [Formula: see text] increase to its time complexity. For the colored version of the problem, our algorithm has a time complexity of [Formula: see text]. We implement the algorithm and exemplify its application to the data mining of proteobacterial gene blocks that are centered in predicted proteobacterial genomic islands, leading to the identification of putatively mobilized clusters of virulence, pathogenicity, and resistance genes.


Assuntos
Genoma Bacteriano , Família Multigênica , Análise de Sequência de DNA/métodos , Algoritmos , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Padrões de Referência , Análise de Sequência de DNA/normas
18.
J Comput Biol ; 14(6): 856-72, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17691898

RESUMO

mRNA molecules are folded in the cells and therefore many of their substrings may actually be inaccessible to protein and microRNA binding. The need to apply an accessibility criterion to the task of genome-wide mRNA motif discovery raises the challenge of overcoming the core O(n(3)) factor imposed by the time complexity of the currently best known algorithms for RNA secondary structure prediction. We speed up the dynamic programming algorithms that are standard for RNA folding prediction. Our new approach significantly reduces the computations without sacrificing the optimality of the results, yielding an expected time complexity of O(n(2) psi(n)), where psi(n) is shown to be constant on average under standard polymer folding models. A benchmark analysis confirms that in practice the runtime ratio between the previous approach and the new algorithm indeed grows linearly with increasing sequence size. The fast new RNA folding algorithm is utilized for genome-wide discovery of accessible cis-regulatory motifs in data sets of ribosomal densities and decay rates of S. cerevisiae genes and to the mining of exposed binding sites of tissue-specific microRNAs in A. thaliana.


Assuntos
Algoritmos , RNA/química , RNA/genética , Sequências Reguladoras de Ácido Ribonucleico , Arabidopsis/genética , Biologia Computacional , MicroRNAs/genética , Modelos Moleculares , Conformação de Ácido Nucleico , Ribossomos/fisiologia , Proteínas de Saccharomyces cerevisiae/química , Proteínas de Saccharomyces cerevisiae/genética , Software
19.
J Comput Biol ; 14(7): 908-26, 2007 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-17803370

RESUMO

The discovery of non-coding RNA (ncRNA) motifs and their role in regulating gene expression has recently attracted considerable attention. The goal is to discover these motifs in a sequence database. Current RNA motif search methods start from the primary sequence and only then take into account secondary structure considerations. One can think of developing a flexible structure-based motif search method that will filter datasets based on secondary structure first, while allowing extensive primary sequence factors and additional factors such as potential pseudoknots as constraints. Since different motifs vary in structure rigidity and in local sequence constraints, there is a need for algorithms and tools that can be fine-tuned according to the searched RNA motif, but differ in their approach from the RNAMotif descriptor language. We present an RNA motif search tool called STRMS (Structural RNA Motif Search), which takes as input the secondary structure of the query, including local sequence and structure constraints, and a target sequence database. It reports all occurrences of the query in the target, ranked by their similarity to the query, and produces an html file that displays graphical images of the predicted structures for both the query and the candidate hits. Our tool is flexible and takes into account a large number of sequence options and existence of potential pseudoknots as dictated by specific queries. Our approach combines pre-folding and an O(m n) RNA pattern matching algorithm based on subtree homeomorphism for ordered, rooted trees. An O(n(2) log n) extension is described that allows the search engine to take into account the pseudoknots typical to riboswitches. We employed STRMS in search for both new and known RNA motifs (riboswitches and tRNAs) in large target databases. Our results point to a number of additional purine bacterial riboswitch candidates in newly sequenced bacteria, and demonstrate high sensitivity on known riboswitches and tRNAs. Code and data are available at www.cs.bgu.ac.il/vaksler/STRMS.


Assuntos
Algoritmos , Sequência de Bases , RNA/genética , Análise de Sequência de RNA , Conformação de Ácido Nucleico , RNA/química , RNA Bacteriano/genética , Reprodutibilidade dos Testes
20.
J Comput Biol ; 13(8): 1397-418, 2006 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-17061918

RESUMO

A new problem in phylogenetic inference is presented, based on recent biological findings indicating a strong association between reversals (i.e., inversions) and repeats. These biological findings are formalized here in a new mathematical model, called repeat-annotated phylogenetic trees (RAPT). We show that, under RAPT, the evolutionary process--including both the tree-topology as well as internal node genome orders--is uniquely determined, a property that is of major significance both in theory and in practice. Furthermore, the repeats are employed to provide linear-time algorithms for reconstructing both the genomic orders and the phylogeny, which are NP-hard problems under the classical model of sorting by reversals (SBR).


Assuntos
Biologia Computacional/métodos , Genoma Bacteriano/genética , Filogenia , Sequências Repetitivas de Ácido Nucleico/genética , Algoritmos , Modelos Biológicos , Recombinação Genética/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA