Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Nucleic Acids Res ; 47(13): 6642-6655, 2019 07 26.
Artigo em Inglês | MEDLINE | ID: mdl-31334812

RESUMO

Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the selection of genes in cancer by assessing the effects of de-novo somatic mutations on protein-coding genes. Using a machine-learning model, we quantified the functional effects of ∼3M somatic mutations extracted from over 10 000 human cancerous samples, and compared them against the effects of all possible single-nucleotide mutations in the coding human genome. We detected 593 protein-coding genes showing statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with known cancer genes, but also include many overlooked genes. FABRIC is designed to avoid false discoveries by comparing each gene to its own background model using rigorous statistics, making minimal assumptions about the distribution of random somatic mutations. The framework is an open-source project with a simple command-line interface.


Assuntos
Biologia Computacional/métodos , Genes Neoplásicos , Mutação , Proteínas de Neoplasias/genética , Neoplasias/genética , Conjuntos de Dados como Assunto , Humanos , Modelos Genéticos , Mutação de Sentido Incorreto , Proteínas de Neoplasias/química , Proteínas de Neoplasias/fisiologia
2.
PLoS Comput Biol ; 15(12): e1007204, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31790387

RESUMO

Mature microRNAs (miRNAs) regulate most human genes through direct base-pairing with mRNAs. We investigate the underlying principles of miRNA regulation in living cells. To this end, we overexpressed miRNAs in different cell types and measured the mRNA decay rate under a paradigm of a transcriptional arrest. Based on an exhaustive matrix of mRNA-miRNA binding probabilities, and parameters extracted from our experiments, we developed a computational framework that captures the cooperative action of miRNAs in living cells. The framework, called COMICS, simulates the stochastic binding events between miRNAs and mRNAs in cells. The input of COMICS is cell-specific profiles of mRNAs and miRNAs, and the outcome is the retention level of each mRNA at the end of 100,000 iterations. The results of COMICS from thousands of miRNA manipulations reveal gene sets that exhibit coordinated behavior with respect to all miRNAs (total of 248 families). We identified a small set of genes that are highly responsive to changes in the expression of almost any of the miRNAs. In contrast, about 20% of the tested genes remain insensitive to a broad range of miRNA manipulations. The set of insensitive genes is strongly enriched with genes that belong to the translation machinery. These trends are shared by different cell types. We conclude that the stochastic nature of miRNAs reveals unexpected robustness of gene expression in living cells. By applying a systematic probabilistic approach some key design principles of cell states are revealed, emphasizing in particular, the immunity of the translational machinery vis-a-vis miRNA manipulations across cell types. We propose COMICS as a valuable platform for assessing the outcome of miRNA regulation of cells in health and disease.


Assuntos
MicroRNAs/genética , MicroRNAs/metabolismo , Modelos Genéticos , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Biologia Computacional , Simulação por Computador , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Células HEK293 , Células HeLa , Humanos , Células MCF-7 , Estabilidade de RNA/genética , Processos Estocásticos
3.
Nucleic Acids Res ; 45(9): 5048-5060, 2017 May 19.
Artigo em Inglês | MEDLINE | ID: mdl-28379430

RESUMO

The primary function of microRNAs (miRNAs) is to maintain cell homeostasis. In cancerous tissues miRNAs' expression undergo drastic alterations. In this study, we use miRNA expression profiles from The Cancer Genome Atlas of 24 cancer types and 3 healthy tissues, collected from >8500 samples. We seek to classify the cancer's origin and tissue identification using the expression from 1046 reported miRNAs. Despite an apparent uniform appearance of miRNAs among cancerous samples, we recover indispensable information from lowly expressed miRNAs regarding the cancer/tissue types. Multiclass support vector machine classification yields an average recall of 58% in identifying the correct tissue and tumor types. Data discretization had led to substantial improvement, reaching an average recall of 91% (95% median). We propose a straightforward protocol as a crucial step in classifying tumors of unknown primary origin. Our counter-intuitive conclusion is that in almost all cancer types, highly expressing miRNAs mask the significant signal that lower expressed miRNAs provide.


Assuntos
Biomarcadores Tumorais/análise , MicroRNAs/análise , Neoplasias/diagnóstico , Biomarcadores Tumorais/genética , Perfilação da Expressão Gênica , Humanos , MicroRNAs/genética , Neoplasias/classificação , Neoplasias/genética
4.
Bioinformatics ; 30(17): i624-30, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25161256

RESUMO

MOTIVATION: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all included proteins, finding an optimal level of granularity for the purpose of focusing on protein functional groups remain elusive. Here, we ask whether knowledge-based annotations on protein families can support the automatic unsupervised methods for identifying high-quality protein families. We present a method that yields within the ProtoNet hierarchy an optimal partition of clusters, relative to manual annotation schemes. The method's principle is to minimize the entropy-derived distance between annotation-based partitions and all available hierarchical partitions. We describe the best front (BF) partition of 2 478 328 proteins from UniRef50. Of 4,929,553 ProtoNet tree clusters, BF based on Pfam annotations contain 26,891 clusters. The high quality of the partition is validated by the close correspondence with the set of clusters that best describe thousands of keywords of Pfam. The BF is shown to be superior to naïve cut in the ProtoNet tree that yields a similar number of clusters. Finally, we used parameters intrinsic to the clustering process to enrich a priori the BF's clusters. We present the entropy-based method's benefit in overcoming the unavoidable limitations of nested clusters in ProtoNet. We suggest that this automatic information-based cluster selection can be useful for other large-scale annotation schemes, as well as for systematically testing and comparing putative families derived from alternative clustering methods. AVAILABILITY AND IMPLEMENTATION: A catalog of BF clusters for thousands of Pfam keywords is provided at http://protonet.cs.huji.ac.il/bestFront/.


Assuntos
Proteínas/classificação , Algoritmos , Análise por Conglomerados , Anotação de Sequência Molecular , Análise de Sequência de Proteína
5.
Nucleic Acids Res ; 40(Database issue): D313-20, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22121228

RESUMO

ProtoNet 6.0 (http://www.protonet.cs.huji.ac.il) is a data structure of protein families that cover the protein sequence space. These families are generated through an unsupervised bottom-up clustering algorithm. This algorithm organizes large sets of proteins in a hierarchical tree that yields high-quality protein families. The 2012 ProtoNet (Version 6.0) tree includes over 9 million proteins of which 5.5% come from UniProtKB/SwissProt and the rest from UniProtKB/TrEMBL. The hierarchical tree structure is based on an all-against-all comparison of 2.5 million representatives of UniRef50. Rigorous annotation-based quality tests prune the tree to most informative 162,088 clusters. Every high-quality cluster is assigned a ProtoName that reflects the most significant annotations of its proteins. These annotations are dominated by GO terms, UniProt/Swiss-Prot keywords and InterPro. ProtoNet 6.0 operates in a default mode. When used in the advanced mode, this data structure offers the user a view of the family tree at any desired level of resolution. Systematic comparisons with previous versions of ProtoNet are carried out. They show how our view of protein families evolves, as larger parts of the sequence space become known. ProtoNet 6.0 provides numerous tools to navigate the hierarchy of clusters.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Análise de Sequência de Proteína , Algoritmos , Análise por Conglomerados , Internet , Metagenoma , Anotação de Sequência Molecular
6.
Bioinformatics ; 27(13): i142-8, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21685063

RESUMO

MOTIVATION: Much of the large-scale molecular data from living cells can be represented in terms of networks. Such networks occupy a central position in cellular systems biology. In the protein-protein interaction (PPI) network, nodes represent proteins and edges represent connections between them, based on experimental evidence. As PPI networks are rich and complex, a mathematical model is sought to capture their properties and shed light on PPI evolution. The mathematical literature contains various generative models of random graphs. It is a major, still largely open question, which of these models (if any) can properly reproduce various biologically interesting networks. Here, we consider this problem where the graph at hand is the PPI network of Saccharomyces cerevisiae. We are trying to distinguishing between a model family which performs a process of copying neighbors, represented by the duplication-divergence (DD) model, and models which do not copy neighbors, with the Barabási-Albert (BA) preferential attachment model as a leading example. RESULTS: The observed property of the network is the distribution of maximal bicliques in the graph. This is a novel criterion to distinguish between models in this area. It is particularly appropriate for this purpose, since it reflects the graph's growth pattern under either model. This test clearly favors the DD model. In particular, for the BA model, the vast majority (92.9%) of the bicliques with both sides ≥4 must be already embedded in the model's seed graph, whereas the corresponding figure for the DD model is only 5.1%. Our results, based on the biclique perspective, conclusively show that a naïve unmodified DD model can capture a key aspect of PPI networks.


Assuntos
Modelos Estatísticos , Proteínas/metabolismo , Saccharomyces cerevisiae/metabolismo , Mapeamento de Interação de Proteínas , Biologia de Sistemas
7.
Bioinformatics ; 27(5): 655-61, 2011 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-21258061

RESUMO

MOTIVATION: Large-scale RNA expression measurements are generating enormous quantities of data. During the last two decades, many methods were developed for extracting insights regarding the interrelationships between genes from such data. The mathematical and computational perspectives that underlie these methods are usually algebraic or probabilistic. RESULTS: Here, we introduce an unexplored geometric view point where expression levels of genes in multiple experiments are interpreted as vectors in a high-dimensional space. Specifically, we find, for the expression profile of each particular gene, its approximation as a linear combination of profiles of a few other genes. This method is inspired by recent developments in the realm of compressed sensing in the machine learning domain. To demonstrate the power of our approach in extracting valuable information from the expression data, we independently applied it to large-scale experiments carried out on the yeast and malaria parasite whole transcriptomes. The parameters extracted from the sparse reconstruction of the expression profiles, when fed to a supervised learning platform, were used to successfully predict the relationships between genes throughout the Gene Ontology hierarchy and protein-protein interaction map. Extensive assessment of the biological results shows high accuracy in both recovering known predictions and in yielding accurate predictions missing from the current databases. We suggest that the geometrical approach presented here is suitable for a broad range of high-dimensional experimental data.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Inteligência Artificial , Plasmodium falciparum/genética , RNA Fúngico/genética , RNA de Protozoário/genética , Saccharomyces cerevisiae/genética
8.
Sci Rep ; 11(1): 14901, 2021 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-34290314

RESUMO

The characterization of germline genetic variation affecting cancer risk, known as cancer predisposition, is fundamental to preventive and personalized medicine. Studies of genetic cancer predisposition typically identify significant genomic regions based on family-based cohorts or genome-wide association studies (GWAS). However, the results of such studies rarely provide biological insight or functional interpretation. In this study, we conducted a comprehensive analysis of cancer predisposition in the UK Biobank cohort using a new gene-based method for detecting protein-coding genes that are functionally interpretable. Specifically, we conducted proteome-wide association studies (PWAS) to identify genetic associations mediated by alterations to protein function. With PWAS, we identified 110 significant gene-cancer associations in 70 unique genomic regions across nine cancer types and pan-cancer. In 48 of the 110 PWAS associations (44%), estimated gene damage is associated with reduced rather than elevated cancer risk, suggesting a protective effect. Together with standard GWAS, we implicated 145 unique genomic loci with cancer risk. While most of these genomic regions are supported by external evidence, our results also highlight many novel loci. Based on the capacity of PWAS to detect non-additive genetic effects, we found that 46% of the PWAS-significant cancer regions exhibited exclusive recessive inheritance. These results highlight the importance of recessive genetic effects, without relying on familial studies. Finally, we show that many of the detected genes exert substantial cancer risk in the studied cohort determined by a quantitative functional description, suggesting their relevance for diagnosis and genetic consulting.


Assuntos
Genes Recessivos/genética , Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla/métodos , Proteínas de Neoplasias/genética , Proteínas de Neoplasias/fisiologia , Neoplasias/genética , Proteoma/genética , Estudos de Coortes , Feminino , Aconselhamento Genético , Loci Gênicos/genética , Mutação em Linhagem Germinativa , Humanos , Masculino , Neoplasias/diagnóstico , Risco , Reino Unido
9.
Front Mol Biosci ; 8: 772852, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34993232

RESUMO

A hallmark of cancer evolution is that the tumor may change its cell identity and improve its survival and fitness. Drastic change in microRNA (miRNA) composition and quantities accompany such dynamic processes. Cancer samples are composed of cells' mixtures of varying stages of cancerous progress. Therefore, cell-specific molecular profiling represents cellular averaging. In this study, we consider the degree to which altering miRNAs composition shifts cell behavior. We used COMICS, an iterative framework that simulates the stochastic events of miRNA-mRNA pairing, using a probabilistic approach. COMICS simulates the likelihood that cells change their transcriptome following many iterations (100 k). Results of COMICS from the human cell line (HeLa) confirmed that most genes are resistant to miRNA regulation. However, COMICS results suggest that the composition of the abundant miRNAs dictates the nature of the cells (across three cell lines) regardless of its actual mRNA steady-state. In silico perturbations of cell lines (i.e., by overexpressing miRNAs) allowed to classify genes according to their sensitivity and resilience to any combination of miRNA perturbations. Our results expose an overlooked quantitative dimension for a set of genes and miRNA regulation in living cells. The immediate implication is that even relatively modest overexpression of specific miRNAs may shift cell identity and impact cancer evolution.

10.
Sci Rep ; 10(1): 13462, 2020 08 10.
Artigo em Inglês | MEDLINE | ID: mdl-32778766

RESUMO

It is estimated that up to 10% of cancer incidents are attributed to inherited genetic alterations. Despite extensive research, there are still gaps in our understanding of genetic predisposition to cancer. It was theorized that ultra-rare variants partially account for the missing heritable component. We harness the UK BioBank dataset of ~ 500,000 individuals, 14% of which were diagnosed with cancer, to detect ultra-rare, possibly high-penetrance cancer predisposition variants. We report on 115 cancer-exclusive ultra-rare variations and nominate 26 variants with additional independent evidence as cancer predisposition variants. We conclude that population cohorts are valuable source for expanding the collection of novel cancer predisposition genes.


Assuntos
Predisposição Genética para Doença/genética , Variação Genética/genética , Neoplasias/genética , Bases de Dados Genéticas , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Mutação/genética , Penetrância , Polimorfismo de Nucleotídeo Único/genética
11.
Genome Biol ; 21(1): 173, 2020 07 14.
Artigo em Inglês | MEDLINE | ID: mdl-32665031

RESUMO

We introduce Proteome-Wide Association Study (PWAS), a new method for detecting gene-phenotype associations mediated by protein function alterations. PWAS aggregates the signal of all variants jointly affecting a protein-coding gene and assesses their overall impact on the protein's function using machine learning and probabilistic models. Subsequently, it tests whether the gene exhibits functional variability between individuals that correlates with the phenotype of interest. PWAS can capture complex modes of heritability, including recessive inheritance. A comparison with GWAS and other existing methods proves its capacity to recover causal protein-coding genes and highlight new associations. PWAS is available as a command-line tool.


Assuntos
Fenótipo , Proteoma , Proteômica/métodos , Software , Neoplasias Colorretais/genética , Estudo de Associação Genômica Ampla , Humanos
12.
BMC Evol Biol ; 9: 285, 2009 Dec 08.
Artigo em Inglês | MEDLINE | ID: mdl-19995431

RESUMO

BACKGROUND: Codon usage may vary significantly between different organisms and between genes within the same organism. Several evolutionary processes have been postulated to be the predominant determinants of codon usage: selection, mutation, and genetic drift. However, the relative contribution of each of these factors in different species remains debatable. The availability of complete genomes for tens of multicellular organisms provides an opportunity to inspect the relationship between codon usage and the evolutionary age of genes. RESULTS: We assign an evolutionary age to a gene based on the relative positions of its identified homologues in a standard phylogenetic tree. This yields a classification of all genes in a genome to several evolutionary age classes. The present study starts from the observation that each age class of genes has a unique codon usage and proceeds to provide a quantitative analysis of the codon usage in these classes. This observation is made for the genomes of Homo sapiens, Mus musculus, and Drosophila melanogaster. It is even more remarkable that the differences between codon usages in different age groups exhibit similar and consistent behavior in various organisms. While we find that GC content and gene length are also associated with the evolutionary age of genes, they can provide only a partial explanation for the observed codon usage. CONCLUSION: While factors such as GC content, mutational bias, and selection shape the codon usage in a genome, the evolutionary history of an organism over hundreds of millions of years is an overlooked property that is strongly linked to GC content, protein length, and, even more significantly, to the codon usage of metazoan genomes.


Assuntos
Códon , Evolução Molecular , Genoma , Animais , Composição de Bases , Humanos
13.
Nucleic Acids Res ; 35(Database issue): D241-6, 2007 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-17099230

RESUMO

Protein domains are subunits of proteins that recur throughout the protein world. There are many definitions attempting to capture the essence of a protein domain, and several systems that identify protein domains and classify them into families. EVEREST, recently described in Portugaly et al. (2006) BMC Bioinformatics, 7, 277, is one such system that performs the task automatically, using protein sequence alone. Herein we describe EVEREST release 2.0, consisting of 20,029 families, each defined by one or more HMMs. The current EVEREST database was constructed by scanning UniProt 8.1 and all PDB sequences (total over 3,000,000 sequences) with each of the EVEREST families. EVEREST annotates 64% of all sequences, and covers 59% of all residues. EVEREST is available at http://www.everest.cs.huji.ac.il/. The website provides annotations given by SCOP, CATH, Pfam A and EVEREST. It allows for browsing through the families of each of those sources, graphically visualizing the domain organization of the proteins in the family. The website also provides access to analyzes of relationships between domain families, within and across domain definition systems. Users can upload sequences for analysis by the set of EVEREST families. Finally an advanced search form allows querying for families matching criteria regarding novelty, phylogenetic composition and more.


Assuntos
Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Sequência Conservada , Evolução Molecular , Internet , Proteínas/classificação , Análise de Sequência de Proteína , Interface Usuário-Computador
14.
Nucleic Acids Res ; 33(Database issue): D216-8, 2005 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-15608180

RESUMO

ProtoNet is an automatic hierarchical classification of the protein sequence space. In 2004, the ProtoNet (version 4.0) presents the analysis of over one million proteins merged from SwissProt and TrEMBL databases. In addition to rich visualization and analysis tools to navigate the clustering hierarchy, we incorporated several improvements that allow a simplified view of the scaffold of the proteins. An unsupervised, biologically valid method that was developed resulted in a condensation of the ProtoNet hierarchy to only 12% of the clusters. A large portion of these clusters was automatically assigned high confidence biological names according to their correspondence with functional annotations. ProtoNet is available at: http://www.protonet.cs.huji.ac.il.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Análise de Sequência de Proteína , Animais , Análise por Conglomerados , Humanos , Internet , Camundongos , Proteínas/química
15.
BMC Bioinformatics ; 7: 277, 2006 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-16749920

RESUMO

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at 1, provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.


Assuntos
Bases de Dados de Proteínas , Reconhecimento Automatizado de Padrão/métodos , Estrutura Terciária de Proteína , Proteínas/classificação , Software , Análise por Conglomerados , Biologia Computacional/métodos , Modelos Estatísticos , Proteínas/química , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos
16.
J Comput Biol ; 13(2): 215-28, 2006 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-16597236

RESUMO

DNA amplifications and deletions characterize cancer genome and are often related to disease evolution. Microarray-based techniques for measuring these DNA copy-number changes use fluorescence ratios at arrayed DNA elements (BACs, cDNA, or oligonucleotides) to provide signals at high resolution, in terms of genomic locations. These data are then further analyzed to map aberrations and boundaries and identify biologically significant structures. We develop a statistical framework that enables the casting of several DNA copy number data analysis questions as optimization problems over real-valued vectors of signals. The simplest form of the optimization problem seeks to maximize phi(I) = Sigmanu(i)/radical|I| over all subintervals I in the input vector. We present and prove a linear time approximation scheme for this problem, namely, a process with time complexity O (nepsilon(-2)) that outputs an interval for which phi(I) is at least Opt/alpha(epsilon), where Opt is the actual optimum and alpha(epsilon) --> 1 as epsilon --> 0. We further develop practical implementations that improve the performance of the naive quadratic approach by orders of magnitude. We discuss properties of optimal intervals and how they apply to the algorithm performance. We benchmark our algorithms on synthetic as well as publicly available DNA copy number data. We demonstrate the use of these methods for identifying aberrations in single samples as well as common alterations in fixed sets and subsets of breast cancer samples.


Assuntos
Algoritmos , Neoplasias da Mama/genética , DNA/química , Dosagem de Genes , Hibridização de Ácido Nucleico/métodos , Aberrações Cromossômicas , Feminino , Marcadores Genéticos , Humanos , Hibridização in Situ Fluorescente , Células Tumorais Cultivadas
17.
Nucleic Acids Res ; 31(1): 348-52, 2003 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-12520020

RESUMO

The ProtoNet site provides an automatic hierarchical clustering of the SWISS-PROT protein database. The clustering is based on an all-against-all BLAST similarity search. The similarities' E-score is used to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. ProtoNet (version 1.3) is accessible in the form of an interactive web site at http://www.protonet.cs.huji.ac.il. ProtoNet provides navigation tools for monitoring the clustering process with a vertical and horizontal view. Each cluster at any level of the hierarchy is assigned with a statistical index, indicating the level of purity based on biological keywords such as those provided by SWISS-PROT and InterPro. ProtoNet can be used for function prediction, for defining superfamilies and subfamilies and for large-scale protein annotation purposes.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Animais , Análise por Conglomerados , Armazenamento e Recuperação da Informação , Internet , Proteínas/química , Proteínas/fisiologia
19.
Proc Natl Acad Sci U S A ; 101(33): 12201-6, 2004 Aug 17.
Artigo em Inglês | MEDLINE | ID: mdl-15304646

RESUMO

Alignment of protein structures is a fundamental task in computational molecular biology. Good structural alignments can help detect distant evolutionary relationships that are hard or impossible to discern from protein sequences alone. Here, we study the structural alignment problem as a family of optimization problems and develop an approximate polynomial-time algorithm to solve them. For a commonly used scoring function, the algorithm runs in O(n(10)/epsilon(6)) time, for globular protein of length n, and it detects alignments that score within an additive error of epsilon from all optima. Thus, we prove that this task is computationally feasible, although the method that we introduce is too slow to be a useful everyday tool. We argue that such approximate solutions are, in fact, of greater interest than exact ones because of the noisy nature of experimentally determined protein coordinates. The measurement of similarity between a pair of protein structures used by our algorithm involves the Euclidean distance between the structures (appropriately rigidly transformed). We show that an alternative approach, which relies on internal distance matrices, must incorporate sophisticated geometric ingredients if it is to guarantee optimality and run in polynomial time. We use these observations to visualize the scoring function for several real instances of the problem. Our investigations yield insights on the computational complexity of protein alignment under various scoring functions. These insights can be used in the design of scoring functions for which the optimum can be approximated efficiently and perhaps in the development of efficient algorithms for the multiple structural alignment problem.


Assuntos
Proteínas/química , Algoritmos , Fenômenos Biofísicos , Biofísica , Modelos Moleculares , Modelos Estatísticos , Estrutura Molecular
20.
Bioinformatics ; 18 Suppl 1: S14-21, 2002.
Artigo em Inglês | MEDLINE | ID: mdl-12169526

RESUMO

MOTIVATION: A large fraction of biological research concentrates on individual proteins and on small families of proteins. One of the current major challenges in bioinformatics is to extend our knowledge to very large sets of proteins. Several major projects have tackled this problem. Such undertakings usually start with a process that clusters all known proteins or large subsets of this space. Some work in this area is carried out automatically, while other attempts incorporate expert advice and annotation. RESULTS: We propose a novel technique that automatically clusters protein sequences. We consider all proteins in SWISSPROT, and carry out an all-against-all BLAST similarity test among them. With this similarity measure in hand we proceed to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. Here we compare the clusters that result from alternative merging rules, and validate the results against InterPro. Our preliminary results show that clusters that are consistent with several rather than a single merging rule tend to comply with InterPro annotation. This is an affirmation of the view that the protein space consists of families that differ markedly in their evolutionary conservation.


Assuntos
Algoritmos , Análise por Conglomerados , Bases de Dados de Proteínas , Proteínas/química , Proteínas/classificação , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão , Homologia de Sequência de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA