Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
1.
Entropy (Basel) ; 26(10)2024 Sep 25.
Artículo en Inglés | MEDLINE | ID: mdl-39451890

RESUMEN

A better understanding of protein-protein interaction (PPI) networks representing physical interactions between proteins could be beneficial for evolutionary insights as well as for practical applications such as drug development. As a statistical model for PPI networks, duplication-divergence models have been proposed, but they suffer from resulting in either very sparse networks in which most of the proteins are isolated, or in networks which are much denser than what is usually observed, having almost no isolated proteins. Moreover, in real networks, where a gene codes a protein, gene loss may occur. The loss of nodes has not been captured in duplication-divergence models to date. Here, we introduce a new duplication-divergence model which includes node loss. This mechanism results in networks in which the proportion of isolated proteins can take on values which are strictly between 0 and 1. To understand this new model, we apply strong and weak attacks to networks from duplication-divergence models with and without node loss, and compare the results to those obtained when carrying out similar attacks on two real PPI networks of E. coli and of S. cerevisiae. We find that the new model more closely reflects the damage caused by strong and weak attacks found in the PPI networks.

2.
Bioinformatics ; 37(13): 1928-1929, 2021 07 27.
Artículo en Inglés | MEDLINE | ID: mdl-32931579

RESUMEN

SUMMARY: Gene co-expression networks can be constructed in multiple different ways, both in the use of different measures of co-expression, and in the thresholds applied to the calculated co-expression values, from any given dataset. It is often not clear which co-expression network construction method should be preferred. COGENT provides a set of tools designed to aid the choice of network construction method without the need for any external validation data. AVAILABILITY AND IMPLEMENTATION: https://github.com/lbozhilova/COGENT. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.


Asunto(s)
Redes Reguladoras de Genes , Programas Informáticos , Pruebas Diagnósticas de Rutina , Expresión Génica
3.
Bioinformatics ; 2021 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-33523234

RESUMEN

MOTIVATION: Even within well studied organisms, many genes lack useful functional annotations. One way to generate such functional information is to infer biological relationships between genes/proteins, using a network of gene coexpression data that includes functional annotations. However, the lack of trustworthy functional annotations can impede the validation of such networks. Hence, there is a need for a principled method to construct gene coexpression networks that capture biological information and are structurally stable even in the absence of functional information. RESULTS: We introduce the concept of signed distance correlation as a measure of dependency between two variables, and apply it to generate gene coexpression networks. Distance correlation offers a more intuitive approach to network construction than commonly used methods such as Pearson correlation and mutual information. We propose a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information. We analyse data from three different organisms to illustrate how networks generated with our method are more stable and capture more biological information compared to networks obtained from Pearson correlation or mutual information. SUPPLEMENTARY INFORMATION: Supplementary Information and code are available at Bioinformatics and https://github.com/javier-pardodiaz/sdcorGCN online.

4.
BMC Genomics ; 21(1): 756, 2020 Nov 02.
Artículo en Inglés | MEDLINE | ID: mdl-33138772

RESUMEN

BACKGROUND: Recent advances in single-cell RNA sequencing have allowed researchers to explore transcriptional function at a cellular level. In particular, single-cell RNA sequencing reveals that there exist clusters of cells with similar gene expression profiles, representing different transcriptional states. RESULTS: In this study, we present SCPPIN, a method for integrating single-cell RNA sequencing data with protein-protein interaction networks that detects active modules in cells of different transcriptional states. We achieve this by clustering RNA-sequencing data, identifying differentially expressed genes, constructing node-weighted protein-protein interaction networks, and finding the maximum-weight connected subgraphs with an exact Steiner-tree approach. As case studies, we investigate two RNA-sequencing data sets from human liver spheroids and human adipose tissue, respectively. With SCPPIN we expand the output of differential expressed genes analysis with information from protein interactions. We find that different transcriptional states have different subnetworks of the protein-protein interaction networks significantly enriched which represent biological pathways. In these pathways, SCPPIN identifies proteins that are not differentially expressed but have a crucial biological function (e.g., as receptors) and therefore reveals biology beyond a standard differential expressed gene analysis. CONCLUSIONS: The introduced SCPPIN method can be used to systematically analyse differentially expressed genes in single-cell RNA sequencing data by integrating it with protein interaction data. The detected modules that characterise each cluster help to identify and hypothesise a biological function associated to those cells. Our analysis suggests the participation of unexpected proteins in these pathways that are undetectable from the single-cell RNA sequencing data alone. The techniques described here are applicable to other organisms and tissues.


Asunto(s)
Mapas de Interacción de Proteínas , ARN , Análisis por Conglomerados , Perfilación de la Expresión Génica , Redes Reguladoras de Genes , Humanos , ARN/genética , Análisis de Secuencia de ARN
5.
BMC Bioinformatics ; 20(1): 446, 2019 Aug 28.
Artículo en Inglés | MEDLINE | ID: mdl-31462221

RESUMEN

BACKGROUND: Protein interaction databases often provide confidence scores for each recorded interaction based on the available experimental evidence. Protein interaction networks (PINs) are then built by thresholding on these scores, so that only interactions of sufficiently high quality are included. These networks are used to identify biologically relevant motifs or nodes using metrics such as degree or betweenness centrality. This type of analysis can be sensitive to the choice of threshold. If a node metric is to be useful for extracting biological signal, it should induce similar node rankings across PINs obtained at different reasonable confidence score thresholds. RESULTS: We propose three measures-rank continuity, identifiability, and instability-to evaluate how robust a node metric is to changes in the score threshold. We apply our measures to twenty-five metrics and identify four as the most robust: the number of edges in the step-1 ego network, as well as the leave-one-out differences in average redundancy, average number of edges in the step-1 ego network, and natural connectivity. Our measures show good agreement across PINs from different species and data sources. Analysis of synthetically generated scored networks shows that robustness results are context-specific, and depend both on network topology and on how scores are placed across network edges. CONCLUSION: Due to the uncertainty associated with protein interaction detection, and therefore network structure, for PIN analysis to be reproducible, it should yield similar results across different confidence score thresholds. We demonstrate that while certain node metrics are robust with respect to threshold choice, this is not always the case. Promisingly, our results suggest that there are some metrics that are robust across networks constructed from different databases, and different scoring procedures.


Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Mapas de Interacción de Proteínas , Proteínas/metabolismo , Algoritmos , Humanos
6.
Bioinformatics ; 34(1): 64-71, 2018 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-29036452

RESUMEN

Motivation: Our work is motivated by an interest in constructing a protein-protein interaction network that captures key features associated with Parkinson's disease. While there is an abundance of subnetwork construction methods available, it is often far from obvious which subnetwork is the most suitable starting point for further investigation. Results: We provide a method to assess whether a subnetwork constructed from a seed list (a list of nodes known to be important in the area of interest) differs significantly from a randomly generated subnetwork. The proposed method uses a Monte Carlo approach. As different seed lists can give rise to the same subnetwork, we control for redundancy by constructing a minimal seed list as the starting point for the significance test. The null model is based on random seed lists of the same length as a minimum seed list that generates the subnetwork; in this random seed list the nodes have (approximately) the same degree distribution as the nodes in the minimum seed list. We use this null model to select subnetworks which deviate significantly from random on an appropriate set of statistics and might capture useful information for a real world protein-protein interaction network. Availability and implementation: The software used in this paper are available for download at https://sites.google.com/site/elliottande/. The software is written in Python and uses the NetworkX library. Contact: ande.elliott@gmail.com or felix.reed-tsochas@sbs.ox.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Método de Montecarlo , Enfermedad de Parkinson/metabolismo , Mapeo de Interacción de Proteínas/métodos , Programas Informáticos , Biología Computacional/métodos , Humanos
7.
Bioinformatics ; 32(7): 993-1000, 2016 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-26130573

RESUMEN

MOTIVATION: Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. RESULTS: Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. AVAILABILITY AND IMPLEMENTATION: Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html CONTACT: fsun@usc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Cadenas de Markov , Algoritmos , Animales , Análisis por Conglomerados , Biología Computacional/métodos , Genoma , Modelos Estadísticos , Vertebrados
8.
Brief Bioinform ; 15(3): 343-53, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24064230

RESUMEN

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia/métodos , Algoritmos , Biología Computacional/tendencias , Genómica/métodos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento , Cadenas de Markov , Modelos Estadísticos , Alineación de Secuencia , Análisis de Secuencia/estadística & datos numéricos
9.
Phys Rev Lett ; 117(7): 078301, 2016 Aug 12.
Artículo en Inglés | MEDLINE | ID: mdl-27564002

RESUMEN

Community detection, the division of a network into dense subnetworks with only sparse connections between them, has been a topic of vigorous study in recent years. However, while there exist a range of effective methods for dividing a network into a specified number of communities, it is an open question how to determine exactly how many communities one should use. Here we describe a mathematically principled approach for finding the number of communities in a network by maximizing the integrated likelihood of the observed network structure under an appropriate generative model. We demonstrate the approach on a range of benchmark networks, both real and computer generated.

10.
Bioinformatics ; 30(17): i430-7, 2014 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-25161230

RESUMEN

MOTIVATION: Biological network comparison software largely relies on the concept of alignment where close matches between the nodes of two or more networks are sought. These node matches are based on sequence similarity and/or interaction patterns. However, because of the incomplete and error-prone datasets currently available, such methods have had limited success. Moreover, the results of network alignment are in general not amenable for distance-based evolutionary analysis of sets of networks. In this article, we describe Netdis, a topology-based distance measure between networks, which offers the possibility of network phylogeny reconstruction. RESULTS: We first demonstrate that Netdis is able to correctly separate different random graph model types independent of network size and density. The biological applicability of the method is then shown by its ability to build the correct phylogenetic tree of species based solely on the topology of current protein interaction networks. Our results provide new evidence that the topology of protein interaction networks contains information about evolutionary processes, despite the lack of conservation of individual interactions. As Netdis is applicable to all networks because of its speed and simplicity, we apply it to a large collection of biological and non-biological networks where it clusters diverse networks by type. AVAILABILITY AND IMPLEMENTATION: The source code of the program is freely available at http://www.stats.ox.ac.uk/research/proteins/resources. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Mapeo de Interacción de Proteínas/métodos , Algoritmos , Animales , Evolución Biológica , Humanos , Filogenia
11.
Bioinformatics ; 29(21): 2690-8, 2013 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-23990418

RESUMEN

MOTIVATION: Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. RESULTS: Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics. AVAILABILITY: Our implementation of the five statistics is available as R package named 'multiAlignFree' at be http://www-rcf.usc.edu/∼fsun/Programs/multiAlignFree/multiAlignFreemain.html. CONTACT: reinert@stats.ox.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Análisis de Secuencia de ADN/métodos , Animales , Sitios de Unión , Interpretación Estadística de Datos , Ratones , Elementos Reguladores de la Transcripción , Alineación de Secuencia , Factores de Transcripción/metabolismo
12.
Netw Sci (Camb Univ Press) ; 10(2): 131-145, 2022 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-36217370

RESUMEN

Even within well-studied organisms, many genes lack useful functional annotations. One way to generate such functional information is to infer biological relationships between genes or proteins, using a network of gene coexpression data that includes functional annotations. Signed distance correlation has proved useful for the construction of unweighted gene coexpression networks. However, transforming correlation values into unweighted networks may lead to a loss of important biological information related to the intensity of the correlation. Here we introduce a principled method to construct weighted gene coexpression networks using signed distance correlation. These networks contain weighted edges only between those pairs of genes whose correlation value is higher than a given threshold. We analyse data from different organisms and find that networks generated with our method based on signed distance correlation are more stable and capture more biological information compared to networks obtained from Pearson correlation. Moreover, we show that signed distance correlation networks capture more biological information than unweighted networks based on the same metric. While we use biological data sets to illustrate the method, the approach is general and can be used to construct networks in other domains. Code and data are available on https://github.com/javier-pardodiaz/sdcorGCN.

13.
J Comput Biol ; 29(7): 752-768, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-35588362

RESUMEN

Nitrogen uptake in legumes is facilitated by bacteria such as Rhizobium leguminosarum. For this bacterium, gene expression data are available, but functional gene annotation is less well developed than for other model organisms. More annotations could lead to a better understanding of the pathways for growth, plant colonization, and nitrogen fixation in R. leguminosarum. In this study, we present a pipeline that combines novel scores from gene coexpression network analysis in a principled way to identify the genes that are associated with certain growth conditions or highly coexpressed with a predefined set of genes of interest. This association may lead to putative functional annotation or to a prioritized list of genes for further study.


Asunto(s)
Rhizobium leguminosarum , Proteínas Bacterianas/genética , Proteínas Bacterianas/metabolismo , Fijación del Nitrógeno/genética , Rhizobium leguminosarum/genética , Rhizobium leguminosarum/metabolismo
14.
Appl Netw Sci ; 7(1): 15, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35308059

RESUMEN

As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models. This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions. Supplementary Information: The online version contains supplementary material available at 10.1007/s41109-022-00454-2.

15.
Bioinformatics ; 26(18): i611-7, 2010 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-20823329

RESUMEN

MOTIVATION: A wealth of protein-protein interaction (PPI) data has recently become available. These data are organized as PPI networks and an efficient and biologically meaningful method to compare such PPI networks is needed. As a first step, we would like to compare observed networks to established network models, under the aspect of small subgraph counts, as these are conjectured to relate to functional modules in the PPI network. We employ the software tool GraphCrunch with the Graphlet Degree Distribution Agreement (GDDA) score to examine the use of such counts for network comparison. RESULTS: Our results show that the GDDA score has a pronounced dependency on the number of edges and vertices of the networks being considered. This should be taken into account when testing the fit of models. We provide a method for assessing the statistical significance of the fit between random graph models and biological networks based on non-parametric tests. Using this method we examine the fit of Erdös-Rényi (ER), ER with fixed degree distribution and geometric (3D) models to PPI networks. Under these rigorous tests none of these models fit to the PPI networks. The GDDA score is not stable in the region of graph density relevant to current PPI networks. We hypothesize that this score instability is due to the networks under consideration having a graph density in the threshold region for the appearance of small subgraphs. This is true for both geometric (3D) and ER random graph models. Such threshold behaviour may be linked to the robustness and efficiency properties of the PPI networks.


Asunto(s)
Interpretación Estadística de Datos , Modelos Biológicos , Mapeo de Interacción de Proteínas/métodos , Programas Informáticos , Simulación por Computador , Humanos , Saccharomyces cerevisiae/genética , Transducción de Señal
16.
J Theor Biol ; 284(1): 106-16, 2011 Sep 07.
Artículo en Inglés | MEDLINE | ID: mdl-21723298

RESUMEN

Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.


Asunto(s)
Secuencias Reguladoras de Ácidos Nucleicos/genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Animales , Interpretación Estadística de Datos , Drosophila/genética , Evolución Molecular , VIH-1/genética , Modelos Estadísticos , Alineación de Secuencia , Homología de Secuencia de Ácido Nucleico
17.
Proteins ; 78(13): 2781-97, 2010 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-20635422

RESUMEN

Biological processes are commonly controlled by precise protein-protein interactions. These connections rely on specific amino acids at the binding interfaces. Here we predict the binding residues of such interprotein complexes. We have developed a suite of methods, i-Patch, which predict the interprotein contact sites by considering the two proteins as a network, with residues as nodes and contacts as edges. i-Patch starts with two proteins, A and B, which are assumed to interact, but for which the structure of the complex is not available. However, we assume that for each protein, we have a reference structure and a multiple sequence alignment of homologues. i-Patch then uses the propensities of patches of residues to interact, to predict interprotein contact sites. i-Patch outperforms several other tested algorithms for prediction of interprotein contact sites. It gives 59% precision with 20% recall on a blind test set of 31 protein pairs. Combining the i-Patch scores with an existing correlated mutation algorithm, McBASC, using a logistic model gave little improvement. Results from a case study, on bacterial chemotaxis protein complexes, demonstrate that our predictions can identify contact residues, as well as suggesting unknown interfaces in multiprotein complexes.


Asunto(s)
Algoritmos , Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Secuencia de Aminoácidos , Proteínas Bacterianas/química , Proteínas Bacterianas/genética , Proteínas Bacterianas/metabolismo , Sitios de Unión/genética , Biología Computacional/métodos , Proteínas de la Membrana/química , Proteínas de la Membrana/genética , Proteínas de la Membrana/metabolismo , Proteínas Quimiotácticas Aceptoras de Metilo , Modelos Moleculares , Datos de Secuencia Molecular , Unión Proteica , Estructura Terciaria de Proteína , Proteínas/genética , Proteínas/metabolismo , Reproducibilidad de los Resultados , Homología de Secuencia de Aminoácido
18.
Proc Math Phys Eng Sci ; 476(2241): 20190783, 2020 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-33061788

RESUMEN

Empirical networks often exhibit different meso-scale structures, such as community and core-periphery structures. Core-periphery structure typically consists of a well-connected core and a periphery that is well connected to the core but sparsely connected internally. Most core-periphery studies focus on undirected networks. We propose a generalization of core-periphery structure to directed networks. Our approach yields a family of core-periphery block model formulations in which, contrary to many existing approaches, core and periphery sets are edge-direction dependent. We focus on a particular structure consisting of two core sets and two periphery sets, which we motivate empirically. We propose two measures to assess the statistical significance and quality of our novel structure in empirical data, where one often has no ground truth. To detect core-periphery structure in directed networks, we propose three methods adapted from two approaches in the literature, each with a different trade-off between computational complexity and accuracy. We assess the methods on benchmark networks where our methods match or outperform standard methods from the literature, with a likelihood approach achieving the highest accuracy. Applying our methods to three empirical networks-faculty hiring, a world trade dataset and political blogs-illustrates that our proposed structure provides novel insights in empirical networks.

19.
PLoS Comput Biol ; 4(7): e1000118, 2008 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-18654616

RESUMEN

Protein interactions play a vital part in the function of a cell. As experimental techniques for detection and validation of protein interactions are time consuming, there is a need for computational methods for this task. Protein interactions appear to form a network with a relatively high degree of local clustering. In this paper we exploit this clustering by suggesting a score based on triplets of observed protein interactions. The score utilises both protein characteristics and network properties. Our score based on triplets is shown to complement existing techniques for predicting protein interactions, outperforming them on data sets which display a high degree of clustering. The predicted interactions score highly against test measures for accuracy. Compared to a similar score derived from pairwise interactions only, the triplet score displays higher sensitivity and specificity. By looking at specific examples, we show how an experimental set of interactions can be enriched and validated. As part of this work we also examine the effect of different prior databases upon the accuracy of prediction and find that the interactions from the same kingdom give better results than from across kingdoms, suggesting that there may be fundamental differences between the networks. These results all emphasize that network structure is important and helps in the accurate prediction of protein interactions. The protein interaction data set and the program used in our analysis, and a list of predictions and validations, are available at http://www.stats.ox.ac.uk/bioinfo/resources/PredictingInteractions.


Asunto(s)
Biología Computacional/métodos , Redes Neurales de la Computación , Mapeo de Interacción de Proteínas/métodos , Secuencia de Aminoácidos , Animales , Bases de Datos de Proteínas , Humanos , Valor Predictivo de las Pruebas , Proteínas/química , Proteínas/metabolismo , Homología Estructural de Proteína , Relación Estructura-Actividad , Integración de Sistemas
20.
Bioinformatics ; 23(17): 2314-21, 2007 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-17599931

RESUMEN

MOTIVATION: The Majority Vote approach has demonstrated that protein-protein interactions can be used to predict the structure or function of a protein. In this article we propose a novel method for the prediction of such protein characteristics based on frequencies of pairwise interactions. In addition, we study a second new approach using the pattern frequencies of triplets of proteins, thus for the first time taking network structure explicitly into account. Both these methods are extended to jointly consider multiple organisms and multiple characteristics. RESULTS: Compared to the standard non-network-based method, namely the Majority Vote method, in large networks our predictions tend to be more accurate. For structure prediction, the Frequency-based method reaches up to 71% accuracy, and the Triplet-based method reaches up to 72% accuracy, whereas for function prediction, both the Triplet-based method and the Frequency-based method reach up to 90% accuracy. Function prediction on proteins without homologues showed slightly less but comparable accuracies. Including partially annotated proteins substantially increases the number of proteins for which our methods predict their characteristics with reasonable accuracy. We find that the enhanced Triplet-based method does not currently yield significantly better results than the enhanced Frequency-based method, suggesting that triplets of interactions do not contain substantially more information about protein characteristics than interaction pairs. Our methods offer two main improvements over current approaches--first, multiple protein characteristics are considered simultaneously, and second, data is integrated from multiple species. In addition, the Triplet-based method includes network structure more explicitly than the Majority Vote and the Frequency-based method. AVAILABILITY: The program is available upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Modelos Biológicos , Mapeo de Interacción de Proteínas/métodos , Proteínas/metabolismo , Transducción de Señal/fisiología , Simulación por Computador , Interpretación Estadística de Datos , Modelos Estadísticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA