Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
Add more filters










Publication year range
1.
Stat Appl Genet Mol Biol ; 16(5-6): 313-331, 2017 11 27.
Article in English | MEDLINE | ID: mdl-29166289

ABSTRACT

We introduce a new method to test efficiently for cospeciation in tritrophic systems. Our method utilises an analogy with electrical circuit theory to reduce higher order systems into bitrophic data sets that retain the information of the original system. We use a sophisticated permutation scheme that weights interactions between two trophic layers based on their connection to the third layer in the system. Our method has several advantages compared to the method of Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): "Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology," Stat. Appl. Genet. Mol. Biol., 12, 679-701.]. We do not require triangular interactions to connect the three phylogenetic trees and an easily interpreted p-value is obtained in one step. Another advantage of our method is the scope for generalisation to higher order systems and phylogenetic networks. The performance of our method is compared to the methods of Hommola et al. [Hommola, K., J. E. Smith, Y. Qiu and W. R. Gilks (2009): "A permutation test of host-parasite cospeciation," Mol. Biol. Evol., 26, 1457-1468.] and Mramba et al. [Mramba, L. K., S. Barber, K. Hommola, L. A. Dyer, J. S. Wilson, M. L. Forister and W. R. Gilks (2013): "Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology," Stat. Appl. Genet. Mol. Biol., 12, 679-701.] at the bitrophic and tritrophic level, respectively. This was achieved by evaluating type I error and statistical power. The results show that our method produces unbiased p-values and has comparable power overall at both trophic levels. Our method was successfully applied to a dataset of leaf-mining moths, parasitoid wasps and host plants [Lopez-Vaamonde, C., H. Godfray, S. West, C. Hansson and J. Cook (2005): "The evolution of host use and unusual reproductive strategies in achrysocharoides parasitoid wasps," J. Evol. Biol., 18, 1029-1041.], at both the bitrophic and tritrophic levels.


Subject(s)
Host-Parasite Interactions , Models, Statistical , Models, Theoretical , Computer Simulation , Host-Parasite Interactions/genetics
2.
Stat Appl Genet Mol Biol ; 12(6): 679-701, 2013 Dec.
Article in English | MEDLINE | ID: mdl-24114867

ABSTRACT

There is a need for a reliable statistical test which is appropriate for assessing cospeciation of more than two phylogenies. We have developed an algorithm using a permutation method that can be used to test for and infer tri-trophic evolutionary relationships of organisms given both their phylogenies and pairwise interactions. An overall statistic has been developed based on the dominant eigenvalue of a covariance matrix, and compared to values of the statistic computed when tree labels are permuted. The resulting overall p-value is used to test for the presence or absence of cospeciation in a tri-trophic system. If cospeciation is detected, we propose new test statistics based on partial correlations to uncover more details about the relationships between multiple phylogenies. One of the strengths of our method is that it allows more parasites than hosts or more hosts than parasites, with multiple associations and more than one parasite attached to a host (or one parasite attached to multiple hosts). The new method does not require any parametric assumptions of the distribution of the data, and unlike the old methods, which utilize several pairwise steps, the overall statistic used is obtained in one step. We have applied our method to two published datasets where we obtained detailed information about the strength of associations among species with calculated partial p-values and one overall p-value from the dominant eigenvalue test statistic. Our permutation method produces reliable results with a clear procedure and statistics applied in an intuitive manner. Our algorithm is useful in testing evidence for three-way cospeciation in multiple phylogenies with tri-trophic associations and determining which phylogenies are involved in cospeciation.


Subject(s)
Models, Genetic , Algorithms , Animals , Bacteria/genetics , Biological Evolution , Computer Simulation , Data Interpretation, Statistical , Genetic Speciation , Host-Parasite Interactions/genetics , Isoptera/genetics , Isoptera/microbiology , Phylogeny , Symbiosis
3.
Stat Appl Genet Mol Biol ; 10: Article 8, 2011.
Article in English | MEDLINE | ID: mdl-21291418

ABSTRACT

It has long been known that the amino-acid sequence of a protein determines its 3-dimensional structure, but accurate ab initio prediction of structure from sequence remains elusive. We gain insight into local protein structure conformation by studying the relationship of dihedral angles in pairs of residues in protein sequences (dipeptides). We adopt a contingency table approach, exploring a targeted set of hypotheses through log-linear modelling to detect patterns of association of these dihedral angles in all dipeptides considered. Our models indicate a substantial association of the side-chain conformation of the first residue with the backbone conformation of the second residue (side-to-back interaction) as well as a weaker but still significant association of the backbone conformation of the first residue with the side-chain conformation of the second residue (back-to-side interaction). To compare these interactions across different dipeptides, we cluster the parameter estimates for the corresponding association terms. This reveals a striking pattern. For the side-to-back term, all dipeptides which have the same first residue jointly appear in distinct clusters whereas for the back-to-side term, we observe a much weaker pattern. This suggests that the conformation of the first residue affects the conformation of the second.


Subject(s)
Dipeptides/chemistry , Models, Chemical , Proteins/chemistry , Sequence Analysis, Protein/statistics & numerical data , Amino Acid Sequence , Cluster Analysis , Computer Simulation , Protein Conformation
4.
Mol Biol Evol ; 26(7): 1457-68, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19329652

ABSTRACT

We introduce a statistical method that explores host-parasite coevolution by testing the null hypothesis that hosts and their associated parasites evolved independently. This test is simple and intuitive and involves only suitable randomization of the observed data. It is not even necessary to construct host and parasite phylogenetic trees, as the test can be performed directly on distance matrices. Statistical power of the test was evaluated using simulated data consistent with the alternative hypothesis of cospeciation. Results were compared with the method of Mantel (1967) and the ParaFit method of Legendre et al. (2002). We observed that our method has greater power overall and thus a higher ability to detect cospeciation in closely related host-parasite systems. Our test was also successful when applied to the pocket gopher and chewing lice system.


Subject(s)
Biological Evolution , Genetic Speciation , Host-Parasite Interactions , Models, Genetic , Animals , Phylogeny
5.
Genome Biol ; 8(2): R15, 2007.
Article in English | MEDLINE | ID: mdl-17274809

ABSTRACT

BACKGROUND: The human genome contains thousands of non-coding sequences that are often more conserved between vertebrate species than protein-coding exons. These highly conserved non-coding elements (CNEs) are associated with genes that coordinate development, and have been proposed to act as transcriptional enhancers. Despite their extreme sequence conservation in vertebrates, sequences homologous to CNEs have not been identified in invertebrates. RESULTS: Here we report that nematode genomes contain an alternative set of CNEs that share sequence characteristics, but not identity, with their vertebrate counterparts. CNEs thus represent a very unusual class of sequences that are extremely conserved within specific animal lineages yet are highly divergent between lineages. Nematode CNEs are also associated with developmental regulatory genes, and include well-characterized enhancers and transcription factor binding sites, supporting the proposed function of CNEs as cis-regulatory elements. Most remarkably, 40 of 156 human CNE-associated genes with invertebrate orthologs are also associated with CNEs in both worms and flies. CONCLUSION: A core set of genes that regulate development is associated with CNEs across three animal groups (worms, flies and vertebrates). We propose that these CNEs reflect the parallel evolution of alternative enhancers for a common set of developmental regulatory genes in different animal groups. This 're-wiring' of gene regulatory networks containing key developmental coordinators was probably a driving force during the evolution of animal body plans. CNEs may, therefore, represent the genomic traces of these 'hard-wired' core gene regulatory networks that specify the development of each alternative animal body plan.


Subject(s)
Caenorhabditis elegans/genetics , Conserved Sequence/genetics , DNA, Intergenic/genetics , Evolution, Molecular , Gene Regulatory Networks/genetics , Growth and Development/genetics , Animals , Base Composition , Base Sequence , Humans , Molecular Sequence Data , Sequence Analysis, DNA , Sequence Homology , Species Specificity
6.
J Bioinform Comput Biol ; 4(2): 425-41, 2006 Apr.
Article in English | MEDLINE | ID: mdl-16819793

ABSTRACT

One of the main goals of analysing DNA sequences is to understand the temporal and positional information that specifies gene expression. An important step in this process is the recognition of gene expression regulatory elements. Experimental procedures for this are slow and costly. In this paper we present a computational non-supervised algorithm that facilitates the process by statistically identifying the most likely regions within a putative regulatory sequence. A probabilistic technique is presented, based on the approximation of regulatory DNA with a Markov chain, for the location of putative transcription factor binding sites in a single stretch of DNA. Hereto we developed a procedure to approximate the order of Markov model for a given DNA sequence that circumvents some of the prohibitive assumptions underlying Markov modeling. Application of the algorithm to data from 55 genes in five species shows the high sensitivity of this Markov search algorithm. Our algorithm does not require any prior knowledge in the form of description or cross-genomic comparison; it is context sensitive and takes DNA heterogeneity into account.


Subject(s)
Chromosome Mapping/methods , DNA/genetics , Regulatory Sequences, Nucleic Acid/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Transcription Factors/genetics , Artificial Intelligence , Binding Sites , Computer Simulation , DNA/chemistry , Markov Chains , Models, Genetic , Models, Statistical , Pattern Recognition, Automated , Protein Binding , Transcription Factors/chemistry
7.
Brief Bioinform ; 7(1): 48-54, 2006 Mar.
Article in English | MEDLINE | ID: mdl-16761364

ABSTRACT

There are no well-known properties in regulatory DNA analogous to those in coding sequences; their spatial location is not regular, the consensus regulatory elements are often degenerate and there are no understandable rules governing their evolution. This makes it difficult to recognize regulatory regions within genome. We review developments in the statistical characterization of regulatory regions and methods of their recognition in eukaryotic genomes.


Subject(s)
Computational Biology , DNA/physiology , Genome , Regulatory Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/statistics & numerical data , Predictive Value of Tests
8.
Stat Appl Genet Mol Biol ; 5: Article5, 2006.
Article in English | MEDLINE | ID: mdl-16646869

ABSTRACT

Experiments to determine the complete 3-dimensional structures of protein complexes are difficult to perform and only a limited range of such structures are available. In contrast, large-scale screening experiments have identified thousands of pairwise interactions between proteins, but such experiments do not produce explicit structural information. In addition, the data produced by these high through-put experiments contain large numbers of false positive results, and can be biased against detection of certain types of interaction. Several methods exist that analyse such pairwise interaction data in terms of the constituent domains within proteins, scoring pairs of domain superfamilies according to their propensity to interact. These scores can be used to predict the strongest domain-domain contact (the contact with the largest surface area) between interacting proteins for which the domain-level structures of the individual proteins are known. We test this predictive approach on a set of pairwise protein interactions taken from the Protein Quaternary Structure (PQS) database for which the true domain-domain contacts are known.While the overall prediction success rate across the whole test data set is poor, we shown how interactions in the test data set for which the training data are not informative can be automatically excluded from the prediction process, giving improved prediction success rates at the expense of restricted coverage of the test data.


Subject(s)
Protein Interaction Mapping/methods , Protein Structure, Tertiary , Binding Sites , Data Interpretation, Statistical , Databases, Protein , Saccharomyces cerevisiae Proteins/chemistry , Saccharomyces cerevisiae Proteins/metabolism , Two-Hybrid System Techniques
9.
Trends Genet ; 22(1): 5-10, 2006 Jan.
Article in English | MEDLINE | ID: mdl-16290136

ABSTRACT

Many conserved non-coding elements (CNEs) in vertebrate genomes have been shown to function as tissue-specific enhancers. However, the target genes of most CNEs are unknown. Here we show that the target genes of duplicated CNEs can be predicted by considering their neighbouring paralogous genes. This enables us to provide the first systematic estimate of the genomic range for distal cis-regulatory interactions in the human genome: half of CNEs are >250 kb away from their associated gene.


Subject(s)
Enhancer Elements, Genetic , RNA, Untranslated/genetics , Animals , Gene Duplication , Genome, Human , Humans , Takifugu/genetics , Transcription Factors/genetics
10.
Bioinformatics ; 22(1): 117-9, 2006 Jan 01.
Article in English | MEDLINE | ID: mdl-16234319

ABSTRACT

SUMMARY: We describe an algorithm and software tool for comparing alternative phylogenetic trees. The main application of the software is to compare phylogenies obtained using different phylogenetic methods for some fixed set of species or obtained using different gene sequences from those species. The algorithm pairs up each branch in one phylogeny with a matching branch in the second phylogeny and finds the optimum 1-to-1 map between branches in the two trees in terms of a topological score. The software enables the user to explore the corresponding mapping between the phylogenies interactively, and clearly highlights those parts of the trees that differ, both in terms of topology and branch length. AVAILABILITY: The software is implemented as a Java applet at http://www.mrc-bsu.cam.ac.uk/personal/thomas/phylo_comparison/comparison_page.html. It is also available on request from the authors.


Subject(s)
Computational Biology/methods , Algorithms , Computer Graphics , HIV/genetics , Internet , Models, Genetic , Models, Statistical , Phylogeny , Programming Languages , Sequence Alignment , Software , User-Computer Interface
11.
Article in English | MEDLINE | ID: mdl-20483234

ABSTRACT

We recently identified approximately 1400 conserved non-coding elements (CNEs) shared by the genomes of fugu (Takifugu rubripes) and human that appear to be associated with developmental regulation in vertebrates [Woolfe, A., Goodson, M., Goode, D.K., Snell, P., McEwen, G.K., Vavouri, T., Smith, S.F., North, P., Callaway, H., Kelly, K., Walter, K., Abnizova, I., Gilks, W., Edwards, Y.J.K., Cooke, J.E., Elgar, G., 2005. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3 (1), e7]. This study encompassed a multi-disciplinary approach using bioinformatics, statistical methods and functional assays to identify and characterise the CNEs. Using an in vivo enhancer assay, over 90% of tested CNEs up-regulate tissue-specific GFP expression. Here we review our group's research in the field of characterising non-coding sequences conserved in vertebrates. We take this opportunity to discuss our research in progress and present some results of new and additional analyses. These include a phylogenomics analysis of CNEs, sequence conservation patterns in vertebrate CNEs and the distribution of human SNPs in the CNEs. We highlight the usefulness of the CNE dataset to help correlate genetic variation in health and disease. We also discuss the functional analysis using the enhancer assay and the enrichment of predicted transcription factor binding sites for two CNEs. Public access to the CNEs plus annotation is now possible and is described. The content of this review was presented by Dr. Y.J.K. Edwards at the TODAI International Symposium on Functional Genomics of the Pufferfish, Tokyo, Japan, 3-6 November 2004.

12.
BMC Bioinformatics ; 6: 302, 2005 Dec 14.
Article in English | MEDLINE | ID: mdl-16354297

ABSTRACT

BACKGROUND: One of the most evident achievements of bioinformatics is the development of methods that transfer biological knowledge from characterised proteins to uncharacterised sequences. This mode of protein function assignment is mostly based on the detection of sequence similarity and the premise that functional properties are conserved during evolution. Most automatic approaches developed to date rely on the identification of clusters of homologous proteins and the mapping of new proteins onto these clusters, which are expected to share functional characteristics. RESULTS: Here, we inverse the logic of this process, by considering the mapping of sequences directly to a functional classification instead of mapping functions to a sequence clustering. In this mode, the starting point is a database of labelled proteins according to a functional classification scheme, and the subsequent use of sequence similarity allows defining the membership of new proteins to these functional classes. In this framework, we define the Correspondence Indicators as measures of relationship between sequence and function and further formulate two Bayesian approaches to estimate the probability for a sequence of unknown function to belong to a functional class. This approach allows the parametrisation of different sequence search strategies and provides a direct measure of annotation error rates. We validate this approach with a database of enzymes labelled by their corresponding four-digit EC numbers and analyse specific cases. CONCLUSION: The performance of this method is significantly higher than the simple strategy consisting in transferring the annotation from the highest scoring BLAST match and is expected to find applications in automated functional annotation pipelines.


Subject(s)
Algorithms , Enzymes/chemistry , Enzymes/metabolism , Models, Biological , Models, Chemical , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Computer Simulation , Conserved Sequence , Documentation/methods , Evolution, Molecular , Models, Statistical , Molecular Sequence Data , Sequence Homology, Amino Acid
13.
Bioinformatics ; 21 Suppl 2: ii137-43, 2005 Sep 01.
Article in English | MEDLINE | ID: mdl-16204093

ABSTRACT

MOTIVATION: It is widely acknowledged that microarray data are subject to high noise levels and results are often platform dependent. Therefore, microarray experiments should be replicated several times and in several laboratories before the results can be relied upon. To make the best use of such extensive datasets, methods for microarray data fusion are required. Ideally, the fused data should distil important aspects of the data while suppressing unwanted sources of variation and be amenable to further informal and formal methods of analysis. Also, the variability in the quality of experimentation should be taken into account. RESULTS: We present such an approach to data fusion, based on multivariate regression. We apply our methodology to data from a previous study on cell-cycle control in Schizosaccharomyces pombe. AVAILABILITY: The algorithm implemented in R is freely available from the authors on request.


Subject(s)
Algorithms , Databases, Protein , Gene Expression Profiling/methods , Information Storage and Retrieval/methods , Oligonucleotide Array Sequence Analysis/methods , Schizosaccharomyces/cytology , Schizosaccharomyces/metabolism , Cell Cycle/physiology , Cell Cycle Proteins/metabolism , Data Interpretation, Statistical , Database Management Systems , Multivariate Analysis , Regression Analysis , Systems Integration
14.
BMC Bioinformatics ; 6: 234, 2005 Sep 26.
Article in English | MEDLINE | ID: mdl-16185360

ABSTRACT

BACKGROUND: A common feature of microarray experiments is the occurrence of missing gene expression data. These missing values occur for a variety of reasons, in particular, because of the filtering of poor quality spots and the removal of undefined values when a logarithmic transformation is applied to negative background-corrected intensities. The efficiency and power of an analysis performed can be substantially reduced by having an incomplete matrix of gene intensities. Additionally, most statistical methods require a complete intensity matrix. Furthermore, biases may be introduced into analyses through missing information on some genes. Thus methods for appropriately replacing (imputing) missing data and/or weighting poor quality spots are required. RESULTS: We present a likelihood-based method for imputing missing data or weighting poor quality spots that requires a number of biological or technical replicates. This likelihood-based approach assumes that the data for a given spot arising from each channel of a two-dye (two-channel) cDNA microarray comparison experiment independently come from a three-component mixture distribution--the parameters of which are estimated through use of a constrained E-M algorithm. Posterior probabilities of belonging to each component of the mixture distributions are calculated and used to decide whether imputation is required. These posterior probabilities may also be used to construct quality weights that can down-weight poor quality spots in any analysis performed afterwards. The approach is illustrated using data obtained from an experiment to observe gene expression changes with 24 hr paclitaxel (Taxol) treatment on a human cervical cancer derived cell line (HeLa). CONCLUSION: As the quality of microarray experiments affect downstream processes, it is important to have a reliable and automatic method of identifying poor quality spots and arrays. We propose a method of identifying poor quality spots, and suggest a method of repairing the arrays by either imputation or assigning quality weights to the spots. This repaired data set would be less biased and can be analysed using any of the appropriate statistical methods found in the microarray literature.


Subject(s)
Algorithms , Likelihood Functions , Oligonucleotide Array Sequence Analysis/methods , Uterine Cervical Neoplasms/genetics , Female , HeLa Cells , Humans , Models, Statistical , Quality Control
15.
Trends Genet ; 21(8): 436-40, 2005 Aug.
Article in English | MEDLINE | ID: mdl-15979195

ABSTRACT

In a recent study, 1373 highly conserved non-coding elements (CNEs) were detected by aligning the human and Takifugu rubripes (Fugu) genomes. The remarkable degree of sequence conservation in CNEs compared with their surroundings suggested comparing the base composition within CNEs with their 5' and 3' flanking regions. The analysis reveals a novel, sharp and distinct signal of nucleotide frequency bias precisely at the border between CNEs and flanking regions.


Subject(s)
DNA/genetics , Vertebrates/genetics , Animals , Base Composition , Conserved Sequence , DNA/chemistry , Humans , Takifugu/genetics
16.
Blood ; 106(3): 1003-7, 2005 Aug 01.
Article in English | MEDLINE | ID: mdl-15827132

ABSTRACT

Cluster of differentiation (CD) antigens are expressed on cells of myeloid and lymphoid lineages. As most disease processes involve immune system activation or suppression, these antigens offer unique opportunities for monitoring host responses. Immunophenotyping using limited numbers of CD antigens enables differentiation states of immune system cells to be determined. Extended phenotyping involving parallel measurement of multiple CD antigens may help identify expression pattern signatures associated with specific disease states. To explore this possibility we have made a CD monoclonal antibody array and scanner, enabling the parallel immunophenotyping of leukocyte cell suspensions in a single and rapid analysis. To demonstrate this approach, we used the specific example of patients infected with human immunodeficiency virus type-1 (HIV-1). An invariant HIV-induced CD antigen signature has been defined that is both robust and independent of clinical outcome, composed of a unique profile of CD antigen expression levels that are both increased and decreased relative to internal controls. The results indicate that HIV-induced changes in CD antigen expression are disease specific and independent of outcome. Their invariant nature indicates an irreversible component to retroviral infection and suggests the utility of CD antigen expression patterns in other disease settings.


Subject(s)
Antigens, CD/analysis , HIV Infections/pathology , Immunophenotyping/methods , Antibodies, Monoclonal , Gene Expression Regulation , HIV Infections/immunology , Humans , Leukocytes/chemistry , Leukocytes/pathology , Leukocytes/virology , Protein Array Analysis/methods
17.
BMC Bioinformatics ; 6: 109, 2005 Apr 27.
Article in English | MEDLINE | ID: mdl-15857505

ABSTRACT

BACKGROUND: This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information. RESULTS: We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA. CONCLUSION: We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.


Subject(s)
Computational Biology/methods , DNA/chemistry , Drosophila melanogaster/genetics , Genome , Sequence Analysis, DNA , Algorithms , Amino Acid Motifs , Animals , Base Sequence , Binding Sites , Cell Nucleus/metabolism , Chromatin/metabolism , Cluster Analysis , Databases, Genetic , Genes, Insect , Genes, Regulator , Genomics , Models, Statistical , Molecular Sequence Data , Transcription, Genetic
18.
Math Biosci ; 193(2): 223-34, 2005 Feb.
Article in English | MEDLINE | ID: mdl-15748731

ABSTRACT

Databases of protein sequences have grown rapidly in recent years as a result of genome sequencing projects. Annotating protein sequences with descriptions of their biological function ideally requires careful experimentation, but this work lags far behind. Instead, biological function is often imputed by copying annotations from similar protein sequences. This gives rise to annotation errors, and more seriously, to chains of misannotation. [Percolation of annotation errors in a database of protein sequences (2002)] developed a probabilistic framework for exploring the consequences of this percolation of errors through protein databases, and applied their theory to a simple database model. Here we apply the theory to hierarchically structured protein sequence databases, and draw conclusions about database quality at different levels of the hierarchy.


Subject(s)
Amino Acid Sequence , Databases, Protein/standards , Sequence Homology, Amino Acid
19.
Bioinformatics ; 21(7): 993-1001, 2005 Apr 01.
Article in English | MEDLINE | ID: mdl-15509600

ABSTRACT

MOTIVATION: Several methods have recently been developed to analyse large-scale sets of physical interactions between proteins in terms of physical contacts between the constituent domains, often with a view to predicting new pairwise interactions. Our aim is to combine genomic interaction data, in which domain-domain contacts are not explicitly reported, with the domain-level structure of individual proteins, in order to learn about the structure of interacting protein pairs. Our approach is driven by the need to assess the evidence for physical contacts between domains in a statistically rigorous way. RESULTS: We develop a statistical approach that assigns p-values to pairs of domain superfamilies, measuring the strength of evidence within a set of protein interactions that domains from these superfamilies form contacts. A set of p-values is calculated for SCOP superfamily pairs, based on a pooled data set of interactions from yeast. These p-values can be used to predict which domains come into contact in an interacting protein pair. This predictive scheme is tested against protein complexes in the Protein Quaternary Structure (PQS) database, and is used to predict domain-domain contacts within 705 interacting protein pairs taken from our pooled data set.


Subject(s)
Algorithms , Databases, Protein , Models, Chemical , Protein Interaction Mapping/methods , Saccharomyces cerevisiae Proteins/chemistry , Saccharomyces cerevisiae Proteins/metabolism , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Binding Sites , Computer Simulation , Models, Statistical , Protein Binding , Protein Structure, Tertiary , Saccharomyces cerevisiae Proteins/analysis , Saccharomyces cerevisiae Proteins/classification , Structure-Activity Relationship
20.
Bioinformatics ; 18(12): 1641-9, 2002 Dec.
Article in English | MEDLINE | ID: mdl-12490449

ABSTRACT

Public sequence databases contain information on the sequence, structure and function of proteins. Genome sequencing projects have led to a rapid increase in protein sequence information, but reliable, experimentally verified, information on protein function lags a long way behind. To address this deficit, functional annotation in protein databases is often inferred by sequence similarity to homologous, annotated proteins, with the attendant possibility of error. Now, the functional annotation in these homologous proteins may itself have been acquired through sequence similarity to yet other proteins, and it is generally not possible to determine how the functional annotation of any given protein has been acquired. Thus the possibility of chains of misannotation arises, a process we term 'error percolation'. With some simple assumptions, we develop a dynamical probabilistic model for these misannotation chains. By exploring the consequences of the model for annotation quality it is evident that this iterative approach leads to a systematic deterioration of database quality.


Subject(s)
Databases, Protein , Documentation , Models, Statistical , Proteins/chemistry , Sequence Analysis, Protein/methods , Animals , False Positive Reactions , Humans , Information Storage and Retrieval/methods , Models, Genetic , Models, Molecular , Proteins/classification , Proteins/genetics , Quality Control , Reproducibility of Results , Sensitivity and Specificity , Sequence Alignment/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...