Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Bioinformatics ; 39(39 Suppl 1): i357-i367, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387189

RESUMO

The tendency of an amino acid to adopt certain configurations in folded proteins is treated here as a statistical estimation problem. We model the joint distribution of the observed mainchain and sidechain dihedral angles (〈ϕ,ψ,χ1,χ2,…〉) of any amino acid by a mixture of a product of von Mises probability distributions. This mixture model maps any vector of dihedral angles to a point on a multi-dimensional torus. The continuous space it uses to specify the dihedral angles provides an alternative to the commonly used rotamer libraries. These rotamer libraries discretize the space of dihedral angles into coarse angular bins, and cluster combinations of sidechain dihedral angles (〈χ1,χ2,…〉) as a function of backbone 〈ϕ,ψ〉 conformations. A 'good' model is one that is both concise and explains (compresses) observed data. Competing models can be compared directly and in particular our model is shown to outperform the Dunbrack rotamer library in terms of model complexity (by three orders of magnitude) and its fidelity (on average 20% more compression) when losslessly explaining the observed dihedral angle data across experimental resolutions of structures. Our method is unsupervised (with parameters estimated automatically) and uses information theory to determine the optimal complexity of the statistical model, thus avoiding under/over-fitting, a common pitfall in model selection problems. Our models are computationally inexpensive to sample from and are geared to support a number of downstream studies, ranging from experimental structure refinement, de novo protein design, and protein structure prediction. We call our collection of mixture models as PhiSiCal (ϕψχal). AVAILABILITY AND IMPLEMENTATION: PhiSiCal mixture models and programs to sample from them are available for download at http://lcb.infotech.monash.edu.au/phisical.


Assuntos
Compressão de Dados , Bibliotecas , Aminoácidos , Biblioteca Gênica , Teoria da Informação
2.
Bioinformatics ; 38(Suppl 1): i229-i237, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758809

RESUMO

SUMMARY: Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this article demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterized substitution matrix and a time-parameterized alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyze the statistical properties of MMLSUM model and contrast it with others. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aminoácidos , Modelos Estatísticos , Algoritmos , Sequência de Aminoácidos , Benchmarking
3.
Bioinformatics ; 38(Suppl 1): i255-i263, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758808

RESUMO

MOTIVATION: Alignments are correspondences between sequences. How reliable are alignments of amino acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments. RESULTS: By analyzing the sequences and structures of 1 million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence. Our results clearly demarcate the 'daylight', 'twilight' and 'midnight' zones for interpreting residue-residue correspondences from sequence information alone. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aminoácidos , Proteínas , Algoritmos , Sequência de Aminoácidos , Proteínas/química , Reprodutibilidade dos Testes , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos
4.
Proteins ; 88(12): 1557-1558, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-32662915

RESUMO

We have modeled modifications of a known ligand to the SARS-CoV-2 (COVID-19) protease, that can form a covalent adduct, plus additional ligand-protein hydrogen bonds.


Assuntos
Antivirais , Afídeos , Infecções por Coronavirus , Inseticidas , Pandemias , Pneumonia Viral , Acetilcolinesterase , Animais , Betacoronavirus , COVID-19 , Cisteína Endopeptidases , Humanos , Simulação de Acoplamento Molecular , Inibidores de Proteases , SARS-CoV-2 , Proteínas não Estruturais Virais
5.
Bioinformatics ; 35(14): i360-i369, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510703

RESUMO

The information criterion of minimum message length (MML) provides a powerful statistical framework for inductive reasoning from observed data. We apply MML to the problem of protein sequence comparison using finite state models with Dirichlet distributions. The resulting framework allows us to supersede the ad hoc cost functions commonly used in the field, by systematically addressing the problem of arbitrariness in alignment parameters, and the disconnect between substitution scores and gap costs. Furthermore, our framework enables the generation of marginal probability landscapes over all possible alignment hypotheses, with potential to facilitate the users to simultaneously rationalize and assess competing alignment relationships between protein sequences, beyond simply reporting a single (best) alignment. We demonstrate the performance of our program on benchmarks containing distantly related protein sequences. AVAILABILITY AND IMPLEMENTATION: The open-source program supporting this work is available from: http://lcb.infotech.monash.edu.au/seqmmligner. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Sequência de Aminoácidos , Modelos Estatísticos , Probabilidade , Proteínas , Alinhamento de Sequência , Software
6.
Bioinformatics ; 33(7): 1005-1013, 2017 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-28065899

RESUMO

Motivation: Structural molecular biology depends crucially on computational techniques that compare protein three-dimensional structures and generate structural alignments (the assignment of one-to-one correspondences between subsets of amino acids based on atomic coordinates). Despite its importance, the structural alignment problem has not been formulated, much less solved, in a consistent and reliable way. To overcome these difficulties, we present here a statistical framework for the precise inference of structural alignments, built on the Bayesian and information-theoretic principle of Minimum Message Length (MML). The quality of any alignment is measured by its explanatory power-the amount of lossless compression achieved to explain the protein coordinates using that alignment. Results: We have implemented this approach in MMLigner , the first program able to infer statistically significant structural alignments. We also demonstrate the reliability of MMLigner 's alignment results when compared with the state of the art. Importantly, MMLigner can also discover different structural alignments of comparable quality, a challenging problem for oligomers and protein complexes. Availability and Implementation: Source code, binaries and an interactive web version are available at http://lcb.infotech.monash.edu.au/mmligner . Contact: arun.konagurthu@monash.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Modelos Estatísticos , Proteínas/química , Alinhamento de Sequência , Algoritmos , Teorema de Bayes , Reprodutibilidade dos Testes , Software
7.
Mol Biol Evol ; 33(5): 1349-57, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-26912811

RESUMO

Methods for measuring genetic distances in phylogenetics are known to be sensitive to the evolutionary model assumed. However, there is a lack of established methodology to accommodate the trade-off between incorporating sufficient biological reality and avoiding model overfitting. In addition, as traditional methods measure distances based on the observed number of substitutions, their tend to underestimate distances between diverged sequences due to backward and parallel substitutions. Various techniques were proposed to correct this, but they lack the robustness against sequences that are distantly related and of unequal base frequencies. In this article, we present a novel genetic distance estimate based on information theory that overcomes the above two hurdles. Instead of examining the observed number of substitutions, this method estimates genetic distances using Shannon's mutual information. This naturally provides an effective framework for balancing model complexity and goodness of fit. Our distance estimate is shown to be approximately linear to elapsed time and hence is less sensitive to the divergence of sequence data and compositional biased sequences. Using extensive simulation data, we show that our method 1) consistently reconstructs more accurate phylogeny topologies than existing methods, 2) is robust in extreme conditions such as diverged phylogenies, unequal base frequencies data, and heterogeneous mutation patterns, and 3) scales well with large phylogenies.


Assuntos
Evolução Biológica , Modelos Genéticos , Análise de Sequência/métodos , Algoritmos , Composição de Bases , Simulação por Computador , Evolução Molecular , Variação Genética , Teoria da Informação , Filogenia
8.
Bioinformatics ; 30(17): i512-8, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25161241

RESUMO

MOTIVATION: Progress in protein biology depends on the reliability of results from a handful of computational techniques, structural alignments being one. Recent reviews have highlighted substantial inconsistencies and differences between alignment results generated by the ever-growing stock of structural alignment programs. The lack of consensus on how the quality of structural alignments must be assessed has been identified as the main cause for the observed differences. Current methods assess structural alignment quality by constructing a scoring function that attempts to balance conflicting criteria, mainly alignment coverage and fidelity of structures under superposition. This traditional approach to measuring alignment quality, the subject of considerable literature, has failed to solve the problem. Further development along the same lines is unlikely to rectify the current deficiencies in the field. RESULTS: This paper proposes a new statistical framework to assess structural alignment quality and significance based on lossless information compression. This is a radical departure from the traditional approach of formulating scoring functions. It links the structural alignment problem to the general class of statistical inductive inference problems, solved using the information-theoretic criterion of minimum message length. Based on this, we developed an efficient and reliable measure of structural alignment quality, I-value. The performance of I-value is demonstrated in comparison with a number of popular scoring functions, on a large collection of competing alignments. Our analysis shows that I-value provides a rigorous and reliable quantification of structural alignment quality, addressing a major gap in the field. AVAILABILITY: http://lcb.infotech.monash.edu.au/I-value. SUPPLEMENTARY INFORMATION: Online supplementary data are available at http://lcb.infotech.monash.edu.au/I-value/suppl.html.


Assuntos
Homologia Estrutural de Proteína , Algoritmos , Compressão de Dados , Interpretação Estatística de Dados
9.
Acta Crystallogr D Biol Crystallogr ; 70(Pt 3): 904-6, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24598758

RESUMO

Atomic coordinates in the Worldwide Protein Data Bank (wwPDB) are generally reported to greater precision than the experimental structure determinations have actually achieved. By using information theory and data compression to study the compressibility of protein atomic coordinates, it is possible to quantify the amount of randomness in the coordinate data and thereby to determine the realistic precision of the reported coordinates. On average, the value of each C(α) coordinate in a set of selected protein structures solved at a variety of resolutions is good to about 0.1 Å.


Assuntos
Bases de Dados de Proteínas/normas , Interface Usuário-Computador , Cristalografia por Raios X/normas , Dicionários Químicos como Assunto , Espectroscopia de Ressonância Magnética/normas , Microscopia Eletrônica/normas , Valor Preditivo dos Testes , Distribuição Aleatória
10.
Bioinformatics ; 28(12): i97-105, 2012 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-22689785

RESUMO

MOTIVATION: Secondary structure underpins the folding pattern and architecture of most proteins. Accurate assignment of the secondary structure elements is therefore an important problem. Although many approximate solutions of the secondary structure assignment problem exist, the statement of the problem has resisted a consistent and mathematically rigorous definition. A variety of comparative studies have highlighted major disagreements in the way the available methods define and assign secondary structure to coordinate data. RESULTS: We report a new method to infer secondary structure based on the Bayesian method of minimum message length inference. It treats assignments of secondary structure as hypotheses that explain the given coordinate data. The method seeks to maximize the joint probability of a hypothesis and the data. There is a natural null hypothesis and any assignment that cannot better it is unacceptable. We developed a program SST based on this approach and compared it with popular programs, such as DSSP and STRIDE among others. Our evaluation suggests that SST gives reliable assignments even on low-resolution structures. AVAILABILITY: http://www.csse.monash.edu.au/~karun/sst.


Assuntos
Biologia Computacional/métodos , Estrutura Secundária de Proteína , Proteínas/análise , Algoritmos , Teorema de Bayes , Modelos Teóricos
11.
Bioinformatics ; 27(13): i43-51, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21685100

RESUMO

UNLABELLED: Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features-that is, helices and strands of sheet-by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison. This article introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of minimum message length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them. The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure. AVAILABILITY: http://www.csse.monash.edu.au/~karun/pmml.


Assuntos
Algoritmos , Proteínas/química , Clostridium/química , Ligação de Hidrogênio , Modelos Moleculares , Dobramento de Proteína , Estrutura Secundária de Proteína , Proteínas/metabolismo
12.
Adv Exp Med Biol ; 696: 657-66, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21431607

RESUMO

A biological compression model, expert model, is presented which is superior to existing compression algorithms in both compression performance and speed. The model is able to compress whole eukaryotic genomes. Most importantly, the model provides a framework for knowledge discovery from biological data. It can be used for repeat element discovery, sequence alignment and phylogenetic analysis. We demonstrate that the model can handle statistically biased sequences and distantly related sequences where conventional knowledge discovery tools often fail.


Assuntos
Algoritmos , Compressão de Dados/estatística & dados numéricos , Biologia Computacional , Sistemas Inteligentes , Genoma Humano , Genômica/estatística & dados numéricos , Humanos , Teoria da Informação , Bases de Conhecimento , Modelos Genéticos , Modelos Estatísticos , Filogenia , Sequências Repetitivas de Ácido Nucleico , Alinhamento de Sequência/estatística & dados numéricos
13.
BMC Bioinformatics ; 11: 599, 2010 Dec 16.
Artigo em Inglês | MEDLINE | ID: mdl-21159205

RESUMO

BACKGROUND: Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique. RESULTS: Experiments on both simulated data and real data show that XMAligner is superior to conventional methods especially on distantly related sequences and statistically biased data. XMAligner can align sequences of eukaryote genome size with only a modest hardware requirement. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The alignment results from XMAligner can be integrated into a visualisation tool for viewing purpose. CONCLUSIONS: The information-theoretic approach for sequence alignment is shown to overcome the mentioned problems of conventional character matching alignment methods. The article shows that, as genomic sequences are meant to carry information, considering the information content of nucleotides is helpful for genomic sequence alignment. AVAILABILITY: Downloadable binaries, documentation and data can be found at ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/.


Assuntos
Algoritmos , Compressão de Dados , Alinhamento de Sequência/métodos , Sequência de Bases , Genômica/métodos , Modelos Teóricos , Software
14.
Front Mol Biosci ; 7: 612920, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33996891

RESUMO

What is the architectural "basis set" of the observed universe of protein structures? Using information-theoretic inference, we answer this question with a dictionary of 1,493 substructures-called concepts-typically at a subdomain level, based on an unbiased subset of known protein structures. Each concept represents a topologically conserved assembly of helices and strands that make contact. Any protein structure can be dissected into instances of concepts from this dictionary. We dissected the Protein Data Bank and completely inventoried all the concept instances. This yields many insights, including correlations between concepts and catalytic activities or binding sites, useful for rational drug design; local amino-acid sequence-structure correlations, useful for ab initio structure prediction methods; and information supporting the recognition and exploration of evolutionary relationships, useful for structural studies. An interactive site, Proçodic, at http://lcb.infotech.monash.edu.au/prosodic (click), provides access to and navigation of the entire dictionary of concepts and their usages, and all associated information. This report is part of a continuing programme with the goal of elucidating fundamental principles of protein architecture, in the spirit of the work of Cyrus Chothia.

15.
Methods Mol Biol ; 1958: 123-131, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30945216

RESUMO

We recently developed an unsupervised Bayesian inference methodology to automatically infer a dictionary of protein supersecondary structures (Subramanian et al., IEEE data compression conference proceedings (DCC), 340-349, 2017). Specifically, this methodology uses the information-theoretic framework of minimum message length (MML) criterion for hypothesis selection (Wallace, Statistical and inductive inference by minimum message length, Springer Science & Business Media, New York, 2005). The best dictionary of supersecondary structures is the one that yields the most (lossless) compression on the source collection of folding patterns represented as tableaux (matrix representations that capture the essence of protein folding patterns (Lesk, J Mol Graph. 13:159-164, 1995). This book chapter outlines our MML methodology for inferring the supersecondary structure dictionary. The inferred dictionary is available at http://lcb.infotech.monash.edu.au/proteinConcepts/scop100/dictionary.html .


Assuntos
Motivos de Aminoácidos , Biologia Computacional/métodos , Proteínas/química , Algoritmos , Teorema de Bayes , Compressão de Dados , Humanos , Modelos Moleculares , Dobramento de Proteína
16.
BMC Bioinformatics ; 8 Suppl 2: S10, 2007 May 03.
Artigo em Inglês | MEDLINE | ID: mdl-17493248

RESUMO

BACKGROUND: Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences between repeats, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence. Linear space complexity is important when exploring long DNA sequences of the order of millions of bases. Compressing a sequence in isolation will include information on self-repetition. Whereas compressing a sequence Y in the context of another X can find what new information X gives about Y. This paper presents a methodology for performing comparative analysis to find features exposed by such models. RESULTS: We apply such a model to find features across chromosomes of Cyanidioschyzon merolae. We present a tool that provides useful linear transformations to investigate and save new sequences. Various examples illustrate the methodology, finding features for sequences alone and in different contexts. We also show how to highlight all sets of self-repetition features, in this case within Plasmodium falciparum chromosome 2. CONCLUSION: The methodology finds features that are significant and that biologists confirm. The exploration of long information sequences in linear time and space is fast and the saved results are self documenting.


Assuntos
Algoritmos , DNA/química , DNA/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Sequência de Bases , Armazenamento e Recuperação da Informação/métodos , Teoria da Informação , Dados de Sequência Molecular
17.
J Comput Biol ; 22(6): 487-97, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25695500

RESUMO

The problem of superposition of two corresponding vector sets by minimizing their sum-of-squares error under orthogonal transformation is a fundamental task in many areas of science, notably structural molecular biology. This problem can be solved exactly using an algorithm whose time complexity grows linearly with the number of correspondences. This efficient solution has facilitated the widespread use of the superposition task, particularly in studies involving macromolecular structures. This article formally derives a set of sufficient statistics for the least-squares superposition problem. These statistics are additive. This permits a highly efficient (constant time) computation of superpositions (and sufficient statistics) of vector sets that are composed from its constituent vector sets under addition or deletion operation, where the sufficient statistics of the constituent sets are already known (that is, the constituent vector sets have been previously superposed). This results in a drastic improvement in the run time of the methods that commonly superpose vector sets under addition or deletion operations, where previously these operations were carried out ab initio (ignoring the sufficient statistics). We experimentally demonstrate the improvement our work offers in the context of protein structural alignment programs that assemble a reliable structural alignment from well-fitting (substructural) fragment pairs. A C++ library for this task is available online under an open-source license.


Assuntos
Proteínas/química , Algoritmos , Análise dos Mínimos Quadrados , Modelos Moleculares , Conformação Proteica , Alinhamento de Sequência/métodos
18.
Bioinformatics ; 19(10): 1294-5, 2003 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-12835276

RESUMO

UNLABELLED: Described is an algorithm to find the longest interval having at least a specified minimum bias in a sequence of characters (bases, amino acids), e.g. 'at least 0.95 (A+T)-rich'. It is based on an algorithm to find the longest interval having a non-negative sum in a sequence of positive and negative numbers. In practice, it runs in linear time; this can be guaranteed if the bias is rational. AVAILABILITY: Java code of the algorithm can be found at http://www.csse.monash.edu.au/~lloyd/tildeProgLang/Java2/Biased/. SUPPLEMENTARY INFORMATION: Examples of applications to Plasmodium falciparum genomic DNA can be found at the above URL.


Assuntos
Algoritmos , Modelos Estatísticos , Reconhecimento Automatizado de Padrão , Alinhamento de Sequência/métodos , Análise de Sequência/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA