Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Bioinformatics ; 37(20): 3456-3463, 2021 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-33983436

RESUMO

MOTIVATION: Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. RESULTS: eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): to maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. AVAILABILITY AND IMPLEMENTATION: The eCOMPASS executable, C++ open source code and input data sets are available at https://www.igs.umaryland.edu/labs/neuwald/software/compass. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
PLoS Comput Biol ; 14(12): e1006237, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30596639

RESUMO

Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.


Assuntos
Sítios de Ligação/fisiologia , Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Modelos Moleculares , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas/fisiologia , Elementos Estruturais de Proteínas , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos
3.
PLoS Comput Biol ; 12(12): e1005294, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-28002465

RESUMO

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes' theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).


Assuntos
Acetiltransferases/química , Modelos Moleculares , Análise de Sequência de Proteína/métodos , Acetiltransferases/genética , Acetiltransferases/metabolismo , Sequência de Aminoácidos , Animais , Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/metabolismo , Biologia Computacional , Humanos , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/métodos
4.
PLoS Comput Biol ; 12(5): e1004936, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-27192614

RESUMO

We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.


Assuntos
Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Teorema de Bayes , Biologia Computacional , Bases de Dados de Proteínas , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/normas , Software
5.
Bioinformatics ; 31(3): 324-31, 2015 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-25294922

RESUMO

MOTIVATION: DNA and protein patterns are usefully represented by sequence logos. However, the methods for logo generation in common use lack a proper statistical basis, and are non-optimal for recognizing functionally relevant alignment columns. RESULTS: We redefine the information at a logo position as a per-observation multiple alignment log-odds score. Such scores are positive or negative, depending on whether a column's observations are better explained as arising from relatedness or chance. Within this framework, we propose distinct normalized maximum likelihood and Bayesian measures of column information. We illustrate these measures on High Mobility Group B (HMGB) box proteins and a dataset of enzyme alignments. Particularly in the context of protein alignments, our measures improve the discrimination of biologically relevant positions. AVAILABILITY AND IMPLEMENTATION: Our new measures are implemented in an open-source Web-based logo generation program, which is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/logoddslogo/index.html. A stand-alone version of the program is also available from this site. CONTACT: altschul@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Teorema de Bayes , Matrizes de Pontuação de Posição Específica , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , Sequência de Aminoácidos , Humanos , Anotação de Sequência Molecular , Dados de Sequência Molecular , Homologia de Sequência de Aminoácidos
6.
Bioinformatics ; 27(24): 3356-63, 2011 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-21998158

RESUMO

MOTIVATION: Pairwise protein sequence alignments are generally evaluated using scores defined as the sum of substitution scores for aligning amino acids to one another, and gap scores for aligning runs of amino acids in one sequence to null characters inserted into the other. Protein profiles may be abstracted from multiple alignments of protein sequences, and substitution and gap scores have been generalized to the alignment of such profiles either to single sequences or to other profiles. Although there is widespread agreement on the general form substitution scores should take for profile-sequence alignment, little consensus has been reached on how best to construct profile-profile substitution scores, and a large number of these scoring systems have been proposed. Here, we assess a variety of such substitution scores. For this evaluation, given a gold standard set of multiple alignments, we calculate the probability that a profile column yields a higher substitution score when aligned to a related than to an unrelated column. We also generalize this measure to sets of two or three adjacent columns. This simple approach has the advantages that it does not depend primarily upon the gold-standard alignment columns with the weakest empirical support, and that it does not need to fit gap and offset costs for use with each substitution score studied. RESULTS: A simple symmetrization of mean profile-sequence scores usually performed the best. These were followed closely by several specific scoring systems constructed using a variety of rationales. CONTACT: altschul@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Biologia Computacional , Eucariotos/química , Probabilidade
7.
PLoS Comput Biol ; 6(7): e1000852, 2010 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-20657661

RESUMO

Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct "BILD" ("Bayesian Integral Log-odds") substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Aminoácidos , Sequência de Bases , Teorema de Bayes , Sequência Consenso , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/genética , Bases de Dados Genéticas , Plasmodium , Estrutura Terciária de Proteína , Toxoplasma
8.
Nucleic Acids Res ; 37(3): 815-24, 2009 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-19088134

RESUMO

Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.


Assuntos
Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas
9.
Sci Rep ; 10(1): 1691, 2020 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-32015389

RESUMO

Protein functional constraints are manifest as superfamily and functional-subgroup conserved residues, and as pairwise correlations. Deep Analysis of Residue Constraints (DARC) aids the visualization of these constraints, characterizes how they correlate with each other and with structure, and estimates statistical significance. This can identify determinants of protein functional specificity, as we illustrate for bacterial DNA clamp loader ATPases. These load ring-shaped sliding clamps onto DNA to keep polymerase attached during replication and contain one δ, three γ, and one δ' AAA+ subunits semi-circularly arranged in the order δ-γ1-γ2-γ3-δ'. Only γ is active, though both γ and δ' functionally influence an adjacent γ subunit. DARC identifies, as functionally-congruent features linking allosterically the ATP, DNA, and clamp binding sites: residues distinctive of γ and of γ/δ' that mutually interact in trans, centered on the catalytic base; several γ/δ'-residues and six γ/δ'-covariant residue pairs within the DNA binding N-termini of helices α2 and α3; and γ/δ'-residues associated with the α2 C-terminus and the clamp-binding loop. Most notable is a trans-acting γ/δ' hydroxyl group that 99% of other AAA+ proteins lack. Mutation of this hydroxyl to a methyl group impedes clamp binding and opening, DNA binding, and ATP hydrolysis-implying a remarkably clamp-loader-specific function.


Assuntos
Proteínas de Ligação a DNA/metabolismo , Subunidades Proteicas/metabolismo , Adenosina Trifosfatases/metabolismo , Trifosfato de Adenosina/metabolismo , Sítios de Ligação/fisiologia , DNA Polimerase III/metabolismo , DNA Bacteriano/metabolismo , Escherichia coli/metabolismo , Hidrólise , Estrutura Secundária de Proteína , Sensibilidade e Especificidade
10.
Bioinformatics ; 24(13): i15-23, 2008 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-18586708

RESUMO

MOTIVATION: The flexibility in gap cost enjoyed by hidden Markov models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. RESULTS: We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance. These results suggest possible improvements to the PSI-BLAST protein database search program. AVAILABILITY: The scripts for performing evaluations are available upon request from the authors.


Assuntos
Algoritmos , Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Cadeias de Markov , Dados de Sequência Molecular , Sensibilidade e Especificidade
11.
Nucleic Acids Res ; 34(20): 5966-73, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-17068079

RESUMO

Protein sequence database search programs may be evaluated both for their retrieval accuracy--the ability to separate meaningful from chance similarities--and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.


Assuntos
Bases de Dados de Proteínas , Alinhamento de Sequência , Análise de Sequência de Proteína , Interpretação Estatística de Dados , Reprodutibilidade dos Testes , Software
12.
J Comput Biol ; 25(2): 121-129, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-28771374

RESUMO

We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1[Formula: see text], these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.


Assuntos
Biologia Computacional/métodos , Fatores de Elongação Ligados a GTP Fosfo-Hidrolases/química , Proteínas de Saccharomyces cerevisiae/química , Análise de Sequência de Proteína/métodos , Análise por Conglomerados
13.
Algorithms Mol Biol ; 13: 7, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29588650

RESUMO

BACKGROUND: An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods, which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. Similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. In contrast, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive. RESULTS: We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log-Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets. CONCLUSION: Our experiments make a good case for using a two-step approach for accurate taxonomic assignment. We show that our method can be used as a filtering step before using phylogenetic methods and provides a way to interpret BLAST results using more information than provided by E-values and bit-scores alone.

14.
Elife ; 72018 01 16.
Artigo em Inglês | MEDLINE | ID: mdl-29336305

RESUMO

Residues responsible for allostery, cooperativity, and other subtle but functionally important interactions remain difficult to detect. To aid such detection, we employ statistical inference based on the assumption that residues distinguishing a protein subgroup from evolutionarily divergent subgroups often constitute an interacting functional network. We identify such networks with the aid of two measures of statistical significance. One measure aids identification of divergent subgroups based on distinguishing residue patterns. For each subgroup, a second measure identifies structural interactions involving pattern residues. Such interactions are derived either from atomic coordinates or from Direct Coupling Analysis scores, used as surrogates for structural distances. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. These and similar analyses can aid the design of drugs targeting allosteric sites.


Assuntos
Biologia Computacional/métodos , Enzimas/química , Enzimas/metabolismo , Conformação Proteica
15.
BMC Biol ; 4: 41, 2006 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-17156431

RESUMO

BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.


Assuntos
Biologia Computacional/estatística & dados numéricos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Algoritmos , Animais , Humanos , Biossíntese de Proteínas , Alinhamento de Sequência , Software
16.
FEBS J ; 272(20): 5101-9, 2005 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-16218944

RESUMO

Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.


Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Alinhamento de Sequência/métodos , Algoritmos , Internet , Proteínas/química , Proteínas/genética , Curva ROC , Alinhamento de Sequência/estatística & dados numéricos , Homologia de Sequência de Aminoácidos , Software
17.
Mol Biochem Parasitol ; 126(2): 231-8, 2003 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-12615322

RESUMO

Plasmodium falciparum iron regulatory-like protein (PfIRPa, accession AJ012289) has homology to a family of iron-responsive element (IRE)-binding proteins (IRPs) found in different species. We have previously demonstrated that erythrocyte P. falciparum PfIRPa binds a mammalian consensus IRE and that the binding activity is regulated by iron status. In the work we now report, we have cloned a C-terminus histidine-tagged PfIRPa and overexpressed it in a bacterial expression system in soluble form capable of binding IREs. To overexpress PfIRPa, we used the T7 promoter-driven vector, pET28a(+), in conjunction with the Rosetta(DE3)pLysS strain of E. coli, which carries extra copies of tRNA genes usually found in organisms such as P. falciparum whose genome is (A+T)-rich. The histidine-tagged recombinant protein (rPfIRPa) in soluble form was partially purified using His-bind resin. We searched the plasmodial database, plasmoDB, to identify sequences capable of forming IRE loops using a specially developed algorithm, and found three plasmodial sequences matching the search criteria. In gel retardation assays, rPfIRPa bound three 32P-labeled putative plasmodial IREs with affinity exceeding the affinity for the mammalian consensus IRE. The binding was concentration-dependent and was not inhibited by heparin, an inhibitor of non-specific binding. Immunodepletion of rPfIRPa resulted in substantial inhibition of the signal intensity in the gel retardation assays and in Western blot-determinations of rPfIRPa protein levels. Endogenous PfIRPa retained all three putative 32P-IREs at the same position on the gel as the recombinant PfIRPa.


Assuntos
Proteínas Reguladoras de Ferro/metabolismo , Plasmodium falciparum/metabolismo , Proteínas Proto-Oncogênicas/metabolismo , Proteínas de Protozoários/metabolismo , Proteínas de Peixe-Zebra , Animais , Sequência de Bases , Sítios de Ligação , Primers do DNA , Humanos , Proteínas Reguladoras de Ferro/biossíntese , Proteínas Reguladoras de Ferro/genética , Células Jurkat , Dados de Sequência Molecular , Conformação de Ácido Nucleico , Ligação Proteica , Proteínas Tirosina Quinases/metabolismo , Proteínas de Protozoários/biossíntese , Proteínas de Protozoários/genética , RNA de Protozoário/química , RNA de Protozoário/genética , Proteínas Recombinantes/metabolismo , Proteínas Wnt
18.
J Comput Biol ; 20(1): 1-18, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23294268

RESUMO

The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components. Amino acid frequencies at homologous positions within related proteins have been fruitfully modeled by Dirichlet mixtures, and we use the Dirichlet process to derive such mixtures with an unbounded number of components. This application of the method requires several technical innovations to sample an unbounded number of Dirichlet-mixture components. The resulting Dirichlet mixtures model multiple-alignment data substantially better than do previously derived ones. They consist of over 500 components, in contrast to fewer than 40 previously, and provide a novel perspective on the structure of proteins. Individual protein positions should be seen not as falling into one of several categories, but rather as arrayed near probability ridges winding through amino acid multinomial space.


Assuntos
Proteínas/química , Proteínas/genética , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Teorema de Bayes , Biologia Computacional , Funções Verossimilhança , Cadeias de Markov , Conceitos Matemáticos , Modelos Estatísticos , Método de Monte Carlo , Teoria da Probabilidade , Estatísticas não Paramétricas
19.
Biol Direct ; 7: 12, 2012 Apr 17.
Artigo em Inglês | MEDLINE | ID: mdl-22510480

RESUMO

BACKGROUND: BLAST is a commonly-used software package for comparing a query sequence to a database of known sequences; in this study, we focus on protein sequences. Position-specific-iterated BLAST (PSI-BLAST) iteratively searches a protein sequence database, using the matches in round i to construct a position-specific score matrix (PSSM) for searching the database in round i + 1. Biegert and Söding developed Context-sensitive BLAST (CS-BLAST), which combines information from searching the sequence database with information derived from a library of short protein profiles to achieve better homology detection than PSI-BLAST, which builds its PSSMs from scratch. RESULTS: We describe a new method, called domain enhanced lookup time accelerated BLAST (DELTA-BLAST), which searches a database of pre-constructed PSSMs before searching a protein-sequence database, to yield better homology detection. For its PSSMs, DELTA-BLAST employs a subset of NCBI's Conserved Domain Database (CDD). On a test set derived from ASTRAL, with one round of searching, DELTA-BLAST achieves a ROC5000 of 0.270 vs. 0.116 for CS-BLAST. The performance advantage diminishes in iterated searches, but DELTA-BLAST continues to achieve better ROC scores than CS-BLAST. CONCLUSIONS: DELTA-BLAST is a useful program for the detection of remote protein homologs. It is available under the "Protein BLAST" link at http://blast.ncbi.nlm.nih.gov.


Assuntos
Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Ferramenta de Busca/métodos , Software , Algoritmos , Biologia Computacional/métodos , Internet , Curva ROC , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Homologia de Sequência de Aminoácidos , Fatores de Tempo
20.
J Comput Biol ; 18(8): 925-39, 2011 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-21702692

RESUMO

A model is a set of possible theories for describing a set of data. When the data are used to select a maximum-likelihood theory, an important question is how many effectively independent theories the model contains; the log of this number is called the model's complexity. The Dirichlet model is the set of all Dirichlet distributions, which are probability densities over the space of multinomials. A Dirichlet distribution may be used to describe multiple-alignment data, consisting of n columns of letters, with c letters in each column. We here derive, in the limit of large n and c, a closed-form expression for the complexity of the Dirichlet model applied to such data. For small c, we derive as well a minor correction to this formula, which is easily calculated by Monte Carlo simulation. Although our results are confined to the Dirichlet model, they may cast light as well on the complexity of Dirichlet mixture models, which have been applied fruitfully to the study of protein multiple sequence alignments.


Assuntos
Biologia Computacional , Proteínas/análise , Alinhamento de Sequência , Algoritmos , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Modelos Estatísticos , Método de Monte Carlo , Probabilidade , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA