Pesquisa | Biblioteca Virtual em Saúde

1.

Improved selection of canonical proteins for reference proteomes.

Insana, Giuseppe; Martin, Maria J; Pearson, William R.

NAR Genom Bioinform ; 6(2): lqae066, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38863529

RESUMO

The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.

2.

Comparison of detection methods and genome quality when quantifying nuclear mitochondrial insertions in vertebrate genomes.

Triant, Deborah A; Pearson, William R.

Front Genet ; 13: 984513, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36482890

RESUMO

The integration of mitochondrial genome fragments into the nuclear genome is well documented, and the transfer of these mitochondrial nuclear pseudogenes (numts) is thought to be an ongoing evolutionary process. With the increasing number of eukaryotic genomes available, genome-wide distributions of numts are often surveyed. However, inconsistencies in genome quality can reduce the accuracy of numt estimates, and methods used for identification can be complicated by the diverse sizes and ages of numts. Numts have been previously characterized in rodent genomes and it was postulated that they might be more prevalent in a group of voles with rapidly evolving karyotypes. Here, we examine 37 rodent genomes, and an additional 26 vertebrate genomes, while also considering numt detection methods. We identify numts using DNA:DNA and protein:translated-DNA similarity searches and compare numt distributions among rodent and vertebrate taxa to assess whether some groups are more susceptible to transfer. A combination of protein sequence comparisons (protein:translated-DNA) and BLASTN genomic DNA searches detect 50% more numts than genomic DNA:DNA searches alone. In addition, higher-quality RefSeq genomes produce lower estimates of numts than GenBank genomes, suggesting that lower quality genome assemblies can overestimate numts abundance. Phylogenetic analysis shows that mitochondrial transfers are not associated with karyotypic diversity among rodents. Surprisingly, we did not find a strong correlation between numt counts and genome size. Estimates using DNA: DNA analyses can underestimate the amount of mitochondrial DNA that is transferred to the nucleus.

3.

Barriers to integration of bioinformatics into undergraduate life sciences education: A national study of US life sciences faculty uncover significant barriers to integrating bioinformatics into undergraduate instruction.

Williams, Jason J; Drew, Jennifer C; Galindo-Gonzalez, Sebastian; Robic, Srebrenka; Dinsdale, Elizabeth; Morgan, William R; Triplett, Eric W; Burnette, James M; Donovan, Samuel S; Fowlks, Edison R; Goodman, Anya L; Grandgenett, Nealy F; Goller, Carlos C; Hauser, Charles; Jungck, John R; Newman, Jeffrey D; Pearson, William R; Ryder, Elizabeth F; Sierk, Michael; Smith, Todd M; Tosado-Acevedo, Rafael; Tapprich, William; Tobin, Tammy C; Toro-Martínez, Arlín; Welch, Lonnie R; Wilson, Melissa A; Ebenbach, David; McWilliams, Mindy; Rosenwald, Anne G; Pauley, Mark A.

PLoS One ; 14(11): e0224288, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31738797

RESUMO

Bioinformatics, a discipline that combines aspects of biology, statistics, mathematics, and computer science, is becoming increasingly important for biological research. However, bioinformatics instruction is not yet generally integrated into undergraduate life sciences curricula. To understand why we studied how bioinformatics is being included in biology education in the US by conducting a nationwide survey of faculty at two- and four-year institutions. The survey asked several open-ended questions that probed barriers to integration, the answers to which were analyzed using a mixed-methods approach. The barrier most frequently reported by the 1,260 respondents was lack of faculty expertise/training, but other deterrents-lack of student interest, overly-full curricula, and lack of student preparation-were also common. Interestingly, the barriers faculty face depended strongly on whether they are members of an underrepresented group and on the Carnegie Classification of their home institution. We were surprised to discover that the cohort of faculty who were awarded their terminal degree most recently reported the most preparation in bioinformatics but teach it at the lowest rate.

Assuntos

Biologia/educação , Biologia Computacional/educação , Currículo , Docentes/estatística & dados numéricos , Feminino , Humanos , Masculino , Motivação , Estudantes/psicologia , Inquéritos e Questionários/estatística & dados numéricos , Estados Unidos

4.

Using SQL Databases for Sequence Similarity Searching and Analysis.

Pearson, William R; Mackey, Aaron J.

Curr Protoc Bioinformatics ; 59: 9.4.1-9.4.22, 2017 09 13.

Artigo em Inglês | MEDLINE | ID: mdl-28902397

RESUMO

Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc.

Assuntos

Biologia Computacional/métodos , Bases de Dados de Proteínas , Análise de Sequência de Proteína/métodos , Software , Escherichia coli/genética , Evolução Molecular , Proteínas/química , Proteínas/genética , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos

5.

Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold.

Pearson, William R; Li, Weizhong; Lopez, Rodrigo.

Nucleic Acids Res ; 45(7): e46, 2017 04 20.

Artigo em Inglês | MEDLINE | ID: mdl-27923999

RESUMO

Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.

Assuntos

Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Domínios Proteicos , Software

6.

Finding Protein and Nucleotide Similarities with FASTA.

Pearson, William R.

Curr Protoc Bioinformatics ; 53: 3.9.1-3.9.25, 2016 Mar 24.

Artigo em Inglês | MEDLINE | ID: mdl-27010337

RESUMO

The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.

Assuntos

Nucleotídeos/química , Proteínas/química , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos , Homologia de Sequência do Ácido Nucleico , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas

7.

Protein Function Prediction: Problems and Pitfalls.

Pearson, William R.

Curr Protoc Bioinformatics ; 51: 4.12.1-4.12.8, 2015 Sep 03.

Artigo em Inglês | MEDLINE | ID: mdl-26334923

RESUMO

The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of "functional similarity," and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer "one-size-fits-all" solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood.

Assuntos

Bases de Dados de Proteínas , Proteínas/química , Proteínas/metabolismo , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Mineração de Dados/métodos , Dados de Sequência Molecular , Relação Estrutura-Atividade

8.

Most partial domains in proteins are alignment and annotation artifacts.

Triant, Deborah A; Pearson, William R.

Genome Biol ; 16: 99, 2015 May 15.

Artigo em Inglês | MEDLINE | ID: mdl-25976240

RESUMO

BACKGROUND: Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2). RESULTS: We characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts. CONCLUSIONS: Partial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein's gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.

Assuntos

Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Proteínas/química , Alinhamento de Sequência , Animais , Bases de Dados Genéticas , Bases de Dados de Proteínas , Drosophila/genética , Humanos , Camundongos , Modelos Moleculares , Software

9.

The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes.

Furnham, Nicholas; Holliday, Gemma L; de Beer, Tjaart A P; Jacobsen, Julius O B; Pearson, William R; Thornton, Janet M.

Nucleic Acids Res ; 42(Database issue): D485-9, 2014 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-24319146

RESUMO

Understanding which are the catalytic residues in an enzyme and what function they perform is crucial to many biology studies, particularly those leading to new therapeutics and enzyme design. The original version of the Catalytic Site Atlas (CSA) (http://www.ebi.ac.uk/thornton-srv/databases/CSA) published in 2004, which catalogs the residues involved in enzyme catalysis in experimentally determined protein structures, had only 177 curated entries and employed a simplistic approach to expanding these annotations to homologous enzyme structures. Here we present a new version of the CSA (CSA 2.0), which greatly expands the number of both curated (968) and automatically annotated catalytic sites in enzyme structures, utilizing a new method for annotation transfer. The curated entries are used, along with the variation in residue type from the sequence comparison, to generate 3D templates of the catalytic sites, which in turn can be used to find catalytic sites in new structures. To ease the transfer of CSA annotations to other resources a new ontology has been developed: the Enzyme Mechanism Ontology, which has permitted the transfer of annotations to Mechanism, Annotation and Classification in Enzymes (MACiE) and UniProt Knowledge Base (UniProtKB) resources. The CSA database schema has been re-designed and both the CSA data and search capabilities are presented in a new modern web interface.

Assuntos

Domínio Catalítico , Bases de Dados de Proteínas , Enzimas/química , Ontologias Biológicas , Internet , Análise de Sequência de Proteína

10.

BLAST and FASTA similarity searching for multiple sequence alignment.

Pearson, William R.

Methods Mol Biol ; 1079: 75-101, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-24170396

RESUMO

BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry-homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today's very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Sequência de Aminoácidos , Mineração de Dados , Bases de Dados de Proteínas , Humanos , Dados de Sequência Molecular , Homologia de Sequência de Aminoácidos

11.

Adjusting scoring matrices to correct overextended alignments.

Mills, Lauren J; Pearson, William R.

Bioinformatics ; 29(23): 3007-13, 2013 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-23995390

RESUMO

MOTIVATION: Sequence similarity searches performed with BLAST, SSEARCH and FASTA achieve high sensitivity by using scoring matrices (e.g. BLOSUM62) that target low identity (<33%) alignments. Although such scoring matrices can effectively identify distant homologs, they can also produce local alignments that extend beyond the homologous regions. RESULTS: We measured local alignment start/stop boundary accuracy using a set of queries where the correct alignment boundaries were known, and found that 7% of BLASTP and 8% of SSEARCH alignment boundaries were overextended. Overextended alignments include non-homologous sequences; they occur most frequently between sequences that are more closely related (>33% identity). Adjusting the scoring matrix to reflect the identity of the homologous sequence can correct higher identity overextended alignment boundaries. In addition, the scoring matrix that produced a correct alignment could be reliably predicted based on the sequence identity seen in the original BLOSUM62 alignment. Realigning with the predicted scoring matrix corrected 37% of all overextended alignments, resulting in more correct alignments than using BLOSUM62 alone.

Assuntos

Biologia Computacional/métodos , Matrizes de Pontuação de Posição Específica , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Dados de Sequência Molecular , Homologia de Sequência de Aminoácidos

12.

An introduction to sequence similarity ("homology") searching.

Pearson, William R.

Curr Protoc Bioinformatics ; Chapter 3: 3.1.1-3.1.8, 2013 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-23749753

RESUMO

Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify "homologous" proteins or genes by detecting excess similarity- statistically significant similarity that reflects common ancestry. This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.

Assuntos

Proteínas/química , Alinhamento de Sequência/métodos , Bases de Dados de Proteínas , Proteínas/genética , Análise de Sequência , Homologia de Sequência

13.

Selecting the Right Similarity-Scoring Matrix.

Pearson, William R.

Curr Protoc Bioinformatics ; 43: 3.5.1-3.5.9, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-24509512

RESUMO

Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. "Deep" scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20 - 30% identity, while "shallow" scoring matrices (e.g. VTML10 - VTML80), target alignments that share 90 - 50% identity, reflecting much less evolutionary change. While "deep" matrices provide very sensitive similarity searches, they also require longer sequence alignments and can sometimes produce alignment overextension into non-homologous regions. Shallower scoring matrices are more effective when searching for short protein domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full-length protein sequences, but short domains or restricted evolutionary look-back require shallower scoring matrices.

Assuntos

Matrizes de Pontuação de Posição Específica , Sequência de Aminoácidos , Substituição de Aminoácidos , DNA , Dados de Sequência Molecular , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos

14.

PSI-Search: iterative HOE-reduced profile SSEARCH searching.

Li, Weizhong; McWilliam, Hamish; Goujon, Mickael; Cowley, Andrew; Lopez, Rodrigo; Pearson, William R.

Bioinformatics ; 28(12): 1650-1, 2012 Jun 15.

Artigo em Inglês | MEDLINE | ID: mdl-22539666

RESUMO

UNLABELLED: Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains-homologous over-extension (HOE). PSI-Search combines an optimal Smith-Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile construction strategy. An optional sequence boundary-masking procedure, which prevents alignments from being extended after they are initially included, can reduce HOE errors in the PSSM profile. Preventing HOE improves selectivity for both PSI-BLAST and PSI-Search, but PSI-Search has ~4-fold better selectivity than PSI-BLAST and similar sensitivity at 50% and 60% family coverage. PSI-Search is also produces 2- for 4-fold fewer false-positives than JackHMMER, but is ~5% less sensitive. AVAILABILITY AND IMPLEMENTATION: PSI-Search is available from the authors as a standalone implementation written in Perl for Linux-compatible platforms. It is also available through a web interface (www.ebi.ac.uk/Tools/sss/psisearch) and SOAP and REST Web Services (www.ebi.ac.uk/Tools/webservices).

Assuntos

Motivos de Aminoácidos , Alinhamento de Sequência/métodos , Software , Biologia Computacional/métodos , Bases de Dados de Proteínas , Internet , Linguagens de Programação

15.

MACiE: exploring the diversity of biochemical reactions.

Holliday, Gemma L; Andreini, Claudia; Fischer, Julia D; Rahman, Syed Asad; Almonacid, Daniel E; Williams, Sophie T; Pearson, William R.

Nucleic Acids Res ; 40(Database issue): D783-9, 2012 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-22058127

RESUMO

MACiE (which stands for Mechanism, Annotation and Classification in Enzymes) is a database of enzyme reaction mechanisms, and can be accessed from http://www.ebi.ac.uk/thornton-srv/databases/MACiE/. This article presents the release of Version 3 of MACiE, which not only extends the dataset to 335 entries, covering 182 of the EC sub-subclasses with a crystal structure available (~90%), but also incorporates greater chemical and structural detail. This version of MACiE represents a shift in emphasis for new entries, from non-homologous representatives covering EC reaction space to enzymes with mechanisms of interest to our users and collaborators with a view to exploring the chemical diversity of life. We present new tools for exploring the data in MACiE and comparing entries as well as new analyses of the data and new searches, many of which can now be accessed via dedicated Perl scripts.

Assuntos

Bases de Dados de Proteínas , Enzimas/química , Biocatálise , Fenômenos Bioquímicos , Domínio Catalítico , Coenzimas/química , Enzimas/classificação , Internet , Anotação de Sequência Molecular

16.

RefProtDom: a protein database with improved domain boundaries and homology relationships.

Gonzalez, Mileidy W; Pearson, William R.

Bioinformatics ; 26(18): 2361-2, 2010 Sep 15.

Artigo em Inglês | MEDLINE | ID: mdl-20693322

RESUMO

UNLABELLED: RefProtDom provides a set of divergent query domains, originally selected from Pfam, and full-length proteins containing their homologous domains, with diverse architectures, for evaluating pair-wise and iterative sequence similarity searches. Pfam homology and domain boundary annotations in the target library were supplemented using local and semi-global searches, PSI-BLAST searches, and SCOP and CATH classifications. AVAILABILITY: RefProtDom is available from http://faculty.virginia.edu/wrpearson/fasta/PUBS/gonzalez09a.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Proteínas , Software

17.

Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments.

Sierk, Michael L; Smoot, Michael E; Bass, Ellen J; Pearson, William R.

BMC Bioinformatics ; 11: 146, 2010 Mar 22.

Artigo em Inglês | MEDLINE | ID: mdl-20307279

RESUMO

BACKGROUND: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. RESULTS: We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance. CONCLUSIONS: The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.

Assuntos

Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína

18.

Homologous over-extension: a challenge for iterative similarity searches.

Gonzalez, Mileidy W; Pearson, William R.

Nucleic Acids Res ; 38(7): 2177-89, 2010 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-20064877

RESUMO

We have characterized a novel type of PSI-BLAST error, homologous over-extension (HOE), using embedded PFAM domain queries on searches against a reference library containing Pfam-annotated UniProt sequences and random synthetic sequences. PSI-BLAST makes two types of errors: alignments to non-homologous regions and HOE alignments that begin in a homologous region, but extend beyond the homology into neighboring sequence regions. When the neighboring sequence region contains a non-homologous domain, PSI-BLAST can incorporate the unrelated sequence into its position specific scoring matrix, which then finds non-homologous proteins with significant expectation values. HOE accounts for the largest fraction of the initial false positive (FP) errors, and the largest fraction of FPs at iteration 5. In searches against complete protein sequences, 5-9% of alignments at iteration 5 are non-homologous. HOE frequently begins in a partial protein domain; when partial domains are removed from the library, HOE errors decrease from 16 to 3% of weighted coverage (hard queries; 35-5% for sampled queries) and no-error searches increase from 2 to 58% weighed coverage (hard; 16-78% sampled). When HOE is reduced by not extending previously found sequences, PSI-BLAST specificity improves 4-8-fold, with little loss in sensitivity.

Assuntos

Alinhamento de Sequência/métodos , Homologia de Sequência de Aminoácidos , Filogenia , Matrizes de Pontuação de Posição Específica , Estrutura Terciária de Proteína , Proteínas/química , Proteínas/classificação , Proteínas/genética

19.

Globally, unrelated protein sequences appear random.

Lavelle, Daniel T; Pearson, William R.

Bioinformatics ; 26(3): 310-8, 2010 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-19948773

RESUMO

MOTIVATION: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models. RESULTS: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in alpha-helical secondary structures (but not beta-strands). Five-residue consensus exceptional words are enriched for alpha-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for alpha-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequência de Aminoácidos , Proteínas/química , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas , Dobramento de Proteína , Estrutura Secundária de Proteína

20.

The limits of protein sequence comparison?

Pearson, William R; Sierk, Michael L.

Curr Opin Struct Biol ; 15(3): 254-60, 2005 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-15919194

RESUMO

Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.

Assuntos

Algoritmos , Bases de Dados de Proteínas , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Dados de Sequência Molecular , Proteínas/análise , Proteínas/classificação , Alinhamento de Sequência/tendências , Análise de Sequência de Proteína/tendências , Homologia de Sequência de Aminoácidos , Relação Estrutura-Atividade

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA