Pesquisa | BVS Integralidade em Saúde

ALP & FALP: C++ libraries for pairwise local alignment E-values.

Sheetlin, Sergey; Park, Yonil; Frith, Martin C; Spouge, John L.

Bioinformatics ; 32(2): 304-5, 2016 Jan 15.

Artigo em Inglês | MEDLINE | ID: mdl-26428291

RESUMO

MOTIVATION: Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments. AVAILABILITY AND IMPLEMENTATION: To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP. CONTACT: spouge@nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Biologia Computacional/métodos , DNA/química , Proteínas/química , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , DNA/metabolismo , Bases de Dados Factuais , Humanos , Proteínas/metabolismo , Alinhamento de Sequência

Frameshift alignment: statistics and post-genomic applications.

Sheetlin, Sergey L; Park, Yonil; Frith, Martin C; Spouge, John L.

Bioinformatics ; 30(24): 3575-82, 2014 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-25172925

RESUMO

MOTIVATION: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. RESULTS: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.

Assuntos

Mutação da Fase de Leitura , Alinhamento de Sequência/métodos , Algoritmos , Interpretação Estatística de Dados , Genoma Humano , Genômica , Humanos , Metagenômica , Pseudogenes , Análise de Sequência de DNA , Análise de Sequência de Proteína , Análise de Sequência de RNA , Software

The whole alignment and nothing but the alignment: the problem of spurious alignment flanks.

Frith, Martin C; Park, Yonil; Sheetlin, Sergey L; Spouge, John L.

Nucleic Acids Res ; 36(18): 5863-71, 2008 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-18796526

RESUMO

Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human-fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple 'overalignment' P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.

Assuntos

Alinhamento de Sequência/métodos , Animais , Biologia Computacional , Interpretação Estatística de Dados , Genômica , Humanos , Probabilidade

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES.

Park, Yonil; Sheetlin, Sergey; Spouge, John L.

Ann Stat ; 37(6A): 3697, 2009 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-20148197

RESUMO

The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.

The identification of complete domains within protein sequences using accurate E-values for semi-global alignment.

Kann, Maricel G; Sheetlin, Sergey L; Park, Yonil; Bryant, Stephen H; Spouge, John L.

Nucleic Acids Res ; 35(14): 4678-85, 2007.

Artigo em Inglês | MEDLINE | ID: mdl-17596268

RESUMO

The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a 'semi-global alignment'. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.

Assuntos

Estrutura Terciária de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Sequência Conservada , Bases de Dados de Proteínas , Cadeias de Markov , Reprodutibilidade dos Testes , Software

The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment.

Sheetlin, Sergey; Park, Yonil; Spouge, John L.

Nucleic Acids Res ; 33(15): 4987-94, 2005.

Artigo em Inglês | MEDLINE | ID: mdl-16147981

RESUMO

The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter lambda and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter lambda can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243-260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters lambda and k within the errors required (lambda, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Simulação por Computador , Interpretação Estatística de Dados , Software

New finite-size correction for local alignment score distributions.

Park, Yonil; Sheetlin, Sergey; Ma, Ning; Madden, Thomas L; Spouge, John L.

BMC Res Notes ; 5: 286, 2012 Jun 12.

Artigo em Inglês | MEDLINE | ID: mdl-22691307

RESUMO

BACKGROUND: Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a "finite-size" correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score. FINDINGS: We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences. CONCLUSIONS: The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site ( http://blast.ncbi.nlm.nih.gov).

Assuntos

Sequência de Aminoácidos , Alinhamento de Sequência/métodos , Software , Bases de Dados de Proteínas , Internet , Dados de Sequência Molecular , Probabilidade , Projetos de Pesquisa , Alinhamento de Sequência/estatística & dados numéricos

Objective method for estimating asymptotic parameters, with an application to sequence alignment.

Sheetlin, Sergey; Park, Yonil; Spouge, John L.

Phys Rev E Stat Nonlin Soft Matter Phys ; 84(3 Pt 1): 031914, 2011 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-22060410

RESUMO

Sequence alignment is an indispensable computational tool in modern molecular biology. The model underlying biological sequence alignment is of interest to physicists because it approximates the statistical mechanics of DNA and protein annealing, while bearing an intimate relationship to models of directed polymers in random media. Recent methods for determining the statistics of random sequence alignments have reduced the computation time to less than 1 s, opening up some interesting possibilities for online computation with biological search engines. Before implementation, however, the methods required an objective technique for computing regression coefficients pertinent to an asymptotic regime. Typically, physicists estimate parameters pertinent to an asymptotic regime subjectively: They eyeball their data; estimate the asymptotic regime where the regression model holds with reasonable accuracy; and then regress data only within the estimated asymptotic regime. Our publicly available computer program ARRP replaces the subjective assessment of the asymptotic regime with an objective change-point detection method, increasing confidence in the scientific objectivity of the parameter estimates. Asymptotic regression has potential applications across most of physics.

Assuntos

Algoritmos , Alinhamento de Sequência/métodos , Análise de Sequência/métodos , Software

The correlation error and finite-size correction in an ungapped sequence alignment.

Park, Yonil; Spouge, John L.

Bioinformatics ; 18(9): 1236-42, 2002 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-12217915

RESUMO

MOTIVATION: The BLAST program for comparing two sequences assumes independent sequences in its random model. The resulting random alignment matrices have correlations across their diagonals. Analytic formulas for the BLAST p-value essentially neglect these correlations and are equivalent to a random model with independent diagonals. Progress on the independent diagonals model has been surprisingly rapid, but the practical magnitude of the correlations it neglects remains unknown. In addition, BLAST uses a finite-size correction that is particularly important when either of the sequences being compared is short. Several formulas for the finite-size correction have now been given, but the corresponding errors in the BLAST p-values have not been quantified. As the lengths of compared sequences tend to infinity, it is also theoretically unknown whether the neglected correlations vanish faster than the finite-size correction. RESULTS: Because we required certain analytic formulas, our study restricted its computer experiments to ungapped sequence alignment. We expect some of our conclusions to extend qualitatively to gapped sequence alignment, however. With this caveat, the finite-size correction appeared to vanish faster than the neglected correlations. Although the finite-size correction underestimated the BLAST p-value, it improved the approximation substantially for all but very short sequences. In practice, the Altschul-Gish finite-size correction was superior to Spouge's. The independent diagonals model was always within a factor of 2 of the true BLAST p-value, although fitting p-value parameters from it probably is unwise. CONTACT: spouge@ncbi.nlm.nih.gov

Assuntos

Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Armazenamento e Recuperação da Informação/métodos , Modelos Estatísticos , Alinhamento de Sequência/métodos , Análise de Sequência/métodos , Reações Falso-Positivas , Modelos Genéticos , Método de Monte Carlo , National Library of Medicine (U.S.) , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Homologia de Sequência , Estatística como Assunto , Estados Unidos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa