Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
1.
PLoS Comput Biol ; 12(5): e1004936, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-27192614

RESUMO

We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.


Assuntos
Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Teorema de Bayes , Biologia Computacional , Bases de Dados de Proteínas , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/normas , Software
2.
Biomed Res Int ; 2013: 865181, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24319692

RESUMO

BACKGROUND: Next generation sequencing (NGS) is being widely used to identify genetic variants associated with human disease. Although the approach is cost effective, the underlying data is susceptible to many types of error. Importantly, since NGS technologies and protocols are rapidly evolving, with constantly changing steps ranging from sample preparation to data processing software updates, it is important to enable researchers to routinely assess the quality of sequencing and alignment data prior to downstream analyses. RESULTS: Here we describe QPLOT, an automated tool that can facilitate the quality assessment of sequencing run performance. Taking standard sequence alignments as input, QPLOT generates a series of diagnostic metrics summarizing run quality and produces convenient graphical summaries for these metrics. QPLOT is computationally efficient, generates webpages for interactive exploration of detailed results, and can handle the joint output of many sequencing runs. CONCLUSION: QPLOT is an automated tool that facilitates assessment of sequence run quality. We routinely apply QPLOT to ensure quick detection of diagnostic of sequencing run problems. We hope that QPLOT will be useful to the community as well.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/normas , Software , Interpretação Estatística de Dados , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Controle de Qualidade , Alinhamento de Sequência/normas , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de RNA/normas , Análise de Sequência de RNA/estatística & dados numéricos
3.
J Comput Biol ; 18(11): 1449-64, 2011 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-21951055

RESUMO

Probabilistic approaches for sequence alignment are usually based on pair Hidden Markov Models (HMMs) or Stochastic Context Free Grammars (SCFGs). Recent studies have shown a significant correlation between the content of short indels and their flanking regions, which by definition cannot be modelled by the above two approaches. In this work, we present a context-sensitive indel model based on a pair Tree-Adjoining Grammar (TAG), along with accompanying algorithms for efficient alignment and parameter estimation. The increased precision and statistical power of this model is shown on simulated and real genomic data. As the cost of sequencing plummets, the usefulness of comparative analysis is becoming limited by alignment accuracy rather than data availability. Our results will therefore have an impact on any type of downstream comparative genomics analyses that rely on alignments. Fine-grained studies of small functional regions or disease markers, for example, could be significantly improved by our method. The implementation is available at www.mcb.mcgill.ca/~blanchem/software.html.


Assuntos
Mutação INDEL , Modelos Estatísticos , Alinhamento de Sequência/métodos , Algoritmos , Teorema de Bayes , Simulação por Computador , Genoma Humano , Estudo de Associação Genômica Ampla , Humanos , Funções Verossimilhança , Cadeias de Markov , Modelos Genéticos , Padrões de Referência , Alinhamento de Sequência/normas , Análise de Sequência de DNA/métodos
4.
Bioinformatics ; 27(8): 1157-8, 2011 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-21320865

RESUMO

UNLABELLED: I propose a new application of profile Hidden Markov Models in the area of SNP discovery from resequencing data, to greatly reduce false SNP calls caused by misalignments around insertions and deletions (indels). The central concept is per-Base Alignment Quality, which accurately measures the probability of a read base being wrongly aligned. The effectiveness of BAQ has been positively confirmed on large datasets by the 1000 Genomes Project analysis subgroup. AVAILABILITY: http://samtools.sourceforge.net CONTACT: hengli@broadinstitute.org.


Assuntos
Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Bases , Genômica , Mutação INDEL , Cadeias de Markov , Alinhamento de Sequência/normas
5.
Proteomics ; 11(6): 1114-24, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21298787

RESUMO

As high-resolution instruments are becoming standard in proteomics laboratories, label-free quantification using precursor measurements is becoming a viable option, and is consequently rapidly gaining popularity. Several software solutions have been presented for label-free analysis, but to our knowledge no conclusive studies regarding the sensitivity and reliability of each step of the analysis procedure has been described. Here, we use real complex samples to assess the reliability of label-free quantification using four different software solutions. A generic approach to quality test quantitative label-free LC-MS is introduced. Measures for evaluation are defined for feature detection, alignment and quantification. All steps of the analysis could be considered adequately performed by the utilized software solutions, although differences and possibilities for improvement could be identified. The described method provides an effective testing procedure, which can help the user to quickly pinpoint where in the workflow changes are needed.


Assuntos
Proteômica/estatística & dados numéricos , Proteômica/normas , Software , Espectrometria de Massas em Tandem/estatística & dados numéricos , Espectrometria de Massas em Tandem/normas , Algoritmos , Cromatografia Líquida/normas , Cromatografia Líquida/estatística & dados numéricos , Biologia Computacional , Interpretação Estatística de Dados , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Proteínas/isolamento & purificação , Controle de Qualidade , Reprodutibilidade dos Testes , Alinhamento de Sequência/normas , Alinhamento de Sequência/estatística & dados numéricos , Fluxo de Trabalho
6.
Nucleic Acids Res ; 34(16): 4364-74, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-16936316

RESUMO

We have developed MUMMALS, a program to construct multiple protein sequence alignment using probabilistic consistency. MUMMALS improves alignment quality by using pairwise alignment hidden Markov models (HMMs) with multiple match states that describe local structural information without exploiting explicit structure predictions. Parameters for such models have been estimated from a large library of structure-based alignments. We show that (i) on remote homologs, MUMMALS achieves statistically best accuracy among several leading aligners, such as ProbCons, MAFFT and MUSCLE, albeit the average improvement is small, in the order of several percent; (ii) a large collection (>10 000) of automatically computed pairwise structure alignments of divergent protein domains is superior to smaller but carefully curated datasets for estimation of alignment parameters and performance tests; (iii) reference-independent evaluation of alignment quality using sequence alignment-dependent structure superpositions correlates well with reference-dependent evaluation that compares sequence-based alignments to structure-based reference alignments.


Assuntos
Cadeias de Markov , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína , Software , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Padrões de Referência , Alinhamento de Sequência/normas
7.
Nucleic Acids Res ; 33(22): 7120-8, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16361270

RESUMO

Multiple sequence alignments play a central role in the annotation of novel genomes. Given the biological and computational complexity of this task, the automatic generation of high-quality alignments remains challenging. Since multiple alignments are usually employed at the very start of data analysis pipelines, it is crucial to ensure high alignment quality. We describe a simple, yet elegant, solution to assess the biological accuracy of alignments automatically. Our approach is based on the comparison of several alignments of the same sequences. We introduce two functions to compare alignments: the average overlap score and the multiple overlap score. The former identifies difficult alignment cases by expressing the similarity among several alignments, while the latter estimates the biological correctness of individual alignments. We implemented both functions in the MUMSA program and demonstrate the overall robustness and accuracy of both functions on three large benchmark sets.


Assuntos
Alinhamento de Sequência/normas , Algoritmos , Curva ROC , Reprodutibilidade dos Testes , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína
8.
Mol Phylogenet Evol ; 36(3): 641-53, 2005 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-15935703

RESUMO

The outcome of a phylogenetic analysis based on DNA sequence data is highly dependent on the homology-assignment step and may vary with alignment parameter costs. Robustness to changes in parameter costs is therefore a desired quality of a data set because the final conclusions will be less dependent on selecting a precise optimal cost set. Here, node stability is explored in relationship to separate versus combined analysis in three different data sets, all including several data partitions. Robustness to changes in cost sets is measured as number of successive changes that can be made in a given cost set before a specific clade is lost. The changes are in all cases base change cost, gap penalties, and adding/removing/changing affine gap costs. When combining data partitions, the number of clades that appear in the entire parameter space is not remarkably increased, in some cases this number even decreased. However, when combining data partitions the trees from cost sets including affine gap costs were always more similar than the trees were from cost sets without affine gap costs. This was not the case when the data partitions were analyzed independently. When data sets were combined approximately 80% of the clades found under cost sets including affine gap costs resisted at least one change to the cost set.


Assuntos
Filogenia , Alinhamento de Sequência/métodos , Animais , Artrópodes/genética , Hordeum/genética , Modelos Genéticos , Alinhamento de Sequência/normas
9.
BMC Bioinformatics ; 5: 149, 2004 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-15485572

RESUMO

BACKGROUND: The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. RESULTS: In this paper, we propose two ways to improve the hit criterion. First, we define the group criterion combining the advantages of the single-seed and double-seed approaches used in existing algorithms. Second, we introduce transition-constrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with the YASS software, supporting both improvements. CONCLUSIONS: Proposed algorithmic ideas allow to obtain a significant gain in sensitivity of similarity search without increase in execution time. The method has been implemented in YASS software available at http://www.loria.fr/projects/YASS/.


Assuntos
DNA/genética , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Algoritmos , Animais , Cromossomos Humanos X/genética , DNA Bacteriano/genética , DNA Fúngico/genética , Drosophila/genética , Humanos , Cadeias de Markov , Modelos Estatísticos , Neisseria meningitidis/genética , Saccharomyces cerevisiae/genética
10.
Bioinformatics ; 18(9): 1243-9, 2002 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-12217916

RESUMO

MOTIVATION: The best quality multiple sequence alignments are generally considered to derive from structural superposition. However, no previous work has studied the relative performance of profile hidden Markov models (HMMs) derived from such alignments. Therefore several alignment methods have been used to generate multiple sequence alignments from 348 structurally aligned families in the HOMSTRAD database. The performance of profile HMMs derived from the structural and sequence-based alignments has been assessed for homologue detection. RESULTS: The best alignment methods studied here correctly align nearly 80% of residues with respect to structure alignments. Alignment quality and model sensitivity are found to be dependent on average number, length, and identity of sequences in the alignment. The striking conclusion is that, although structural data may improve the quality of multiple sequence alignments, this does not add to the ability of the derived profile HMMs to find sequence homologues. SUPPLEMENTARY INFORMATION: A list of HOMSTRAD families used in this study and the corresponding Pfam families is available at http://www.sanger.ac.uk/Users/sgj/alignments/map.html CONTACT: sgj@sanger.ac.uk


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Armazenamento e Recuperação da Informação/métodos , Alinhamento de Sequência/métodos , Homologia de Sequência , Sequência de Aminoácidos , Estudos de Avaliação como Assunto , Internet , Cadeias de Markov , Modelos Genéticos , Modelos Estatísticos , Dados de Sequência Molecular , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Alinhamento de Sequência/normas
11.
Bioinformatics ; 18(3): 496-7, 2002 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-11934755

RESUMO

SUMMARY: A public server for evaluating the accuracy of protein sequence alignment methods is presented. CASA is an implementation of the alignment accuracy benchmark presented by Sauder et al. (Proteins, 40, 6-22, 2000). The benchmark currently contains 39321 pairwise protein structure alignments produced with the CE program from SCOP domain definitions. The server produces graphical and tabular comparisons of the accuracy of a user's input sequence alignments with other commonly used programs, such as BLAST, PSI-BLAST, Clustal W, and SAM-T99. AVAILABILITY: The server is located at http://capb.dbi.udel.edu/casa.


Assuntos
Bases de Dados de Proteínas , Proteínas/química , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Software , Algoritmos , Calibragem , Metodologias Computacionais , Estudos de Avaliação como Assunto , Internet , National Library of Medicine (U.S.) , Análise de Sequência de Proteína/métodos , Estados Unidos
12.
Bioinformatics ; 17(8): 713-20, 2001 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-11524372

RESUMO

MOTIVATION: SAM-T99 is an iterative hidden Markov model-based method for finding proteins similar to a single target sequence and aligning them. One of its main uses is to produce multiple alignments of homologs of the target sequence. Previous tests of SAM-T99 and its predecessors have concentrated on the quality of the searches performed, not on the quality of the multiple alignment. In this paper we report on tests of multiple alignment quality, comparing SAM-T99 to the standard multiple aligner, CLUSTALW. RESULTS: The paper evaluates the multiple-alignment aspect of the SAM-T99 protocol, using the BAliBASE benchmark alignment database. On these benchmarks, SAM-T99 is comparable in accuracy with ClustalW. AVAILABILITY: The SAM-T99 protocol can be run on the web at http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html and the alignment tune-up option described here can be run at http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-tuneup.html. The protocol is also part of the standard SAM suite of tools. http://www.cse.ucsc.edu/research/compbio/sam/


Assuntos
Bases de Dados de Proteínas , Proteínas/química , Proteínas/genética , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Cadeias de Markov , Alinhamento de Sequência/normas
13.
Protein Sci ; 7(2): 445-56, 1998 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-9521122

RESUMO

We apply a simple method for aligning protein sequences on the basis of a 3D structure, on a large scale, to the proteins in the scop classification of fold families. This allows us to assess, understand, and improve our automatic method against an objective, manually derived standard, a type of comprehensive evaluation that has not yet been possible for other structural alignment algorithms. Our basic approach directly matches the backbones of two structures, using repeated cycles of dynamic programming and least-squares fitting to determine an alignment minimizing coordinate difference. Because of simplicity, our method can be readily modified to take into account additional features of protein structure such as the orientation of side chains or the location-dependent cost of opening a gap. Our basic method, augmented by such modifications, can find reasonable alignments for all but 1.5% of the known structural similarities in scop, i.e., all but 32 of the 2,107 superfamily pairs. We discuss the specific protein structural features that make these 32 pairs so difficult to align and show how our procedure effectively partitions the relationships in scop into different categories, depending on what aspects of protein structure are involved (e.g., depending on whether or not consideration of side-chain orientation is necessary for proper alignment). We also show how our pairwise alignment procedure can be extended to generate a multiple alignment for a group of related structures. We have compared these alignments in detail with corresponding manual ones culled from the literature. We find good agreement (to within 95% for the core regions), and detailed comparison highlights how particular protein structural features (such as certain strands) are problematical to align, giving somewhat ambiguous results. With these improvements and systematic tests, our procedure should be useful for the development of scop and the future classification of protein folds.


Assuntos
Proteínas/química , Padrões de Referência , Alinhamento de Sequência/normas , Sequência de Aminoácidos , Automação , Dados de Sequência Molecular , Conformação Proteica , Proteínas/classificação , Alinhamento de Sequência/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA