Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Methods Mol Biol ; 2231: 3-16, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33289883

RESUMEN

Clustal Omega is a version, completely rewritten and revised in 2011, of the widely used Clustal series of programs for multiple sequence alignment. It can deal with very large numbers (many tens of thousands) of DNA/RNA or protein sequences due to its use of the mBed algorithm for calculating guide-trees. This algorithm allows very large alignment problems to be tackled very quickly, even on personal computers. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models. The program currently is used from the command-line or can be run online.


Asunto(s)
Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Secuencia de Bases , Programas Informáticos
2.
Bioinformatics ; 36(1): 90-95, 2020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-31292629

RESUMEN

MOTIVATION: Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. RESULTS: We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. AVAILABILITY AND IMPLEMENTATION: QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Benchmarking , Alineación de Secuencia , Benchmarking/métodos , Estructura Secundaria de Proteína , Alineación de Secuencia/métodos , Programas Informáticos
3.
Protein Sci ; 27(1): 135-145, 2018 01.
Artículo en Inglés | MEDLINE | ID: mdl-28884485

RESUMEN

Clustal Omega is a widely used package for carrying out multiple sequence alignment. Here, we describe some recent additions to the package and benchmark some alternative ways of making alignments. These benchmarks are based on protein structure comparisons or predictions and include a recently described method based on secondary structure prediction. In general, Clustal Omega is fast enough to make very large alignments and the accuracy of protein alignments is high when compared to alternative packages. The package is freely available as executables or source code from www.clustal.org or can be run on-line from a variety of sites, especially the EBI www.ebi.ac.uk.


Asunto(s)
Proteínas/genética , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Proteínas/química
4.
Bioinformatics ; 33(9): 1331-1337, 2017 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-28093407

RESUMEN

Motivation: Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of 'true' alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA. Results: In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores. Availability and Implementation: QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz. Contact: quan.le@ucd.ie. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Estructura Secundaria de Proteína , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Algoritmos , Secuencia de Aminoácidos , Benchmarking , Biología Computacional/métodos , Exactitud de los Datos , Alineación de Secuencia/normas , Análisis de Secuencia de Proteína/normas
5.
Bioinformatics ; 32(6): 814-20, 2016 03 15.
Artículo en Inglés | MEDLINE | ID: mdl-26568625

RESUMEN

MOTIVATION: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. RESULTS: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. AVAILABILITY AND IMPLEMENTATION: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz CONTACT: des.higgins@ucd.ie SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Alineación de Secuencia , Algoritmos , Proteínas , Programas Informáticos
6.
Algorithms Mol Biol ; 10: 26, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26457114

RESUMEN

BACKGROUND: Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time. RESULTS: We examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced. CONCLUSIONS: This has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.

7.
BMC Bioinformatics ; 16: 269, 2015 Aug 25.
Artículo en Inglés | MEDLINE | ID: mdl-26303676

RESUMEN

BACKGROUND: Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous. RESULTS: The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N(2)) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity. CONCLUSION: OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz.


Asunto(s)
Transportadoras de Casetes de Unión a ATP , Algoritmos , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Secuencia de Aminoácidos , Humanos , Datos de Secuencia Molecular , Homología de Secuencia de Aminoácido
9.
Curr Protoc Bioinformatics ; 48: 3.13.1-3.13.16, 2014 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-25501942

RESUMEN

Clustal Omega is a package for making multiple sequence alignments of amino acid or nucleotide sequences, quickly and accurately. It is a complete upgrade and rewrite of earlier Clustal programs. This unit describes how to run Clustal Omega interactively from a command line, although it can also be run online from several sites. The unit describes a basic protocol for taking a set of unaligned sequences and producing a full alignment. There are also protocols for using an external HMM or iteration to help improve an alignment.


Asunto(s)
Alineación de Secuencia , Programas Informáticos , Secuencia de Aminoácidos , Secuencia de Bases , Homología de Secuencia de Aminoácido
10.
BMC Bioinformatics ; 15: 338, 2014 Oct 04.
Artículo en Inglés | MEDLINE | ID: mdl-25282640

RESUMEN

BACKGROUND: Guide-trees are used as part of an essential heuristic to enable the calculation of multiple sequence alignments. They have been the focus of much method development but there has been little effort at determining systematically, which guide-trees, if any, give the best alignments. Some guide-tree construction schemes are based on pair-wise distances amongst unaligned sequences. Others try to emulate an underlying evolutionary tree and involve various iteration methods. RESULTS: We explore all possible guide-trees for a set of protein alignments of up to eight sequences. We find that pairwise distance based default guide-trees sometimes outperform evolutionary guide-trees, as measured by structure derived reference alignments. However, default guide-trees fall way short of the optimum attainable scores. On average chained guide-trees perform better than balanced ones but are not better than default guide-trees for small alignments. CONCLUSIONS: Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to sub-optimal guide-trees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guide-trees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guide-trees. The results for randomly chained guide-trees improve with the number of sequences.


Asunto(s)
Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Análisis por Conglomerados , Filogenia , Proteínas/genética
11.
Proc Natl Acad Sci U S A ; 111(29): 10556-61, 2014 Jul 22.
Artículo en Inglés | MEDLINE | ID: mdl-25002495

RESUMEN

Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random.


Asunto(s)
Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína , Programas Informáticos , Algoritmos , Sistema Enzimático del Citocromo P-450/química , Bases de Datos de Proteínas , Estándares de Referencia
12.
Methods Mol Biol ; 1079: 105-16, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24170397

RESUMEN

Clustal Omega is a completely rewritten and revised version of the widely used Clustal series of programs for multiple sequence alignment. It can deal with very large numbers (many tens of thousands) of DNA/RNA or protein sequences due to its use of the mBED algorithm for calculating guide trees. This algorithm allows very large alignment problems to be tackled very quickly, even on personal computers. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models. The program currently is used from the command line or can be run on line.


Asunto(s)
Biología Computacional/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Internet , Interfaz Usuario-Computador
13.
Bioinformatics ; 29(8): 989-95, 2013 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-23428640

RESUMEN

MOTIVATION: Recent developments in sequence alignment software have made possible multiple sequence alignments (MSAs) of >100 000 sequences in reasonable times. At present, there are no systematic analyses concerning the scalability of the alignment quality as the number of aligned sequences is increased. RESULTS: We benchmarked a wide range of widely used MSA packages using a selection of protein families with some known structures and found that the accuracy of such alignments decreases markedly as the number of sequences grows. This is more or less true of all packages and protein families. The phenomenon is mostly due to the accumulation of alignment errors, rather than problems in guide-tree construction. This is partly alleviated by using iterative refinement or selectively adding sequences. The average accuracy of progressive methods by comparison with structure-based benchmarks can be improved by incorporating information derived from high-quality structural alignments of sequences with solved structures. This suggests that the availability of high quality curated alignments will have to complement algorithmic and/or software developments in the long-term. AVAILABILITY AND IMPLEMENTATION: Benchmark data used in this study are available at http://www.clustal.org/omega/homfam-20110613-25.tar.gz and http://www.clustal.org/omega/bali3fam-26.tar.gz. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Programas Informáticos
14.
Mol Syst Biol ; 7: 539, 2011 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-21988835

RESUMEN

Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.


Asunto(s)
Minería de Datos/métodos , Proteínas/análisis , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Biología de Sistemas , Algoritmos , Secuencia de Aminoácidos , Secuencia de Bases , Bases de Datos Factuales , Datos de Secuencia Molecular , Proteínas/química , Programas Informáticos , Biología de Sistemas/instrumentación , Biología de Sistemas/métodos
15.
Algorithms Mol Biol ; 5: 21, 2010 May 14.
Artículo en Inglés | MEDLINE | ID: mdl-20470396

RESUMEN

BACKGROUND: The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments. RESULTS: In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. CONCLUSIONS: We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.

16.
PLoS One ; 5(12): e14454, 2010 Dec 29.
Artículo en Inglés | MEDLINE | ID: mdl-21209922

RESUMEN

BACKGROUND: More and more nucleotide sequences of type A influenza virus are available in public databases. Although these sequences have been the focus of many molecular epidemiological and phylogenetic analyses, most studies only deal with a few representative sequences. In this paper, we present a complete analysis of all Haemagglutinin (HA) and Neuraminidase (NA) gene sequences available to allow large scale analyses of the evolution and epidemiology of type A influenza. METHODOLOGY/PRINCIPAL FINDINGS: This paper describes an analysis and complete classification of all HA and NA gene sequences available in public databases using multivariate and phylogenetic methods. CONCLUSIONS/SIGNIFICANCE: We analyzed 18,975 HA sequences and divided them into 280 subgroups according to multivariate and phylogenetic analyses. Similarly, we divided 11,362 NA sequences into 202 subgroups. Compared to previous analyses, this work is more detailed and comprehensive, especially for the bigger datasets. Therefore, it can be used to show the full and complex phylogenetic diversity and provides a framework for studying the molecular evolution and epidemiology of type A influenza virus. For more than 85% of type A influenza HA and NA sequences into GenBank, they are categorized in one unambiguous and unique group. Therefore, our results are a kind of genetic and phylogenetic annotation for influenza HA and NA sequences. In addition, sequences of swine influenza viruses come from 56 HA and 45 NA subgroups. Most of these subgroups also include viruses from other hosts indicating cross species transmission of the viruses between pigs and other hosts. Furthermore, the phylogenetic diversity of swine influenza viruses from Eurasia is greater than that of North American strains and both of them are becoming more diverse. Apart from viruses from human, pigs, birds and horses, viruses from other species show very low phylogenetic diversity. This might indicate that viruses have not become established in these species. Based on current evidence, there is no simple pattern of inter-hemisphere transmission of avian influenza viruses and it appears to happen sporadically. However, for H6 subtype avian influenza viruses, such transmissions might have happened very frequently and multiple and bidirectional transmission events might exist.


Asunto(s)
Hemaglutininas/genética , Virus de la Influenza A/genética , Gripe Aviar/virología , Gripe Humana/virología , Neuraminidasa/genética , Animales , Aves , Bases de Datos Genéticas , Evolución Molecular , Variación Genética , Geografía , Humanos , Análisis Multivariante , Filogenia
17.
BMC Bioinformatics ; 5: 188, 2004 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-15574202

RESUMEN

BACKGROUND: Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Maximisation (EM) algorithm, are frequently used for estimating the phase and frequency of the underlying haplotypes. These methods have proved very successful, predicting the phase-known frequencies from data for which the phase is unknown with a high degree of accuracy. Recently there has been much speculation as to the effect of unknown, or missing allelic data - a common phenomenon even with modern automated DNA analysis techniques - on the performance of EM-based methods. To this end an EM-based program, modified to accommodate missing data, has been developed, incorporating non-parametric bootstrapping for the calculation of accurate confidence intervals. RESULTS: Here we present the results of the analyses of various data sets in which randomly selected known alleles have been relabelled as missing. Remarkably, we find that the absence of up to 30% of the data in both biallelic and multiallelic data sets with moderate to strong levels of linkage disequilibrium can be tolerated. Additionally, the frequencies of haplotypes which predominate in the complete data analysis remain essentially the same after the addition of the random noise caused by missing data. CONCLUSIONS: These findings have important implications for the area of data gathering. It may be concluded that small levels of drop out in the data do not affect the overall accuracy of haplotype analysis perceptibly, and that, given recent findings on the effect of inaccurate data, ambiguous data points are best treated as unknown.


Asunto(s)
Sesgo , Frecuencia de los Genes , Haplotipos/genética , Alelos , Fibrosis Quística/genética , Regulador de Conductancia de Transmembrana de Fibrosis Quística/genética , Marcadores Genéticos/genética , Genética de Población/estadística & datos numéricos , Genotipo , Humanos , Polimorfismo de Nucleótido Simple/genética , Tamaño de la Muestra , Programas Informáticos/estadística & datos numéricos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...