Búsqueda | Portal de Búsqueda de la BVS Colombia

1.

A survey of error-correction methods for next-generation sequencing.

Yang, Xiao; Chockalingam, Sriram P; Aluru, Srinivas.

Brief Bioinform ; 14(1): 56-66, 2013 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-22492192

RESUMEN

UNLABELLED: Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. AVAILABILITY: All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.

Asunto(s)

Análisis de Secuencia de ADN/tendencias , Programas Informáticos , Algoritmos , Animales , Mapeo Cromosómico/estadística & datos numéricos , Mapeo Cromosómico/tendencias , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Bases de Datos Genéticas/tendencias , Predicción , Humanos , Alineación de Secuencia/estadística & datos numéricos , Alineación de Secuencia/tendencias , Análisis de Secuencia de ADN/estadística & datos numéricos

2.

Data deposition: Missing data mean holes in tree of life.

Drew, Bryan T.

Nature ; 493(7432): 305, 2013 Jan 17.

Artículo en Inglés | MEDLINE | ID: mdl-23325204

Asunto(s)

Recolección de Datos/normas , Bases de Datos Genéticas/normas , Filogenia , Bases de Datos Genéticas/estadística & datos numéricos , Bases de Datos Genéticas/tendencias , Bases de Datos de Ácidos Nucleicos , Alineación de Secuencia/estadística & datos numéricos , Alineación de Secuencia/tendencias

3.

Status quo of annotation of human disease variants.

Venselaar, Hanka; Camilli, Franscesca; Gholizadeh, Shima; Snelleman, Marlou; Brunner, Han G; Vriend, Gert.

BMC Bioinformatics ; 14: 352, 2013 Dec 04.

Artículo en Inglés | MEDLINE | ID: mdl-24305467

RESUMEN

BACKGROUND: The ever on-going technical developments in Next Generation Sequencing have led to an increase in detected disease related mutations. Many bioinformatics approaches exist to analyse these variants, and of those the methods that use 3D structure information generally outperform those that do not use this information. 3D structure information today is available for about twenty percent of the human exome, and homology modelling can double that fraction. This percentage is rapidly increasing so that we can expect to analyse the majority of all human exome variants in the near future using protein structure information. RESULTS: We collected a test dataset of well-described mutations in proteins for which 3D-structure information is available. This test dataset was used to analyse the possibilities and the limitations of methods based on sequence information alone, hybrid methods, machine learning based methods, and structure based methods. CONCLUSIONS: Our analysis shows that the use of structural features improves the classification of mutations. This study suggests strategies for future analyses of disease causing mutations, and it suggests which bioinformatics approaches should be developed to make progress in this field.

Asunto(s)

Biología Computacional/métodos , Variación Genética , Anotación de Secuencia Molecular/métodos , Proteínas/genética , Inteligencia Artificial , Análisis por Conglomerados , Secuencia Conservada/genética , Bases de Datos Genéticas , Exoma/genética , Genoma Humano/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/tendencias , Humanos , Mutación/genética , Polimorfismo de Nucleótido Simple/genética , Proteínas/química , Alineación de Secuencia/tendencias , Homología de Secuencia de Aminoácido

4.

Next-generation genome.

Nat Methods ; 5(12): 989, 2008 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-19054852

RESUMEN

Sequencing technology is now advanced enough to decode individual human genomes. Will it prove to be better than existing methods for discovering the genetic basis of human phenotypic variation?

Asunto(s)

Mapeo Cromosómico/tendencias , Variación Genética/genética , Desequilibrio de Ligamiento/genética , Polimorfismo de Nucleótido Simple/genética , Alineación de Secuencia/tendencias , Análisis de Secuencia de ADN/tendencias

5.

The RNA structure alignment ontology.

Brown, James W; Birmingham, Amanda; Griffiths, Paul E; Jossinet, Fabrice; Kachouri-Lafond, Rym; Knight, Rob; Lang, B Franz; Leontis, Neocles; Steger, Gerhard; Stombaugh, Jesse; Westhof, Eric.

RNA ; 15(9): 1623-31, 2009 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-19622678

RESUMEN

Multiple sequence alignments are powerful tools for understanding the structures, functions, and evolutionary histories of linear biological macromolecules (DNA, RNA, and proteins), and for finding homologs in sequence databases. We address several ontological issues related to RNA sequence alignments that are informed by structure. Multiple sequence alignments are usually shown as two-dimensional (2D) matrices, with rows representing individual sequences, and columns identifying nucleotides from different sequences that correspond structurally, functionally, and/or evolutionarily. However, the requirement that sequences and structures correspond nucleotide-by-nucleotide is unrealistic and hinders representation of important biological relationships. High-throughput sequencing efforts are also rapidly making 2D alignments unmanageable because of vertical and horizontal expansion as more sequences are added. Solving the shortcomings of traditional RNA sequence alignments requires explicit annotation of the meaning of each relationship within the alignment. We introduce the notion of "correspondence," which is an equivalence relation between RNA elements in sets of sequences as the basis of an RNA alignment ontology. The purpose of this ontology is twofold: first, to enable the development of new representations of RNA data and of software tools that resolve the expansion problems with current RNA sequence alignments, and second, to facilitate the integration of sequence data with secondary and three-dimensional structural information, as well as other experimental information, to create simultaneously more accurate and more exploitable RNA alignments.

Asunto(s)

ARN/análisis , Alineación de Secuencia/métodos , Programas Informáticos , Animales , Secuencia de Bases , Humanos , Modelos Biológicos , Datos de Secuencia Molecular , Conformación de Ácido Nucleico , Filogenia , ARN/química , Alineación de Secuencia/tendencias , Análisis de Secuencia de ARN/métodos , Homología de Secuencia de Ácido Nucleico

6.

Recent trends in molecular phylogenetic analysis: where to next?

Blair, Christopher; Murphy, Robert W.

J Hered ; 102(1): 130-8, 2011.

Artículo en Inglés | MEDLINE | ID: mdl-20696667

RESUMEN

The acquisition of large multilocus sequence data is providing researchers with an unprecedented amount of information to resolve difficult phylogenetic problems. With these large quantities of data comes the increasing challenge regarding the best methods of analysis. We review the current trends in molecular phylogenetic analysis, focusing specifically on the topics of multiple sequence alignment and methods of tree reconstruction. We suggest that traditional methods are inadequate for these highly heterogeneous data sets and that researchers employ newer more sophisticated search algorithms in their analyses. If we are to best extract the information present in these data sets, a sound understanding of basic phylogenetic principles combined with modern methodological techniques are necessary.

Asunto(s)

Bases de Datos Genéticas , Filogenia , Alineación de Secuencia/tendencias , Análisis de Secuencia de ADN/tendencias , Algoritmos , Evolución Molecular , Sitios Genéticos , Modelos Biológicos , Especificidad de la Especie

7.

Recent developments in the MAFFT multiple sequence alignment program.

Katoh, Kazutaka; Toh, Hiroyuki.

Brief Bioinform ; 9(4): 286-98, 2008 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-18372315

RESUMEN

The accuracy and scalability of multiple sequence alignment (MSA) of DNAs and proteins have long been and are still important issues in bioinformatics. To rapidly construct a reasonable MSA, we developed the initial version of the MAFFT program in 2002. MSA software is now facing greater challenges in both scalability and accuracy than those of 5 years ago. As increasing amounts of sequence data are being generated by large-scale sequencing projects, scalability is now critical in many situations. The requirement of accuracy has also entered a new stage since the discovery of functional noncoding RNAs (ncRNAs); the secondary structure should be considered for constructing a high-quality alignment of distantly related ncRNAs. To deal with these problems, in 2007, we updated MAFFT to Version 6 with two new techniques: the PartTree algorithm and the Four-way consistency objective function. The former improved the scalability of progressive alignment and the latter improved the accuracy of ncRNA alignment. We review these and other techniques that MAFFT uses and suggest possible future directions of MSA software as a basis of comparative analyses. MAFFT is available at http://align.bmr.kyushu-u.ac.jp/mafft/software/.

Asunto(s)

Algoritmos , Inteligencia Artificial , Reconocimiento de Normas Patrones Automatizadas/tendencias , Alineación de Secuencia/tendencias , Análisis de Secuencia/tendencias , Programas Informáticos/tendencias , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia/métodos

8.

Pfam 10 years on: 10,000 families and still growing.

Sammut, Stephen John; Finn, Robert D; Bateman, Alex.

Brief Bioinform ; 9(3): 210-9, 2008 May.

Artículo en Inglés | MEDLINE | ID: mdl-18344544

RESUMEN

Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10,000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.

Asunto(s)

Sistemas de Administración de Bases de Datos/tendencias , Bases de Datos de Proteínas/tendencias , Almacenamiento y Recuperación de la Información/tendencias , Proteínas/química , Proteínas/clasificación , Alineación de Secuencia/tendencias , Análisis de Secuencia de Proteína/tendencias

9.

Standardizing the next generation of bioinformatics software development with BioHDF (HDF5).

Mason, Christopher E; Zumbo, Paul; Sanders, Stephan; Folk, Mike; Robinson, Dana; Aydt, Ruth; Gollery, Martin; Welsh, Mark; Olson, N Eric; Smith, Todd M.

Adv Exp Med Biol ; 680: 693-700, 2010.

Artículo en Inglés | MEDLINE | ID: mdl-20865556

RESUMEN

Next Generation Sequencing technologies are limited by the lack of standard bioinformatics infrastructures that can reduce data storage, increase data processing performance, and integrate diverse information. HDF technologies address these requirements and have a long history of use in data-intensive science communities. They include general data file formats, libraries, and tools for working with the data. Compared to emerging standards, such as the SAM/BAM formats, HDF5-based systems demonstrate significantly better scalability, can support multiple indexes, store multiple data types, and are self-describing. For these reasons, HDF5 and its BioHDF extension are well suited for implementing data models to support the next generation of bioinformatics applications.

Asunto(s)

Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia/estadística & datos numéricos , Biología Computacional , Simulación por Computador , Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas , Alineación de Secuencia/normas , Alineación de Secuencia/tendencias , Análisis de Secuencia/normas , Análisis de Secuencia/tendencias , Programas Informáticos/normas , Programas Informáticos/tendencias , Diseño de Software , Interfaz Usuario-Computador

10.

ProfileGrids as a new visual representation of large multiple sequence alignments: a case study of the RecA protein family.

Roca, Alberto I; Almada, Albert E; Abajian, Aaron C.

BMC Bioinformatics ; 9: 554, 2008 Dec 22.

Artículo en Inglés | MEDLINE | ID: mdl-19102758

RESUMEN

BACKGROUND: Multiple sequence alignments are a fundamental tool for the comparative analysis of proteins and nucleic acids. However, large data sets are no longer manageable for visualization and investigation using the traditional stacked sequence alignment representation. RESULTS: We introduce ProfileGrids that represent a multiple sequence alignment as a matrix color-coded according to the residue frequency occurring at each column position. JProfileGrid is a Java application for computing and analyzing ProfileGrids. A dynamic interaction with the alignment information is achieved by changing the ProfileGrid color scheme, by extracting sequence subsets at selected residues of interest, and by relating alignment information to residue physical properties. Conserved family motifs can be identified by the overlay of similarity plot calculations on a ProfileGrid. Figures suitable for publication can be generated from the saved spreadsheet output of the colored matrices as well as by the export of conservation information for use in the PyMOL molecular visualization program.We demonstrate the utility of ProfileGrids on 300 bacterial homologs of the RecA family - a universally conserved protein involved in DNA recombination and repair. Careful attention was paid to curating the collected RecA sequences since ProfileGrids allow the easy identification of rare residues in an alignment. We relate the RecA alignment sequence conservation to the following three topics: the recently identified DNA binding residues, the unexplored MAW motif, and a unique Bacillus subtilis RecA homolog sequence feature. CONCLUSION: ProfileGrids allow large protein families to be visualized more effectively than the traditional stacked sequence alignment form. This new graphical representation facilitates the determination of the sequence conservation at residue positions of interest, enables the examination of structural patterns by using residue physical properties, and permits the display of rare sequence features within the context of an entire alignment. JProfileGrid is free for non-commercial use and is available from http://www.profilegrid.org. Furthermore, we present a curated RecA protein collection that is more diverse than previous data sets; and, therefore, this RecA ProfileGrid is a rich source of information for nanoanatomy analysis.

Asunto(s)

Proteínas Bacterianas/química , Familia de Multigenes , Rec A Recombinasas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Secuencia de Aminoácidos , Datos de Secuencia Molecular , Alineación de Secuencia/tendencias , Análisis de Secuencia de Proteína/tendencias , Programas Informáticos/tendencias

11.

Multiple sequence alignments.

Wallace, Iain M; Blackshields, Gordon; Higgins, Desmond G.

Curr Opin Struct Biol ; 15(3): 261-6, 2005 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-15963889

RESUMEN

Multiple sequence alignments are very widely used in all areas of DNA and protein sequence analysis. The main methods that are still in use are based on 'progressive alignment' and date from the mid to late 1980s. Recently, some dramatic improvements have been made to the methodology with respect either to speed and capacity to deal with large numbers of sequences or to accuracy. There have also been some practical advances concerning how to combine three-dimensional structural information with primary sequences to give more accurate alignments, when structures are available.

Asunto(s)

Algoritmos , ADN/química , Modelos Moleculares , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia/métodos , Secuencia de Aminoácidos , Secuencia de Bases , Simulación por Computador , ADN/análisis , ADN/clasificación , Modelos Químicos , Conformación Molecular , Datos de Secuencia Molecular , Proteínas/análisis , Proteínas/clasificación , Alineación de Secuencia/tendencias , Análisis de Secuencia/tendencias , Homología de Secuencia , Programas Informáticos , Relación Estructura-Actividad

12.

The limits of protein sequence comparison?

Pearson, William R; Sierk, Michael L.

Curr Opin Struct Biol ; 15(3): 254-60, 2005 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-15919194

RESUMEN

Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.

Asunto(s)

Algoritmos , Bases de Datos de Proteínas , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Datos de Secuencia Molecular , Proteínas/análisis , Proteínas/clasificación , Alineación de Secuencia/tendencias , Análisis de Secuencia de Proteína/tendencias , Homología de Secuencia de Aminoácido , Relación Estructura-Actividad

13.

Predicting active site residue annotations in the Pfam database.

Mistry, Jaina; Bateman, Alex; Finn, Robert D.

BMC Bioinformatics ; 8: 298, 2007 Aug 09.

Artículo en Inglés | MEDLINE | ID: mdl-17688688

RESUMEN

BACKGROUND: Approximately 5% of Pfam families are enzymatic, but only a small fraction of the sequences within these families (<0.5%) have had the residues responsible for catalysis determined. To increase the active site annotations in the Pfam database, we have developed a strict set of rules, chosen to reduce the rate of false positives, which enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family. DESCRIPTION: We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and MEROPS we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives. CONCLUSION: We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.

Asunto(s)

Bases de Datos de Proteínas , Secuencia de Aminoácidos , Sitios de Unión , Bases de Datos de Proteínas/tendencias , Datos de Secuencia Molecular , Valor Predictivo de las Pruebas , Alineación de Secuencia/métodos , Alineación de Secuencia/tendencias , Homología de Secuencia de Aminoácido , Diseño de Software

14.

A parameterized algorithm for protein structure alignment.

Xu, Jinbo; Jiao, Feng; Berger, Bonnie.

J Comput Biol ; 14(5): 564-77, 2007 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-17683261

RESUMEN

This paper proposes a parameterized polynomial time approximation scheme (PTAS) for aligning two protein structures, in the case where one protein structure is represented by a contact map graph and the other by a contact map graph or a distance matrix. If the sequential order of alignment is not required, the time complexity is polynomial in the protein size and exponential with respect to two parameters D(u)/D(l) and D(c)/D(l), which usually can be treated as constants. In particular, D(u) is the distance threshold determining if two residues are in contact or not, D(c) is the maximally allowed distance between two matched residues after two proteins are superimposed, and D(l) is the minimum inter-residue distance in a typical protein. This result clearly demonstrates that the computational hardness of the contact map based protein structure alignment problem is related not to protein size but to several parameters modeling the problem. The result is achieved by decomposing the protein structure using tree decomposition and discretizing the rigid-body transformation space. Preliminary experimental results indicate that on a Linux PC, it takes from ten minutes to one hour to align two proteins with approximately 100 residues.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Homología Estructural de Proteína , Animales , Biología Computacional/tendencias , Flavodoxina/química , Humanos , Pliegue de Proteína , Alineación de Secuencia/tendencias

15.

Clustered sequence representation for fast homology search.

Cameron, Michael; Bernstein, Yaniv; Williams, Hugh E.

J Comput Biol ; 14(5): 594-614, 2007 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-17683263

RESUMEN

We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.

Asunto(s)

Bases de Datos de Proteínas , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Homología de Secuencia de Aminoácido , Secuencia de Aminoácidos , Animales , Bases de Datos de Proteínas/tendencias , Humanos , Datos de Secuencia Molecular , Alineación de Secuencia/tendencias , Análisis de Secuencia de Proteína/tendencias

16.

Effects of long-range correlations in DNA on sequence alignment score statistics.

Messer, Philipp W; Bundschuh, Ralf; Vingron, Martin; Arndt, Peter F.

J Comput Biol ; 14(5): 655-68, 2007 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-17683266

RESUMEN

Long-range correlations in genomic base composition are a ubiquitous statistical feature among many eukaryotic genomes. In this article, these correlations are shown to substantially influence the statistics of sequence alignment scores. Using a Gaussian approximation to model the correlated score landscape, we calculate the corrections to the scale parameter lambda of the extreme value distribution of alignment scores. Our approximate analytic results are supported by a detailed numerical study based on a simple algorithm to efficiently generate long-range correlated random sequences. We find both, mean and exponential tail of the score distribution for long-range correlated sequences to be substantially shifted compared to random sequences with independent nucleotides. The significance of measured alignment scores will therefore change upon incorporation of the correlations in the null model. We discuss the magnitude of this effect in a biological context.

Asunto(s)

Simulación por Computador , Modelos Genéticos , Modelos Estadísticos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Homología de Secuencia de Ácido Nucleico , Animales , Humanos , Alineación de Secuencia/métodos , Alineación de Secuencia/tendencias , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/tendencias

17.

A reduction-based exact algorithm for the contact map overlap problem.

Xie, Wei; Sahinidis, Nikolaos V.

J Comput Biol ; 14(5): 637-54, 2007 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-17683265

RESUMEN

Aligning proteins based on their structural similarity is a fundamental problem in molecular biology with applications in many settings, including structure classification, database search, function prediction, and assessment of folding prediction methods. Structural alignment can be done via several methods, including contact map overlap (CMO) maximization that aligns proteins in a way that maximizes the number of common residue contacts. In this paper, we develop a reduction-based exact algorithm for the CMO problem. Our approach solves CMO directly rather than after transformation to other combinatorial optimization problems. We exploit the mathematical structure of the problem in order to develop a number of efficient lower bounding, upper bounding, and reduction schemes. Computational experiments demonstrate that our algorithm runs significantly faster than existing exact algorithms and solves some hard CMO instances that were not solved in the past. In addition, the algorithm produces protein clusters that are in excellent agreement with the SCOP classification. An implementation of our algorithm is accessible as an on-line server at http://eudoxus.scs.uiuc.edu/cmos/cmos.html.

Asunto(s)

Algoritmos , Alineación de Secuencia , Análisis de Secuencia de Proteína , Homología Estructural de Proteína , Animales , Proteínas Bacterianas/química , Proteínas Bacterianas/genética , Biología Computacional/tendencias , Modelos Químicos , Alineación de Secuencia/métodos , Alineación de Secuencia/tendencias , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de Proteína/tendencias

18.

Improved alignment quality by combining evolutionary information, predicted secondary structure and self-organizing maps.

Ohlson, Tomas; Aggarwal, Varun; Elofsson, Arne; MacCallum, Robert M.

BMC Bioinformatics ; 7: 357, 2006 Jul 25.

Artículo en Inglés | MEDLINE | ID: mdl-16869963

RESUMEN

BACKGROUND: Protein sequence alignment is one of the basic tools in bioinformatics. Correct alignments are required for a range of tasks including the derivation of phylogenetic trees and protein structure prediction. Numerous studies have shown that the incorporation of predicted secondary structure information into alignment algorithms improves their performance. Secondary structure predictors have to be trained on a set of somewhat arbitrarily defined states (e.g. helix, strand, coil), and it has been shown that the choice of these states has some effect on alignment quality. However, it is not unlikely that prediction of other structural features also could provide an improvement. In this study we use an unsupervised clustering method, the self-organizing map, to assign sequence profile windows to "structural states" and assess their use in sequence alignment. RESULTS: The addition of self-organizing map locations as inputs to a profile-profile scoring function improves the alignment quality of distantly related proteins slightly. The improvement is slightly smaller than that gained from the inclusion of predicted secondary structure. However, the information seems to be complementary as the two prediction schemes can be combined to improve the alignment quality by a further small but significant amount. CONCLUSION: It has been observed in many studies that predicted secondary structure significantly improves the alignments. Here we have shown that the addition of self-organizing map locations can further improve the alignments as the self-organizing map locations seem to contain some information that is not captured by the predicted secondary structure.

Asunto(s)

Evolución Molecular , Estructura Secundaria de Proteína , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Biología Computacional/métodos , Biología Computacional/tendencias , Bases de Datos de Proteínas , Predicción , Redes Neurales de la Computación , Alineación de Secuencia/tendencias , Análisis de Secuencia de Proteína/tendencias , Homología de Secuencia de Aminoácido

19.

Discovering new genes with advanced homology detection.

Li, Weizhong; Godzik, Adam.

Trends Biotechnol ; 20(8): 315-6, 2002 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-12127268

RESUMEN

Most genome annotation protocols combine ab initio predictions with transcription and homology analyses to produce reliable gene predictions but they often fail to detect many actual genes. Alternative approaches involving more sensitive homology recognition methods are playing an increasingly important role in the next stage of gene discovery. The hunt for new genes is far from over.

Asunto(s)

Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de Proteína/métodos , Genoma Humano , Humanos , Cadenas de Markov , Alineación de Secuencia/tendencias , Programas Informáticos , beta-Defensinas/genética

20.

DIALIGN P: fast pair-wise and multiple sequence alignment using parallel processors.

Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard.

BMC Bioinformatics ; 5: 128, 2004 Sep 09.

Artículo en Inglés | MEDLINE | ID: mdl-15357879

RESUMEN

BACKGROUND: Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. RESULTS: Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. CONCLUSIONS: By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.

Asunto(s)

Alineación de Secuencia/métodos , Alineación de Secuencia/tendencias , Programas Informáticos , Biología Computacional/métodos , Genoma

RESUMEN

Asunto(s)

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA