Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
BMC Genomics ; 18(Suppl 2): 111, 2017 03 14.
Artigo em Inglês | MEDLINE | ID: mdl-28361712

RESUMO

BACKGROUND: Over the past two decades, phylogenetic networks have been studied to model reticulate evolutionary events. The relationships among phylogenetic networks, phylogenetic trees and clusters serve as the basis for reconstruction and comparison of phylogenetic networks. To understand these relationships, two problems are raised: the tree containment problem, which asks whether a phylogenetic tree is displayed in a phylogenetic network, and the cluster containment problem, which asks whether a cluster is represented at a node in a phylogenetic network. Both the problems are NP-complete. RESULTS: A fast exponential-time algorithm for the cluster containment problem on arbitrary networks is developed and implemented in C. The resulting program is further extended into a computer program for fast computation of the Soft Robinson-Foulds distance between phylogenetic networks. CONCLUSIONS: Two computer programs are developed for facilitating reconstruction and validation of phylogenetic network models in evolutionary and comparative genomics. Our simulation tests indicated that they are fast enough for use in practice. Additionally, the distribution of the Soft Robinson-Foulds distance between phylogenetic networks is demonstrated to be unlikely normal by our simulation data.


Assuntos
Algoritmos , Biologia Computacional/estatística & dados numéricos , Modelos Genéticos , Filogenia , Software , Animais , Evolução Biológica , Culicidae/classificação , Culicidae/genética , Proteínas de Plantas/genética , Poaceae/classificação , Poaceae/genética , RNA de Cadeia Dupla/genética , RNA Fúngico/genética , Rhizoctonia/classificação , Rhizoctonia/genética
2.
BMC Bioinformatics ; 14 Suppl 16: S8, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564762

RESUMO

BACKGROUND: Protein complexes conserved across species indicate processes that are core to cellular machinery (e.g. cell-cycle or DNA damage-repair complexes conserved across human and yeast). While numerous computational methods have been devised to identify complexes from the protein interaction (PPI) networks of individual species, these are severely limited by noise and errors (false positives) in currently available datasets. Our analysis using human and yeast PPI networks revealed that these methods missed several important complexes including those conserved between the two species (e.g. the MLH1-MSH2-PMS2-PCNA mismatch-repair complex). Here, we note that much of the functionalities of yeast complexes have been conserved in human complexes not only through sequence conservation of proteins but also of critical functional domains. Therefore, integrating information of domain conservation might throw further light on conservation patterns between yeast and human complexes. RESULTS: We identify conserved complexes by constructing an interolog network (IN) leveraging on the functional conservation of proteins between species through domain conservation (from Ensembl) in addition to sequence similarity. We employ 'state-of-the-art' methods to cluster the interolog network, and map these clusters back to the original PPI networks to identify complexes conserved between the species. Evaluation of our IN-based approach (called COCIN) on human and yeast interaction data identifies several additional complexes (76% recall) compared to direct complex detection from the original PINs (54% recall). Our analysis revealed that the IN-construction removes several non-conserved interactions many of which are false positives, thereby improving complex prediction. In fact removing non-conserved interactions from the original PINs also resulted in higher number of conserved complexes, thereby validating our IN-based approach. These complexes included the mismatch repair complex, MLH1-MSH2-PMS2-PCNA, and other important ones namely, RNA polymerase-II, EIF3 and MCM complexes, all of which constitute core cellular processes known to be conserved across the two species. CONCLUSIONS: Our method based on integrating domain conservation and sequence similarity to construct interolog networks helps to identify considerably more conserved complexes between the PPI networks from two species compared to direct complex prediction from the PPI networks. We observe from our experiments that protein complexes are not conserved from yeast to human in a straightforward way, that is, it is not the case that a yeast complex is a (proper) sub-set of a human complex with a few additional proteins present in the human complex. Instead complexes have evolved multifold with considerable re-organization of proteins and re-distribution of their functions across complexes. This finding can have significant implications on attempts to extrapolate other kinds of relationships such as synthetic lethality from yeast to human, for example in the identification of novel cancer targets. AVAILABILITY: http://www.comp.nus.edu.sg/~leonghw/COCIN/.


Assuntos
Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Sequência Conservada , Humanos , Proteínas/metabolismo , Saccharomyces cerevisiae/metabolismo
3.
BMC Bioinformatics ; 13 Suppl 17: S16, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23282200

RESUMO

Complexes of physically interacting proteins are one of the fundamental functional units responsible for driving key biological mechanisms within the cell. With the advent of high-throughput techniques, significant amount of protein interaction (PPI) data has been catalogued for organisms such as yeast, which has in turn fueled computational methods for systematic identification and study of protein complexes. However, many complexes are dynamic entities - their subunits are known to assemble at a particular cellular space and time to perform a particular function and disassemble after that - and while current computational analyses have concentrated on studying the dynamics of individual or pairs of proteins in PPI networks, a crucial aspect overlooked is the dynamics of whole complex formations. In this work, using yeast as our model, we incorporate 'time' in the form of cell-cycle phases into the prediction of complexes from PPI networks and study the temporal phenomena of complex assembly and disassembly across phases. We hypothesize that 'staticness' (constitutive expression) of proteins might be related to their temporal "reusability" across complexes, and test this hypothesis using complexes predicted from large-scale PPI networks across the yeast cell cycle phases. Our results hint towards a biological design principle underlying cellular mechanisms - cells maintain generic proteins as 'static' to enable their "reusability" across multiple temporal complexes. We also demonstrate that these findings provide additional support and alternative explanations to findings from existing works on the dynamics in PPI networks.


Assuntos
Ciclo Celular , Modelos Biológicos , Complexos Multiproteicos/metabolismo , Mapeamento de Interação de Proteínas , Proteínas/metabolismo , Algoritmos , Saccharomyces cerevisiae/citologia , Saccharomyces cerevisiae/metabolismo
4.
Eur Phys J E Soft Matter ; 35(4): 9704, 2012 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-22526978

RESUMO

By introducing an additional hydrogen bond to hydrogen bond interaction in the force field of the CSAW (Conditioned Self-Avoiding Walk) model, we investigate into the mechanism of antiparallel ß-sheet formation based on the folding of a short polyalanine in gas phase. Through our numerical simulation, we detect the possible presence of a transient helix during ß-sheet formation, whose presence is shown to have slowed the formation of ß-sheets by an order of magnitude. While we observe the mechanisms of nucleation, zipping and induction that drives the formation of a ß-sheet, we uncover a new mechanism that involves transient ß-turns and short ß-sheets during the formation of long ß-sheets. Our results have enabled us to provide an overview on the mechanisms of ß-sheet formation via two main folding pathways: slow folding through the intermediate state of transient helix, and fast folding from the nucleation of ß-turn.


Assuntos
Modelos Químicos , Peptídeos/química , Estrutura Secundária de Proteína , Simulação por Computador , Transferência de Energia , Gases/química , Ligação de Hidrogênio , Interações Hidrofóbicas e Hidrofílicas , Modelos Moleculares , Dobramento de Proteína
5.
BMC Bioinformatics ; 11 Suppl 1: S63, 2010 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-20122239

RESUMO

BACKGROUND: Conserved gene clusters are groups of genes that are located close to one another in the genomes of several species. They tend to code for proteins that have a functional interaction. The identification of conserved gene clusters is an important step towards understanding genome evolution and predicting gene function. RESULTS: In this paper, we propose a novel pairwise gene cluster model that combines the notion of bidirectional best hits with the r-window model introduced in 2003 by Durand and Sankoff. The bidirectional best hit (BBH) constraint removes the need to specify the minimum number of shared genes in the r-window model and improves the relevance of the results. We design a subquadratic time algorithm to compute the set of BBH r-window gene clusters efficiently. CONCLUSION: We apply our cluster model to the comparative analysis of E. coli K-12 and B. subtilis and perform an extensive comparison between our new model and the gene teams model developed by Bergeron et al. As compared to the gene teams model, our new cluster model has a slightly lower recall but a higher precision at all levels of recall when the results were ranked using statistical tests. An analysis of the most significant BBH r-window gene cluster show that they correspond to known operons.


Assuntos
Genoma Bacteriano , Genômica/métodos , Família Multigênica , Bacillus subtilis/genética , Escherichia coli/genética
6.
BMC Bioinformatics ; 11: 504, 2010 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-20939868

RESUMO

BACKGROUND: The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes. RESULTS: Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes. CONCLUSIONS: We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw.


Assuntos
Cadeias de Markov , Mapeamento de Interação de Proteínas/métodos , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Software , Análise por Conglomerados , Bases de Dados de Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas de Saccharomyces cerevisiae/química
7.
BMC Bioinformatics ; 11: 505, 2010 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-20939873

RESUMO

BACKGROUND: In many protein-protein interaction (PPI) networks, densely connected hub proteins are more likely to be essential proteins. This is referred to as the "centrality-lethality rule", which indicates that the topological placement of a protein in PPI network is connected with its biological essentiality. Though such connections are observed in many PPI networks, the underlying topological properties for these connections are not yet clearly understood. Some suggested putative connections are the involvement of essential proteins in the maintenance of overall network connections, or that they play a role in essential protein clusters. In this work, we have attempted to examine the placement of essential proteins and the network topology from a different perspective by determining the correlation of protein essentiality and reverse nearest neighbor topology (RNN). RESULTS: The RNN topology is a weighted directed graph derived from PPI network, and it is a natural representation of the topological dependences between proteins within the PPI network. Similar to the original PPI network, we have observed that essential proteins tend to be hub proteins in RNN topology. Additionally, essential genes are enriched in clusters containing many hub proteins in RNN topology (RNN protein clusters). Based on these two properties of essential genes in RNN topology, we have proposed a new measure; the RNN cluster centrality. Results from a variety of PPI networks demonstrate that RNN cluster centrality outperforms other centrality measures with regard to the proportion of selected proteins that are essential proteins. We also investigated the biological importance of RNN clusters. CONCLUSIONS: This study reveals that RNN cluster centrality provides the best correlation of protein essentiality and placement of proteins in PPI network. Additionally, merged RNN clusters were found to be topologically important in that essential proteins are significantly enriched in RNN clusters, and biologically important because they play an important role in many Gene Ontology (GO) processes.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Proteínas/genética , Análise por Conglomerados , Biologia Computacional/métodos , Bases de Dados de Proteínas , Genes Essenciais
8.
Genome Inform ; 23(1): 159-68, 2009 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-20180271

RESUMO

Protein complexes are responsible for most of vital biological processes within the cell. Understanding the machinery behind these biological processes requires detection and analysis of complexes and their constituent proteins. A wealth of computational approaches towards detection of complexes deal with clustering of protein-protein interaction (PPI) networks. Among these clustering approaches, the Markov Clustering (MCL) algorithm has proved to be reasonably successful, mainly due to its scalability and robustness. However, MCL produces many noisy clusters, which either do not represent any known complexes or have additional proteins (noise) that reduce the accuracies of correctly predicted complexes. Consequently, the accuracies of these clusters when matched with known complexes are quite low. Refinement of these clusters to improve the accuracy requires deeper understanding of the organization of complexes. Recently, experiments on yeast by Gavin et al. (2006) revealed that proteins within a complex are organized in two parts: core and attachment. Based on these insights, we propose our method (MCL-CA), which couples core-attachment based refinement steps to refine the clusters produced by MCL. We evaluated the effectiveness of our approach on two different datasets and compared the quality of our predicted complexes with that produced by MCL. The results show that our approach significantly improves the accuracies of predicted complexes when matched with known complexes. A direct result of this is that MCL-CA is able to cover larger number of known complexes than MCL. Further, we also compare our method with two very recently proposed methods CORE and COACH, which also capitalize on the core-attachment structure. We also discuss several instances to show that our predicted complexes clearly adhere to the core-attachment structure as revealed by Gavin et al.


Assuntos
Análise por Conglomerados , Cadeias de Markov , Proteínas/química , Algoritmos
9.
IEEE/ACM Trans Comput Biol Bioinform ; 16(5): 1702-1711, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-28678711

RESUMO

We consider the problem of sorting signed permutations by reversals, transpositions, transreversals, and block-interchanges and give a 2-approximation scheme, called the GSB (Genome Sorting by Bridges) scheme. Our result extends 2-approximation algorithm of He and Chen [12] that allowed only reversals and block-interchanges, and also the 1.5 approximation algorithm of Hartman and Sharan [11] that allowed only transreversals and transpositions. We prove this result by introducing three bridge structures in the breakpoint graph, namely, the L-bridge, T-bridge, and X-bridge and show that they model "proper" reversals, transpositions, transreversals, and block-interchanges, respectively. We show that we can always find at least one of these three bridges in any breakpoint graph, thus giving an upper bound on the number of operations needed. We prove a lower bound on the distance and use it to show that GSB has a 2-approximation ratio. An ${\text{O(n}}^{3})$O(n3) algorithm called GSB-I that is based on the GSB approximation scheme presented in this paper has recently been published by Yu, Hao, and Leong in [17] . We note that our 2-approximation scheme admits many possible implementations by varying the order we search for proper rearrangement operations.


Assuntos
Rearranjo Gênico/genética , Genoma/genética , Genômica/métodos , Algoritmos , Modelos Genéticos
10.
J Bioinform Comput Biol ; 6(3): 467-92, 2008 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-18574859

RESUMO

Peptide sequencing plays a fundamental role in proteomics. Tandem mass spectrometry, being sensitive and efficient, is one of the most commonly used techniques in peptide sequencing. Many computational models and algorithms have been developed for peptide sequencing using tandem mass spectrometry. In this paper, we investigate general issues in de novo sequencing, and present results that can be used to improve current de novo sequencing algorithms. We propose a general preprocessing scheme that performs binning, pseudo-peak introduction, and noise removal, and present theoretical and experimental analyses on each of the components. Then, we study the antisymmetry problem and current assumptions related to it, and propose a more realistic way to handle the antisymmetry problem based on analysis of some datasets. We integrate our findings on preprocessing and the antisymmetry problem with some current models for peptide sequencing. Experimental results show that our findings help to improve accuracies for de novo sequencing.


Assuntos
Sequência de Aminoácidos , Peptídeos/análise , Proteômica/métodos , Análise de Sequência de Proteína/métodos , Biologia Computacional , Eficiência Organizacional , Dados de Sequência Molecular
11.
J Bioinform Comput Biol ; 6(3): 435-66, 2008 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-18574858

RESUMO

Protein complexes are fundamental for understanding principles of cellular organizations. As the sizes of protein-protein interaction (PPI) networks are increasing, accurate and fast protein complex prediction from these PPI networks can serve as a guide for biological experiments to discover novel protein complexes. However, it is not easy to predict protein complexes from PPI networks, especially in situations where the PPI network is noisy and still incomplete. Here, we study the use of indirect interactions between level-2 neighbors (level-2 interactions) for protein complex prediction. We know from previous work that proteins which do not interact but share interaction partners (level-2 neighbors) often share biological functions. We have proposed a method in which all direct and indirect interactions are first weighted using topological weight (FS-Weight), which estimates the strength of functional association. Interactions with low weight are removed from the network, while level-2 interactions with high weight are introduced into the interaction network. Existing clustering algorithms can then be applied to this modified network. We have also proposed a novel algorithm that searches for cliques in the modified network, and merge cliques to form clusters using a "partial clique merging" method. Experiments show that (1) the use of indirect interactions and topological weight to augment protein-protein interactions can be used to improve the precision of clusters predicted by various existing clustering algorithms; and (2) our complex-finding algorithm performs very well on interaction networks modified in this way. Since no other information except the original PPI network is used, our approach would be very useful for protein complex prediction, especially for prediction of novel protein complexes.


Assuntos
Biologia Computacional , Simulação por Computador , Mapeamento de Interação de Proteínas , Algoritmos
12.
J Bioinform Comput Biol ; 16(3): 1840010, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-29566638

RESUMO

The accurate detection of genomic islands (GIs) in microbial genomes is important for both evolutionary study and medical research, because GIs may promote genome evolution and contain genes involved in pathogenesis. Various computational methods have been developed to predict GIs over the years. However, most of them cannot make full use of GI-associated features to achieve desirable performance. Additionally, many methods cannot be directly applied to newly sequenced genomes. We develop a new method called GI-Cluster, which provides an effective way to integrate multiple GI-related features via consensus clustering. GI-Cluster does not require training datasets or existing genome annotations, but it can still achieve comparable or better performance than supervised learning methods in comprehensive evaluations. Moreover, GI-Cluster is widely applicable, either to complete and incomplete genomes or to initial GI predictions from other programs. GI-Cluster also provides plots to visualize the distribution of predicted GIs and related features. GI-Cluster is available at https://github.com/icelu/GI_Cluster.


Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Ilhas Genômicas , Genômica/métodos , Bases de Dados Genéticas , Transferência Genética Horizontal , Genoma , Salmonella typhi/genética , Vibrio cholerae/genética
13.
Genome Inform ; 19: 119-30, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-18546510

RESUMO

Peptide identification by tandem mass spectrometry (MS/MS) is one of the most important problems in proteomics. Recent advances in high throughput MS/MS experiments result in huge amount of spectra. Unfortunately, identification of these spectra is relatively slow, and the accuracies of current algorithms are not high with the presence of noises and post-translational modifications (PTMs). In this paper, we strive to achieve high accuracy and efficiency for peptide identification problem, with special concern on identification of peptides with PTMs. This paper expands our previous work on PepSOM with the introduction of two accurate modified scoring functions: Slambda for peptide identification and Slambda* for identification of peptides with PTMs. Experiments showed that our algorithm is both fast and accurate for peptide identification. Experiments on spectra with simulated and real PTMs confirmed that our algorithm is accurate for identifying PTMs.


Assuntos
Biologia Computacional/métodos , Peptídeos/química , Processamento de Proteína Pós-Traducional , Espectrometria de Massas em Tandem/métodos , Algoritmos , Linhagem Celular Tumoral , Computadores , Humanos , Modelos Biológicos , Linguagens de Programação , Proteômica/métodos , Reprodutibilidade dos Testes , Software
14.
Nucleic Acids Res ; 33(17): e144, 2005 Sep 28.
Artigo em Inglês | MEDLINE | ID: mdl-16192568

RESUMO

The broad applicability of gene expression profiling to genomic analyses has generated huge demand for mass production of microarrays and hence for improving the cost effectiveness of microarray fabrication. We developed a post-processing method for deriving a good synthesis strategy. In this paper, we assessed all the known efficient methods and our post-processing method for reducing the number of synthesis cycles for manufacturing a DNA-chip of a given set of oligos. Our experimental results on both simulated and 52 real datasets show that no single method consistently gives the best synthesis strategy, and post-processing an existing strategy is necessary as it often reduces the number of synthesis cycles further.


Assuntos
Perfilação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos , Sondas de Oligonucleotídeos/síntese química , Algoritmos , Composição de Bases , Biologia Computacional/métodos , Sondas de Oligonucleotídeos/química
15.
Gene ; 626: 132-139, 2017 Aug 30.
Artigo em Inglês | MEDLINE | ID: mdl-28512059

RESUMO

The first genome-scale metabolic network of Cordyceps militaris (iWV1170) was constructed representing its whole metabolisms, which consisted of 894 metabolites and 1,267 metabolic reactions across five compartments, including the plasma membrane, cytoplasm, mitochondria, peroxisome and extracellular space. The iWV1170 could be exploited to explain its phenotypes of growth ability, cordycepin and other metabolites production on various substrates. A high number of genes encoding extracellular enzymes for degradation of complex carbohydrates, lipids and proteins were existed in C. militaris genome. By comparative genome-scale analysis, the adenine metabolic pathway towards putative cordycepin biosynthesis was reconstructed, indicating their evolutionary relationships across eleven species of entomopathogenic fungi. The overall metabolic routes involved in the putative cordycepin biosynthesis were also identified in C. militaris, including central carbon metabolism, amino acid metabolism (glycine, l-glutamine and l-aspartate) and nucleotide metabolism (adenosine and adenine). Interestingly, a lack of the sequence coding for ribonucleotide reductase inhibitor was observed in C. militaris that might contribute to its over-production of cordycepin.


Assuntos
Cordyceps/genética , Genoma Fúngico , Redes e Vias Metabólicas , Cordyceps/metabolismo , Cordyceps/patogenicidade , Desoxiadenosinas/biossíntese , Desoxiadenosinas/genética , Proteínas Fúngicas/genética , Proteínas Fúngicas/metabolismo , Hidrolases/genética , Hidrolases/metabolismo , Ribonucleotídeo Redutases/genética , Ribonucleotídeo Redutases/metabolismo , Virulência/genética
16.
BMC Bioinformatics ; 7 Suppl 4: S12, 2006 Dec 12.
Artigo em Inglês | MEDLINE | ID: mdl-17217504

RESUMO

BACKGROUND: The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences. RESULTS: In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances. CONCLUSION: Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences.


Assuntos
Algoritmos , Sequência Conservada , Alinhamento de Sequência/métodos , Análise de Sequência/métodos , Homologia de Sequência
17.
J Bioinform Comput Biol ; 4(6): 1329-52, 2006 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-17245817

RESUMO

Peptide sequencing using tandem mass spectrometry data is an important and challenging problem in proteomics. We address the problem of peptide sequencing for multi-charge spectra. Most peptide sequencing algorithms currently consider only charge one or two ions even for higher-charge spectra. We give a characterization of multi-charge spectra by generalizing existing models. Using our models, we analyzed spectra from Global Proteome Machine (GPM) [Craig R, Cortens JP, Beavis RC, J Proteome Res 3:1234-1242, 2004.] (with charges 1-5), Institute for Systems Biology (ISB) [Keller A, Purvine S, Nesvizhskii AI, Stolyar S, Goodlett DR, Kolker E, OMICS 6:207-212, 2002.] and Orbitrap (both with charges 1-3). Our analysis for the GPM dataset shows that higher charge peaks contribute significantly to prediction of the complete peptide. They also help to explain why existing algorithms do not perform well on multi-charge spectra. Based on these analyses, we claim that peptide sequencing algorithms can achieve higher sensitivity results if they also consider higher charge ions. We verify this claim by proposing a de novo sequencing algorithm called the greedy best strong tag (GBST) algorithm that is simple but considers higher charge ions based on our new model. Evaluation on multi-charge spectra shows that our simple GBST algorithm outperforms Lutefisk and PepNovo, especially for the GPM spectra of charge three or more.


Assuntos
Algoritmos , Espectrometria de Massas/métodos , Modelos Químicos , Mapeamento de Peptídeos/métodos , Peptídeos/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Dados de Sequência Molecular
18.
Genome Inform ; 17(2): 89-99, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-17503382

RESUMO

As the scale of the microarray experiments increases, a single oligo nucleotide array is no longer large enough. Therefore, the use of multiple oligo arrays for one experiment becomes more important. The design and synthesis of multiple arrays to minimize the overall synthesis cost is an interesting and important problem. We formulate the multiple array synthesis problem (MASP) that deals with the distribution of the probes (or oligos) to different arrays, and then deposition of the probes onto each array. We propose a cost function to capture the synthesis cost and a performance ratio for analysis of the quality of multiple arrays produced by different algorithms. We propose a Distribution and Deposition Algorithm (DDA) for the solving the MASP. In this algorithm, the probes are first distributed onto multiple arrays according to their characteristics such as GC contents. Then the probes on each arrays are deposited using a good deposition algorithm. Two other algorithms were also proposed and used for comparison. Experiments show that our algorithm can effectively output short synthesis sequences for multiple arrays, and the algorithm is efficient.


Assuntos
Algoritmos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Sequência de Bases , Biologia Computacional , Simulação por Computador , Sondas de DNA/química , Etiquetas de Sequências Expressas , Análise de Sequência com Séries de Oligonucleotídeos/economia , Oligonucleotídeos/síntese química , Oligonucleotídeos/química , Análise de Sequência de DNA , Software
19.
Genome Inform ; 17(2): 194-205, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-17503392

RESUMO

Peptide identification by tandem mass spectrometry is both an important and challenging problem in proteomics. At present, huge amount of spectrum data are generated by high throughput mass spectrometers at a very fast pace, but algorithms to analyze these spectra are either too slow, not accurate enough, or only gives partial sequences or sequence tags. In this paper, we emphasize on the balance between identification completeness and efficiency with reasonable accuracy for peptide identification by tandem mass spectrum. Our method works by converting spectra to vectors in high-dimensional space, and subsequently use self-organizing map (SOM) and multi-point range query (MPRQ) algorithm as a coarse filter reduce the number of candidates to achieve efficient and accurate database search. Experiments show that our algorithm is both fast and accurate in peptide identification.


Assuntos
Algoritmos , Mapeamento de Peptídeos , Peptídeos/análise , Análise de Sequência de Proteína , Espectrometria de Massas em Tandem/métodos , Biologia Computacional , Bases de Dados de Proteínas , Microcomputadores , Modelos Biológicos , Peptídeos/química , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Software , Fatores de Tempo
20.
Comput Struct Biotechnol J ; 14: 200-6, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27293536

RESUMO

Clusters of genes acquired by lateral gene transfer in microbial genomes, are broadly referred to as genomic islands (GIs). GIs often carry genes important for genome evolution and adaptation to niches, such as genes involved in pathogenesis and antibiotic resistance. Therefore, GI prediction has gradually become an important part of microbial genome analysis. Despite inherent difficulties in identifying GIs, many computational methods have been developed and show good performance. In this mini-review, we first summarize the general challenges in predicting GIs. Then we group existing GI detection methods by their input, briefly describe representative methods in each group, and discuss their advantages as well as limitations. Finally, we look into the potential improvements for better GI prediction.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA