Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 331
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 40(Supplement_1): i328-i336, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940160

RESUMO

SUMMARY: Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. AVAILABILITY AND IMPLEMENTATION: The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.


Assuntos
Algoritmos , Cadeias de Markov , Alinhamento de Sequência , Software , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Análise de Sequência de Proteína/métodos , Filogenia , Proteínas/química
2.
IEEE Trans Nanobioscience ; 19(3): 506-517, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32396096

RESUMO

Statistical resampling methods are widely used for confidence interval placement and as a data perturbation technique for statistical inference and learning. An important assumption of popular resampling methods such as the standard bootstrap is that input observations are identically and independently distributed (i.i.d.). However, within the area of computational biology and bioinformatics, many different factors can contribute to intra-sequence dependence, such as recombination and other evolutionary processes governing sequence evolution. The SEquential RESampling ("SERES") framework was previously proposed to relax the simplifying assumption of i.i.d. input observations. SERES resampling takes the form of random walks on an input of either aligned or unaligned biomolecular sequences. This study introduces the first application of SERES random walks on aligned sequence inputs and is also the first to demonstrate the utility of SERES as a data perturbation technique to yield improved statistical estimates. We focus on the classical problem of recombination-aware local genealogical inference. We show in a simulation study that coupling SERES resampling and re-estimation with recHMM, a hidden Markov model-based method, produces local genealogical inferences with consistent and often large improvements in terms of topological accuracy. We further evaluate method performance using empirical HIV genome sequence datasets.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Cadeias de Markov , Filogenia , Alinhamento de Sequência/métodos , Simulação por Computador
3.
Methods Mol Biol ; 2112: 175-186, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32006286

RESUMO

The VAST+ algorithm is an efficient, simple, and elegant solution to the problem of comparing the atomic structures of biological assemblies. Given two protein assemblies, it takes as input all the pairwise structural alignments of the component proteins. It then clusters the rotation matrices from the pairwise superpositions, with the clusters corresponding to subsets of the two assemblies that may be aligned and well superposed. It uses the Vector Alignment Search Tool (VAST) protein-protein comparison method for the input structural alignments, but other methods could be used, as well. From a chosen cluster, an "original" alignment for the assembly may be defined by simply combining the relevant input alignments. However, it is often useful to reduce/trim the original alignment, using a Monte Carlo refinement algorithm, which allows biologically relevant conformational differences to be more readily detected and observed. The method is easily extended to include RNA or DNA molecules. VAST+ results may be accessed via the URL https://www.ncbi.nlm.nih.gov/Structure , then entering a PDB accession or terms in the search box, and using the link [VAST+] in the upper right corner of the Structure Summary page.


Assuntos
Proteínas/química , Alinhamento de Sequência/métodos , Algoritmos , Bases de Dados de Proteínas , Método de Monte Carlo , Conformação Proteica , Ferramenta de Busca/métodos , Software
4.
Genes (Basel) ; 11(1)2020 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-31936127

RESUMO

Thioester-containing proteins (TEPs) superfamily is known to play important innate immune functions in a wide range of animal phyla. TEPs are involved in recognition, and in the direct or mediated killing of several invading organisms or pathogens. While several TEPs have been identified in many invertebrates, only one TEP (named BgTEP) has been previously characterized in the freshwater snail, Biomphalaria glabrata. As the presence of a single member of that family is particularly intriguing, transcriptomic data and the recently published genome were used to explore the presence of other BgTEP related genes in B. glabrata. Ten other TEP members have been reported and classified into different subfamilies: Three complement-like factors (BgC3-1 to BgC3-3), one α-2-macroblobulin (BgA2M), two macroglobulin complement-related proteins (BgMCR1, BgMCR2), one CD109 (BgCD109), and three insect TEP (BgTEP2 to BgTEP4) in addition to the previously characterized BgTEP that we renamed BgTEP1. This is the first report on such a level of TEP diversity and of the presence of macroglobulin complement-related proteins (MCR) in mollusks. Gene structure analysis revealed alternative splicing in the highly variable region of three members (BgA2M, BgCD109, and BgTEP2) with a particularly unexpected diversity for BgTEP2. Finally, different gene expression profiles tend to indicate specific functions for such novel family members.


Assuntos
Biomphalaria/genética , Imunidade Inata/genética , Sequência de Aminoácidos/genética , Animais , Água Doce , Perfilação da Expressão Gênica/métodos , Filogenia , Schistosoma mansoni , Alinhamento de Sequência/métodos , Fatores de Transcrição/genética , Transcriptoma/genética
5.
BMC Bioinformatics ; 20(Suppl 18): 573, 2019 Nov 25.
Artigo em Inglês | MEDLINE | ID: mdl-31760933

RESUMO

BACKGROUND: During procedures for conducting multiple sequence alignment, that is so essential to use the substitution score of pairwise alignment. To compute adaptive scores for alignment, researchers usually use Hidden Markov Model or probabilistic consistency methods such as partition function. Recent studies show that optimizing the parameters for hidden Markov model, as well as integrating hidden Markov model with partition function can raise the accuracy of alignment. The combination of partition function and optimized HMM, which could further improve the alignment's accuracy, however, was ignored by these researches. RESULTS: A novel algorithm for MSA called ProbPFP is presented in this paper. It intergrate optimized HMM by particle swarm with partition function. The algorithm of PSO was applied to optimize HMM's parameters. After that, the posterior probability obtained by the HMM was combined with the one obtained by partition function, and thus to calculate an integrated substitution score for alignment. In order to evaluate the effectiveness of ProbPFP, we compared it with 13 outstanding or classic MSA methods. The results demonstrate that the alignments obtained by ProbPFP got the maximum mean TC scores and mean SP scores on these two benchmark datasets: SABmark and OXBench, and it got the second highest mean TC scores and mean SP scores on the benchmark dataset BAliBASE. ProbPFP is also compared with 4 other outstanding methods, by reconstructing the phylogenetic trees for six protein families extracted from the database TreeFam, based on the alignments obtained by these 5 methods. The result indicates that the reference trees are closer to the phylogenetic trees reconstructed from the alignments obtained by ProbPFP than the other methods. CONCLUSIONS: We propose a new multiple sequence alignment method combining optimized HMM and partition function in this paper. The performance validates this method could make a great improvement of the alignment's accuracy.


Assuntos
Biologia Computacional/métodos , Proteínas/genética , Alinhamento de Sequência/métodos , Algoritmos , Animais , Humanos , Cadeias de Markov , Família Multigênica , Filogenia , Proteínas/química , Software
6.
BMC Bioinformatics ; 20(1): 473, 2019 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-31521110

RESUMO

BACKGROUND: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. RESULTS: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite . CONCLUSION: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.


Assuntos
Anotação de Sequência Molecular/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Cadeias de Markov
7.
Sci Rep ; 9(1): 1630, 2019 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-30733500

RESUMO

The ongoing evolution of microbial pathogens represents a significant issue in diagnostic PCR/qPCR. Many assays are burdened with false negativity due to mispriming and/or probe-binding failures. Therefore, PCR/qPCR assays used in the laboratory should be periodically re-assessed in silico on public sequences to evaluate the ability to detect actually circulating strains and to infer potentially escaping variants. In the work presented we re-assessed a RT-qPCR assay for the universal detection of influenza A (IA) viruses currently recommended by the European Union Reference Laboratory for Avian Influenza. To this end, the primers and probe sequences were challenged against more than 99,000 M-segment sequences in five data pools. To streamline this process, we developed a simple algorithm called the SequenceTracer designed for alignment stratification, compression, and personal sequence subset selection and also demonstrated its utility. The re-assessment confirmed the high inclusivity of the assay for the detection of avian, swine and human pandemic H1N1 IA viruses. On the other hand, the analysis identified human H3N2 strains with a critical probe-interfering mutation circulating since 2010, albeit with a significantly fluctuating proportion. Minor variations located in the forward and reverse primers identified in the avian and swine data were also considered.


Assuntos
Vírus da Influenza A/genética , Reação em Cadeia da Polimerase Via Transcriptase Reversa/métodos , Algoritmos , Simulação por Computador , Primers do DNA , Mutação , Alinhamento de Sequência/métodos
8.
Bioinformatics ; 34(4): 576-584, 2018 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-29040374

RESUMO

Motivation: Pair Hidden Markov Models (PHMMs) are probabilistic models used for pairwise sequence alignment, a quintessential problem in bioinformatics. PHMMs include three types of hidden states: match, insertion and deletion. Most previous studies have used one or two hidden states for each PHMM state type. However, few studies have examined the number of states suitable for representing sequence data or improving alignment accuracy. Results: We developed a novel method to select superior models (including the number of hidden states) for PHMM. Our method selects models with the highest posterior probability using Factorized Information Criterion, which is widely utilized in model selection for probabilistic models with hidden variables. Our simulations indicated that this method has excellent model selection capabilities with slightly improved alignment accuracy. We applied our method to DNA datasets from 5 and 28 species, ultimately selecting more complex models than those used in previous studies. Availability and implementation: The software is available at https://github.com/bigsea-t/fab-phmm. Contact: mhamada@waseda.jp. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Animais , Teorema de Bayes , Humanos , Modelos Estatísticos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Análise de Sequência de RNA/métodos
9.
Genomics ; 109(5-6): 419-431, 2017 10.
Artigo em Inglês | MEDLINE | ID: mdl-28669847

RESUMO

Sequence alignment is an active research area in the field of bioinformatics. It is also a crucial task as it guides many other tasks like phylogenetic analysis, function, and/or structure prediction of biological macromolecules like DNA, RNA, and Protein. Proteins are the building blocks of every living organism. Although protein alignment problem has been studied for several decades, unfortunately, every available method produces alignment results differently for a single alignment problem. Multiple sequence alignment is characterized as a very high computational complex problem. Many stochastic methods, therefore, are considered for improving the accuracy of alignment. Among them, many researchers frequently use Genetic Algorithm. In this study, we have shown different types of the method applied in alignment and the recent trends in the multiobjective genetic algorithm for solving multiple sequence alignment. Many recent studies have demonstrated considerable progress in finding the alignment accuracy.


Assuntos
Biologia Computacional/métodos , Proteínas/genética , Alinhamento de Sequência/métodos , Algoritmos , Cadeias de Markov , Filogenia , Análise de Sequência de Proteína
10.
Bioinformatics ; 33(24): 3902-3908, 2017 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-28666322

RESUMO

MOTIVATION: Methods able to provide reliable protein alignments are crucial for many bioinformatics applications. In the last years many different algorithms have been developed and various kinds of information, from sequence conservation to secondary structure, have been used to improve the alignment performances. This is especially relevant for proteins with highly divergent sequences. However, recent works suggest that different features may have different importance in diverse protein classes and it would be an advantage to have more customizable approaches, capable to deal with different alignment definitions. RESULTS: Here we present Rigapollo, a highly flexible pairwise alignment method based on a pairwise HMM-SVM that can use any type of information to build alignments. Rigapollo lets the user decide the optimal features to align their protein class of interest. It outperforms current state of the art methods on two well-known benchmark datasets when aligning highly divergent sequences. AVAILABILITY AND IMPLEMENTATION: A Python implementation of the algorithm is available at http://ibsquare.be/rigapollo. CONTACT: wim.vranken@vub.be. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Máquina de Vetores de Suporte , Algoritmos , Cadeias de Markov , Estrutura Secundária de Proteína , Proteínas/química , Software
11.
BMC Bioinformatics ; 18(1): 299, 2017 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-28606054

RESUMO

BACKGROUND: Genome sequencing provides a powerful tool for pathogen detection and can help resolve outbreaks that pose public safety and health risks. Mapping of DNA reads to genomes plays a fundamental role in this approach, where accurate alignment and classification of sequencing data is crucial. Standard mapping methods crudely treat bases as independent from their neighbors. Accuracy might be improved by using higher order paired hidden Markov models (HMMs), which model neighbor effects, but introduce design and implementation issues that have typically made them impractical for read mapping applications. We present a variable-order paired HMM that we term VarHMM, which addresses central issues involved with higher order modeling for sequence alignment. RESULTS: Compared with existing alignment methods, VarHMM is able to model higher order distributions and quantify alignment probabilities with greater detail and accuracy. In a series of comparison tests, in which Ion Torrent sequenced DNA was mapped to similar bacterial strains, VarHMM consistently provided better strain discrimination than any of the other alignment methods that we compared with. CONCLUSIONS: Our results demonstrate the advantages of higher ordered probability distribution modeling and also suggest that further development of such models would benefit read mapping in a range of other applications as well.


Assuntos
DNA Bacteriano , Genoma Bacteriano/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , DNA Bacteriano/análise , DNA Bacteriano/classificação , DNA Bacteriano/genética , Cadeias de Markov
12.
BMC Genomics ; 18(Suppl 4): 362, 2017 05 24.
Artigo em Inglês | MEDLINE | ID: mdl-28589863

RESUMO

BACKGROUND: The recent advancement of whole genome alignment software has made it possible to align two genomes very efficiently and with only a small sacrifice in sensitivity. Yet it becomes very slow if the extra sensitivity is needed. This paper proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without paying much extra running time. RESULTS AND CONCLUSIONS: We have applied our method to a popular whole genome alignment tool LAST, and we called the resulting tool LASTM. Experimental results showed that LASTM could find more high quality alignments with a little extra running time. For example, when comparing human and mouse genomes, to produce the similar number of alignments with similar average length and similarity, LASTM was about three times faster than LAST. We conclude that our method can be used to improve the sensitivity, and the extra time it takes is small, and thus it is worthwhile to be implemented in existing tools.


Assuntos
Alinhamento de Sequência/métodos , Sequenciamento Completo do Genoma/métodos , Animais , Humanos , Fatores de Tempo
13.
PLoS Comput Biol ; 12(12): e1005294, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-28002465

RESUMO

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes' theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).


Assuntos
Acetiltransferases/química , Modelos Moleculares , Análise de Sequência de Proteína/métodos , Acetiltransferases/genética , Acetiltransferases/metabolismo , Sequência de Aminoácidos , Animais , Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/genética , Proteínas de Caenorhabditis elegans/metabolismo , Biologia Computacional , Humanos , Cadeias de Markov , Método de Monte Carlo , Alinhamento de Sequência/métodos
14.
PLoS One ; 11(12): e0167430, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27918587

RESUMO

In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector.


Assuntos
Aminoácidos/química , Proteínas/química , Algoritmos , Sequência de Aminoácidos , Glicosídeo Hidrolases/química , Probabilidade , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos
15.
Bioinformatics ; 32(24): 3826-3828, 2016 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-27638400

RESUMO

MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. AVAILABILITY AND IMPLEMENTATION: Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: jgonzalezd@udc.esSupplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Proteínas , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Aminoácidos , Cadeias de Markov , Software
16.
Mol Biol Evol ; 33(11): 2976-2989, 2016 11.
Artigo em Inglês | MEDLINE | ID: mdl-27486222

RESUMO

To detect positive selection at individual amino acid sites, most methods use an empirical Bayes approach. After parameters of a Markov process of codon evolution are estimated via maximum likelihood, they are passed to Bayes formula to compute the posterior probability that a site evolved under positive selection. A difficulty with this approach is that parameter estimates with large errors can negatively impact Bayesian classification. By assigning priors to some parameters, Bayes Empirical Bayes (BEB) mitigates this problem. However, as implemented, it imposes uniform priors, which causes it to be overly conservative in some cases. When standard regularity conditions are not met and parameter estimates are unstable, inference, even under BEB, can be negatively impacted. We present an alternative to BEB called smoothed bootstrap aggregation (SBA), which bootstraps site patterns from an alignment of protein coding DNA sequences to accommodate the uncertainty in the parameter estimates. We show that deriving the correction for parameter uncertainty from the data in hand, in combination with kernel smoothing techniques, improves site specific inference of positive selection. We compare BEB to SBA by simulation and real data analysis. Simulation results show that SBA balances accuracy and power at least as well as BEB, and when parameter estimates are unstable, the performance gap between BEB and SBA can widen in favor of SBA. SBA is applicable to a wide variety of other inference problems in molecular evolution.


Assuntos
Aminoácidos/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Teorema de Bayes , Evolução Biológica , Códon/genética , Simulação por Computador , Evolução Molecular , Funções Verossimilhança , Cadeias de Markov , Modelos Genéticos , Modelos Estatísticos , Probabilidade , Seleção Genética , Incerteza
17.
J Theor Biol ; 393: 67-74, 2016 Mar 21.
Artigo em Inglês | MEDLINE | ID: mdl-26801876

RESUMO

Detecting three dimensional structures of protein sequences is a challenging task in biological sciences. For this purpose, protein fold recognition has been utilized as an intermediate step which helps in classifying a novel protein sequence into one of its folds. The process of protein fold recognition encompasses feature extraction of protein sequences and feature identification through suitable classifiers. Several feature extractors are developed to retrieve useful information from protein sequences. These features are generally extracted by constituting protein's sequential, physicochemical and evolutionary properties. The performance in terms of recognition accuracy has also been gradually improved over the last decade. However, it is yet to reach a well reasonable and accepted level. In this work, we first applied HMM-HMM alignment of protein sequence from HHblits to extract profile HMM (PHMM) matrix. Then we computed the distance between respective PHMM matrices using kernalized dynamic programming. We have recorded significant improvement in fold recognition over the state-of-the-art feature extractors. The improvement of recognition accuracy is in the range of 2.7-11.6% when experimented on three benchmark datasets from Structural Classification of Proteins.


Assuntos
Cadeias de Markov , Proteínas/química , Alinhamento de Sequência/métodos , Bases de Dados de Proteínas , Estrutura Secundária de Proteína , Reprodutibilidade dos Testes , Máquina de Vetores de Suporte
18.
Artigo em Inglês | MEDLINE | ID: mdl-26357074

RESUMO

We introduce MRFy, a tool for protein remote homology detection that captures beta-strand dependencies in the Markov random field. Over a set of 11 SCOP beta-structural superfamilies, MRFy shows a 14 percent improvement in mean Area Under the Curve for the motif recognition problem as compared to HMMER, 25 percent improvement as compared to RAPTOR, 14 percent improvement as compared to HHPred, and a 18 percent improvement as compared to CNFPred and RaptorX. MRFy was implemented in the Haskell functional programming language, and parallelizes well on multi-core systems. MRFy is available, as source code as well as an executable, from http://mrfy.cs.tufts.edu/.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Homologia de Sequência de Aminoácidos , Algoritmos , Motivos de Aminoácidos , Cadeias de Markov , Modelos Estatísticos , Processos Estocásticos
19.
Artigo em Inglês | MEDLINE | ID: mdl-26357079

RESUMO

This paper introduces a simple and effective approach to improve the accuracy of multiple sequence alignment. We use a natural measure to estimate the similarity of the input sequences, and based on this measure, we align the input sequences differently. For example, for inputs with high similarity, we consider the whole sequences and align them globally, while for those with moderately low similarity, we may ignore the flank regions and align them locally. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with about one dozen leading alignment tools on three benchmark alignment databases, and GLProbs's alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic trees construction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Sequência de Aminoácidos , Cadeias de Markov , Dados de Sequência Molecular , Filogenia , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/classificação
20.
Bioinformatics ; 31(23): 3850-2, 2015 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-26231431

RESUMO

MOTIVATION: The HHsearch algorithm, implementing a hidden Markov model (HMM)-HMM alignment method, has shown excellent alignment performance in the so-called twilight zone (target-template sequence identity with ∼20%). However, an optimal alignment by HHsearch may contain small to large errors, leading to poor structure prediction if these errors are located in important structural elements. RESULTS: HHalign-Kbest server runs a full pipeline, from the generation of suboptimal HMM-HMM alignments to the evaluation of the best structural models. In the HHsearch framework, it implements a novel algorithm capable of generating k-best HMM-HMM suboptimal alignments rather than only the optimal one. For large proteins, a directed acyclic graph-based implementation reduces drastically the memory usage. Improved alignments were systematically generated among the top k suboptimal alignments. To recognize them, corresponding structural models were systematically generated and evaluated with Qmean score. The method was benchmarked over 420 targets from the SCOP30 database. In the range of HHsearch probability of 20-99%, average quality of the models (TM-score) raised by 4.1-16.3% and 8.0-21.0% considering the top 1 and top 10 best models, respectively. AVAILABILITY AND IMPLEMENTATION: http://bioserv.rpbs.univ-paris-diderot.fr/services/HHalign-Kbest/ (source code and server). CONTACT: guerois@cea.fr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Alinhamento de Sequência/métodos , Software , Homologia Estrutural de Proteína , Algoritmos , Cadeias de Markov , Modelos Moleculares , Análise de Sequência de Proteína
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA