Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 222
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 40(Supplement_1): i328-i336, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940160

RESUMO

SUMMARY: Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. AVAILABILITY AND IMPLEMENTATION: The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.


Assuntos
Algoritmos , Cadeias de Markov , Alinhamento de Sequência , Software , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Análise de Sequência de Proteína/métodos , Filogenia , Proteínas/química
2.
Proteins ; 89(12): 1940-1948, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34324227

RESUMO

In CASP, blind testing of model accuracy estimation methods has been conducted on models submitted by tertiary structure prediction servers. In CASP14, model accuracy estimation results were evaluated in terms of both global and local structure accuracy, as in the previous CASPs. Unlike the previous CASPs that did not show pronounced improvements in performance, the best single-model method (from the Baker group) showed an improved performance in CASP14, particularly in evaluating global structure accuracy when compared to both the best single-model methods in previous CASPs and the best multi-model methods in the current CASP. Although the CASP14 experiment on model accuracy estimation did not deal with the structures generated by AlphaFold2, new challenges that have arisen due to the success of AlphaFold2 are discussed.


Assuntos
Modelos Moleculares , Conformação Proteica , Proteínas , Software , Biologia Computacional , Proteínas/química , Proteínas/metabolismo , Reprodutibilidade dos Testes , Análise de Sequência de Proteína/métodos
3.
J Comput Biol ; 28(6): 570-586, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33960831

RESUMO

A profile mixture (PM) model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend, in part, on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here, using algebraic methods, we show that a PM model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Análise de Sequência de Proteína/métodos , Animais , Humanos , Cadeias de Markov , Proteínas/química , Proteínas/genética
4.
Nucleic Acids Res ; 49(W1): W60-W66, 2021 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-33963861

RESUMO

The Bologna ENZyme Web Server (BENZ WS) annotates four-level Enzyme Commission numbers (EC numbers) as defined by the International Union of Biochemistry and Molecular Biology (IUBMB). BENZ WS filters a target sequence with a combined system of Hidden Markov Models, modelling protein sequences annotated with the same molecular function, and Pfams, carrying along conserved protein domains. BENZ returns, when successful, for any enzyme target sequence an associated four-level EC number. Our system can annotate both monofunctional and polyfunctional enzymes, and it can be a valuable resource for sequence functional annotation.


Assuntos
Enzimas/química , Anotação de Sequência Molecular/métodos , Análise de Sequência de Proteína/métodos , Software , Internet , Cadeias de Markov , Domínios Proteicos , Alinhamento de Sequência
5.
Anal Biochem ; 612: 113954, 2021 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32946833

RESUMO

BACKGROUND: DNA-binding proteins perform important roles in cellular processes and are involved in many biological activities. These proteins include crucial protein-DNA binding domains and can interact with single-stranded or double-stranded DNA, and accordingly classified as single-stranded DNA-binding proteins (SSBs) or double-stranded DNA-binding proteins (DSBs). Computational prediction of SSBs and DSBs helps in annotating protein functions and understanding of protein-binding domains. RESULTS: Performance is reported using the DNA-binding protein dataset that was recently introduced by Wang et al., [1]. The proposed method achieved a sensitivity of 0.600, specificity of 0.792, AUC of 0.758, MCC of 0.369, accuracy of 0.744, and F-measure of 0.536, on the independent test set. CONCLUSION: The proposed method with the hidden Markov model (HMM) profiles for feature extraction, outperformed the benchmark method in the literature and achieved an overall improvement of approximately 3%. The source code and supplementary information of the proposed method is available at https://github.com/roneshsharma/Predict-DNA-binding-proteins/wiki.


Assuntos
Biologia Computacional/métodos , DNA de Cadeia Simples/química , DNA de Cadeia Simples/metabolismo , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/metabolismo , DNA/química , DNA/metabolismo , Sequência de Aminoácidos , Bases de Dados de Proteínas , Cadeias de Markov , Modelos Estatísticos , Ligação Proteica , Domínios Proteicos , Análise de Sequência de Proteína/métodos , Software , Máquina de Vetores de Suporte
6.
PLoS One ; 15(9): e0238625, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32915813

RESUMO

Recent advances in DNA sequencing methods revolutionized biology by providing highly accurate reads, with high throughput or high read length. These read data are being used in many biological and medical applications. Modern DNA sequencing methods have no equivalent in protein sequencing, severely limiting the widespread application of protein data. Recently, several optical protein sequencing methods have been proposed that rely on the fluorescent labeling of amino acids. Here, we introduce the reprotonation-deprotonation protein sequencing method. Unlike other methods, this proposed technique relies on the measurement of an electrical signal and requires no fluorescent labeling. In reprotonation-deprotonation protein sequencing, the terminal amino acid is identified through its unique protonation signal, and by repeatedly cleaving the terminal amino acids one-by-one, each amino acid in the peptide is measured. By means of simulations, we show that, given a reference database of known proteins, reprotonation-deprotonation sequencing has the potential to correctly identify proteins in a sample. Our simulations provide target values for the signal-to-noise ratios that sensor devices need to attain in order to detect reprotonation-deprotonation events, as well as suitable pH values and required measurement times per amino acid. For instance, an SNR of 10 is required for a 61.71% proteome recovery rate with 100 ms measurement time per amino acid.


Assuntos
Aminoácidos/química , Proteínas/química , Proteoma/genética , Análise de Sequência de Proteína/métodos , Aminoácidos/genética , Corantes Fluorescentes/química , Peptídeos/química , Peptídeos/genética , Proteínas/genética , Proteoma/química , Prótons , Análise de Sequência de DNA/métodos , Razão Sinal-Ruído
7.
Methods Mol Biol ; 2165: 83-101, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32621220

RESUMO

Intrinsically disordered regions (IDRs) are estimated to be highly abundant in nature. While only several thousand proteins are annotated with experimentally derived IDRs, computational methods can be used to predict IDRs for the millions of currently uncharacterized protein chains. Several dozen disorder predictors were developed over the last few decades. While some of these methods provide accurate predictions, unavoidably they also make some mistakes. Consequently, one of the challenges facing users of these methods is how to decide which predictions can be trusted and which are likely incorrect. This practical problem can be solved using quality assessment (QA) scores that predict correctness of the underlying (disorder) predictions at a residue level. We motivate and describe a first-of-its-kind toolbox of QA methods, QUARTER (QUality Assessment for pRotein inTrinsic disordEr pRedictions), which provides the scores for a diverse set of ten disorder predictors. QUARTER is available to the end users as a free and convenient webserver at http://biomine.cs.vcu.edu/servers/QUARTER/ . We briefly describe the predictive architecture of QUARTER and provide detailed instructions on how to use the webserver. We also explain how to interpret results produced by QUARTER with the help of a case study.


Assuntos
Proteínas Intrinsicamente Desordenadas/química , Conformação Proteica , Análise de Sequência de Proteína/métodos , Software , Análise de Sequência de Proteína/normas
8.
Forensic Sci Int Genet ; 47: 102295, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32289731

RESUMO

For the past three decades, forensic genetic investigations have focused on elucidating DNA signatures. While DNA has a number of desirable properties (e.g., presence in most biological materials, an amenable chemistry for analysis and well-developed statistics), DNA also has limitations. DNA may be in low quantity in some tissues, such as hair, and in some tissues it may degrade more readily than its protein counterparts. Recent research efforts have shown the feasibility of performing protein-based human identification in cases in which recovery of DNA is challenged; however, the methods involved in assessing the rarity of a given protein profile have not been addressed adequately. In this paper an algorithm is proposed that describes the computation of a random match probability (RMP) resulting from a genetically variable peptide signature. The approach described herein explicitly models proteomic error and genetic linkage, makes no assumptions as to allelic drop-out, and maps the observed proteomic alleles to their expected protein products from DNA which, in turn, permits standard corrections for population structure and finite database sizes. To assess the feasibility of this approach, RMPs were estimated from peptide profiles of skin samples from 25 individuals of European ancestry. 126 common peptide alleles were used in this approach, yielding a mean RMP of approximately 10-2.


Assuntos
Algoritmos , Peptídeos , Análise de Sequência de Proteína/métodos , Alelos , Cromatografia Líquida , Frequência do Gene , Humanos , Espectrometria de Massas , Método de Monte Carlo , Probabilidade , Proteômica
9.
IEEE/ACM Trans Comput Biol Bioinform ; 17(6): 1918-1931, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-30998480

RESUMO

As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.


Assuntos
Sequência de Aminoácidos/genética , Aminoácidos/química , Biologia Computacional/métodos , Proteínas , Análise de Sequência de Proteína/métodos , Algoritmos , Dobramento de Proteína , Estrutura Secundária de Proteína/genética , Proteínas/química , Proteínas/genética , Proteínas/fisiologia
10.
Biomolecules ; 9(12)2019 12 04.
Artigo em Inglês | MEDLINE | ID: mdl-31817166

RESUMO

Superoxide dismutase (SOD) is the primary enzyme of the cellular antioxidant defense cascade. Misfolding, concomitant oligomerization, and higher order aggregation of human cytosolic SOD are linked to amyotrophic lateral sclerosis (ALS). Although, with two metal ion cofactors SOD1 is extremely robust, the de-metallated apo form is intrinsically disordered. Since the rise of oxygen-based metabolism and antioxidant defense systems are evolutionary coupled, SOD is an interesting protein with a deep evolutionary history. We deployed statistical analysis of sequence space to decode evolutionarily co-varying residues in this protein. These were validated by applying graph theoretical modelling to understand the impact of the presence of metal ion co-factors in dictating the disordered (apo) to hidden disordered (wild-type SOD1) transition. Contact maps were generated for different variants, and the selected significant residues were mapped on separate structure networks. Sequence space analysis coupled with structure networks helped us to map the evolutionarily coupled co-varying patches in the SOD1 and its metal-depleted variants. In addition, using structure network analysis, the residues with a major impact on the internal dynamics of the protein structure were investigated. Our results reveal that the bulk of these evolutionarily co-varying residues are localized in the loop regions and positioned differentially depending upon the metal residence and concomitant steric restrictions of the loops.


Assuntos
Análise de Sequência de Proteína/métodos , Superóxido Dismutase-1/química , Superóxido Dismutase-1/genética , Evolução Molecular , Humanos , Cadeias de Markov , Modelos Moleculares , Mutação , Conformação Proteica , Dobramento de Proteína
11.
BMC Bioinformatics ; 20(1): 473, 2019 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-31521110

RESUMO

BACKGROUND: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. RESULTS: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite . CONCLUSION: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.


Assuntos
Anotação de Sequência Molecular/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Cadeias de Markov
12.
PDA J Pharm Sci Technol ; 73(6): 622-634, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31209169

RESUMO

The application of advanced methodologies such as next-generation sequencing (NGS) and mass spectrometry (MS) to the characterization of cell lines and recombinant proteins has enabled the highly sensitive detection of sequence variants (SVs). However, although these approaches can be leveraged to provide deep insight into product microheterogeneity caused by SVs, they are not used in a standardized manner across the industry. Currently, there is little clarity and consensus on the utilization, timing, and significance of SV findings. This white paper addresses the current practices, logistics, and strategies for the analysis of SVs using a benchmarking survey coordinated by the International Consortium for Innovation & Quality in Pharmaceutical Development (IQ) as well as a series of deliberations among a panel of experts assembled from across the biopharmaceutical industry. Discussion includes current industry experiences including approaches for detection and quantitation of SVs during cell-line and process development, risk assessments, and regulatory feedback. Although SVs are a potential issue for all recombinant protein therapeutics, the scope of this discussion will be limited to SVs produced in mammalian cells. Ultimately, it is our hope that the findings from the survey and deliberations of the committee are useful to decision makers in industry and positions them to respond to findings of SVs in recombinant proteins that are destined for clinical or commercial use in a strategic manner.LAY ABSTRACT: This white paper addresses the current practices, logistics, and strategies for the analysis of amino acid sequence variants using a benchmarking survey coordinated by the International Consortium for Innovation & Quality in Pharmaceutical Development (IQ) as well as a series of deliberations among a panel of experts assembled from across the biopharmaceutical industry. Discussion includes current industry experiences regarding detection and quantitation of SVs during cell-line and process development, risk assessments, and regulatory feedback.


Assuntos
Indústria Farmacêutica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Proteínas Recombinantes/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Benchmarking , Humanos , Mamíferos , Espectrometria de Massas/métodos , Medição de Risco/métodos
13.
Int J Mol Sci ; 19(12)2018 Nov 22.
Artigo em Inglês | MEDLINE | ID: mdl-30469512

RESUMO

Signal peptides are N-terminal presequences responsible for targeting proteins to the endomembrane system, and subsequent subcellular or extracellular compartments, and consequently condition their proper function. The significance of signal peptides stimulates development of new computational methods for their detection. These methods employ learning systems trained on datasets comprising signal peptides from different types of proteins and taxonomic groups. As a result, the accuracy of predictions are high in the case of signal peptides that are well-represented in databases, but might be low in other, atypical cases. Such atypical signal peptides are present in proteins found in apicomplexan parasites, causative agents of malaria and toxoplasmosis. Apicomplexan proteins have a unique amino acid composition due to their AT-biased genomes. Therefore, we designed a new, more flexible and universal probabilistic model for recognition of atypical eukaryotic signal peptides. Our approach called signalHsmm includes knowledge about the structure of signal peptides and physicochemical properties of amino acids. It is able to recognize signal peptides from the malaria parasites and related species more accurately than popular programs. Moreover, it is still universal enough to provide prediction of other signal peptides on par with the best preforming predictors.


Assuntos
Plasmodium/química , Sinais Direcionadores de Proteínas , Proteínas de Protozoários/química , Análise de Sequência de Proteína/métodos , Aminoácidos/química , Cadeias de Markov , Análise de Sequência de Proteína/normas
14.
Int J Mol Sci ; 19(10)2018 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-30301243

RESUMO

Using computational techniques to identify intrinsically disordered residues is practical and effective in biological studies. Therefore, designing novel high-accuracy strategies is always preferable when existing strategies have a lot of room for improvement. Among many possibilities, a meta-strategy that integrates the results of multiple individual predictors has been broadly used to improve the overall performance of predictors. Nonetheless, a simple and direct integration of individual predictors may not effectively improve the performance. In this project, dual-threshold two-step significance voting and neural networks were used to integrate the predictive results of four individual predictors, including: DisEMBL, IUPred, VSL2, and ESpritz. The new meta-strategy has improved the prediction performance of intrinsically disordered residues significantly, compared to all four individual predictors and another four recently-designed predictors. The improvement was validated using five-fold cross-validation and in independent test datasets.


Assuntos
Proteínas Intrinsicamente Desordenadas/química , Redes Neurais de Computação , Análise de Sequência de Proteína/métodos , Humanos , Proteínas Intrinsicamente Desordenadas/metabolismo , Análise de Sequência de Proteína/normas , Software
15.
BMC Bioinformatics ; 19(1): 229, 2018 06 18.
Artigo em Inglês | MEDLINE | ID: mdl-29914376

RESUMO

BACKGROUND: In order to capture the vital structural information of the original protein, the symbol sequence was transformed into the Markov frequency matrix according to the consecutive three residues throughout the chain. A three-dimensional sparse matrix sized 20 × 20 × 20 was obtained and expanded to one-dimensional vector. Then, an appropriate measurement matrix was selected for the vector to obtain a compressed feature set by random projection. Consequently, the new compressive sensing feature extraction technology was proposed. RESULTS: Several indexes were analyzed on the cell membrane, cytoplasm, and nucleus dataset to detect the discrimination of the features. In comparison with the traditional methods of scale wavelet energy and amino acid components, the experimental results suggested the advantage and accuracy of the features by this new method. CONCLUSIONS: The new features extracted from this model could preserve the maximum information contained in the sequence and reflect the essential properties of the protein. Thus, it is an adequate and potential method in collecting and processing the protein sequence from a large sample size and high dimension.


Assuntos
Algoritmos , Compressão de Dados/métodos , Cadeias de Markov , Fragmentos de Peptídeos/metabolismo , Proteínas/química , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Membrana Celular/metabolismo , Núcleo Celular/metabolismo , Citoplasma/metabolismo , Humanos , Fragmentos de Peptídeos/química , Mapas de Interação de Proteínas
16.
Acta Biotheor ; 66(2): 135-148, 2018 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-29700659

RESUMO

The accurate annotation of an unknown protein sequence depends on extant data of template sequences. This could be empirical or sets of reference sequences, and provides an exhaustive pool of probable functions. Individual methods of predicting dominant function possess shortcomings such as varying degrees of inter-sequence redundancy, arbitrary domain inclusion thresholds, heterogeneous parameterization protocols, and ill-conditioned input channels. Here, I present a rigorous theoretical derivation of various steps of a generic algorithm that integrates and utilizes several statistical methods to predict the dominant function in unknown protein sequences. The accompanying mathematical proofs, interval definitions, analysis, and numerical computations presented are meant to offer insights not only into the specificity and accuracy of predictions, but also provide details of the operatic mechanisms involved in the integration and its ensuing rigor. The algorithm uses numerically modified raw hidden markov model scores of well defined sets of training sequences and clusters them on the basis of known function. The results are then fed into an artificial neural network, the predictions of which can be refined using the available data. This pipeline is trained recursively and can be used to discern the dominant principal function, and thereby, annotate an unknown protein sequence. Whilst, the approach is complex, the specificity of the final predictions can benefit laboratory workers design their experiments with greater confidence.


Assuntos
Algoritmos , Genes Dominantes , Matemática , Modelos Estatísticos , Proteínas/química , Proteínas/classificação , Análise de Sequência de Proteína/métodos , Inteligência Artificial , Humanos , Cadeias de Markov , Redes Neurais de Computação , Proteínas/genética
17.
Brief Bioinform ; 19(2): 219-230, 2018 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-27802931

RESUMO

Sequence-based prediction of residue-residue contact in proteins becomes increasingly more important for improving protein structure prediction in the big data era. In this study, we performed a large-scale comparative assessment of 15 locally installed contact predictors. To assess these methods, we collected a big data set consisting of 680 nonredundant proteins covering different structural classes and target difficulties. We investigated a wide range of factors that may influence the precision of contact prediction, including target difficulty, structural class, the alignment depth and distribution of contact pairs in a protein structure. We found that: (1) the machine learning-based methods outperform the direct-coupling-based methods for short-range contact prediction, while the latter are significantly better for long-range contact prediction. The consensus-based methods, which combine machine learning and direct-coupling methods, perform the best. (2) The target difficulty does not have clear influence on the machine learning-based methods, while it does affect the direct-coupling and consensus-based methods significantly. (3) The alignment depth has relatively weak effect on the machine learning-based methods. However, for the direct-coupling-based methods and consensus-based methods, the predicted contacts for targets with deeper alignment tend to be more accurate. (4) All methods perform relatively better on ß and α + ß proteins than on α proteins. (5) Residues buried in the core of protein structure are more prone to be in contact than residues on the surface (22 versus 6%). We believe these are useful results for guiding future development of new approach to contact prediction.


Assuntos
Algoritmos , Domínios e Motivos de Interação entre Proteínas , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Biologia Computacional/métodos , Humanos , Modelos Moleculares , Conformação Proteica , Dobramento de Proteína , Proteínas/química
18.
Bioinformatics ; 34(3): 445-452, 2018 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-28968848

RESUMO

Motivation: Intrinsic disorder (ID), i.e. the lack of a unique folded conformation at physiological conditions, is a common feature for many proteins, which requires specialized biochemical experiments that are not high-throughput. Missing X-ray residues from the PDB have been widely used as a proxy for ID when developing computational methods. This may lead to a systematic bias, where predictors deviate from biologically relevant ID. Large benchmarking sets on experimentally validated ID are scarce. Recently, the DisProt database has been renewed and expanded to include manually curated ID annotations for several hundred new proteins. This provides a large benchmark set which has not yet been used for training ID predictors. Results: Here, we describe the first systematic benchmarking of ID predictors on the new DisProt dataset. In contrast to previous assessments based on missing X-ray data, this dataset contains mostly long ID regions and a significant amount of fully ID proteins. The benchmarking shows that ID predictors work quite well on the new dataset, especially for long ID segments. However, a large fraction of ID still goes virtually undetected and the ranking of methods is different than for PDB data. In particular, many predictors appear to confound ID and regions outside X-ray structures. This suggests that the ID prediction methods capture different flavors of disorder and can benefit from highly accurate curated examples. Availability and implementation: The raw data used for the evaluation are available from URL: http://www.disprot.org/assessment/. Contact: silvio.tosatto@unipd.it. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Conformação Proteica , Análise de Sequência de Proteína/métodos
19.
Bioinformatics ; 34(4): 576-584, 2018 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-29040374

RESUMO

Motivation: Pair Hidden Markov Models (PHMMs) are probabilistic models used for pairwise sequence alignment, a quintessential problem in bioinformatics. PHMMs include three types of hidden states: match, insertion and deletion. Most previous studies have used one or two hidden states for each PHMM state type. However, few studies have examined the number of states suitable for representing sequence data or improving alignment accuracy. Results: We developed a novel method to select superior models (including the number of hidden states) for PHMM. Our method selects models with the highest posterior probability using Factorized Information Criterion, which is widely utilized in model selection for probabilistic models with hidden variables. Our simulations indicated that this method has excellent model selection capabilities with slightly improved alignment accuracy. We applied our method to DNA datasets from 5 and 28 species, ultimately selecting more complex models than those used in previous studies. Availability and implementation: The software is available at https://github.com/bigsea-t/fab-phmm. Contact: mhamada@waseda.jp. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Animais , Teorema de Bayes , Humanos , Modelos Estatísticos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Análise de Sequência de RNA/métodos
20.
Proteins ; 86 Suppl 1: 387-398, 2018 03.
Artigo em Inglês | MEDLINE | ID: mdl-29178137

RESUMO

Every second year, the community experiment "Critical Assessment of Techniques for Structure Prediction" (CASP) is conducting an independent blind assessment of structure prediction methods, providing a framework for comparing the performance of different approaches and discussing the latest developments in the field. Yet, developers of automated computational modeling methods clearly benefit from more frequent evaluations based on larger sets of data. The "Continuous Automated Model EvaluatiOn (CAMEO)" platform complements the CASP experiment by conducting fully automated blind prediction assessments based on the weekly pre-release of sequences of those structures, which are going to be published in the next release of the PDB Protein Data Bank. CAMEO publishes weekly benchmarking results based on models collected during a 4-day prediction window, on average assessing ca. 100 targets during a time frame of 5 weeks. CAMEO benchmarking data is generated consistently for all participating methods at the same point in time, enabling developers to benchmark and cross-validate their method's performance, and directly refer to the benchmarking results in publications. In order to facilitate server development and promote shorter release cycles, CAMEO sends weekly email with submission statistics and low performance warnings. Many participants of CASP have successfully employed CAMEO when preparing their methods for upcoming community experiments. CAMEO offers a variety of scores to allow benchmarking diverse aspects of structure prediction methods. By introducing new scoring schemes, CAMEO facilitates new development in areas of active research, for example, modeling quaternary structure, complexes, or ligand binding sites.


Assuntos
Biologia Computacional/métodos , Modelos Moleculares , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Sítios de Ligação , Bases de Dados de Proteínas , Humanos , Ligantes , Ligação Proteica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA