Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 14.299
Filtrar
1.
Bioinformatics ; 40(Supplement_2): ii79-ii86, 2024 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-39230690

RESUMO

MOTIVATION: For the alignment of large numbers of protein sequences, tools are predominant that decide to align two residues using only simple prior knowledge, e.g. amino acid substitution matrices, and using only part of the available data. The accuracy of state-of-the-art programs declines with decreasing sequence identity and when increasingly large numbers of sequences are aligned. Recently, transformer-based deep-learning models started to harness the vast amount of protein sequence data, resulting in powerful pretrained language models with the main purpose of generating high-dimensional numerical representations, embeddings, for individual sites that agglomerate evolutionary, structural, and biophysical information. RESULTS: We extend the traditional profile hidden Markov model so that it takes as inputs unaligned protein sequences and the corresponding embeddings. We fit the model with gradient descent using our existing differentiable hidden Markov layer. All sequences and their embeddings are jointly aligned to a model of the protein family. We report that our upgraded HMM-based aligner, learnMSA2, combined with the ProtT5-XL protein language model aligns on average almost 6% points more columns correctly than the best amino acid-based competitor and scales well with sequence number. The relative advantage of learnMSA2 over other programs tends to be greater when the sequence identity is lower and when the number of sequences is larger. Our results strengthen the evidence on the rich information contained in protein language models' embeddings and their potential downstream impact on the field of bioinformatics. Availability and implementation:  https://github.com/Gaius-Augustus/learnMSA, PyPI and Bioconda, evaluation: https://github.com/felbecker/snakeMSA.


Assuntos
Cadeias de Markov , Proteínas , Alinhamento de Sequência , Análise de Sequência de Proteína , Alinhamento de Sequência/métodos , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Aprendizado Profundo , Algoritmos , Biologia Computacional/métodos , Sequência de Aminoácidos
2.
J Mol Biol ; 436(17): 168605, 2024 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-39237195

RESUMO

Prediction of the intrinsic disorder in protein sequences is an active research area, with well over 100 predictors that were released to date. These efforts are motivated by the functional importance and high levels of abundance of intrinsic disorder, combined with relatively low amounts of experimental annotations. The disorder predictors are periodically evaluated by independent assessors in the Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiments. The recently completed CAID2 experiment assessed close to 40 state-of-the-art methods demonstrating that some of them produce accurate results. In particular, flDPnn2 method, which is the successor of flDPnn that performed well in the CAID1 experiment, secured the overall most accurate results on the Disorder-NOX dataset in CAID2. flDPnn2 implements a number of improvements when compared to its predecessor including changes to the inputs, increased size of the deep network model that we retrained on a larger training set, and addition of an alignment module. Using results from CAID2, we show that flDPnn2 produces accurate predictions very quickly, modestly improving over the accuracy of flDPnn and reducing the runtime by half, to about 27 s per protein. flDPnn2 is freely available as a convenient web server at http://biomine.cs.vcu.edu/servers/flDPnn2/.


Assuntos
Biologia Computacional , Proteínas Intrinsicamente Desordenadas , Proteínas Intrinsicamente Desordenadas/química , Proteínas Intrinsicamente Desordenadas/metabolismo , Biologia Computacional/métodos , Bases de Dados de Proteínas , Conformação Proteica , Software , Sequência de Aminoácidos , Proteínas/química , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos
3.
J Mol Biol ; 436(17): 168494, 2024 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-39237207

RESUMO

Knowledge of the solvent accessibility of residues in a protein is essential for different applications, including the identification of interacting surfaces in protein-protein interactions and the characterization of variations. We describe E-pRSA, a novel web server to estimate Relative Solvent Accessibility values (RSAs) of residues directly from a protein sequence. The method exploits two complementary Protein Language Models to provide fast and accurate predictions. When benchmarked on different blind test sets, E-pRSA scores at the state-of-the-art, and outperforms a previous method we developed, DeepREx, which was based on sequence profiles after Multiple Sequence Alignments. The E-pRSA web server is freely available at https://e-prsa.biocomp.unibo.it/main/ where users can submit single-sequence and batch jobs.


Assuntos
Proteínas , Software , Solventes , Solventes/química , Proteínas/química , Proteínas/genética , Biologia Computacional/métodos , Sequência de Aminoácidos , Análise de Sequência de Proteína/métodos , Internet , Conformação Proteica , Modelos Moleculares , Alinhamento de Sequência
4.
Sci Rep ; 14(1): 20692, 2024 09 05.
Artigo em Inglês | MEDLINE | ID: mdl-39237735

RESUMO

Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.


Assuntos
Evolução Molecular , Proteínas , Alinhamento de Sequência , Proteínas/metabolismo , Proteínas/genética , Proteínas/química , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Algoritmos , Software , Análise de Sequência de Proteína/métodos
5.
BMC Bioinformatics ; 25(1): 287, 2024 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-39223474

RESUMO

BACKGROUND: Recently, the process of evolution information and the deep learning network has promoted the improvement of protein contact prediction methods. Nevertheless, still remain some bottleneck: (1) One of the bottlenecks is the prediction of orphans and other fewer evolution information proteins. (2) The other bottleneck is the method of predicting single-sequence-based proteins mainly focuses on selecting protein sequence features and tuning the neural network architecture, However, while the deeper neural networks improve prediction accuracy, there is still the problem of increasing the computational burden. Compared with other neural networks in the field of protein prediction, the graph neural network has the following advantages: due to the advantage of revealing the topology structure via graph neural network and being able to take advantage of the hierarchical structure and local connectivity of graph neural networks has certain advantages in capturing the features of different levels of abstraction in protein molecules. When using protein sequence and structure information for joint training, the dependencies between the two kinds of information can be better captured. And it can process protein molecular structures of different lengths and shapes, while traditional neural networks need to convert proteins into fixed-size vectors or matrices for processing. RESULTS: Here, we propose a single-sequence-based protein contact map predictor PCP-GC-LM, with dual-level graph neural networks and convolution networks. Our method performs better with other single-sequence-based predictors in different independent tests. In addition, to verify the validity of our method against complex protein structures, we will also compare it with other methods in two homodimers protein test sets (DeepHomo test dataset and CASP-CAPRI target dataset). Furthermore, we also perform ablation experiments to demonstrate the necessity of a dual graph network. In all, our framework presents new modules to accurately predict inter-chain contact maps in protein and it's also useful to analyze interactions in other types of protein complexes.


Assuntos
Redes Neurais de Computação , Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas , Aprendizado Profundo , Conformação Proteica , Algoritmos
6.
Yi Chuan ; 46(8): 661-669, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39140146

RESUMO

The identification of enzyme functions plays a crucial role in understanding the mechanisms of biological activities and advancing the development of life sciences. However, existing enzyme EC number prediction methods did not fully utilize protein sequence information and still had shortcomings in identification accuracy. To address this issue, we proposed an EC number prediction network using hierarchical features and global features (ECPN-HFGF). This method first utilized residual networks to extract generic features from protein sequences, and then employed hierarchical feature extraction modules and global feature extraction modules to further extract hierarchical and global features of protein sequences. Subsequently, the prediction results of both feature types were combined, and a multitask learning framework was utilized to achieve accurate prediction of enzyme EC numbers. Experimental results indicated that the ECPN-HFGF method performed best in the task of predicting EC numbers for protein sequences, achieving macro F1 and micro F1 scores of 95.5% and 99.0%, respectively. The ECPN-HFGF method effectively combined hierarchical and global features of protein sequences, allowing for rapid and accurate EC number prediction. Compared to current commonly used methods, this method offers significantly higher prediction accuracy, providing an efficient approach for the advancement of enzymology research and enzyme engineering applications.


Assuntos
Biologia Computacional , Biologia Computacional/métodos , Sequência de Aminoácidos , Proteínas/química , Algoritmos , Análise de Sequência de Proteína/métodos , Enzimas/química , Enzimas/metabolismo
7.
BMC Med Inform Decis Mak ; 24(1): 236, 2024 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-39192227

RESUMO

Efforts to enhance the accuracy of protein sequence classification are of utmost importance in driving forward biological analyses and facilitating significant medical advancements. This study presents a cutting-edge model called ProtICNN-BiLSTM, which combines attention-based Improved Convolutional Neural Networks (ICNN) and Bidirectional Long Short-Term Memory (BiLSTM) units seamlessly. Our main goal is to improve the accuracy of protein sequence classification by carefully optimizing performance through Bayesian Optimisation. ProtICNN-BiLSTM combines the power of CNN and BiLSTM architectures to effectively capture local and global protein sequence dependencies. In the proposed model, the ICNN component uses convolutional operations to identify local patterns. Captures long-range associations by analyzing sequence data forward and backwards. In advanced biological studies, Bayesian Optimisation optimizes model hyperparameters for efficiency and robustness. The model was extensively confirmed with PDB-14,189 and other protein data. We found that ProtICNN-BiLSTM outperforms traditional categorization models. Bayesian Optimization's fine-tuning and seamless integration of local and global sequence information make it effective. The precision of ProtICNN-BiLSTM improves comparative protein sequence categorization. The study improves computational bioinformatics for complex biological analysis. Good results from the ProtICNN-BiLSTM model improve protein sequence categorization. This powerful tool could improve medical and biological research. The breakthrough protein sequence classification model is ProtICNN-BiLSTM. Bayesian optimization, ICNN, and BiLSTM analyze biological data accurately.


Assuntos
Teorema de Bayes , Aprendizado Profundo , Análise de Sequência de Proteína/métodos , Humanos , Biologia Computacional/métodos , Proteínas
8.
Comput Biol Med ; 181: 109048, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39182368

RESUMO

Neuropeptides are the most ubiquitous neurotransmitters in the immune system, regulating various biological processes. Neuropeptides play a significant role for the discovery of new drugs and targets for nervous system disorders. Traditional experimental methods for identifying neuropeptides are time-consuming and costly. Although several computational methods have been developed to predict the neuropeptides, the accuracy is still not satisfactory due to the representability of the extracted features. In this work, we propose an efficient and interpretable model, NeuroPpred-SHE, for predicting neuropeptides by selecting the optimal feature subset from both hand-crafted features and embeddings of a protein language model. Specially, we first employed a pre-trained T5 protein language model to extract embedding features and twelve other encoding methods to extract hand-crafted features from peptide sequences, respectively. Secondly, we fused both embedding features and hand-crafted features to enhance the feature representability. Thirdly, we utilized random forest (RF), Max-Relevance and Min-Redundancy (mRMR) and eXtreme Gradient Boosting (XGBoost) methods to select the optimal feature subset from the fused features. Finally, we employed five machine learning methods (GBDT, XGBoost, SVM, MLP, and LightGBM) to build the models. Our results show that the model based on GBDT achieves the best performance. Furthermore, our final model was compared with other state-of-the-art methods on an independent test set, the results indicate that our model achieves an AUROC of 97.8 % which is higher than all the other state-of-the-art predictors. Our model is available at: https://github.com/wenjean/NeuroPpred-SHE.


Assuntos
Neuropeptídeos , Biologia Computacional/métodos , Humanos , Bases de Dados de Proteínas , Aprendizado de Máquina , Análise de Sequência de Proteína/métodos
9.
Nat Commun ; 15(1): 7400, 2024 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-39191788

RESUMO

Significant research progress has been made in the field of protein structure and fitness prediction. Particularly, single-sequence-based structure prediction methods like ESMFold and OmegaFold achieve a balance between inference speed and prediction accuracy, showing promise for many downstream prediction tasks. Here, we propose SPIRED, a single-sequence-based structure prediction model that exhibits comparable performance to the state-of-the-art methods but with approximately 5-fold acceleration in inference and at least one order of magnitude reduction in training consumption. By integrating SPIRED with downstream neural networks, we compose an end-to-end framework named SPIRED-Fitness for the rapid prediction of both protein structure and fitness from single sequence with satisfactory accuracy. Moreover, SPIRED-Stab, the derivative of SPIRED-Fitness, achieves state-of-the-art performance in predicting the mutational effects on protein stability.


Assuntos
Redes Neurais de Computação , Conformação Proteica , Proteínas , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Biologia Computacional/métodos , Algoritmos , Estabilidade Proteica , Modelos Moleculares , Análise de Sequência de Proteína/métodos , Mutação
10.
Genome Biol ; 25(1): 230, 2024 Aug 26.
Artigo em Inglês | MEDLINE | ID: mdl-39187866

RESUMO

Seqrutinator is an objective, flexible pipeline that removes sequences with sequencing and/or gene model errors and sequences from pseudogenes from complex, eukaryotic protein superfamilies. Testing Seqrutinator on major superfamilies BAHD, CYP, and UGT removes only 1.94% of SwissProt entries, 14% of entries from the model plant Arabidopsis thaliana, but 80% of entries from Pinus taeda's recent complete proteome. Application of Seqrutinator on crude BAHDomes, CYPomes, and UGTomes obtained from 16 plant proteomes shows convergence of the numbers of paralogues. MSAs, phylogenies, and particularly functional clustering improve drastically upon Seqrutinator application, indicating good performance.


Assuntos
Proteínas de Plantas , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Filogenia , Software , Arabidopsis/genética , Arabidopsis/metabolismo , Proteoma , Família Multigênica , Análise de Sequência de Proteína , Bases de Dados de Proteínas
11.
Chem Commun (Camb) ; 60(70): 9436-9439, 2024 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-39139063

RESUMO

Oxidation of a thioether linkage in thioether-closed macrocyclic peptides led to collision-induced site-selective linearization of the peptides. This method has allowed for de novo sequencing of thioether macrocyclic peptides. The utility of the sequencing method was demonstrated by identifying the correct peptide sequences from a virtually randomized thioether macrocyclic peptide library.


Assuntos
Oxirredução , Sulfetos , Sulfetos/química , Peptídeos Cíclicos/química , Compostos Macrocíclicos/química , Peptídeos/química , Análise de Sequência de Proteína , Biblioteca de Peptídeos , Sequência de Aminoácidos
12.
Nat Commun ; 15(1): 6427, 2024 Jul 30.
Artigo em Inglês | MEDLINE | ID: mdl-39080256

RESUMO

A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptide that generated each acquired tandem mass spectrum. Approaches that leverage known peptide sequence databases cannot detect unexpected peptides and can be impractical or impossible to apply in some settings. Thus, the ability to assign peptide sequences to tandem mass spectra without prior information-de novo peptide sequencing-is valuable for tasks including antibody sequencing, immunopeptidomics, and metaproteomics. Although many methods have been developed to address this problem, it remains an outstanding challenge in part due to the difficulty of modeling the irregular data structure of tandem mass spectra. Here, we describe Casanovo, a machine learning model that uses a transformer neural network architecture to translate the sequence of peaks in a tandem mass spectrum into the sequence of amino acids that comprise the generating peptide. We train a Casanovo model from 30 million labeled spectra and demonstrate that the model outperforms several state-of-the-art methods on a cross-species benchmark dataset. We also develop a version of Casanovo that is fine-tuned for non-enzymatic peptides. Finally, we demonstrate that Casanovo's superior performance improves the analysis of immunopeptidomics and metaproteomics experiments and allows us to delve deeper into the dark proteome.


Assuntos
Peptídeos , Proteômica , Espectrometria de Massas em Tandem , Peptídeos/química , Peptídeos/metabolismo , Espectrometria de Massas em Tandem/métodos , Proteômica/métodos , Redes Neurais de Computação , Aprendizado de Máquina , Humanos , Sequência de Aminoácidos , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas , Algoritmos
13.
Sheng Wu Gong Cheng Xue Bao ; 40(7): 2087-2099, 2024 Jul 25.
Artigo em Chinês | MEDLINE | ID: mdl-39044577

RESUMO

With the increasing of computer power and rapid expansion of biological data, the application of bioinformatics tools has become the mainstream approach to address biological problems. The accurate identification of protein function by bioinformatics tools is crucial for both biomedical research and drug discovery, making it a hot topic of research. In this paper, we categorize bioinformatics-based protein function prediction methods into three categories: protein sequence-based methods, protein structure-based methods, and protein interaction networks-based methods. We further analyze these specific algorithms, highlighting the latest research advancements and providing valuable references for the application of bioinformatics-based protein function prediction in biomedical research and drug discovery.


Assuntos
Algoritmos , Biologia Computacional , Proteínas , Biologia Computacional/métodos , Proteínas/genética , Proteínas/metabolismo , Proteínas/química , Conformação Proteica , Mapas de Interação de Proteínas , Análise de Sequência de Proteína , Sequência de Aminoácidos , Descoberta de Drogas
14.
PLoS Comput Biol ; 20(7): e1011953, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38991035

RESUMO

With recent methodological advances in the field of computational protein design, in particular those based on deep learning, there is an increasing need for frameworks that allow for coherent, direct integration of different models and objective functions into the generative design process. Here we demonstrate how evolutionary multiobjective optimization techniques can be adapted to provide such an approach. With the established Non-dominated Sorting Genetic Algorithm II (NSGA-II) as the optimization framework, we use AlphaFold2 and ProteinMPNN confidence metrics to define the objective space, and a mutation operator composed of ESM-1v and ProteinMPNN to rank and then redesign the least favorable positions. Using the two-state design problem of the foldswitching protein RfaH as an in-depth case study, and PapD and calmodulin as examples of higher-dimensional design problems, we show that the evolutionary multiobjective optimization approach leads to significant reduction in the bias and variance in RfaH native sequence recovery, compared to a direct application of ProteinMPNN. We suggest that this improvement is due to three factors: (i) the use of an informative mutation operator that accelerates the sequence space exploration, (ii) the parallel, iterative design process inherent to the genetic algorithm that improves upon the ProteinMPNN autoregressive sequence decoding scheme, and (iii) the explicit approximation of the Pareto front that leads to optimal design candidates representing diverse tradeoff conditions. We anticipate this approach to be readily adaptable to different models and broadly relevant for protein design tasks with complex specifications.


Assuntos
Algoritmos , Biologia Computacional , Proteínas , Biologia Computacional/métodos , Proteínas/química , Proteínas/genética , Sequência de Aminoácidos , Engenharia de Proteínas/métodos , Análise de Sequência de Proteína/métodos
15.
Anal Chem ; 96(29): 12057-12064, 2024 07 23.
Artigo em Inglês | MEDLINE | ID: mdl-38979842

RESUMO

De novo sequencing of any novel peptide/protein is a difficult task. Full sequence coverage, isomeric amino acid residues, inter- and intramolecular S-S bonds, and numerous other post-translational modifications make the investigators employ various chemical modifications, providing a variety of specific fragmentation MSn patterns. The chemical processes are time-consuming, and their yields never reach 100%, while the subsequent purification often leads to the loss of minor components of the initial peptide mixture. Here, we present the advantages of the EThcD method that enables establishing the full sequence of natural intact peptides of ranid frogs in de novo top-down mode without any chemical modifications. The method provides complete sequence coverage, including the cyclic disulfide section, and reliable identification of isomeric leucine/isoleucine residues. The proposed approach demonstrated its efficiency in the analysis of peptidomes of ranid frogs from several populations of Rana arvalis, Rana temporaria, and Pelophylax esculentus complexes.


Assuntos
Peptídeos , Ranidae , Animais , Peptídeos/química , Peptídeos/análise , Peptídeos/metabolismo , Sequência de Aminoácidos , Análise de Sequência de Proteína/métodos , Proteínas de Anfíbios/química , Proteínas de Anfíbios/metabolismo
16.
Comput Biol Med ; 179: 108815, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-38986287

RESUMO

Predicting protein structure is both fascinating and formidable, playing a crucial role in structure-based drug discovery and unraveling diseases with elusive origins. The Critical Assessment of Protein Structure Prediction (CASP) serves as a biannual battleground where global scientists converge to untangle the intricate relationships within amino acid chains. Two primary methods, Template-Based Modeling (TBM) and Template-Free (TF) strategies, dominate protein structure prediction. The trend has shifted towards Template-Free predictions due to their broader sequence coverage with fewer templates. The predictive process can be broadly classified into contact map, binned-distance, and real-valued distance predictions, each with distinctive strengths and limitations manifested through tailored loss functions. We have also introduced revolutionary end-to-end, and all-atom diffusion-based techniques that have transformed protein structure predictions. Recent advancements in deep learning techniques have significantly improved prediction accuracy, although the effectiveness is contingent upon the quality of input features derived from natural bio-physiochemical attributes and Multiple Sequence Alignments (MSA). Hence, the generation of high-quality MSA data holds paramount importance in harnessing informative input features for enhanced prediction outcomes. Remarkable successes have been achieved in protein structure prediction accuracy, however not enough for what structural knowledge was intended to, which implies need for development in some other aspects of the predictions. In this regard, scientists have opened other frontiers for protein structural prediction. The utilization of subsampling in multiple sequence alignment (MSA) and protein language modeling appears to be particularly promising in enhancing the accuracy and efficiency of predictions, ultimately aiding in drug discovery efforts. The exploration of predicting protein complex structure also opens up exciting opportunities to deepen our knowledge of molecular interactions and design therapeutics that are more effective. In this article, we have discussed the vicissitudes that the scientists have gone through to improve prediction accuracy, and examined the effective policies in predicting from different aspects, including the construction of high quality MSA, providing informative input features, and progresses in deep learning approaches. We have also briefly touched upon transitioning from predicting single-chain protein structures to predicting protein complex structures. Our findings point towards promoting open research environments to support the objectives of protein structure prediction.


Assuntos
Conformação Proteica , Proteínas , Proteínas/química , Modelos Moleculares , Biologia Computacional/métodos , Humanos , Análise de Sequência de Proteína/métodos , Aprendizado Profundo , Bases de Dados de Proteínas
17.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-39003530

RESUMO

Protein function prediction is critical for understanding the cellular physiological and biochemical processes, and it opens up new possibilities for advancements in fields such as disease research and drug discovery. During the past decades, with the exponential growth of protein sequence data, many computational methods for predicting protein function have been proposed. Therefore, a systematic review and comparison of these methods are necessary. In this study, we divide these methods into four different categories, including sequence-based methods, 3D structure-based methods, PPI network-based methods and hybrid information-based methods. Furthermore, their advantages and disadvantages are discussed, and then their performance is comprehensively evaluated and compared. Finally, we discuss the challenges and opportunities present in this field.


Assuntos
Biologia Computacional , Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Humanos , Análise de Sequência de Proteína/métodos , Algoritmos
18.
PLoS Comput Biol ; 20(7): e1012258, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38968291

RESUMO

The practical application of new single molecule protein sequencing (SMPS) technologies requires accurate estimates of their associated sequencing error rates. Here, we describe the development and application of two distinct parameter estimation methods for analyzing SMPS reads produced by fluorosequencing. A Hidden Markov Model (HMM) based approach, extends whatprot, where we previously used HMMs for SMPS peptide-read matching. This extension offers a principled approach for estimating key parameters for fluorosequencing experiments, including missed amino acid cleavages, dye loss, and peptide detachment. Specifically, we adapted the Baum-Welch algorithm, a standard technique to estimate transition probabilities for an HMM using expectation maximization, but modified here to estimate a small number of parameter values directly rather than estimating every transition probability independently. We demonstrate a high degree of accuracy on simulated data, but on experimental datasets, we observed that the model needed to be augmented with an additional error type, N-terminal blocking. This, in combination with data pre-processing, results in reasonable parameterizations of experimental datasets that agree with controlled experimental perturbations. A second independent implementation using a hybrid of DIRECT and Powell's method to reduce the root mean squared error (RMSE) between simulations and the real dataset was also developed. We compare these methods on both simulated and real data, finding that our Baum-Welch based approach outperforms DIRECT and Powell's method by most, but not all, criteria. Although some discrepancies between the results exist, we also find that both approaches provide similar error rate estimates from experimental single molecule fluorosequencing datasets.


Assuntos
Algoritmos , Cadeias de Markov , Análise de Sequência de Proteína , Análise de Sequência de Proteína/métodos , Proteínas/química , Biologia Computacional/métodos , Imagem Individual de Molécula/métodos , Simulação por Computador
19.
Sci Rep ; 14(1): 15000, 2024 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-38951578

RESUMO

The primary objective of analyzing the data obtained in a mass spectrometry-based proteomic experiment is peptide and protein identification, or correct assignment of the tandem mass spectrum to one amino acid sequence. Comparison of empirical fragment spectra with the theoretical predicted one or matching with the collected spectra library are commonly accepted strategies of proteins identification and defining of their amino acid sequences. Although these approaches are widely used and are appreciably efficient for the well-characterized model organisms or measured proteins, they cannot detect novel peptide sequences that have not been previously annotated or are rare. This study presents PowerNovo tool for de novo sequencing of proteins using tandem mass spectra acquired in a variety of types of mass analyzers and different fragmentation techniques. PowerNovo involves an ensemble of models for peptide sequencing: model for detecting regularities in tandem mass spectra, precursors, and fragment ions and a natural language processing model, which has a function of peptide sequence quality assessment and helps with reconstruction of noisy sequences. The results of testing showed that the performance of PowerNovo is comparable and even better than widely utilized PointNovo, DeepNovo, Casanovo, and Novor packages. Also, PowerNovo provides complete cycle of processing (pipeline) of mass spectrometry data and, along with predicting the peptide sequence, involves the peptide assembly and protein inference blocks.


Assuntos
Peptídeos , Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos/química , Peptídeos/análise , Sequência de Aminoácidos , Software , Proteômica/métodos , Algoritmos
20.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-39038936

RESUMO

Sequence database searches followed by homology-based function transfer form one of the oldest and most popular approaches for predicting protein functions, such as Gene Ontology (GO) terms. These searches are also a critical component in most state-of-the-art machine learning and deep learning-based protein function predictors. Although sequence search tools are the basis of homology-based protein function prediction, previous studies have scarcely explored how to select the optimal sequence search tools and configure their parameters to achieve the best function prediction. In this paper, we evaluate the effect of using different options from among popular search tools, as well as the impacts of search parameters, on protein function prediction. When predicting GO terms on a large benchmark dataset, we found that BLASTp and MMseqs2 consistently exceed the performance of other tools, including DIAMOND-one of the most popular tools for function prediction-under default search parameters. However, with the correct parameter settings, DIAMOND can perform comparably to BLASTp and MMseqs2 in function prediction. Additionally, we developed a new scoring function to derive GO prediction from homologous hits that consistently outperform previously proposed scoring functions. These findings enable the improvement of almost all protein function prediction algorithms with a few easily implementable changes in their sequence homolog-based component. This study emphasizes the critical role of search parameter settings in homology-based function transfer and should have an important contribution to the development of future protein function prediction algorithms.


Assuntos
Bases de Dados de Proteínas , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biologia Computacional/métodos , Ontologia Genética , Algoritmos , Análise de Sequência de Proteína/métodos , Software , Aprendizado de Máquina
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA