Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 4.910
Filtrar
1.
Sci Rep ; 14(1): 15000, 2024 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-38951578

RESUMEN

The primary objective of analyzing the data obtained in a mass spectrometry-based proteomic experiment is peptide and protein identification, or correct assignment of the tandem mass spectrum to one amino acid sequence. Comparison of empirical fragment spectra with the theoretical predicted one or matching with the collected spectra library are commonly accepted strategies of proteins identification and defining of their amino acid sequences. Although these approaches are widely used and are appreciably efficient for the well-characterized model organisms or measured proteins, they cannot detect novel peptide sequences that have not been previously annotated or are rare. This study presents PowerNovo tool for de novo sequencing of proteins using tandem mass spectra acquired in a variety of types of mass analyzers and different fragmentation techniques. PowerNovo involves an ensemble of models for peptide sequencing: model for detecting regularities in tandem mass spectra, precursors, and fragment ions and a natural language processing model, which has a function of peptide sequence quality assessment and helps with reconstruction of noisy sequences. The results of testing showed that the performance of PowerNovo is comparable and even better than widely utilized PointNovo, DeepNovo, Casanovo, and Novor packages. Also, PowerNovo provides complete cycle of processing (pipeline) of mass spectrometry data and, along with predicting the peptide sequence, involves the peptide assembly and protein inference blocks.


Asunto(s)
Péptidos , Análisis de Secuencia de Proteína , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Análisis de Secuencia de Proteína/métodos , Péptidos/química , Péptidos/análisis , Secuencia de Aminoácidos , Programas Informáticos , Proteómica/métodos , Algoritmos
2.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-39003530

RESUMEN

Protein function prediction is critical for understanding the cellular physiological and biochemical processes, and it opens up new possibilities for advancements in fields such as disease research and drug discovery. During the past decades, with the exponential growth of protein sequence data, many computational methods for predicting protein function have been proposed. Therefore, a systematic review and comparison of these methods are necessary. In this study, we divide these methods into four different categories, including sequence-based methods, 3D structure-based methods, PPI network-based methods and hybrid information-based methods. Furthermore, their advantages and disadvantages are discussed, and then their performance is comprehensively evaluated and compared. Finally, we discuss the challenges and opportunities present in this field.


Asunto(s)
Biología Computacional , Proteínas , Proteínas/química , Proteínas/metabolismo , Biología Computacional/métodos , Humanos , Análisis de Secuencia de Proteína/métodos , Algoritmos
3.
PLoS Comput Biol ; 20(7): e1012258, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38968291

RESUMEN

The practical application of new single molecule protein sequencing (SMPS) technologies requires accurate estimates of their associated sequencing error rates. Here, we describe the development and application of two distinct parameter estimation methods for analyzing SMPS reads produced by fluorosequencing. A Hidden Markov Model (HMM) based approach, extends whatprot, where we previously used HMMs for SMPS peptide-read matching. This extension offers a principled approach for estimating key parameters for fluorosequencing experiments, including missed amino acid cleavages, dye loss, and peptide detachment. Specifically, we adapted the Baum-Welch algorithm, a standard technique to estimate transition probabilities for an HMM using expectation maximization, but modified here to estimate a small number of parameter values directly rather than estimating every transition probability independently. We demonstrate a high degree of accuracy on simulated data, but on experimental datasets, we observed that the model needed to be augmented with an additional error type, N-terminal blocking. This, in combination with data pre-processing, results in reasonable parameterizations of experimental datasets that agree with controlled experimental perturbations. A second independent implementation using a hybrid of DIRECT and Powell's method to reduce the root mean squared error (RMSE) between simulations and the real dataset was also developed. We compare these methods on both simulated and real data, finding that our Baum-Welch based approach outperforms DIRECT and Powell's method by most, but not all, criteria. Although some discrepancies between the results exist, we also find that both approaches provide similar error rate estimates from experimental single molecule fluorosequencing datasets.


Asunto(s)
Algoritmos , Cadenas de Markov , Análisis de Secuencia de Proteína , Análisis de Secuencia de Proteína/métodos , Proteínas/química , Biología Computacional/métodos , Imagen Individual de Molécula/métodos , Simulación por Computador
4.
Anal Chem ; 96(29): 12057-12064, 2024 Jul 23.
Artículo en Inglés | MEDLINE | ID: mdl-38979842

RESUMEN

De novo sequencing of any novel peptide/protein is a difficult task. Full sequence coverage, isomeric amino acid residues, inter- and intramolecular S-S bonds, and numerous other post-translational modifications make the investigators employ various chemical modifications, providing a variety of specific fragmentation MSn patterns. The chemical processes are time-consuming, and their yields never reach 100%, while the subsequent purification often leads to the loss of minor components of the initial peptide mixture. Here, we present the advantages of the EThcD method that enables establishing the full sequence of natural intact peptides of ranid frogs in de novo top-down mode without any chemical modifications. The method provides complete sequence coverage, including the cyclic disulfide section, and reliable identification of isomeric leucine/isoleucine residues. The proposed approach demonstrated its efficiency in the analysis of peptidomes of ranid frogs from several populations of Rana arvalis, Rana temporaria, and Pelophylax esculentus complexes.


Asunto(s)
Péptidos , Ranidae , Animales , Péptidos/química , Péptidos/análisis , Péptidos/metabolismo , Secuencia de Aminoácidos , Análisis de Secuencia de Proteína/métodos , Proteínas Anfibias/química , Proteínas Anfibias/metabolismo
5.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-39038936

RESUMEN

Sequence database searches followed by homology-based function transfer form one of the oldest and most popular approaches for predicting protein functions, such as Gene Ontology (GO) terms. These searches are also a critical component in most state-of-the-art machine learning and deep learning-based protein function predictors. Although sequence search tools are the basis of homology-based protein function prediction, previous studies have scarcely explored how to select the optimal sequence search tools and configure their parameters to achieve the best function prediction. In this paper, we evaluate the effect of using different options from among popular search tools, as well as the impacts of search parameters, on protein function prediction. When predicting GO terms on a large benchmark dataset, we found that BLASTp and MMseqs2 consistently exceed the performance of other tools, including DIAMOND-one of the most popular tools for function prediction-under default search parameters. However, with the correct parameter settings, DIAMOND can perform comparably to BLASTp and MMseqs2 in function prediction. Additionally, we developed a new scoring function to derive GO prediction from homologous hits that consistently outperform previously proposed scoring functions. These findings enable the improvement of almost all protein function prediction algorithms with a few easily implementable changes in their sequence homolog-based component. This study emphasizes the critical role of search parameter settings in homology-based function transfer and should have an important contribution to the development of future protein function prediction algorithms.


Asunto(s)
Bases de Datos de Proteínas , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biología Computacional/métodos , Ontología de Genes , Algoritmos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Aprendizaje Automático
6.
PLoS Comput Biol ; 20(7): e1011953, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38991035

RESUMEN

With recent methodological advances in the field of computational protein design, in particular those based on deep learning, there is an increasing need for frameworks that allow for coherent, direct integration of different models and objective functions into the generative design process. Here we demonstrate how evolutionary multiobjective optimization techniques can be adapted to provide such an approach. With the established Non-dominated Sorting Genetic Algorithm II (NSGA-II) as the optimization framework, we use AlphaFold2 and ProteinMPNN confidence metrics to define the objective space, and a mutation operator composed of ESM-1v and ProteinMPNN to rank and then redesign the least favorable positions. Using the two-state design problem of the foldswitching protein RfaH as an in-depth case study, and PapD and calmodulin as examples of higher-dimensional design problems, we show that the evolutionary multiobjective optimization approach leads to significant reduction in the bias and variance in RfaH native sequence recovery, compared to a direct application of ProteinMPNN. We suggest that this improvement is due to three factors: (i) the use of an informative mutation operator that accelerates the sequence space exploration, (ii) the parallel, iterative design process inherent to the genetic algorithm that improves upon the ProteinMPNN autoregressive sequence decoding scheme, and (iii) the explicit approximation of the Pareto front that leads to optimal design candidates representing diverse tradeoff conditions. We anticipate this approach to be readily adaptable to different models and broadly relevant for protein design tasks with complex specifications.


Asunto(s)
Algoritmos , Biología Computacional , Proteínas , Biología Computacional/métodos , Proteínas/química , Proteínas/genética , Secuencia de Aminoácidos , Ingeniería de Proteínas/métodos , Análisis de Secuencia de Proteína/métodos
7.
Proc Natl Acad Sci U S A ; 121(27): e2311887121, 2024 Jul 02.
Artículo en Inglés | MEDLINE | ID: mdl-38913900

RESUMEN

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.


Asunto(s)
Proteínas , Alineación de Secuencia , Alineación de Secuencia/métodos , Proteínas/química , Proteínas/metabolismo , Secuencia de Aminoácidos , Algoritmos , Análisis de Secuencia de Proteína/métodos , Biología Computacional/métodos , Bases de Datos de Proteínas
8.
Bioinformatics ; 40(Supplement_1): i410-i417, 2024 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940129

RESUMEN

MOTIVATION: One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools. RESULTS: To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.


Asunto(s)
Bases de Datos de Proteínas , Péptidos , Péptidos/química , Aprendizaje Automático , Espectrometría de Masas/métodos , Algoritmos , Análisis de Secuencia de Proteína/métodos , Espectrometría de Masas en Tándem/métodos
9.
Bioinformatics ; 40(Supplement_1): i328-i336, 2024 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940160

RESUMEN

SUMMARY: Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. AVAILABILITY AND IMPLEMENTATION: The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.


Asunto(s)
Algoritmos , Cadenas de Markov , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Biología Computacional/métodos , Análisis de Secuencia de Proteína/métodos , Filogenia , Proteínas/química
10.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38851299

RESUMEN

Protein-protein interactions (PPIs) are the basis of many important biological processes, with protein complexes being the key forms implementing these interactions. Understanding protein complexes and their functions is critical for elucidating mechanisms of life processes, disease diagnosis and treatment and drug development. However, experimental methods for identifying protein complexes have many limitations. Therefore, it is necessary to use computational methods to predict protein complexes. Protein sequences can indicate the structure and biological functions of proteins, while also determining their binding abilities with other proteins, influencing the formation of protein complexes. Integrating these characteristics to predict protein complexes is very promising, but currently there is no effective framework that can utilize both protein sequence and PPI network topology for complex prediction. To address this challenge, we have developed HyperGraphComplex, a method based on hypergraph variational autoencoder that can capture expressive features from protein sequences without feature engineering, while also considering topological properties in PPI networks, to predict protein complexes. Experiment results demonstrated that HyperGraphComplex achieves satisfactory predictive performance when compared with state-of-art methods. Further bioinformatics analysis shows that the predicted protein complexes have similar attributes to known ones. Moreover, case studies corroborated the remarkable predictive capability of our model in identifying protein complexes, including 3 that were not only experimentally validated by recent studies but also exhibited high-confidence structural predictions from AlphaFold-Multimer. We believe that the HyperGraphComplex algorithm and our provided proteome-wide high-confidence protein complex prediction dataset will help elucidate how proteins regulate cellular processes in the form of complexes, and facilitate disease diagnosis and treatment and drug development. Source codes are available at https://github.com/LiDlab/HyperGraphComplex.


Asunto(s)
Biología Computacional , Mapeo de Interacción de Proteínas , Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Proteínas/metabolismo , Proteínas/química , Algoritmos , Mapas de Interacción de Proteínas , Bases de Datos de Proteínas , Humanos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos
11.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38701416

RESUMEN

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.


Asunto(s)
Algoritmos , Biología Computacional , Redes Neurales de la Computación , Estructura Secundaria de Proteína , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biología Computacional/métodos , Bases de Datos de Proteínas , Ontología de Genes , Análisis de Secuencia de Proteína/métodos , Programas Informáticos
12.
J Am Soc Mass Spectrom ; 35(7): 1556-1566, 2024 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-38806410

RESUMEN

Protein phosphorylation, a common post-translational modification (PTM), is fundamental in a plethora of biological processes, most importantly in modulating cell signaling pathways. Matrix-assisted laser desorption/ionization (MALDI) coupled to tandem mass spectrometry (MS/MS) is an attractive method for phosphopeptide characterization due to its high speed, low limit of detection, and surface sampling capabilities. However, MALDI analysis of phosphopeptides is constrained by relatively low abundances in biological samples and poor relative ionization efficiencies in positive ion mode. Additionally, MALDI tends to produce singly charged ions, generally limiting the accessible MS/MS techniques that can be used for peptide sequencing. For example, collision induced dissociation (CID) is readily amendable to the analysis of singly charged ions, but results in facile loss of phosphoric acid, precluding the localization of the PTM. Electron-based dissociation methods (e.g., electron capture dissociation, ECD) are well suited for PTM localization, but require multiply charged peptide cations to avoid neutralization during ECD. Conversely, phosphopeptides are readily ionized using MALDI in negative ion mode. If the precursor ions are first formed in negative ion mode, a gas-phase charge inversion ion/ion reaction could then be used to transform the phosphopeptide anions produced via MALDI into multiply charged cations that are well-suited for ECD. Herein we demonstrate a multistep workflow combining a charge inversion ion/ion reaction that first transforms MALDI-generated phosphopeptide monoanions into multiply charged cations, and then subjects these multiply charged phosphopeptide cations to ECD for sequence determination and phosphate bond localization.


Asunto(s)
Fosfopéptidos , Espectrometría de Masa por Láser de Matriz Asistida de Ionización Desorción , Espectrometría de Masas en Tándem , Fosfopéptidos/química , Fosfopéptidos/análisis , Espectrometría de Masa por Láser de Matriz Asistida de Ionización Desorción/métodos , Espectrometría de Masas en Tándem/métodos , Análisis de Secuencia de Proteína/métodos , Iones/química , Secuencia de Aminoácidos , Humanos
13.
Nucleic Acids Res ; 52(W1): W287-W293, 2024 Jul 05.
Artículo en Inglés | MEDLINE | ID: mdl-38747351

RESUMEN

The PSIRED Workbench is a long established and popular bioinformatics web service offering a wide range of machine learning based analyses for characterizing protein structure and function. In this paper we provide an update of the recent additions and developments to the webserver, with a focus on new Deep Learning based methods. We briefly discuss some trends in server usage since the publication of AlphaFold2 and we give an overview of some upcoming developments for the service. The PSIPRED Workbench is available at http://bioinf.cs.ucl.ac.uk/psipred.


Asunto(s)
Aprendizaje Profundo , Proteínas , Programas Informáticos , Proteínas/química , Proteínas/genética , Internet , Conformación Proteica , Biología Computacional/métodos , Análisis de Secuencia de Proteína/métodos
14.
Int J Biol Macromol ; 270(Pt 2): 132469, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38761901

RESUMEN

Thermophilic proteins are important for academic research and industrial processes, and various computational methods have been developed to identify and screen them. However, their performance has been limited due to the lack of high-quality labeled data and efficient models for representing protein. Here, we proposed a novel sequence-based thermophilic proteins prediction framework, called ThermoFinder. The results demonstrated that ThermoFinder outperforms previous state-of-the-art tools on two benchmark datasets, and feature ablation experiments confirmed the effectiveness of our approach. Additionally, ThermoFinder exhibited exceptional performance and consistency across two newly constructed datasets, one of these was specifically constructed for the regression-based prediction of temperature optimum values directly derived from protein sequences. The feature importance analysis, using shapley additive explanations, further validated the advantages of ThermoFinder. We believe that ThermoFinder will be a valuable and comprehensive framework for predicting thermophilic proteins, and we have made our model open source and available on Github at https://github.com/Luo-SynBioLab/ThermoFinder.


Asunto(s)
Biología Computacional , Programas Informáticos , Biología Computacional/métodos , Proteínas/química , Bases de Datos de Proteínas , Análisis de Secuencia de Proteína/métodos , Algoritmos , Temperatura
15.
BMC Bioinformatics ; 25(1): 176, 2024 May 04.
Artículo en Inglés | MEDLINE | ID: mdl-38704533

RESUMEN

BACKGROUND: Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT: In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION: Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.


Asunto(s)
Proteínas , Proteínas/química , Biología Computacional/métodos , Bases de Datos de Proteínas , Conformación Proteica , Algoritmos , Análisis de Secuencia de Proteína/métodos , Redes Neurales de la Computación
16.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38695119

RESUMEN

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Asunto(s)
Algoritmos , Biología Computacional , Alineación de Secuencia , Alineación de Secuencia/métodos , Biología Computacional/métodos , Programas Informáticos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizaje Profundo , Bases de Datos de Proteínas
17.
Comput Biol Med ; 176: 108538, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38759585

RESUMEN

Anticancer peptides (ACPs) key properties including bioactivity, high efficacy, low toxicity, and lack of drug resistance make them ideal candidates for cancer therapies. To deeply explore the potential of ACPs and accelerate development of cancer therapies, although 53 Artificial Intelligence supported computational predictors have been developed for ACPs and non ACPs classification but only one predictor has been developed for ACPs functional types annotations. Moreover, these predictors extract amino acids distribution patterns to transform peptides sequences into statistical vectors that are further fed to classifiers for discriminating peptides sequences and annotating peptides functional classes. Overall, these predictors remain fail in extracting diverse types of amino acids distribution patterns from peptide sequences. The paper in hand presents a unique CARE encoder that transforms peptides sequences into statistical vectors by extracting 4 different types of distribution patterns including correlation, distribution, composition, and transition. Across public benchmark dataset, proposed encoder potential is explored under two different evaluation settings namely; intrinsic and extrinsic. Extrinsic evaluation indicates that 12 different machine learning classifiers achieve superior performance with the proposed encoder as compared to 55 existing encoders. Furthermore, an intrinsic evaluation reveals that, unlike existing encoders, the proposed encoder generates more discriminative clusters for ACPs and non-ACPs classes. Across 8 public benchmark ACPs and non-ACPs classification datasets, proposed encoder and Adaboost classifier based CAPTURE predictor outperforms existing predictors with an average accuracy, recall and MCC score of 1%, 4%, and 2% respectively. In generalizeability evaluation case study, across 7 benchmark anti-microbial peptides classification datasets, CAPTURE surpasses existing predictors by an average AU-ROC of 2%. CAPTURE predictive pipeline along with label powerset method outperforms state-of-the-art ACPs functional types predictor by 5%, 5%, 5%, 6%, and 3% in terms of average accuracy, subset accuracy, precision, recall, and F1 respectively. CAPTURE web application is available at https://sds_genetic_analysis.opendfki.de/CAPTURE.


Asunto(s)
Antineoplásicos , Péptidos , Humanos , Antineoplásicos/uso terapéutico , Antineoplásicos/química , Péptidos/química , Aprendizaje Automático , Secuencia de Aminoácidos , Biología Computacional/métodos , Neoplasias/tratamiento farmacológico , Análisis de Secuencia de Proteína/métodos , Bases de Datos de Proteínas
18.
Bioinformatics ; 40(5)2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38648741

RESUMEN

SUMMARY: SIMSApiper is a Nextflow pipeline that creates reliable, structure-informed MSAs of thousands of protein sequences faster than standard structure-based alignment methods. Structural information can be provided by the user or collected by the pipeline from online resources. Parallelization with sequence identity-based subsets can be activated to significantly speed up the alignment process. Finally, the number of gaps in the final alignment can be reduced by leveraging the position of conserved secondary structure elements. AVAILABILITY AND IMPLEMENTATION: The pipeline is implemented using Nextflow, Python3, and Bash. It is publicly available on github.com/Bio2Byte/simsapiper.


Asunto(s)
Proteínas , Alineación de Secuencia , Análisis de Secuencia de Proteína , Programas Informáticos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Biología Computacional/métodos , Bases de Datos de Proteínas
19.
Bioinformatics ; 40(5)2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38652603

RESUMEN

MOTIVATION: Antibody therapeutic candidates must exhibit not only tight binding to their target but also good developability properties, especially low risk of immunogenicity. RESULTS: In this work, we fit a simple generative model, SAM, to sixty million human heavy and seventy million human light chains. We show that the probability of a sequence calculated by the model distinguishes human sequences from other species with the same or better accuracy on a variety of benchmark datasets containing >400 million sequences than any other model in the literature, outperforming large language models (LLMs) by large margins. SAM can humanize sequences, generate new sequences, and score sequences for humanness. It is both fast and fully interpretable. Our results highlight the importance of using simple models as baselines for protein engineering tasks. We additionally introduce a new tool for numbering antibody sequences which is orders of magnitude faster than existing tools in the literature. AVAILABILITY AND IMPLEMENTATION: All tools developed in this study are available at https://github.com/Wang-lab-UCSD/AntPack.


Asunto(s)
Anticuerpos , Humanos , Anticuerpos/química , Programas Informáticos , Análisis de Secuencia de Proteína/métodos , Biología Computacional/métodos , Cadenas Pesadas de Inmunoglobulina/química , Cadenas Pesadas de Inmunoglobulina/inmunología , Cadenas Ligeras de Inmunoglobulina/química , Cadenas Ligeras de Inmunoglobulina/inmunología , Algoritmos
20.
Bioinformatics ; 40(4)2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38608190

RESUMEN

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.


Asunto(s)
Algoritmos , Biología Computacional , Aprendizaje Profundo , Procesamiento de Lenguaje Natural , Biología Computacional/métodos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA