Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 4.893
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38695119

RESUMO

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Software , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizado Profundo , Bases de Dados de Proteínas
2.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38701416

RESUMO

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.


Assuntos
Algoritmos , Biologia Computacional , Redes Neurais de Computação , Estrutura Secundária de Proteína , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas , Ontologia Genética , Análise de Sequência de Proteína/métodos , Software
3.
BMC Bioinformatics ; 25(1): 176, 2024 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-38704533

RESUMO

BACKGROUND: Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT: In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION: Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.


Assuntos
Proteínas , Proteínas/química , Biologia Computacional/métodos , Bases de Dados de Proteínas , Conformação Proteica , Algoritmos , Análise de Sequência de Proteína/métodos , Redes Neurais de Computação
4.
Int J Biol Macromol ; 267(Pt 1): 131311, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38599417

RESUMO

In the rapidly evolving field of computational biology, accurate prediction of protein secondary structures is crucial for understanding protein functions, facilitating drug discovery, and advancing disease diagnostics. In this paper, we propose MFTrans, a deep learning-based multi-feature fusion network aimed at enhancing the precision and efficiency of Protein Secondary Structure Prediction (PSSP). This model employs a Multiple Sequence Alignment (MSA) Transformer in combination with a multi-view deep learning architecture to effectively capture both global and local features of protein sequences. MFTrans integrates diverse features generated by protein sequences, including MSA, sequence information, evolutionary information, and hidden state information, using a multi-feature fusion strategy. The MSA Transformer is utilized to interleave row and column attention across the input MSA, while a Transformer encoder and decoder are introduced to enhance the extracted high-level features. A hybrid network architecture, combining a convolutional neural network with a bidirectional Gated Recurrent Unit (BiGRU) network, is used to further extract high-level features after feature fusion. In independent tests, our experimental results show that MFTrans has superior generalization ability, outperforming other state-of-the-art PSSP models by 3 % on average on public benchmarks including CASP12, CASP13, CASP14, TEST2016, TEST2018, and CB513. Case studies further highlight its advanced performance in predicting mutation sites. MFTrans contributes significantly to the protein science field, opening new avenues for drug discovery, disease diagnosis, and protein.


Assuntos
Biologia Computacional , Estrutura Secundária de Proteína , Proteínas , Proteínas/química , Biologia Computacional/métodos , Aprendizado Profundo , Redes Neurais de Computação , Algoritmos , Alinhamento de Sequência , Análise de Sequência de Proteína/métodos
5.
Bioinformatics ; 40(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38652603

RESUMO

MOTIVATION: Antibody therapeutic candidates must exhibit not only tight binding to their target but also good developability properties, especially low risk of immunogenicity. RESULTS: In this work, we fit a simple generative model, SAM, to sixty million human heavy and seventy million human light chains. We show that the probability of a sequence calculated by the model distinguishes human sequences from other species with the same or better accuracy on a variety of benchmark datasets containing >400 million sequences than any other model in the literature, outperforming large language models (LLMs) by large margins. SAM can humanize sequences, generate new sequences, and score sequences for humanness. It is both fast and fully interpretable. Our results highlight the importance of using simple models as baselines for protein engineering tasks. We additionally introduce a new tool for numbering antibody sequences which is orders of magnitude faster than existing tools in the literature. AVAILABILITY AND IMPLEMENTATION: All tools developed in this study are available at https://github.com/Wang-lab-UCSD/AntPack.


Assuntos
Anticorpos , Humanos , Anticorpos/química , Software , Análise de Sequência de Proteína/métodos , Biologia Computacional/métodos , Cadeias Pesadas de Imunoglobulinas/química , Cadeias Pesadas de Imunoglobulinas/imunologia , Cadeias Leves de Imunoglobulina/química , Cadeias Leves de Imunoglobulina/imunologia , Algoritmos
6.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38608190

RESUMO

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.


Assuntos
Algoritmos , Biologia Computacional , Aprendizado Profundo , Processamento de Linguagem Natural , Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos
7.
Comput Biol Med ; 174: 108408, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38636332

RESUMO

Accurately predicting tumor T-cell antigen (TTCA) sequences is a crucial task in the development of cancer vaccines and immunotherapies. TTCAs derived from tumor cells, are presented to immune cells (T cells) through major histocompatibility complex (MHC), via the recognition of specific portions of their structure known as epitopes. More specifically, MHC class I introduces TTCAs to T-cell receptors (TCR) which are located on the surface of CD8+ T cells. However, TTCA sequences are varied and lead to struggles in vaccine design. Recently, Machine learning (ML) models have been developed to predict TTCA sequences which could aid in fast and correct TTCA identification. During the construction of the TTCA predictor, the peptide encoding strategy is an important step. Previous studies have used biological descriptors for encoding TTCA sequences. However, there have been no studies that use natural language processing (NLP), a potential approach for this purpose. As sentences have their own words with diverse properties, biological sequences also hold unique characteristics that reflect evolutionary information, physicochemical values, and structural information. We hypothesized that NLP methods would benefit the prediction of TTCA. To develop a new identifying TTCA model, we first constructed a based model with widely used ML algorithms and extracted features from biological descriptors. Then, to improve our model performance, we added extracted features from biological language models (BLMs) based on NLP methods. Besides, we conducted feature selection by using Chi-square and Pearson Correlation Coefficient techniques. Then, SMOTE, Up-sampling, and Near-Miss were used to treat unbalanced data. Finally, we optimized Sa-TTCA by the SVM algorithm to the four most effective feature groups. The best performance of Sa-TTCA showed a competitive balanced accuracy of 87.5% on a training set, and 72.0% on an independent testing set. Our results suggest that integrating biological descriptors with natural language processing has the potential to improve the precision of predicting protein/peptide functionality, which could be beneficial for developing cancer vaccines.


Assuntos
Antígenos de Neoplasias , Processamento de Linguagem Natural , Máquina de Vetores de Suporte , Humanos , Antígenos de Neoplasias/imunologia , Antígenos de Neoplasias/química , Antígenos de Neoplasias/genética , Neoplasias/imunologia , Análise de Sequência de Proteína/métodos , Biologia Computacional/métodos
8.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38600663

RESUMO

Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.


Assuntos
Redes Neurais de Computação , Proteínas , Alinhamento de Sequência , Sequência de Aminoácidos , Proteínas/química , Análise de Sequência de Proteína/métodos
9.
Bioinformatics ; 40(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38662570

RESUMO

MOTIVATION: Proteins, the molecular workhorses of biological systems, execute a multitude of critical functions dictated by their precise three-dimensional structures. In a complex and dynamic cellular environment, proteins can undergo misfolding, leading to the formation of aggregates that take up various forms, including amorphous and ordered aggregation in the shape of amyloid fibrils. This phenomenon is closely linked to a spectrum of widespread debilitating pathologies, such as Alzheimer's disease, Parkinson's disease, type-II diabetes, and several other proteinopathies, but also hampers the engineering of soluble agents, as in the case of antibody development. As such, the accurate prediction of aggregation propensity within protein sequences has become pivotal due to profound implications in understanding disease mechanisms, as well as in improving biotechnological and therapeutic applications. RESULTS: We previously developed Cordax, a structure-based predictor that utilizes logistic regression to detect aggregation motifs in protein sequences based on their structural complementarity to the amyloid cross-beta architecture. Here, we present a dedicated web server interface for Cordax. This online platform combines several features including detailed scoring of sequence aggregation propensity, as well as 3D visualization with several customization options for topology models of the structural cores formed by predicted aggregation motifs. In addition, information is provided on experimentally determined aggregation-prone regions that exhibit sequence similarity to predicted motifs, scores, and links to other predictor outputs, as well as simultaneous predictions of relevant sequence propensities, such as solubility, hydrophobicity, and secondary structure propensity. AVAILABILITY AND IMPLEMENTATION: The Cordax webserver is freely accessible at https://cordax.switchlab.org/.


Assuntos
Software , Agregados Proteicos , Internet , Amiloide/química , Proteínas/química , Motivos de Aminoácidos , Humanos , Conformação Proteica , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos
10.
Methods Mol Biol ; 2758: 61-75, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38549008

RESUMO

Natural peptides secreted under stress conditions by many organisms are bioactive molecules with a broad spectrum of activities. These molecules could become potential models for novel pharmaceuticals, to which bacteria, according to modern scientific concepts, do not have and cannot develop resistance. Taking this into consideration, it is necessary to clarify the amino acid sequences of such peptides. Here we describe our approach to de novo sequencing of amphibians' skin secretion peptides.


Assuntos
Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos/química , Sequência de Aminoácidos
11.
PLoS Comput Biol ; 20(2): e1011892, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38416757

RESUMO

In proteomics, a crucial aspect is to identify peptide sequences. De novo sequencing methods have been widely employed to identify peptide sequences, and numerous tools have been proposed over the past two decades. Recently, deep learning approaches have been introduced for de novo sequencing. Previous methods focused on encoding tandem mass spectra and predicting peptide sequences from the first amino acid onwards. However, when predicting peptides using tandem mass spectra, the peptide sequence can be predicted not only from the first amino acid but also from the last amino acid due to the coexistence of b-ion (or a- or c-ion) and y-ion (or x- or z-ion) fragments in the tandem mass spectra. Therefore, it is essential to predict peptide sequences bidirectionally. Our approach, called NovoB, utilizes a Transformer model to predict peptide sequences bidirectionally, starting with both the first and last amino acids. In comparison to Casanovo, our method achieved an improvement of the average peptide-level accuracy rate of approximately 9.8% across all species.


Assuntos
Algoritmos , Análise de Sequência de Proteína , Análise de Sequência de Proteína/métodos , Peptídeos/química , Sequência de Aminoácidos , Aminoácidos
12.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38340092

RESUMO

De novo peptide sequencing is a promising approach for novel peptide discovery, highlighting the performance improvements for the state-of-the-art models. The quality of mass spectra often varies due to unexpected missing of certain ions, presenting a significant challenge in de novo peptide sequencing. Here, we use a novel concept of complementary spectra to enhance ion information of the experimental spectrum and demonstrate it through conceptual and practical analyses. Afterward, we design suitable encoders to encode the experimental spectrum and the corresponding complementary spectrum and propose a de novo sequencing model $\pi$-HelixNovo based on the Transformer architecture. We first demonstrated that $\pi$-HelixNovo outperforms other state-of-the-art models using a series of comparative experiments. Then, we utilized $\pi$-HelixNovo to de novo gut metaproteome peptides for the first time. The results show $\pi$-HelixNovo increases the identification coverage and accuracy of gut metaproteome and enhances the taxonomic resolution of gut metaproteome. We finally trained a powerful $\pi$-HelixNovo utilizing a larger training dataset, and as expected, $\pi$-HelixNovo achieves unprecedented performance, even for peptide-spectrum matches with never-before-seen peptide sequences. We also use the powerful $\pi$-HelixNovo to identify antibody peptides and multi-enzyme cleavage peptides, and $\pi$-HelixNovo is highly robust in these applications. Our results demonstrate the effectivity of the complementary spectrum and take a significant step forward in de novo peptide sequencing.


Assuntos
Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos , Sequência de Aminoácidos , Anticorpos , Algoritmos
13.
Artigo em Inglês | MEDLINE | ID: mdl-38316555

RESUMO

The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.


Assuntos
Biologia Computacional , Proteínas , Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência , Conformação Proteica , Software , Algoritmos , Análise de Sequência de Proteína/métodos
14.
Comput Biol Med ; 170: 107956, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38217977

RESUMO

The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.


Assuntos
Biologia Computacional , Proteínas , Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Algoritmos , Máquina de Vetores de Suporte , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas
15.
Nat Commun ; 15(1): 151, 2024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38167372

RESUMO

Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a de novo peptide sequencing method for tandem mass spectrometry. Spectralis leverages several innovations including a convolutional neural network layer connecting peaks in spectra spaced by amino acid masses, proposing fragment ion series classification as a pivotal task for de novo peptide sequencing, and a peptide-spectrum confidence score. On spectra for which database search provided a ground truth, Spectralis surpassed 40% sensitivity at 90% precision, nearly doubling state-of-the-art sensitivity. Application to unidentified spectra confirmed its superiority and showcased its applicability to variant calling. Altogether, these algorithmic innovations and the substantial sensitivity increase in the high-precision range constitute an important step toward broadly applicable peptide sequencing.


Assuntos
Aprendizado Profundo , Algoritmos , Análise de Sequência de Proteína/métodos , Peptídeos/química , Sequência de Aminoácidos
16.
J Mol Biol ; 436(2): 168393, 2024 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-38065275

RESUMO

Many proteins contain cleavable signal or transit peptides that direct them to their final subcellular locations. Such peptides are usually predicted from sequence alone using methods such as TargetP 2.0 and SignalP 6.0. While these methods are usually very accurate, we show here that an analysis of a protein's AlphaFold2-predicted structure can often be used to identify false positive predictions. We start by showing that when given a protein's full-length sequence, AlphaFold2 builds experimentally annotated signal and transit peptides in orientations that point away from the main body of the protein. This indicates that AlphaFold2 correctly identifies that a signal is not destined to be part of the mature protein's structure and suggests, as a corollary, that predicted signals that AlphaFold2 folds with high confidence into the main body of the protein are likely to be false positives. To explore this idea, we analyzed predicted signal peptides in 48 proteomes made available in DeepMind's AlphaFold2 database (https://alphafold.ebi.ac.uk). Applying TargetP 2.0 and SignalP 6.0 to the 561,562 proteins in the database results in 95,236 being predicted to contain a cleavable signal or transit peptide. In 95.1% of these cases, the AlphaFold2 structure of the full-length protein is fully consistent with the prediction of TargetP 2.0 or SignalP 6.0. In the remaining 4.9% of cases where the AlphaFold2 structure does not appear consistent with the prediction, the signal is often only predicted with low confidence. The potential false positives identified here may be useful for training even more accurate signal prediction methods.


Assuntos
Sinais Direcionadores de Proteínas , Análise de Sequência de Proteína , Algoritmos , Sequência de Aminoácidos , Proteoma/metabolismo , Análise de Sequência de Proteína/métodos
17.
Proteomics ; 24(6): e2300231, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37525341

RESUMO

Non-invasive diagnostics and therapies are crucial to prevent patients from undergoing painful procedures. Exosomal proteins can serve as important biomarkers for such advancements. In this study, we attempted to build a model to predict exosomal proteins. All models are trained, tested, and evaluated on a non-redundant dataset comprising 2831 exosomal and 2831 non-exosomal proteins, where no two proteins have more than 40% similarity. Initially, the standard similarity-based method Basic Local Alignment Search Tool (BLAST) was used to predict exosomal proteins, which failed due to low-level similarity in the dataset. To overcome this challenge, machine learning (ML) based models were developed using compositional and evolutionary features of proteins achieving an area under the receiver operating characteristics (AUROC) of 0.73. Our analysis also indicated that exosomal proteins have a variety of sequence-based motifs which can be used to predict exosomal proteins. Hence, we developed a hybrid method combining motif-based and ML-based approaches for predicting exosomal proteins, achieving a maximum AUROC of 0.85 and MCC of 0.56 on an independent dataset. This hybrid model performs better than presently available methods when assessed on an independent dataset. A web server and a standalone software ExoProPred (https://webs.iiitd.edu.in/raghava/exopropred/) have been created to help scientists predict and discover exosomal proteins and find functional motifs present in them.


Assuntos
Algoritmo Florestas Aleatórias , Análise de Sequência de Proteína , Humanos , Sequência de Aminoácidos , Análise de Sequência de Proteína/métodos , Proteínas/metabolismo , Software
18.
Biochim Biophys Acta Proteins Proteom ; 1872(2): 140985, 2024 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-38122964

RESUMO

MOTIVATION: The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms. METHODOLOGY: We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein. METHODS: In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge. EVALUATION: We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and Fmax metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks. CONTRIBUTIONS: FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher Fmax scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower Fmax score.


Assuntos
Proteínas , Análise de Sequência de Proteína , Bases de Dados de Proteínas , Proteínas/metabolismo , Anotação de Sequência Molecular , Análise de Sequência de Proteína/métodos , Ontologia Genética
19.
Nat Commun ; 14(1): 7974, 2023 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-38042873

RESUMO

De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we present PepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.


Assuntos
Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Humanos , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos , Sequência de Aminoácidos , Redes Neurais de Computação , Algoritmos , Biblioteca de Peptídeos
20.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37833837

RESUMO

Protein remote homology detection is essential for structure prediction, function prediction, disease mechanism understanding, etc. The remote homology relationship depends on multiple protein properties, such as structural information and local sequence patterns. Previous studies have shown the challenges for predicting remote homology relationship by protein features at sequence level (e.g. position-specific score matrix). Protein motifs have been used in structure and function analysis due to their unique sequence patterns and implied structural information. Therefore, designing a usable architecture to fuse multiple protein properties based on motifs is urgently needed to improve protein remote homology detection performance. To make full use of the characteristics of motifs, we employed the language model called the protein cubic language model (PCLM). It combines multiple properties by constructing a motif-based neural network. Based on the PCLM, we proposed a predictor called PreHom-PCLM by extracting and fusing multiple motif features for protein remote homology detection. PreHom-PCLM outperforms the other state-of-the-art methods on the test set and independent test set. Experimental results further prove the effectiveness of multiple features fused by PreHom-PCLM for remote homology detection. Furthermore, the protein features derived from the PreHom-PCLM show strong discriminative power for proteins from different structural classes in the high-dimensional space. Availability and Implementation: http://bliulab.net/PreHom-PCLM.


Assuntos
Algoritmos , Proteínas , Proteínas/química , Redes Neurais de Computação , Motivos de Aminoácidos , Idioma , Análise de Sequência de Proteína/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA