Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 346
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Cell ; 177(3): 737-750.e15, 2019 04 18.
Artigo em Inglês | MEDLINE | ID: mdl-31002798

RESUMO

The proteasome mediates selective protein degradation and is dynamically regulated in response to proteotoxic challenges. SKN-1A/Nrf1, an endoplasmic reticulum (ER)-associated transcription factor that undergoes N-linked glycosylation, serves as a sensor of proteasome dysfunction and triggers compensatory upregulation of proteasome subunit genes. Here, we show that the PNG-1/NGLY1 peptide:N-glycanase edits the sequence of SKN-1A protein by converting particular N-glycosylated asparagine residues to aspartic acid. Genetically introducing aspartates at these N-glycosylation sites bypasses the requirement for PNG-1/NGLY1, showing that protein sequence editing rather than deglycosylation is key to SKN-1A function. This pathway is required to maintain sufficient proteasome expression and activity, and SKN-1A hyperactivation confers resistance to the proteotoxicity of human amyloid beta peptide. Deglycosylation-dependent protein sequence editing explains how ER-associated and cytosolic isoforms of SKN-1 perform distinct cytoprotective functions corresponding to those of mammalian Nrf1 and Nrf2. Thus, we uncover an unexpected mechanism by which N-linked glycosylation regulates protein function and proteostasis.


Assuntos
Proteínas de Caenorhabditis elegans/metabolismo , Proteínas de Ligação a DNA/metabolismo , Complexo de Endopeptidases do Proteassoma/metabolismo , Fatores de Transcrição/metabolismo , Sequência de Aminoácidos , Animais , Asparagina/metabolismo , Bortezomib/farmacologia , Sistemas CRISPR-Cas/genética , Caenorhabditis elegans/metabolismo , Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/genética , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/genética , Retículo Endoplasmático/metabolismo , Edição de Genes , Regulação da Expressão Gênica/efeitos dos fármacos , Estresse Oxidativo , Complexo de Endopeptidases do Proteassoma/genética , Subunidades Proteicas/química , Subunidades Proteicas/genética , Subunidades Proteicas/metabolismo , Alinhamento de Sequência , Fatores de Transcrição/química , Fatores de Transcrição/genética
2.
Immunity ; 56(7): 1681-1698.e13, 2023 07 11.
Artigo em Inglês | MEDLINE | ID: mdl-37301199

RESUMO

CD4+ T cell responses are exquisitely antigen specific and directed toward peptide epitopes displayed by human leukocyte antigen class II (HLA-II) on antigen-presenting cells. Underrepresentation of diverse alleles in ligand databases and an incomplete understanding of factors affecting antigen presentation in vivo have limited progress in defining principles of peptide immunogenicity. Here, we employed monoallelic immunopeptidomics to identify 358,024 HLA-II binders, with a particular focus on HLA-DQ and HLA-DP. We uncovered peptide-binding patterns across a spectrum of binding affinities and enrichment of structural antigen features. These aspects underpinned the development of context-aware predictor of T cell antigens (CAPTAn), a deep learning model that predicts peptide antigens based on their affinity to HLA-II and full sequence of their source proteins. CAPTAn was instrumental in discovering prevalent T cell epitopes from bacteria in the human microbiome and a pan-variant epitope from SARS-CoV-2. Together CAPTAn and associated datasets present a resource for antigen discovery and the unraveling genetic associations of HLA alleles with immunopathologies.


Assuntos
COVID-19 , Aprendizado Profundo , Humanos , Captana , SARS-CoV-2 , Antígenos HLA , Epitopos de Linfócito T , Peptídeos
3.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38557677

RESUMO

Protein design is central to nearly all protein engineering problems, as it can enable the creation of proteins with new biological functions, such as improving the catalytic efficiency of enzymes. One key facet of protein design, fixed-backbone protein sequence design, seeks to design new sequences that will conform to a prescribed protein backbone structure. Nonetheless, existing sequence design methods present limitations, such as low sequence diversity and shortcomings in experimental validation of the designed functional proteins. These inadequacies obstruct the goal of functional protein design. To improve these limitations, we initially developed the Graphormer-based Protein Design (GPD) model. This model utilizes the Transformer on a graph-based representation of three-dimensional protein structures and incorporates Gaussian noise and a sequence random masks to node features, thereby enhancing sequence recovery and diversity. The performance of the GPD model was significantly better than that of the state-of-the-art ProteinMPNN model on multiple independent tests, especially for sequence diversity. We employed GPD to design CalB hydrolase and generated nine artificially designed CalB proteins. The results show a 1.7-fold increase in catalytic activity compared to that of the wild-type CalB and strong substrate selectivity on p-nitrophenyl acetate with different carbon chain lengths (C2-C16). Thus, the GPD method could be used for the de novo design of industrial enzymes and protein drugs. The code was released at https://github.com/decodermu/GPD.


Assuntos
Engenharia de Proteínas , Proteínas , Proteínas/química , Sequência de Aminoácidos , Engenharia de Proteínas/métodos
4.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38851299

RESUMO

Protein-protein interactions (PPIs) are the basis of many important biological processes, with protein complexes being the key forms implementing these interactions. Understanding protein complexes and their functions is critical for elucidating mechanisms of life processes, disease diagnosis and treatment and drug development. However, experimental methods for identifying protein complexes have many limitations. Therefore, it is necessary to use computational methods to predict protein complexes. Protein sequences can indicate the structure and biological functions of proteins, while also determining their binding abilities with other proteins, influencing the formation of protein complexes. Integrating these characteristics to predict protein complexes is very promising, but currently there is no effective framework that can utilize both protein sequence and PPI network topology for complex prediction. To address this challenge, we have developed HyperGraphComplex, a method based on hypergraph variational autoencoder that can capture expressive features from protein sequences without feature engineering, while also considering topological properties in PPI networks, to predict protein complexes. Experiment results demonstrated that HyperGraphComplex achieves satisfactory predictive performance when compared with state-of-art methods. Further bioinformatics analysis shows that the predicted protein complexes have similar attributes to known ones. Moreover, case studies corroborated the remarkable predictive capability of our model in identifying protein complexes, including 3 that were not only experimentally validated by recent studies but also exhibited high-confidence structural predictions from AlphaFold-Multimer. We believe that the HyperGraphComplex algorithm and our provided proteome-wide high-confidence protein complex prediction dataset will help elucidate how proteins regulate cellular processes in the form of complexes, and facilitate disease diagnosis and treatment and drug development. Source codes are available at https://github.com/LiDlab/HyperGraphComplex.


Assuntos
Biologia Computacional , Mapeamento de Interação de Proteínas , Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Proteínas/química , Algoritmos , Mapas de Interação de Proteínas , Bases de Dados de Proteínas , Humanos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos
5.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38600663

RESUMO

Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.


Assuntos
Redes Neurais de Computação , Proteínas , Alinhamento de Sequência , Sequência de Aminoácidos , Proteínas/química , Análise de Sequência de Proteína/métodos
6.
J Biol Chem ; 300(11): 107861, 2024 Oct 05.
Artigo em Inglês | MEDLINE | ID: mdl-39374782

RESUMO

Loops in the axial channels of ClpAP and other AAA+ proteases bind a short peptide degron connected by a linker to the N- or C-terminal residue of a native protein to initiate degradation. ATP hydrolysis then powers pore-loop movements that translocate these segments through the channel until a native domain is pulled against the narrow channel entrance, creating an unfolding force. Substrate unfolding is thought to depend on strong contacts between pore loops and a subset of amino acids in the unstructured sequence directly preceding the folded domain. Here, we identify such contact sequences that promote grip for ClpAP and use ClpA structures to place these sequences within ClpA's two AAA+ rings. The positions and chemical nature of certain residues within an unstructured segment that are positioned to interact with the D2 ring have major positive effects on substrate unfolding, whereas segments located within the D1 ring have little consequence. Within the D2-bound segment, two short elements are critical for accelerating degradation; one is at the "top" of D2 and consists of at least two properly positioned nonslippery residues. In contrast, the second D2 element, which can be as short as one residue, is positioned to contact pore loops near the "bottom" of this ring. Comparison with similar studies for ClpXP reveals that positioning a well-gripped substrate sequence within the major unfoldase motor is more important than its proximity to the folded domain and that charged, polar, and hydrophobic residues all contribute favorable contacts to substrate grip.

7.
Mol Biol Evol ; 41(9)2024 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-39213383

RESUMO

Determining the origins of novel genes and the mechanisms driving the emergence of new functions is challenging yet crucial for understanding evolutionary innovations. Recently evolved fish antifreeze proteins (AFPs) offer a unique opportunity to explore these processes, particularly the near-identical type I AFP (AFPI) found in four phylogenetically divergent fish taxa. This study tested the hypothesis of protein sequence convergence beyond functional convergence in three unrelated AFPI-bearing fish lineages. Through comprehensive comparative analyses of newly sequenced genomes of winter flounder and grubby sculpin, along with available high-quality genomes of cunner and 14 other related species, the study revealed that near-identical AFPI proteins originated from distinct genetic precursors in each lineage. Each lineage independently evolved a de novo coding region for the novel ice-binding protein while repurposing fragments from their respective ancestors into potential regulatory regions, representing partial de novo origination-a process that bridges de novo gene formation and the neofunctionalization of duplicated genes. The study supports existing models of new gene origination and introduces new ones: the innovation-amplification-divergence model, where novel changes precede gene duplication; the newly proposed duplication-degeneration-divergence model, which describes new functions arising from degenerated pseudogenes; and the duplication-degeneration-divergence gene fission model, where each new sibling gene differentially degenerates and renovates distinct functional domains from their parental gene. These findings highlight the diverse evolutionary pathways through which a novel functional gene with convergent sequences at the protein level can evolve across divergent species, advancing our understanding of the mechanistic intricacies in new gene formation.


Assuntos
Proteínas Anticongelantes , Evolução Molecular , Animais , Proteínas Anticongelantes/genética , Proteínas de Peixes/genética , Filogenia , Peixes/genética , Linguado/genética
8.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37429578

RESUMO

Computational protein design has been demonstrated to be the most powerful tool in the last few years among protein designing and repacking tasks. In practice, these two tasks are strongly related but often treated separately. Besides, state-of-the-art deep-learning-based methods cannot provide interpretability from an energy perspective, affecting the accuracy of the design. Here we propose a new systematic approach, including both a posterior probability and a joint probability parts, to solve the two essential questions once for all. This approach takes the physicochemical property of amino acids into consideration and uses the joint probability model to ensure the convergence between structure and amino acid type. Our results demonstrated that this method could generate feasible, high-confidence sequences with low-energy side conformations. The designed sequences can fold into target structures with high confidence and maintain relatively stable biochemical properties. The side chain conformation has a significantly lower energy landscape without delegating to a rotamer library or performing the expensive conformational searches. Overall, we propose an end-to-end method that combines the advantages of both deep learning and energy-based methods. The design results of this model demonstrate high efficiency, and precision, as well as a low energy state and good interpretability.


Assuntos
Aprendizado Profundo , Modelos Moleculares , Proteínas/química , Sequência de Aminoácidos , Aminoácidos/química , Conformação Proteica
9.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36403092

RESUMO

MOTIVATION: Biological experimental approaches to protein-protein interaction (PPI) site prediction are critical for understanding the mechanisms of biochemical processes but are time-consuming and laborious. With the development of Deep Learning (DL) techniques, the most popular Convolutional Neural Networks (CNN)-based methods have been proposed to address these problems. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in protein sequences. Current methods cannot efficiently explore the nature of Position Specific Scoring Matrix (PSSM), secondary structure and raw protein sequences by processing them all together. For PPI site prediction, how to effectively model the PPI context with attention to prediction remains an open problem. In addition, the long-distance dependencies of PPI features are important, which is very challenging for many CNN-based methods because the innate ability of CNN is difficult to outperform auto-regressive models like Transformers. RESULTS: To effectively mine the properties of PPI features, a novel hybrid neural network named HN-PPISP is proposed, which integrates a Multi-layer Perceptron Mixer (MLP-Mixer) module for local feature extraction and a two-stage multi-branch module for global feature capture. The model merits Transformer, TextCNN and Bi-LSTM as a powerful alternative for PPI site prediction. On the one hand, this is the first application of an advanced Transformer (i.e. MLP-Mixer) with a hybrid network for sequence-based PPI prediction. On the other hand, unlike existing methods that treat global features altogether, the proposed two-stage multi-branch hybrid module firstly assigns different attention scores to the input features and then encodes the feature through different branch modules. In the first stage, different improved attention modules are hybridized to extract features from the raw protein sequences, secondary structure and PSSM, respectively. In the second stage, a multi-branch network is designed to aggregate information from both branches in parallel. The two branches encode the features and extract dependencies through several operations such as TextCNN, Bi-LSTM and different activation functions. Experimental results on real-world public datasets show that our model consistently achieves state-of-the-art performance over seven remarkable baselines. AVAILABILITY: The source code of HN-PPISP model is available at https://github.com/ylxu05/HN-PPISP.


Assuntos
Redes Neurais de Computação , Software , Sequência de Aminoácidos , Aminoácidos , Estrutura Secundária de Proteína
10.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37020337

RESUMO

Identification of potent peptides through model prediction can reduce benchwork in wet experiments. However, the conventional process of model buildings can be complex and time consuming due to challenges such as peptide representation, feature selection, model selection and hyperparameter tuning. Recently, advanced pretrained deep learning-based language models (LMs) have been released for protein sequence embedding and applied to structure and function prediction. Based on these developments, we have developed UniDL4BioPep, a universal deep-learning model architecture for transfer learning in bioactive peptide binary classification modeling. It can directly assist users in training a high-performance deep-learning model with a fixed architecture and achieve cutting-edge performance to meet the demands in efficiently novel bioactive peptide discovery. To the best of our best knowledge, this is the first time that a pretrained biological language model is utilized for peptide embeddings and successfully predicts peptide bioactivities through large-scale evaluations of those peptide embeddings. The model was also validated through uniform manifold approximation and projection analysis. By combining the LM with a convolutional neural network, UniDL4BioPep achieved greater performances than the respective state-of-the-art models for 15 out of 20 different bioactivity dataset prediction tasks. The accuracy, Mathews correlation coefficient and area under the curve were 0.7-7, 1.23-26.7 and 0.3-25.6% higher, respectively. A user-friendly web server of UniDL4BioPep for the tested bioactivities is established and freely accessible at https://nepc2pvmzy.us-east-1.awsapprunner.com. The source codes, datasets and templates of UniDL4BioPep for other bioactivity fitting and prediction tasks are available at https://github.com/dzjxzyd/UniDL4BioPep.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Peptídeos/química , Software , Sequência de Aminoácidos
11.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37649385

RESUMO

Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on experimental costs and time, the tendency of proteins to crystallize can be initially determined and screened by modeling. As a result, this study created a new pipeline aimed at using protein sequence to predict protein crystallization propensity in the protein material production stage, purification stage and production of crystal stage. The newly created pipeline proposed a new feature selection method, which involves combining Chi-square (${\chi }^{2}$) and recursive feature elimination together with the 12 selected features, followed by a linear discriminant analysisfor dimensionality reduction and finally, a support vector machine algorithm with hyperparameter tuning and 10-fold cross-validation is used to train the model and test the results. This new pipeline has been tested on three different datasets, and the accuracy rates are higher than the existing pipelines. In conclusion, our model provides a new solution to predict multistage protein crystallization propensity which is a big challenge in computational biology.


Assuntos
Algoritmos , Aprendizado de Máquina , Cristalização , Sequência de Aminoácidos , Biologia Computacional
12.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-36946414

RESUMO

In the era of constantly increasing amounts of the available protein data, a relevant and interpretable visualization becomes crucial, especially for tasks requiring human expertise. Poincaré disk projection has previously demonstrated its important efficiency for visualization of biological data such as single-cell RNAseq data. Here, we develop a new method PoincaréMSA for visual representation of complex relationships between protein sequences based on Poincaré maps embedding. We demonstrate its efficiency and potential for visualization of protein family topology as well as evolutionary and functional annotation of uncharacterized sequences. PoincaréMSA is implemented in open source Python code with available interactive Google Colab notebooks as described at https://www.dsimb.inserm.fr/POINCARE_MSA.


Assuntos
Proteínas , Software , Humanos , Sequência de Aminoácidos , Evolução Biológica
13.
Mol Cell Proteomics ; 22(8): 100591, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37301379

RESUMO

The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.


Assuntos
Proteoma , Proteômica , Humanos , Proteoma/genética , Bases de Dados de Proteínas , Sequência de Aminoácidos , Peptídeos
14.
Proc Natl Acad Sci U S A ; 119(24): e2203176119, 2022 06 14.
Artigo em Inglês | MEDLINE | ID: mdl-35648808

RESUMO

Bacterial signal transduction systems sense changes in the environment and transmit these signals to control cellular responses. The simplest one-component signal transduction systems include an input sensor domain and an output response domain encoded in a single protein chain. Alternatively, two-component signal transduction systems transmit signals by phosphorelay between input and output domains from separate proteins. The membrane-tethered periplasmic bile acid sensor that activates the Vibrio parahaemolyticus type III secretion system adopts an obligate heterodimer of two proteins encoded by partially overlapping VtrA and VtrC genes. This co-component signal transduction system binds bile acid using a lipocalin-like domain in VtrC and transmits the signal through the membrane to a cytoplasmic DNA-binding transcription factor in VtrA. Using the domain and operon organization of VtrA/VtrC, we identify a fast-evolving superfamily of co-component systems in enteric bacteria. Accurate machine learning­based fold predictions for the candidate co-components support their homology in the twilight zone of rapidly evolving sequences and provide mechanistic hypotheses about previously unrecognized lipid-sensing functions.


Assuntos
Proteínas de Bactérias , Regulação Bacteriana da Expressão Gênica , Ilhas Genômicas , Proteínas de Membrana , Sistemas de Secreção Tipo III , Vibrio parahaemolyticus , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Ácidos e Sais Biliares/metabolismo , Proteínas de Ligação a DNA/metabolismo , Proteínas de Membrana/genética , Proteínas de Membrana/metabolismo , Multimerização Proteica , Transdução de Sinais , Fatores de Transcrição/metabolismo , Sistemas de Secreção Tipo III/genética , Vibrio parahaemolyticus/genética , Vibrio parahaemolyticus/patogenicidade , Virulência/genética
15.
Proteomics ; : e2400044, 2024 Jun 02.
Artigo em Francês | MEDLINE | ID: mdl-38824664

RESUMO

RNA-dependent liquid-liquid phase separation (LLPS) proteins play critical roles in cellular processes such as stress granule formation, DNA repair, RNA metabolism, germ cell development, and protein translation regulation. The abnormal behavior of these proteins is associated with various diseases, particularly neurodegenerative disorders like amyotrophic lateral sclerosis and frontotemporal dementia, making their identification crucial. However, conventional biochemistry-based methods for identifying these proteins are time-consuming and costly. Addressing this challenge, our study developed a robust computational model for their identification. We constructed a comprehensive dataset containing 137 RNA-dependent and 606 non-RNA-dependent LLPS protein sequences, which were then encoded using amino acid composition, composition of K-spaced amino acid pairs, Geary autocorrelation, and conjoined triad methods. Through a combination of correlation analysis, mutual information scoring, and incremental feature selection, we identified an optimal feature subset. This subset was used to train a random forest model, which achieved an accuracy of 90% when tested against an independent dataset. This study demonstrates the potential of computational methods as efficient alternatives for the identification of RNA-dependent LLPS proteins. To enhance the accessibility of the model, a user-centric web server has been established and can be accessed via the link: http://rpp.lin-group.cn.

16.
BMC Bioinformatics ; 25(1): 85, 2024 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-38413857

RESUMO

PURPOSE: Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. METHODS: We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. RESULTS: PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. CONCLUSION: Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.


Assuntos
Ácidos Borônicos , Proteínas , Proteínas/química , Sequência de Aminoácidos , Alinhamento de Sequência , Algoritmos
17.
BMC Bioinformatics ; 25(1): 342, 2024 Nov 02.
Artigo em Inglês | MEDLINE | ID: mdl-39488701

RESUMO

BACKGROUND: The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations. RESULTS: CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt. CONCLUSION: CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4 .


Assuntos
Algoritmos , Bases de Dados de Proteínas , Software , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Gráficos por Computador , Proteínas/química , Biologia Computacional/métodos
18.
J Proteome Res ; 2024 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-39449618

RESUMO

In metaproteomics studies, constructing a reference protein sequence database that is both comprehensive and not overly large is critical for the peptide identification step. Therefore, the availability of well-curated reference databases and tools for custom database construction is essential to enhance the performance of metaproteomics analyses. In this review, we first provide an overview of metaproteomics by presenting a concise historical background, outlining a typical experimental and bioinformatics workflow, emphasizing the crucial step of constructing a protein sequence database for metaproteomics. We then delve into the current tools available for building such databases, highlighting their individual approaches, utility, and advantages and limitations. Next, we examine existing protein sequence databases, detailing their scope and relevance in metaproteomics research. Then, we provide practical recommendations for constructing protein sequence databases for metaproteomics, along with an overview of the current challenges in this area. We conclude with a discussion of anticipated advancements, emerging trends, and future directions in the construction of protein sequence databases for metaproteomics.

19.
J Biol Chem ; 299(1): 102801, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36528065

RESUMO

Protein phase separation is thought to be a primary driving force for the formation of membrane-less organelles, which control a wide range of biological functions from stress response to ribosome biogenesis. Among phase-separating (PS) proteins, many have intrinsically disordered regions (IDRs) that are needed for phase separation to occur. Accurate identification of IDRs that drive phase separation is important for testing the underlying mechanisms of phase separation, identifying biological processes that rely on phase separation, and designing sequences that modulate phase separation. To identify IDRs that drive phase separation, we first curated datasets of folded, ID, and PS ID sequences. We then used these sequence sets to examine how broadly existing amino acid property scales can be used to distinguish between the three classes of protein regions. We found that there are robust property differences between the classes and, consequently, that numerous combinations of amino acid property scales can be used to make robust predictions of protein phase separation. This result indicates that multiple, redundant mechanisms contribute to the formation of phase-separated droplets from IDRs. The top-performing scales were used to further optimize our previously developed predictor of PS IDRs, ParSe. We then modified ParSe to account for interactions between amino acids and obtained reasonable predictive power for mutations that have been designed to test the role of amino acid interactions in driving protein phase separation. Collectively, our findings provide further insight into the classification of IDRs and the elements involved in protein phase separation.


Assuntos
Proteínas Intrinsicamente Desordenadas , Proteínas Intrinsicamente Desordenadas/química , Domínios Proteicos , Aminoácidos
20.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35914952

RESUMO

Low complexity regions are fragments of protein sequences composed of only a few types of amino acids. These regions frequently occur in proteins and can play an important role in their functions. However, scientists are mainly focused on regions characterized by high diversity of amino acid composition. Similarity between regions of protein sequences frequently reflect functional similarity between them. In this article, we discuss strengths and weaknesses of the similarity analysis of low complexity regions using BLAST, HHblits and CD-HIT. These methods are considered to be the gold standard in protein similarity analysis and were designed for comparison of high complexity regions. However, we lack specialized methods that could be used to compare the similarity of low complexity regions. Therefore, we investigated the existing methods in order to understand how they can be applied to compare such regions. Our results are supported by exploratory study, discussion of amino acid composition and biological roles of selected examples. We show that existing methods need improvements to efficiently search for similar low complexity regions. We suggest features that have to be re-designed specifically for comparing low complexity regions: scoring matrix, multiple sequence alignment, e-value, local alignment and clustering based on a set of representative sequences. Results of this analysis can either be used to improve existing methods or to create new methods for the similarity analysis of low complexity regions.


Assuntos
Aminoácidos , Proteínas , Algoritmos , Sequência de Aminoácidos , Aminoácidos/genética , Análise por Conglomerados , Proteínas/química , Proteínas/genética , Alinhamento de Sequência , Análise de Sequência de Proteína/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA