Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 14.187
Filtrar
Más filtros

Intervalo de año de publicación
1.
Cell ; 172(1-2): 358-372.e23, 2018 01 11.
Artículo en Inglés | MEDLINE | ID: mdl-29307493

RESUMEN

Metabolite-protein interactions control a variety of cellular processes, thereby playing a major role in maintaining cellular homeostasis. Metabolites comprise the largest fraction of molecules in cells, but our knowledge of the metabolite-protein interactome lags behind our understanding of protein-protein or protein-DNA interactomes. Here, we present a chemoproteomic workflow for the systematic identification of metabolite-protein interactions directly in their native environment. The approach identified a network of known and novel interactions and binding sites in Escherichia coli, and we demonstrated the functional relevance of a number of newly identified interactions. Our data enabled identification of new enzyme-substrate relationships and cases of metabolite-induced remodeling of protein complexes. Our metabolite-protein interactome consists of 1,678 interactions and 7,345 putative binding sites. Our data reveal functional and structural principles of chemical communication, shed light on the prevalence and mechanisms of enzyme promiscuity, and enable extraction of quantitative parameters of metabolite binding on a proteome-wide scale.


Asunto(s)
Metaboloma , Proteoma/metabolismo , Proteómica/métodos , Transducción de Señal , Programas Informáticos , Regulación Alostérica , Sitios de Unión , Escherichia coli , Metabolómica/métodos , Unión Proteica , Mapas de Interacción de Proteínas , Proteoma/química , Saccharomyces cerevisiae , Análisis de Secuencia de Proteína/métodos
2.
Cell ; 168(4): 600-612, 2017 02 09.
Artículo en Inglés | MEDLINE | ID: mdl-28187283

RESUMEN

Cancer immunogenomics originally was framed by research supporting the hypothesis that cancer mutations generated novel peptides seen as "non-self" by the immune system. The search for these "neoantigens" has been facilitated by the combination of new sequencing technologies, specialized computational analyses, and HLA binding predictions that evaluate somatic alterations in a cancer genome and interpret their ability to produce an immune-stimulatory peptide. The resulting information can characterize a tumor's neoantigen load, its cadre of infiltrating immune cell types, the T or B cell receptor repertoire, and direct the design of a personalized therapeutic.


Asunto(s)
Antígenos de Neoplasias/inmunología , Neoplasias/genética , Neoplasias/inmunología , Animales , Vacunas contra el Cáncer/inmunología , Genoma Humano , Antígenos HLA/inmunología , Humanos , Inmunogenética , Linfocitos Infiltrantes de Tumor/inmunología , Mutación , Análisis de Secuencia de Proteína
3.
Nat Rev Mol Cell Biol ; 20(11): 681-697, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31417196

RESUMEN

The prediction of protein three-dimensional structure from amino acid sequence has been a grand challenge problem in computational biophysics for decades, owing to its intrinsic scientific interest and also to the many potential applications for robust protein structure prediction algorithms, from genome interpretation to protein function prediction. More recently, the inverse problem - designing an amino acid sequence that will fold into a specified three-dimensional structure - has attracted growing attention as a potential route to the rational engineering of proteins with functions useful in biotechnology and medicine. Methods for the prediction and design of protein structures have advanced dramatically in the past decade. Increases in computing power and the rapid growth in protein sequence and structure databases have fuelled the development of new data-intensive and computationally demanding approaches for structure prediction. New algorithms for designing protein folds and protein-protein interfaces have been used to engineer novel high-order assemblies and to design from scratch fluorescent proteins with novel or enhanced properties, as well as signalling proteins with therapeutic potential. In this Review, we describe current approaches for protein structure prediction and design and highlight a selection of the successful applications they have enabled.


Asunto(s)
Algoritmos , Bases de Datos de Proteínas , Modelos Moleculares , Proteínas/química , Análisis de Secuencia de Proteína , Animales , Humanos , Conformación Proteica , Proteínas/genética , Proteínas/metabolismo
4.
Proc Natl Acad Sci U S A ; 121(27): e2311887121, 2024 Jul 02.
Artículo en Inglés | MEDLINE | ID: mdl-38913900

RESUMEN

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.


Asunto(s)
Proteínas , Alineación de Secuencia , Alineación de Secuencia/métodos , Proteínas/química , Proteínas/metabolismo , Secuencia de Aminoácidos , Algoritmos , Análisis de Secuencia de Proteína/métodos , Biología Computacional/métodos , Bases de Datos de Proteínas
5.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38340092

RESUMEN

De novo peptide sequencing is a promising approach for novel peptide discovery, highlighting the performance improvements for the state-of-the-art models. The quality of mass spectra often varies due to unexpected missing of certain ions, presenting a significant challenge in de novo peptide sequencing. Here, we use a novel concept of complementary spectra to enhance ion information of the experimental spectrum and demonstrate it through conceptual and practical analyses. Afterward, we design suitable encoders to encode the experimental spectrum and the corresponding complementary spectrum and propose a de novo sequencing model $\pi$-HelixNovo based on the Transformer architecture. We first demonstrated that $\pi$-HelixNovo outperforms other state-of-the-art models using a series of comparative experiments. Then, we utilized $\pi$-HelixNovo to de novo gut metaproteome peptides for the first time. The results show $\pi$-HelixNovo increases the identification coverage and accuracy of gut metaproteome and enhances the taxonomic resolution of gut metaproteome. We finally trained a powerful $\pi$-HelixNovo utilizing a larger training dataset, and as expected, $\pi$-HelixNovo achieves unprecedented performance, even for peptide-spectrum matches with never-before-seen peptide sequences. We also use the powerful $\pi$-HelixNovo to identify antibody peptides and multi-enzyme cleavage peptides, and $\pi$-HelixNovo is highly robust in these applications. Our results demonstrate the effectivity of the complementary spectrum and take a significant step forward in de novo peptide sequencing.


Asunto(s)
Análisis de Secuencia de Proteína , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Análisis de Secuencia de Proteína/métodos , Péptidos , Secuencia de Aminoácidos , Anticuerpos , Algoritmos
6.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38600663

RESUMEN

Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.


Asunto(s)
Redes Neurales de la Computación , Proteínas , Alineación de Secuencia , Secuencia de Aminoácidos , Proteínas/química , Análisis de Secuencia de Proteína/métodos
7.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38851299

RESUMEN

Protein-protein interactions (PPIs) are the basis of many important biological processes, with protein complexes being the key forms implementing these interactions. Understanding protein complexes and their functions is critical for elucidating mechanisms of life processes, disease diagnosis and treatment and drug development. However, experimental methods for identifying protein complexes have many limitations. Therefore, it is necessary to use computational methods to predict protein complexes. Protein sequences can indicate the structure and biological functions of proteins, while also determining their binding abilities with other proteins, influencing the formation of protein complexes. Integrating these characteristics to predict protein complexes is very promising, but currently there is no effective framework that can utilize both protein sequence and PPI network topology for complex prediction. To address this challenge, we have developed HyperGraphComplex, a method based on hypergraph variational autoencoder that can capture expressive features from protein sequences without feature engineering, while also considering topological properties in PPI networks, to predict protein complexes. Experiment results demonstrated that HyperGraphComplex achieves satisfactory predictive performance when compared with state-of-art methods. Further bioinformatics analysis shows that the predicted protein complexes have similar attributes to known ones. Moreover, case studies corroborated the remarkable predictive capability of our model in identifying protein complexes, including 3 that were not only experimentally validated by recent studies but also exhibited high-confidence structural predictions from AlphaFold-Multimer. We believe that the HyperGraphComplex algorithm and our provided proteome-wide high-confidence protein complex prediction dataset will help elucidate how proteins regulate cellular processes in the form of complexes, and facilitate disease diagnosis and treatment and drug development. Source codes are available at https://github.com/LiDlab/HyperGraphComplex.


Asunto(s)
Biología Computacional , Mapeo de Interacción de Proteínas , Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Proteínas/metabolismo , Proteínas/química , Algoritmos , Mapas de Interacción de Proteínas , Bases de Datos de Proteínas , Humanos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos
8.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38695119

RESUMEN

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Asunto(s)
Algoritmos , Biología Computacional , Alineación de Secuencia , Alineación de Secuencia/métodos , Biología Computacional/métodos , Programas Informáticos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizaje Profundo , Bases de Datos de Proteínas
9.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38701416

RESUMEN

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.


Asunto(s)
Algoritmos , Biología Computacional , Redes Neurales de la Computación , Estructura Secundaria de Proteína , Proteínas , Proteínas/química , Proteínas/metabolismo , Proteínas/genética , Biología Computacional/métodos , Bases de Datos de Proteínas , Ontología de Genes , Análisis de Secuencia de Proteína/métodos , Programas Informáticos
10.
Nature ; 577(7790): 399-404, 2020 01.
Artículo en Inglés | MEDLINE | ID: mdl-31915375

RESUMEN

Alzheimer's disease is an incurable neurodegenerative disorder in which neuroinflammation has a critical function1. However, little is known about the contribution of the adaptive immune response in Alzheimer's disease2. Here, using integrated analyses of multiple cohorts, we identify peripheral and central adaptive immune changes in Alzheimer's disease. First, we performed mass cytometry of peripheral blood mononuclear cells and discovered an immune signature of Alzheimer's disease that consists of increased numbers of CD8+ T effector memory CD45RA+ (TEMRA) cells. In a second cohort, we found that CD8+ TEMRA cells were negatively associated with cognition. Furthermore, single-cell RNA sequencing revealed that T cell receptor (TCR) signalling was enhanced in these cells. Notably, by using several strategies of single-cell TCR sequencing in a third cohort, we discovered clonally expanded CD8+ TEMRA cells in the cerebrospinal fluid of patients with Alzheimer's disease. Finally, we used machine learning, cloning and peptide screens to demonstrate the specificity of clonally expanded TCRs in the cerebrospinal fluid of patients with Alzheimer's disease to two separate Epstein-Barr virus antigens. These results reveal an adaptive immune response in the blood and cerebrospinal fluid in Alzheimer's disease and provide evidence of clonal, antigen-experienced T cells patrolling the intrathecal space of brains affected by age-related neurodegeneration.


Asunto(s)
Enfermedad de Alzheimer/inmunología , Linfocitos T CD8-positivos/inmunología , Líquido Cefalorraquídeo/inmunología , Anciano , Secuencia de Aminoácidos , Estudios de Cohortes , Humanos , Memoria Inmunológica , Persona de Mediana Edad , Receptores de Antígenos de Linfocitos T/química , Receptores de Antígenos de Linfocitos T/inmunología , Análisis de Secuencia de Proteína
11.
Nucleic Acids Res ; 52(10): 5624-5642, 2024 Jun 10.
Artículo en Inglés | MEDLINE | ID: mdl-38554111

RESUMEN

Gametocyte development of the Plasmodium parasite is a key step for transmission of the parasite. Male and female gametocytes are produced from a subpopulation of asexual blood-stage parasites, but the mechanisms that regulate the differentiation of sexual stages are still under investigation. In this study, we investigated the role of PbARID, a putative subunit of a SWI/SNF chromatin remodeling complex, in transcriptional regulation during the gametocyte development of P. berghei. PbARID expression starts in early gametocytes before the manifestation of male and female-specific features, and disruption of its gene results in the complete loss of gametocytes with detectable male features and the production of abnormal female gametocytes. ChIP-seq analysis of PbARID showed that it forms a complex with gSNF2, an ATPase subunit of the SWI/SNF chromatin remodeling complex, associating with the male cis-regulatory element, TGTCT. Further ChIP-seq of PbARID in gsnf2-knockout parasites revealed an association of PbARID with another cis-regulatory element, TGCACA. RIME and DNA-binding assays suggested that HDP1 is the transcription factor that recruits PbARID to the TGCACA motif. Our results indicated that PbARID could function in two chromatin remodeling events and paly essential roles in both male and female gametocyte development.


Asunto(s)
Ensamble y Desensamble de Cromatina , Plasmodium berghei , Proteínas Protozoarias , Factores de Transcripción , Animales , Femenino , Masculino , Ratones , Ensamble y Desensamble de Cromatina/genética , Plasmodium berghei/genética , Plasmodium berghei/crecimiento & desarrollo , Proteínas Protozoarias/genética , Proteínas Protozoarias/metabolismo , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Genotipo , Análisis de Secuencia de ARN , Cromatina/genética , Cromatina/metabolismo , Secuencia de Aminoácidos , Análisis de Secuencia de Proteína , Filogenia , Transcriptoma , Genoma de Protozoos
12.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37258453

RESUMEN

Protein is the most important component in organisms and plays an indispensable role in life activities. In recent years, a large number of intelligent methods have been proposed to predict protein function. These methods obtain different types of protein information, including sequence, structure and interaction network. Among them, protein sequences have gained significant attention where methods are investigated to extract the information from different views of features. However, how to fully exploit the views for effective protein sequence analysis remains a challenge. In this regard, we propose a multi-view, multi-scale and multi-attention deep neural model (MMSMA) for protein function prediction. First, MMSMA extracts multi-view features from protein sequences, including one-hot encoding features, evolutionary information features, deep semantic features and overlapping property features based on physiochemistry. Second, a specific multi-scale multi-attention deep network model (MSMA) is built for each view to realize the deep feature learning and preliminary classification. In MSMA, both multi-scale local patterns and long-range dependence from protein sequences can be captured. Third, a multi-view adaptive decision mechanism is developed to make a comprehensive decision based on the classification results of all the views. To further improve the prediction performance, an extended version of MMSMA, MMSMAPlus, is proposed to integrate homology-based protein prediction under the framework of multi-view deep neural model. Experimental results show that the MMSMAPlus has promising performance and is significantly superior to the state-of-the-art methods. The source code can be found at https://github.com/wzy-2020/MMSMAPlus.


Asunto(s)
Redes Neurales de la Computación , Proteínas , Secuencia de Aminoácidos , Programas Informáticos , Análisis de Secuencia de Proteína
13.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36545804

RESUMEN

Monoclonal antibodies are biotechnologically produced proteins with various applications in research, therapeutics and diagnostics. Their ability to recognize and bind to specific molecule structures makes them essential research tools and therapeutic agents. Sequence information of antibodies is helpful for understanding antibody-antigen interactions and ensuring their affinity and specificity. De novo protein sequencing based on mass spectrometry is a valuable method to obtain the amino acid sequence of peptides and proteins without a priori knowledge. In this study, we evaluated six recently developed de novo peptide sequencing algorithms (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo), which were not specifically designed for antibody data. We validated their ability to identify and assemble antibody sequences on three multi-enzymatic data sets. The deep learning-based tools Casanovo and PointNovo showed an increased peptide recall across different enzymes and data sets compared with spectrum-graph-based approaches. We evaluated different error types of de novo peptide sequencing tools and their performance for different numbers of missing cleavage sites, noisy spectra and peptides of various lengths. We achieved a sequence coverage of 97.69-99.53% on the light chains of three different antibody data sets using the de Bruijn assembler ALPS and the predictions from Casanovo. However, low sequence coverage and accuracy on the heavy chains demonstrate that complete de novo protein sequencing remains a challenging issue in proteomics that requires improved de novo error correction, alternative digestion strategies and hybrid approaches such as homology search to achieve high accuracy on long protein sequences.


Asunto(s)
Anticuerpos Monoclonales , Péptidos , Secuencia de Aminoácidos , Anticuerpos Monoclonales/genética , Péptidos/genética , Péptidos/química , Algoritmos , Análisis de Secuencia de Proteína/métodos
14.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37833837

RESUMEN

Protein remote homology detection is essential for structure prediction, function prediction, disease mechanism understanding, etc. The remote homology relationship depends on multiple protein properties, such as structural information and local sequence patterns. Previous studies have shown the challenges for predicting remote homology relationship by protein features at sequence level (e.g. position-specific score matrix). Protein motifs have been used in structure and function analysis due to their unique sequence patterns and implied structural information. Therefore, designing a usable architecture to fuse multiple protein properties based on motifs is urgently needed to improve protein remote homology detection performance. To make full use of the characteristics of motifs, we employed the language model called the protein cubic language model (PCLM). It combines multiple properties by constructing a motif-based neural network. Based on the PCLM, we proposed a predictor called PreHom-PCLM by extracting and fusing multiple motif features for protein remote homology detection. PreHom-PCLM outperforms the other state-of-the-art methods on the test set and independent test set. Experimental results further prove the effectiveness of multiple features fused by PreHom-PCLM for remote homology detection. Furthermore, the protein features derived from the PreHom-PCLM show strong discriminative power for proteins from different structural classes in the high-dimensional space. Availability and Implementation: http://bliulab.net/PreHom-PCLM.


Asunto(s)
Algoritmos , Proteínas , Proteínas/química , Redes Neurales de la Computación , Secuencias de Aminoácidos , Lenguaje , Análisis de Secuencia de Proteína/métodos
15.
Bioinformatics ; 40(5)2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38648741

RESUMEN

SUMMARY: SIMSApiper is a Nextflow pipeline that creates reliable, structure-informed MSAs of thousands of protein sequences faster than standard structure-based alignment methods. Structural information can be provided by the user or collected by the pipeline from online resources. Parallelization with sequence identity-based subsets can be activated to significantly speed up the alignment process. Finally, the number of gaps in the final alignment can be reduced by leveraging the position of conserved secondary structure elements. AVAILABILITY AND IMPLEMENTATION: The pipeline is implemented using Nextflow, Python3, and Bash. It is publicly available on github.com/Bio2Byte/simsapiper.


Asunto(s)
Proteínas , Alineación de Secuencia , Análisis de Secuencia de Proteína , Programas Informáticos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Biología Computacional/métodos , Bases de Datos de Proteínas
16.
Bioinformatics ; 40(4)2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38608190

RESUMEN

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.


Asunto(s)
Algoritmos , Biología Computacional , Aprendizaje Profundo , Procesamiento de Lenguaje Natural , Biología Computacional/métodos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos
17.
Bioinformatics ; 40(Supplement_1): i410-i417, 2024 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940129

RESUMEN

MOTIVATION: One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools. RESULTS: To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.


Asunto(s)
Bases de Datos de Proteínas , Péptidos , Péptidos/química , Aprendizaje Automático , Espectrometría de Masas/métodos , Algoritmos , Análisis de Secuencia de Proteína/métodos , Espectrometría de Masas en Tándem/métodos
18.
Bioinformatics ; 40(Supplement_1): i328-i336, 2024 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940160

RESUMEN

SUMMARY: Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. AVAILABILITY AND IMPLEMENTATION: The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.


Asunto(s)
Algoritmos , Cadenas de Markov , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Biología Computacional/métodos , Análisis de Secuencia de Proteína/métodos , Filogenia , Proteínas/química
19.
Bioinformatics ; 40(5)2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38662570

RESUMEN

MOTIVATION: Proteins, the molecular workhorses of biological systems, execute a multitude of critical functions dictated by their precise three-dimensional structures. In a complex and dynamic cellular environment, proteins can undergo misfolding, leading to the formation of aggregates that take up various forms, including amorphous and ordered aggregation in the shape of amyloid fibrils. This phenomenon is closely linked to a spectrum of widespread debilitating pathologies, such as Alzheimer's disease, Parkinson's disease, type-II diabetes, and several other proteinopathies, but also hampers the engineering of soluble agents, as in the case of antibody development. As such, the accurate prediction of aggregation propensity within protein sequences has become pivotal due to profound implications in understanding disease mechanisms, as well as in improving biotechnological and therapeutic applications. RESULTS: We previously developed Cordax, a structure-based predictor that utilizes logistic regression to detect aggregation motifs in protein sequences based on their structural complementarity to the amyloid cross-beta architecture. Here, we present a dedicated web server interface for Cordax. This online platform combines several features including detailed scoring of sequence aggregation propensity, as well as 3D visualization with several customization options for topology models of the structural cores formed by predicted aggregation motifs. In addition, information is provided on experimentally determined aggregation-prone regions that exhibit sequence similarity to predicted motifs, scores, and links to other predictor outputs, as well as simultaneous predictions of relevant sequence propensities, such as solubility, hydrophobicity, and secondary structure propensity. AVAILABILITY AND IMPLEMENTATION: The Cordax webserver is freely accessible at https://cordax.switchlab.org/.


Asunto(s)
Programas Informáticos , Agregado de Proteínas , Internet , Amiloide/química , Proteínas/química , Secuencias de Aminoácidos , Humanos , Conformación Proteica , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos
20.
Bioinformatics ; 40(5)2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38652603

RESUMEN

MOTIVATION: Antibody therapeutic candidates must exhibit not only tight binding to their target but also good developability properties, especially low risk of immunogenicity. RESULTS: In this work, we fit a simple generative model, SAM, to sixty million human heavy and seventy million human light chains. We show that the probability of a sequence calculated by the model distinguishes human sequences from other species with the same or better accuracy on a variety of benchmark datasets containing >400 million sequences than any other model in the literature, outperforming large language models (LLMs) by large margins. SAM can humanize sequences, generate new sequences, and score sequences for humanness. It is both fast and fully interpretable. Our results highlight the importance of using simple models as baselines for protein engineering tasks. We additionally introduce a new tool for numbering antibody sequences which is orders of magnitude faster than existing tools in the literature. AVAILABILITY AND IMPLEMENTATION: All tools developed in this study are available at https://github.com/Wang-lab-UCSD/AntPack.


Asunto(s)
Anticuerpos , Humanos , Anticuerpos/química , Programas Informáticos , Análisis de Secuencia de Proteína/métodos , Biología Computacional/métodos , Cadenas Pesadas de Inmunoglobulina/química , Cadenas Pesadas de Inmunoglobulina/inmunología , Cadenas Ligeras de Inmunoglobulina/química , Cadenas Ligeras de Inmunoglobulina/inmunología , Algoritmos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA