Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
Nucleic Acids Res ; 52(W1): W215-W220, 2024 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-38587188

RESUMO

DeepLoc 2.0 is a popular web server for the prediction of protein subcellular localization and sorting signals. Here, we introduce DeepLoc 2.1, which additionally classifies the input proteins into the membrane protein types Transmembrane, Peripheral, Lipid-anchored and Soluble. Leveraging pre-trained transformer-based protein language models, the server utilizes a three-stage architecture for sequence-based, multi-label predictions. Comparative evaluations with other established tools on a test set of 4933 eukaryotic protein sequences, constructed following stringent homology partitioning, demonstrate state-of-the-art performance. Notably, DeepLoc 2.1 outperforms existing models, with the larger ProtT5 model exhibiting a marginal advantage over the ESM-1B model. The web server is available at https://services.healthtech.dtu.dk/services/DeepLoc-2.1.


Assuntos
Proteínas de Membrana , Software , Proteínas de Membrana/química , Proteínas de Membrana/metabolismo , Internet , Sinais Direcionadores de Proteínas , Análise de Sequência de Proteína
2.
Nucleic Acids Res ; 50(W1): W228-W234, 2022 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-35489069

RESUMO

The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.


Assuntos
Sinais Direcionadores de Proteínas , Proteínas , Humanos , Proteínas/metabolismo , Eucariotos/metabolismo , Transporte Proteico , Idioma , Bases de Dados de Proteínas , Biologia Computacional , Frações Subcelulares/metabolismo
3.
Bioinformatics ; 38(4): 941-946, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-35088833

RESUMO

MOTIVATION: Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. RESULTS: In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. AVAILABILITY AND IMPLEMENTATION: The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Escherichia coli , Idioma , Proteínas , Software , Solubilidade
4.
Bioinformatics ; 33(22): 3685-3690, 2017 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-28961695

RESUMO

MOTIVATION: Deep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training of neural networks are the drivers of this development. The use of deep learning has been especially successful in image recognition; and the development of tools, applications and code examples are in most cases centered within this field rather than within biology. RESULTS: Here, we aim to further the development of deep learning methods within biology by providing application examples and ready to apply and adapt code templates. Given such examples, we illustrate how architectures consisting of convolutional and long short-term memory neural networks can relatively easily be designed and trained to state-of-the-art performance on three biological sequence problems: prediction of subcellular localization, protein secondary structure and the binding of peptides to MHC Class II molecules. AVAILABILITY AND IMPLEMENTATION: All implementations and datasets are available online to the scientific community at https://github.com/vanessajurtz/lasagne4bio. CONTACT: skaaesonderby@gmail.com. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado de Máquina , Estrutura Secundária de Proteína , Transporte Proteico , Análise de Sequência de Proteína/métodos , Biologia Computacional/métodos , Redes Neurais de Computação , Peptídeos/metabolismo , Ligação Proteica
5.
NAR Genom Bioinform ; 5(4): lqad088, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37850036

RESUMO

When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.

6.
Nat Biotechnol ; 40(7): 1023-1025, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-34980915

RESUMO

Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.


Assuntos
Idioma , Sinais Direcionadores de Proteínas , Algoritmos , Sequência de Aminoácidos , Sinais Direcionadores de Proteínas/genética , Proteínas
7.
Comput Biol Chem ; 95: 107596, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34775287

RESUMO

A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.


Assuntos
Bacillus subtilis/genética , Proteínas de Bactérias/genética , Aprendizado de Máquina , Regulação da Expressão Gênica , Proteínas Recombinantes/genética
8.
Artigo em Inglês | MEDLINE | ID: mdl-29527131

RESUMO

The EEG of epileptic patients often contains sharp waveforms called "spikes", occurring between seizures. Detecting such spikes is crucial for diagnosing epilepsy. In this paper, we develop a convolutional neural network (CNN) for detecting spikes in EEG of epileptic patients in an automated fashion. The CNN has a convolutional architecture with filters of various sizes applied to the input layer, leaky ReLUs as activation functions, and a sigmoid output layer. Balanced mini-batches were applied to handle the imbalance in the data set. Leave-one-patient-out cross-validation was carried out to test the CNN and benchmark models on EEG data of five epilepsy patients. We achieved 0.947 AUC for the CNN, while the best performing benchmark model, Support Vector Machines with Gaussian kernel, achieved an AUC of 0.912.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA