Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 263
Filtrar
1.
J Bioinform Comput Biol ; 22(2): 2450006, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38812466

RESUMO

Molecular recognition features (MoRFs) are particular functional segments of disordered proteins, which play crucial roles in regulating the phase transition of membrane-less organelles and frequently serve as central sites in cellular interaction networks. As the association between disordered proteins and severe diseases continues to be discovered, identifying MoRFs has gained growing significance. Due to the limited number of experimentally validated MoRFs, the performance of existing MoRF's prediction algorithms is not good enough and still needs to be improved. In this research, we present a model named MoRF_ESM, which utilizes deep-learning protein representations to predict MoRFs in disordered proteins. This approach employs a pretrained ESM-2 protein language model to generate embedding representations of residues in the form of attention map matrices. These representations are combined with a self-learned TextCNN model for feature extraction and prediction. In addition, an averaging step was incorporated at the end of the MoRF_ESM model to refine the output and generate final prediction results. In comparison to other impressive methods on benchmark datasets, the MoRF_ESM approach demonstrates state-of-the-art performance, achieving [Formula: see text] higher AUC than other methods when tested on TEST1 and achieving [Formula: see text] higher AUC than other methods when tested on TEST2. These results imply that the combination of ESM-2 and TextCNN can effectively extract deep evolutionary features related to protein structure and function, along with capturing shallow pattern features located in protein sequences, and is well qualified for the prediction task of MoRFs. Given that ESM-2 is a highly versatile protein language model, the methodology proposed in this study can be readily applied to other tasks involving the classification of protein sequences.


Assuntos
Algoritmos , Biologia Computacional , Aprendizado Profundo , Proteínas Intrinsicamente Desordenadas , Biologia Computacional/métodos , Proteínas Intrinsicamente Desordenadas/química , Proteínas Intrinsicamente Desordenadas/metabolismo , Bases de Dados de Proteínas/estatística & dados numéricos
2.
Comput Math Methods Med ; 2022: 7191684, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35242211

RESUMO

Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas/genética , Sequência de Aminoácidos , Proteínas de Bactérias/genética , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Análise Discriminante , Evolução Molecular , Helicobacter pylori/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Aprendizado de Máquina , Matrizes de Pontuação de Posição Específica , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Máquina de Vetores de Suporte
4.
Comput Math Methods Med ; 2022: 7493834, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35069791

RESUMO

Helicobacter pylori (H. pylori) is the most common risk factor for gastric cancer worldwide. The membrane proteins of the H. pylori are involved in bacterial adherence and play a vital role in the field of drug discovery. Thus, an accurate and cost-effective computational model is needed to predict the uncharacterized membrane proteins of H. pylori. In this study, a reliable benchmark dataset consisted of 114 membrane and 219 nonmembrane proteins was constructed based on UniProt. A support vector machine- (SVM-) based model was developed for discriminating H. pylori membrane proteins from nonmembrane proteins by using sequence information. Cross-validation showed that our method achieved good performance with an accuracy of 91.29%. It is anticipated that the proposed model will be useful for the annotation of H. pylori membrane proteins and the development of new anti-H. pylori agents.


Assuntos
Proteínas de Bactérias/genética , Helicobacter pylori/genética , Proteínas de Membrana/genética , Sequência de Aminoácidos , Aminoácidos/análise , Proteínas de Bactérias/química , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Helicobacter pylori/química , Helicobacter pylori/patogenicidade , Interações entre Hospedeiro e Microrganismos , Humanos , Proteínas de Membrana/química , Máquina de Vetores de Suporte
5.
PLoS Comput Biol ; 17(11): e1009550, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34748537

RESUMO

Metabolic network models are increasingly being used in health care and industry. As a consequence, many tools have been released to automate their reconstruction process de novo. In order to enable gene deletion simulations and integration of gene expression data, these networks must include gene-protein-reaction (GPR) rules, which describe with a Boolean logic relationships between the gene products (e.g., enzyme isoforms or subunits) associated with the catalysis of a given reaction. Nevertheless, the reconstruction of GPRs still remains a largely manual and time consuming process. Aiming at fully automating the reconstruction process of GPRs for any organism, we propose the open-source python-based framework GPRuler. By mining text and data from 9 different biological databases, GPRuler can reconstruct GPRs starting either from just the name of the target organism or from an existing metabolic model. The performance of the developed tool is evaluated at small-scale level for a manually curated metabolic model, and at genome-scale level for three metabolic models related to Homo sapiens and Saccharomyces cerevisiae organisms. By exploiting these models as benchmarks, the proposed tool shown its ability to reproduce the original GPR rules with a high level of accuracy. In all the tested scenarios, after a manual investigation of the mismatches between the rules proposed by GPRuler and the original ones, the proposed approach revealed to be in many cases more accurate than the original models. By complementing existing tools for metabolic network reconstruction with the possibility to reconstruct GPRs quickly and with a few resources, GPRuler paves the way to the study of context-specific metabolic networks, representing the active portion of the complete network in given conditions, for organisms of industrial or biomedical interest that have not been characterized metabolically yet.


Assuntos
Redes e Vias Metabólicas/genética , Modelos Biológicos , Software , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Modelos Genéticos , Anotação de Sequência Molecular , Mapas de Interação de Proteínas/genética , Estrutura Quaternária de Proteína , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
6.
Int J Mol Sci ; 22(22)2021 Nov 12.
Artigo em Inglês | MEDLINE | ID: mdl-34830151

RESUMO

Transmembrane proteins (TMPs) play important roles in cells, ranging from transport processes and cell adhesion to communication. Many of these functions are mediated by intrinsically disordered regions (IDRs), flexible protein segments without a well-defined structure. Although a variety of prediction methods are available for predicting IDRs, their accuracy is very limited on TMPs due to their special physico-chemical properties. We prepared a dataset containing membrane proteins exclusively, using X-ray crystallography data. MemDis is a novel prediction method, utilizing convolutional neural network and long short-term memory networks for predicting disordered regions in TMPs. In addition to attributes commonly used in IDR predictors, we defined several TMP specific features to enhance the accuracy of our method further. MemDis achieved the highest prediction accuracy on TMP-specific dataset among other popular IDR prediction methods.


Assuntos
Biologia Computacional/métodos , Proteínas Intrinsicamente Desordenadas/química , Proteínas de Membrana/química , Redes Neurais de Computação , Sequência de Aminoácidos , Mineração de Dados/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , Internet , Modelos Moleculares , Conformação Proteica , Reprodutibilidade dos Testes
7.
Comput Math Methods Med ; 2021: 9997669, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34697557

RESUMO

Modeling antigenic variation in influenza (flu) virus A H3N2 using amino acid sequences is a promising approach for improving the prediction accuracy of immune efficacy of vaccines and increasing the efficiency of vaccine screening. Antigenic drift and antigenic jump/shift, which arise from the accumulation of mutations with small or moderate effects and from a major, abrupt change with large effects on the surface antigen hemagglutinin (HA), respectively, are two types of antigenic variation that facilitate immune evasion of flu virus A and make it challenging to predict the antigenic properties of new viral strains. Despite considerable progress in modeling antigenic variation based on the amino acid sequences, few studies focus on the deep learning framework which could be most suitable to be applied to this task. Here, we propose a novel deep learning approach that incorporates a convolutional neural network (CNN) and bidirectional long-short-term memory (BLSTM) neural network to predict antigenic variation. In this approach, CNN extracts the complex local contexts of amino acids while the BLSTM neural network captures the long-distance sequence information. When compared to the existing methods, our deep learning approach achieves the overall highest prediction performance on the validation dataset, and more encouragingly, it achieves prediction agreements of 99.20% and 96.46% for the strains in the forthcoming year and in the next two years included in an existing set of chronological amino acid sequences, respectively. These results indicate that our deep learning approach is promising to be applied to antigenic variation prediction of flu virus A H3N2.


Assuntos
Variação Antigênica , Aprendizado Profundo , Vírus da Influenza A Subtipo H3N2/genética , Vírus da Influenza A Subtipo H3N2/imunologia , Influenza Humana/virologia , Sequência de Aminoácidos , Antígenos Virais/genética , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Glicoproteínas de Hemaglutininação de Vírus da Influenza/genética , Glicoproteínas de Hemaglutininação de Vírus da Influenza/imunologia , Humanos , Redes Neurais de Computação
8.
Comput Math Methods Med ; 2021: 7681497, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34671418

RESUMO

Membrane protein is an important kind of proteins. It plays essential roles in several cellular processes. Based on the intramolecular arrangements and positions in a cell, membrane proteins can be divided into several types. It is reported that the types of a membrane protein are highly related to its functions. Determination of membrane protein types is a hot topic in recent years. A plenty of computational methods have been proposed so far. Some of them used functional domain information to encode proteins. However, this procedure was still crude. In this study, we designed a novel feature extraction scheme to obtain informative features of proteins from their functional domain information. Such scheme termed domains as words and proteins, represented by its domains, as sentences. The natural language processing approach, word2vector, was applied to access the features of domains, which were further refined to protein features. Based on these features, RAndom k-labELsets with random forest as the base classifier was employed to build the multilabel classifier, namely, iMPT-FDNPL. The tenfold cross-validation results indicated the good performance of such classifier. Furthermore, such classifier was superior to other classifiers based on features derived from functional domains via one-hot scheme or derived from other properties of proteins, suggesting the effectiveness of protein features generated by the proposed scheme.


Assuntos
Proteínas de Membrana/química , Proteínas de Membrana/classificação , Processamento de Linguagem Natural , Algoritmos , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Domínios Proteicos , Máquina de Vetores de Suporte
9.
PLoS Comput Biol ; 17(9): e1009446, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34555022

RESUMO

Only a small fraction of genes deposited to databases have been experimentally characterised. The majority of proteins have their function assigned automatically, which can result in erroneous annotations. The reliability of current annotations in public databases is largely unknown; experimental attempts to validate the accuracy within individual enzyme classes are lacking. In this study we performed an overview of functional annotations to the BRENDA enzyme database. We first applied a high-throughput experimental platform to verify functional annotations to an enzyme class of S-2-hydroxyacid oxidases (EC 1.1.3.15). We chose 122 representative sequences of the class and screened them for their predicted function. Based on the experimental results, predicted domain architecture and similarity to previously characterised S-2-hydroxyacid oxidases, we inferred that at least 78% of sequences in the enzyme class are misannotated. We experimentally confirmed four alternative activities among the misannotated sequences and showed that misannotation in the enzyme class increased over time. Finally, we performed a computational analysis of annotations to all enzyme classes in the BRENDA database, and showed that nearly 18% of all sequences are annotated to an enzyme class while sharing no similarity or domain architecture to experimentally characterised representatives. We showed that even well-studied enzyme classes of industrial relevance are affected by the problem of functional misannotation.


Assuntos
Oxirredutases do Álcool/classificação , Bases de Dados de Proteínas/estatística & dados numéricos , Anotação de Sequência Molecular/estatística & dados numéricos , Oxirredutases do Álcool/química , Oxirredutases do Álcool/genética , Animais , Biologia Computacional , Enzimas/química , Enzimas/classificação , Enzimas/genética , Humanos , Modelos Moleculares , Domínios Proteicos , Homologia de Sequência de Aminoácidos
10.
Int J Mol Sci ; 22(17)2021 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-34502531

RESUMO

Interactions between proteins are essential to any cellular process and constitute the basis for molecular networks that determine the functional state of a cell. With the technical advances in recent years, an astonishingly high number of protein-protein interactions has been revealed. However, the interactome of O-linked N-acetylglucosamine transferase (OGT), the sole enzyme adding the O-linked ß-N-acetylglucosamine (O-GlcNAc) onto its target proteins, has been largely undefined. To that end, we collated OGT interaction proteins experimentally identified in the past several decades. Rigorous curation of datasets from public repositories and O-GlcNAc-focused publications led to the identification of up to 929 high-stringency OGT interactors from multiple species studied (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, and others). Among them, 784 human proteins were found to be interactors of human OGT. Moreover, these proteins spanned a very diverse range of functional classes (e.g., DNA repair, RNA metabolism, translational regulation, and cell cycle), with significant enrichment in regulating transcription and (co)translation. Our dataset demonstrates that OGT is likely a hub protein in cells. A webserver OGT-Protein Interaction Network (OGT-PIN) has also been created, which is freely accessible.


Assuntos
Acetilglucosamina/metabolismo , Curadoria de Dados/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , N-Acetilglucosaminiltransferases/metabolismo , Mapas de Interação de Proteínas , Processamento de Proteína Pós-Traducional , Animais , Proteínas de Arabidopsis/metabolismo , Proteínas de Drosophila/metabolismo , Humanos , Camundongos , Ratos
11.
Int J Mol Sci ; 22(17)2021 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-34502557

RESUMO

Analysis of differential abundance in proteomics data sets requires careful application of missing value imputation. Missing abundance values widely vary when performing comparisons across different sample treatments. For example, one would expect a consistent rate of "missing at random" (MAR) across batches of samples and varying rates of "missing not at random" (MNAR) depending on the inherent difference in sample treatments within the study. The missing value imputation strategy must thus be selected that best accounts for both MAR and MNAR simultaneously. Several important issues must be considered when deciding the appropriate missing value imputation strategy: (1) when it is appropriate to impute data; (2) how to choose a method that reflects the combinatorial manner of MAR and MNAR that occurs in an experiment. This paper provides an evaluation of missing value imputation strategies used in proteomics and presents a case for the use of hybrid left-censored missing value imputation approaches that can handle the MNAR problem common to proteomics data.


Assuntos
Confiabilidade dos Dados , Bases de Dados de Proteínas/estatística & dados numéricos , Espectrometria de Massas/métodos , Proteômica/estatística & dados numéricos , Neoplasias da Mama/metabolismo , Neoplasias da Mama/patologia , Linhagem Celular Tumoral , Glucose/metabolismo , Humanos , Proteômica/métodos , Proteômica/normas
12.
PLoS Comput Biol ; 17(8): e1008844, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34370723

RESUMO

Many biological processes are mediated by protein-protein interactions (PPIs). Because protein domains are the building blocks of proteins, PPIs likely rely on domain-domain interactions (DDIs). Several attempts exist to infer DDIs from PPI networks but the produced datasets are heterogeneous and sometimes not accessible, while the PPI interactome data keeps growing. We describe a new computational approach called "PPIDM" (Protein-Protein Interactions Domain Miner) for inferring DDIs using multiple sources of PPIs. The approach is an extension of our previously described "CODAC" (Computational Discovery of Direct Associations using Common neighbors) method for inferring new edges in a tripartite graph. The PPIDM method has been applied to seven widely used PPI resources, using as "Gold-Standard" a set of DDIs extracted from 3D structural databases. Overall, PPIDM has produced a dataset of 84,552 non-redundant DDIs. Statistical significance (p-value) is calculated for each source of PPI and used to classify the PPIDM DDIs in Gold (9,175 DDIs), Silver (24,934 DDIs) and Bronze (50,443 DDIs) categories. Dataset comparison reveals that PPIDM has inferred from the 2017 releases of PPI sources about 46% of the DDIs present in the 2020 release of the 3did database, not counting the DDIs present in the Gold-Standard. The PPIDM dataset contains 10,229 DDIs that are consistent with more than 13,300 PPIs extracted from the IMEx database, and nearly 23,300 DDIs (27.5%) that are consistent with more than 214,000 human PPIs extracted from the STRING database. Examples of newly inferred DDIs covering more than 10 PPIs in the IMEx database are provided. Further exploitation of the PPIDM DDI reservoir includes the inventory of possible partners of a protein of interest and characterization of protein interactions at the domain level in combination with other methods. The result is publicly available at http://ppidm.loria.fr/.


Assuntos
Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Mapas de Interação de Proteínas , Algoritmos , Biologia Computacional , Mineração de Dados/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Software
13.
Comput Math Methods Med ; 2021: 5770981, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34413898

RESUMO

Antioxidant proteins (AOPs) play important roles in the management and prevention of several human diseases due to their ability to neutralize excess free radicals. However, the identification of AOPs by using wet-lab experimental techniques is often time-consuming and expensive. In this study, we proposed an accurate computational model, called AOP-HMM, to predict AOPs by extracting discriminatory evolutionary features from hidden Markov model (HMM) profiles. First, auto cross-covariance (ACC) variables were applied to transform the HMM profiles into fixed-length feature vectors. Then, we performed the analysis of variance (ANOVA) method to reduce the dimensionality of the raw feature space. Finally, a support vector machine (SVM) classifier was adopted to conduct the prediction of AOPs. To comprehensively evaluate the performance of the proposed AOP-HMM model, the 10-fold cross-validation (CV), the jackknife CV, and the independent test were carried out on two widely used benchmark datasets. The experimental results demonstrated that AOP-HMM outperformed most of the existing methods and could be used to quickly annotate AOPs and guide the experimental process.


Assuntos
Antioxidantes/química , Aprendizado de Máquina , Peroxirredoxinas/química , Proteínas/química , Algoritmos , Aminoácidos/análise , Antioxidantes/classificação , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Evolução Molecular , Humanos , Cadeias de Markov , Peroxirredoxinas/classificação , Proteínas/classificação
14.
Sci Rep ; 11(1): 12439, 2021 06 14.
Artigo em Inglês | MEDLINE | ID: mdl-34127723

RESUMO

Coiled-coil regions were among the first protein motifs described structurally and theoretically. The simplicity of the motif promises that coiled-coil regions can be detected with reasonable accuracy and precision in any protein sequence. Here, we re-evaluated the most commonly used coiled-coil prediction tools with respect to the most comprehensive reference data set available, the entire Protein Data Bank, down to each amino acid and its secondary structure. Apart from the 30-fold difference in minimum and maximum number of coiled coils predicted the tools strongly vary in where they predict coiled-coil regions. Accordingly, there is a high number of false predictions and missed, true coiled-coil regions. The evaluation of the binary classification metrics in comparison with naïve coin-flip models and the calculation of the Matthews correlation coefficient, the most reliable performance metric for imbalanced data sets, suggests that the tested tools' performance is close to random. This implicates that the tools' predictions have only limited informative value. Coiled-coil predictions are often used to interpret biochemical data and are part of in-silico functional genome annotation. Our results indicate that these predictions should be treated very cautiously and need to be supported and validated by experimental evidence.


Assuntos
Motivos de Aminoácidos , Modelos Moleculares , Estrutura Secundária de Proteína , Sequência de Aminoácidos , Bases de Dados de Proteínas/estatística & dados numéricos , Software
15.
Comput Math Methods Med ; 2021: 5529389, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34055035

RESUMO

Many combinations of protein features are used to improve protein structural class prediction, but the information redundancy is often ignored. In order to select the important features with strong classification ability, we proposed a recursive feature selection with random forest to improve protein structural class prediction. We evaluated the proposed method with four experiments and compared it with the available competing prediction methods. The results indicate that the proposed feature selection method effectively improves the efficiency of protein structural class prediction. Only less than 5% features are used, but the prediction accuracy is improved by 4.6-13.3%. We further compared different protein features and found that the predicted secondary structural features achieve the best performance. This understanding can be used to design more powerful prediction methods for the protein structural class.


Assuntos
Proteínas/química , Proteínas/classificação , Algoritmos , Sequência de Aminoácidos , Aminoácidos/química , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Interações Hidrofóbicas e Hidrofílicas , Conformação Proteica , Elementos Estruturais de Proteínas , Estrutura Secundária de Proteína , Homologia de Sequência de Aminoácidos , Máquina de Vetores de Suporte
16.
Nat Commun ; 12(1): 3168, 2021 05 26.
Artigo em Inglês | MEDLINE | ID: mdl-34039967

RESUMO

The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .


Assuntos
Biologia Computacional/métodos , Aprendizado Profundo , Modelos Biológicos , Estrutura Terciária de Proteína , Proteínas/fisiologia , Sequência de Aminoácidos , Bases de Dados de Proteínas/estatística & dados numéricos , Conjuntos de Dados como Assunto , Modelos Moleculares , Proteínas/ultraestrutura , Relação Estrutura-Atividade
17.
Nat Commun ; 12(1): 1983, 2021 03 31.
Artigo em Inglês | MEDLINE | ID: mdl-33790270

RESUMO

Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.


Assuntos
Algoritmos , Evolução Molecular , Aprendizado de Máquina , Filogenia , Animais , Bases de Dados Genéticas/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Modelos Genéticos
18.
IEEE/ACM Trans Comput Biol Bioinform ; 18(4): 1299-1304, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33687847

RESUMO

The novel coronavirus (COVID-19) infections have adopted the shape of a global pandemic now, demanding an urgent vaccine design. The current work reports contriving an anti-coronavirus peptide scanner tool to discern anti-coronavirus targets in the embodiment of peptides. The proffered CoronaPep tool features the fast fingerprinting of the anti-coronavirus target serving supreme prominence in the current bioinformatics research. The anti-coronavirus target protein sequences reported from the current outbreak are scanned against the anti-coronavirus target data-sets via CORONAPEP which provides precision-based anti-coronavirus peptides. This tool is specifically for the coronavirus data, which can predict peptides from the whole genome, or a gene or protein's list. Besides it is relatively fast, accurate, userfriendly and can generate maximum output from the limited information. The availability of tools like CORONAPEP will immeasurably perquisite researchers in the discipline of oncology and structure-based drug design.


Assuntos
Tratamento Farmacológico da COVID-19 , COVID-19/virologia , SARS-CoV-2/química , SARS-CoV-2/efeitos dos fármacos , Software , Proteínas Virais/química , Proteínas Virais/efeitos dos fármacos , Antivirais/farmacologia , COVID-19/prevenção & controle , Vacinas contra COVID-19/química , Vacinas contra COVID-19/genética , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Desenho de Fármacos , Genoma Viral , Interações entre Hospedeiro e Microrganismos/efeitos dos fármacos , Humanos , Pandemias , Peptídeos/química , Peptídeos/efeitos dos fármacos , Peptídeos/genética , SARS-CoV-2/genética , Proteínas Virais/genética
19.
J Proteome Res ; 20(3): 1464-1475, 2021 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-33605735

RESUMO

The SARS-CoV-2 virus is the causative agent of the 2020 pandemic leading to the COVID-19 respiratory disease. With many scientific and humanitarian efforts ongoing to develop diagnostic tests, vaccines, and treatments for COVID-19, and to prevent the spread of SARS-CoV-2, mass spectrometry research, including proteomics, is playing a role in determining the biology of this viral infection. Proteomics studies are starting to lead to an understanding of the roles of viral and host proteins during SARS-CoV-2 infection, their protein-protein interactions, and post-translational modifications. This is beginning to provide insights into potential therapeutic targets or diagnostic strategies that can be used to reduce the long-term burden of the pandemic. However, the extraordinary situation caused by the global pandemic is also highlighting the need to improve mass spectrometry data and workflow sharing. We therefore describe freely available data and computational resources that can facilitate and assist the mass spectrometry-based analysis of SARS-CoV-2. We exemplify this by reanalyzing a virus-host interactome data set to detect protein-protein interactions and identify host proteins that could potentially be used as targets for drug repurposing.


Assuntos
COVID-19/virologia , Disseminação de Informação/métodos , Espectrometria de Massas/métodos , SARS-CoV-2/química , COVID-19/epidemiologia , Teste para COVID-19/métodos , Teste para COVID-19/estatística & dados numéricos , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Reposicionamento de Medicamentos , Interações entre Hospedeiro e Microrganismos/fisiologia , Humanos , Espectrometria de Massas/estatística & dados numéricos , Pandemias , Domínios e Motivos de Interação entre Proteínas , Mapas de Interação de Proteínas , Processamento de Proteína Pós-Traducional , Proteômica/métodos , Proteômica/estatística & dados numéricos , SARS-CoV-2/patogenicidade , SARS-CoV-2/fisiologia , Proteínas Virais/química , Proteínas Virais/fisiologia , Tratamento Farmacológico da COVID-19
20.
Nucleic Acids Res ; 49(D1): D266-D273, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33237325

RESUMO

CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.


Assuntos
Biologia Computacional/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Domínios Proteicos , Proteínas/química , Sequência de Aminoácidos , COVID-19/epidemiologia , COVID-19/prevenção & controle , COVID-19/virologia , Biologia Computacional/métodos , Epidemias , Humanos , Internet , Anotação de Sequência Molecular , Proteínas/genética , Proteínas/metabolismo , SARS-CoV-2/genética , SARS-CoV-2/metabolismo , SARS-CoV-2/fisiologia , Análise de Sequência de Proteína/métodos , Homologia de Sequência de Aminoácidos , Proteínas Virais/química , Proteínas Virais/genética , Proteínas Virais/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA