Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 56
Filtrar
Más filtros













Base de datos
Intervalo de año de publicación
1.
Genes (Basel) ; 12(11)2021 11 18.
Artículo en Inglés | MEDLINE | ID: mdl-34828415

RESUMEN

Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.


Asunto(s)
Automatización de Laboratorios/métodos , Genómica/métodos , Alineación de Secuencia/métodos , Algoritmos , Animales , Automatización de Laboratorios/normas , Genómica/normas , Humanos , Filogenia , Sensibilidad y Especificidad , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de Proteína/normas
2.
Genes (Basel) ; 12(5)2021 05 11.
Artículo en Inglés | MEDLINE | ID: mdl-34064731

RESUMEN

Protein ubiquitylation is an essential post-translational modification process that performs a critical role in a wide range of biological functions, even a degenerative role in certain diseases, and is consequently used as a promising target for the treatment of various diseases. Owing to the significant role of protein ubiquitylation, these sites can be identified by enzymatic approaches, mass spectrometry analysis, and combinations of multidimensional liquid chromatography and tandem mass spectrometry. However, these large-scale experimental screening techniques are time consuming, expensive, and laborious. To overcome the drawbacks of experimental methods, machine learning and deep learning-based predictors were considered for prediction in a timely and cost-effective manner. In the literature, several computational predictors have been published across species; however, predictors are species-specific because of the unclear patterns in different species. In this study, we proposed a novel approach for predicting plant ubiquitylation sites using a hybrid deep learning model by utilizing convolutional neural network and long short-term memory. The proposed method uses the actual protein sequence and physicochemical properties as inputs to the model and provides more robust predictions. The proposed predictor achieved the best result with accuracy values of 80% and 81% and F-scores of 79% and 82% on the 10-fold cross-validation and an independent dataset, respectively. Moreover, we also compared the testing of the independent dataset with popular ubiquitylation predictors; the results demonstrate that our model significantly outperforms the other methods in prediction classification results.


Asunto(s)
Proteínas de Plantas/química , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Ubiquitinación , Secuencias de Aminoácidos , Aprendizaje Profundo , Proteínas de Plantas/metabolismo , Sensibilidad y Especificidad , Análisis de Secuencia de Proteína/normas
3.
PLoS One ; 16(5): e0248841, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33939703

RESUMEN

Linear motifs are short protein subsequences that mediate protein interactions. Hundreds of motif classes including thousands of motif instances are known. Our theory estimates how many motif classes remain undiscovered. As commonly done, we describe motif classes as regular expressions specifying motif length and the allowed amino acids at each motif position. We measure motif specificity for a pair of motif classes by quantifying how many motif-discriminating positions prevent a protein subsequence from matching the two classes at once. We derive theorems for the maximal number of motif classes that can simultaneously maintain a certain number of motif-discriminating positions between all pairs of classes in the motif universe, for a given amino acid alphabet. We also calculate the fraction of all protein subsequences that would belong to a motif class if all potential motif classes came into existence. Naturally occurring pairs of motif classes present most often a single motif-discriminating position. This mild specificity maximizes the potential number of coexisting motif classes, the expansion of the motif universe due to amino acid modifications and the fraction of amino acid sequences that code for a motif instance. As a result, thousands of linear motif classes may remain undiscovered.


Asunto(s)
Secuencias de Aminoácidos , Análisis de Secuencia de Proteína/métodos , Humanos , Sensibilidad y Especificidad , Análisis de Secuencia de Proteína/normas
4.
Open Biol ; 11(2): 200173, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-33529550

RESUMEN

It has become customary in engineering to require a modelling component in research endeavours. In addition, as the code for these models becomes more byzantine in complexity, it is difficult for reviewers and readers to discern their value and understand the underlying code. This opinion piece summarizes the negative experience of the author with the IPRO and OptMAVEn computational protein engineering models as well as problems with the optStoic metabolic pathway model. In our hands, these models often fail to predict reliable ways to engineer proteins and metabolic pathways.


Asunto(s)
Biología Computacional/métodos , Ingeniería de Proteínas/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos/normas , Animales , Biología Computacional/normas , Humanos , Redes y Vías Metabólicas , Ingeniería de Proteínas/normas , Análisis de Secuencia de Proteína/normas
5.
Methods Mol Biol ; 2165: 69-81, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32621219

RESUMEN

Assessing the accuracy of 3D models has become a keystone in the protein structure prediction field. ModFOLD7 is our leading resource for Estimates of Model Accuracy (EMA), which has been upgraded by integrating a number of the pioneering pure-single- and quasi-single-model approaches. Such an integration has given our latest version the strengths to accurately score and rank predicted models, with higher consistency compared to older EMA methods. Additionally, the server provides three options for producing global score estimates, depending on the requirements of the user: (1) ModFOLD7_rank, which is optimized for ranking/selection, (2) ModFOLD7_cor, which is optimized for correlations of predicted and observed scores, and (3) ModFOLD7 global for balanced performance. ModFOLD7 has been ranked among the top few EMA methods according to independent blind testing by the CASP13 assessors. Another evaluation resource for ModFOLD7 is the CAMEO project, where the method is continuously automatically evaluated, showing a significant improvement compared to our previous versions. The ModFOLD7 server is freely available at http://www.reading.ac.uk/bioinf/ModFOLD/ .


Asunto(s)
Simulación de Dinámica Molecular/normas , Conformación Proteica , Análisis de Secuencia de Proteína/normas , Programas Informáticos/normas
6.
Methods Mol Biol ; 2165: 83-101, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32621220

RESUMEN

Intrinsically disordered regions (IDRs) are estimated to be highly abundant in nature. While only several thousand proteins are annotated with experimentally derived IDRs, computational methods can be used to predict IDRs for the millions of currently uncharacterized protein chains. Several dozen disorder predictors were developed over the last few decades. While some of these methods provide accurate predictions, unavoidably they also make some mistakes. Consequently, one of the challenges facing users of these methods is how to decide which predictions can be trusted and which are likely incorrect. This practical problem can be solved using quality assessment (QA) scores that predict correctness of the underlying (disorder) predictions at a residue level. We motivate and describe a first-of-its-kind toolbox of QA methods, QUARTER (QUality Assessment for pRotein inTrinsic disordEr pRedictions), which provides the scores for a diverse set of ten disorder predictors. QUARTER is available to the end users as a free and convenient webserver at http://biomine.cs.vcu.edu/servers/QUARTER/ . We briefly describe the predictive architecture of QUARTER and provide detailed instructions on how to use the webserver. We also explain how to interpret results produced by QUARTER with the help of a case study.


Asunto(s)
Proteínas Intrínsecamente Desordenadas/química , Conformación Proteica , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Análisis de Secuencia de Proteína/normas
7.
Genes (Basel) ; 11(3)2020 03 20.
Artículo en Inglés | MEDLINE | ID: mdl-32245073

RESUMEN

Although there are a number of bioinformatic tools to identify plant nucleotide-binding leucine-rich repeat (NLR) disease resistance genes based on conserved protein sequences, only a few of these tools have attempted to identify disease resistance genes that have not been annotated in the genome. The overall goal of the NLGenomeSweeper pipeline is to annotate NLR disease resistance genes, including RPW8, in the genome assembly with high specificity and a focus on complete functional genes. This is based on the identification of the complete NB-ARC domain, the most conserved domain of NLR genes, using the BLAST suite. In this way, the tool has a high specificity for complete genes and relatively intact pseudogenes. The tool returns all candidate NLR gene locations as well as InterProScan ORF and domain annotations for manual curation of the gene structure.


Asunto(s)
Genómica/métodos , Proteínas NLR/genética , Proteínas de Plantas/genética , Análisis de Secuencia de Proteína/métodos , Programas Informáticos/normas , Arabidopsis , Secuencia Conservada , Resistencia a la Enfermedad , Genómica/normas , Helianthus , Proteínas NLR/química , Proteínas de Plantas/química , Unión Proteica , Dominios Proteicos , Análisis de Secuencia de Proteína/normas
8.
Mol Genet Genomic Med ; 8(4): e1166, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32096919

RESUMEN

BACKGROUND: Different types of in silico approaches can be used to predict the phenotypic consequence of missense variants. Such algorithms are often categorized as sequence based or structure based, when they necessitate 3D structural information. In addition, many other in silico tools, not dedicated to the analysis of variants, can be used to gain additional insights about the possible mechanisms at play. METHODS: Here we applied different computational approaches to a set of 20 known missense variants present on different proteins (CYP, complement factor B, antithrombin and blood coagulation factor VIII). The tools that were used include fast computational approaches and web servers such as PolyPhen-2, PopMusic, DUET, MaestroWeb, SAAFEC, Missense3D, VarSite, FlexPred, PredyFlexy, Clustal Omega, meta-PPISP, FTMap, ClusPro, pyDock, PPM, RING, Cytoscape, and ChannelsDB. RESULTS: We observe some conflicting results among the methods but, most of the time, the combination of several engines helped to clarify the potential impacts of the amino acid substitutions. CONCLUSION: Combining different computational approaches including some that were not developed to investigate missense variants help to predict the possible impact of the amino acid substitutions. Yet, when the modified residues are involved in a salt-bridge, the tools tend to fail, even when the analysis is performed in 3D. Thus, interactive structural analysis with molecular graphics packages such as Chimera or PyMol or others are still needed to clarify automatic prediction.


Asunto(s)
Simulación de Dinámica Molecular/normas , Mutación Missense , Análisis de Secuencia de Proteína/métodos , Programas Informáticos/normas , Factores de Coagulación Sanguínea/química , Factores de Coagulación Sanguínea/genética , Sistema Enzimático del Citocromo P-450/química , Sistema Enzimático del Citocromo P-450/genética , Humanos , Análisis de Secuencia de Proteína/normas
9.
Gigascience ; 9(2)2020 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-32034905

RESUMEN

BACKGROUND: Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of effort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation. RESULTS: Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the World Wide Web Consortium (W3C) standards Resource Description Framework (RDF) and SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language). We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, off-the-shelf solutions, and is applicable to any genome or proteome annotation pipeline. CONCLUSIONS: HAMAP SPARQL rules are freely available for download from the HAMAP FTP site, ftp://ftp.expasy.org/databases/hamap/sparql/, under the CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license. A tutorial and supplementary code to use HAMAP as SPARQL are available on GitHub at https://github.com/sib-swiss/HAMAP-SPARQL, and general documentation about HAMAP can be found on the HAMAP website at https://hamap.expasy.org.


Asunto(s)
Genómica/métodos , Anotación de Secuencia Molecular/métodos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos/normas , Animales , Genómica/normas , Humanos , Anotación de Secuencia Molecular/normas , Análisis de Secuencia de ADN/normas , Análisis de Secuencia de Proteína/normas
10.
Genomics ; 112(2): 1941-1946, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-31740293

RESUMEN

In this paper, a step-by-step classification algorithm based on double-layer SVM model is constructed to predict the secondary structure of proteins. The most important feature of this algorithm is to improve the prediction accuracy of α+ß and α/ß classes through transforming the prediction of two classes of proteins, α+ß and α/ß classes, with low accuracy in the past, into the prediction of all-α and all-ß classes with high accuracy. A widely-used dataset, 25PDB dataset with sequence similarity lower than 40%, is used to evaluate this method. The results show that this method has good performance, and on the basis of ensuring the accuracy of other three structural classes of proteins, the accuracy of α+ß class proteins is improved significantly.


Asunto(s)
Análisis de Secuencia de Proteína/métodos , Máquina de Vectores de Soporte , Animales , Humanos , Conformación Proteica en Hélice alfa , Conformación Proteica en Lámina beta , Análisis de Secuencia de Proteína/normas
11.
Cell Syst ; 8(4): 292-301.e3, 2019 04 24.
Artículo en Inglés | MEDLINE | ID: mdl-31005579

RESUMEN

Predicting protein structure from sequence is a central challenge of biochemistry. Co-evolution methods show promise, but an explicit sequence-to-structure map remains elusive. Advances in deep learning that replace complex, human-designed pipelines with differentiable models optimized end to end suggest the potential benefits of similarly reformulating structure prediction. Here, we introduce an end-to-end differentiable model for protein structure learning. The model couples local and global protein structure via geometric units that optimize global geometry without violating local covalent chemistry. We test our model using two challenging tasks: predicting novel folds without co-evolutionary data and predicting known folds without structural templates. In the first task, the model achieves state-of-the-art accuracy, and in the second, it comes within 1-2 Å; competing methods using co-evolution and experimental templates have been refined over many years, and it is likely that the differentiable approach has substantial room for further improvement, with applications ranging from drug discovery to protein design.


Asunto(s)
Aprendizaje Automático , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Evolución Molecular , Pliegue de Proteína , Análisis de Secuencia de Proteína/normas
12.
Int J Mol Sci ; 19(12)2018 Nov 22.
Artículo en Inglés | MEDLINE | ID: mdl-30469512

RESUMEN

Signal peptides are N-terminal presequences responsible for targeting proteins to the endomembrane system, and subsequent subcellular or extracellular compartments, and consequently condition their proper function. The significance of signal peptides stimulates development of new computational methods for their detection. These methods employ learning systems trained on datasets comprising signal peptides from different types of proteins and taxonomic groups. As a result, the accuracy of predictions are high in the case of signal peptides that are well-represented in databases, but might be low in other, atypical cases. Such atypical signal peptides are present in proteins found in apicomplexan parasites, causative agents of malaria and toxoplasmosis. Apicomplexan proteins have a unique amino acid composition due to their AT-biased genomes. Therefore, we designed a new, more flexible and universal probabilistic model for recognition of atypical eukaryotic signal peptides. Our approach called signalHsmm includes knowledge about the structure of signal peptides and physicochemical properties of amino acids. It is able to recognize signal peptides from the malaria parasites and related species more accurately than popular programs. Moreover, it is still universal enough to provide prediction of other signal peptides on par with the best preforming predictors.


Asunto(s)
Plasmodium/química , Señales de Clasificación de Proteína , Proteínas Protozoarias/química , Análisis de Secuencia de Proteína/métodos , Aminoácidos/química , Cadenas de Markov , Análisis de Secuencia de Proteína/normas
13.
Int J Mol Sci ; 19(10)2018 Oct 07.
Artículo en Inglés | MEDLINE | ID: mdl-30301243

RESUMEN

Using computational techniques to identify intrinsically disordered residues is practical and effective in biological studies. Therefore, designing novel high-accuracy strategies is always preferable when existing strategies have a lot of room for improvement. Among many possibilities, a meta-strategy that integrates the results of multiple individual predictors has been broadly used to improve the overall performance of predictors. Nonetheless, a simple and direct integration of individual predictors may not effectively improve the performance. In this project, dual-threshold two-step significance voting and neural networks were used to integrate the predictive results of four individual predictors, including: DisEMBL, IUPred, VSL2, and ESpritz. The new meta-strategy has improved the prediction performance of intrinsically disordered residues significantly, compared to all four individual predictors and another four recently-designed predictors. The improvement was validated using five-fold cross-validation and in independent test datasets.


Asunto(s)
Proteínas Intrínsecamente Desordenadas/química , Redes Neurales de la Computación , Análisis de Secuencia de Proteína/métodos , Humanos , Proteínas Intrínsecamente Desordenadas/metabolismo , Análisis de Secuencia de Proteína/normas , Programas Informáticos
14.
Int J Mol Sci ; 19(6)2018 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-29914091

RESUMEN

Bacteriophages, which are tremendously important to the ecology and evolution of bacteria, play a key role in the development of genetic engineering. Bacteriophage virion proteins are essential materials of the infectious viral particles and in charge of several of biological functions. The correct identification of bacteriophage virion proteins is of great importance for understanding both life at the molecular level and genetic evolution. However, few computational methods are available for identifying bacteriophage virion proteins. In this paper, we proposed a new method to predict bacteriophage virion proteins using a Multinomial Naïve Bayes classification model based on discrete feature generated from the g-gap feature tree. The accuracy of the proposed model reaches 98.37% with MCC of 96.27% in 10-fold cross-validation. This result suggests that the proposed method can be a useful approach in identifying bacteriophage virion proteins from sequence information. For the convenience of experimental scientists, a web server (PhagePred) that implements the proposed predictor is available, which can be freely accessed on the Internet.


Asunto(s)
Bacteriófagos/química , Análisis de Secuencia de Proteína/métodos , Proteínas Estructurales Virales/química , Teorema de Bayes , Análisis de Secuencia de Proteína/normas , Programas Informáticos
15.
BMC Res Notes ; 11(1): 328, 2018 May 21.
Artículo en Inglés | MEDLINE | ID: mdl-29784028

RESUMEN

OBJECTIVE: BLOSUM matrices serve as standard matrices for many protein sequence alignment programs. BLOSUM matrices have been constructed using BLOCKS version5.0 with 27,102 BLOCKS, whereas the latest updated version14.3 has 6,739,916 BLOCKS. We read with interest the research article by Hess et al. (BMC Bioinform 17:189, 2016) on CorBLOSUM, wherein it is argued that an inaccuracy in the BLOSUM code affects the cluster memberships of sequences. They show that replacing the integer based clustering threshold to floating point arguably improves the performances of CorBLOSUM over BLOSUM and RBLOSUM matrices. They compare BLOSUM6214.3 against RBLOSUM69, with relative entropies of 0.2685 and 0.2662 respectively. The present work attempts to repeat the computation to verify the respective analog matrices. RESULTS: In our attempt to repeat the computation, we observed that the relative entropy of BLOSUM6214.3 is 0.2360 and BLOSUM5014.3 is 0.1198. As only matrices of similar entropies can be compared, BLOSUM62 can be compared only with RBLOSUM66 and BLOSUM50 can be compared only with RBLOSUM56. We conducted experiments with Astral data sets, and demonstrated the improved accuracy in the coverage. Our results imply that RBLOSUM performs statistically better than CorBLOSUM and BLOSUM matrices.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Bases de Datos de Proteínas , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de Proteína/normas
16.
Biomolecules ; 8(1)2018 03 14.
Artículo en Inglés | MEDLINE | ID: mdl-29538331

RESUMEN

It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.


Asunto(s)
Aprendizaje Automático , Simulación del Acoplamiento Molecular/normas , Mapeo de Interacción de Proteínas/métodos , Análisis de Secuencia de Proteína/normas , Mapeo de Interacción de Proteínas/normas
17.
Int J Mol Sci ; 19(1)2018 Jan 08.
Artículo en Inglés | MEDLINE | ID: mdl-29316706

RESUMEN

The function of a protein is of great interest in the cutting-edge research of biological mechanisms, disease development and drug/target discovery. Besides experimental explorations, a variety of computational methods have been designed to predict protein function. Among these in silico methods, the prediction of BLAST is based on protein sequence similarity, while that of machine learning is also based on the sequence, but without the consideration of their similarity. This unique characteristic of machine learning makes it a good complement to BLAST and many other approaches in predicting the function of remotely relevant proteins and the homologous proteins of distinct function. However, the identification accuracies of these in silico methods and their false discovery rate have not yet been assessed so far, which greatly limits the usage of these algorithms. Herein, a comprehensive comparison of the performances among four popular prediction algorithms (BLAST, SVM, PNN and KNN) was conducted. In particular, the performance of these methods was systematically assessed by four standard statistical indexes based on the independent test datasets of 93 functional protein families defined by UniProtKB keywords. Moreover, the false discovery rates of these algorithms were evaluated by scanning the genomes of four representative model organisms (Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae and Mycobacterium tuberculosis). As a result, the substantially higher sensitivity of SVM and BLAST was observed compared with that of PNN and KNN. However, the machine learning algorithms (PNN, KNN and SVM) were found capable of substantially reducing the false discovery rate (SVM < PNN < KNN). In sum, this study comprehensively assessed the performance of four popular algorithms applied to protein function prediction, which could facilitate the selection of the most appropriate method in the related biomedical research.


Asunto(s)
Análisis de Secuencia de Proteína/normas , Programas Informáticos , Aprendizaje Automático , Proteómica/métodos , Proteómica/normas , Reproducibilidad de los Resultados , Análisis de Secuencia de Proteína/métodos
18.
J Comput Biol ; 25(3): 361-373, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-28891684

RESUMEN

Identifying the interaction between drugs and target proteins is an important area of drug research, which provides a broad prospect for low-risk and faster drug development. However, due to the limitations of traditional experiments when revealing drug-protein interactions (DTIs), the screening of targets not only takes a lot of time and money but also has high false-positive and false-negative rates. Therefore, it is imperative to develop effective automatic computational methods to accurately predict DTIs in the postgenome era. In this article, we propose a new computational method for predicting DTIs from drug molecular structure and protein sequence by using the stacked autoencoder of deep learning, which can adequately extract the raw data information. The proposed method has the advantage that it can automatically mine the hidden information from protein sequences and generate highly representative features through iterations of multiple layers. The feature descriptors are then constructed by combining the molecular substructure fingerprint information, and fed into the rotation forest for accurate prediction. The experimental results of fivefold cross-validation indicate that the proposed method achieves superior performance on gold standard data sets (enzymes, ion channels, GPCRs [G-protein-coupled receptors], and nuclear receptors) with accuracy of 0.9414, 0.9116, 0.8669, and 0.8056, respectively. We further comprehensively explore the performance of the proposed method by comparing it with other feature extraction algorithms, state-of-the-art classifiers, and other excellent methods on the same data set. The excellent comparison results demonstrate that the proposed method is highly competitive when predicting drug-target interactions.


Asunto(s)
Aprendizaje Profundo , Simulación del Acoplamiento Molecular/métodos , Análisis de Secuencia de Proteína/métodos , Bases de Datos de Compuestos Químicos , Simulación del Acoplamiento Molecular/normas , Unión Proteica , Reproducibilidad de los Resultados , Análisis de Secuencia de Proteína/normas
19.
Genome Med ; 9(1): 113, 2017 Dec 18.
Artículo en Inglés | MEDLINE | ID: mdl-29254494

RESUMEN

The translation of personal genomics to precision medicine depends on the accurate interpretation of the multitude of genetic variants observed for each individual. However, even when genetic variants are predicted to modify a protein, their functional implications may be unclear. Many diseases are caused by genetic variants affecting important protein features, such as enzyme active sites or interaction interfaces. The scientific community has catalogued millions of genetic variants in genomic databases and thousands of protein structures in the Protein Data Bank. Mapping mutations onto three-dimensional (3D) structures enables atomic-level analyses of protein positions that may be important for the stability or formation of interactions; these may explain the effect of mutations and in some cases even open a path for targeted drug development. To accelerate progress in the integration of these data types, we held a two-day Gene Variation to 3D (GVto3D) workshop to report on the latest advances and to discuss unmet needs. The overarching goal of the workshop was to address the question: what can be done together as a community to advance the integration of genetic variants and 3D protein structures that could not be done by a single investigator or laboratory? Here we describe the workshop outcomes, review the state of the field, and propose the development of a framework with which to promote progress in this arena. The framework will include a set of standard formats, common ontologies, a common application programming interface to enable interoperation of the resources, and a Tool Registry to make it easy to find and apply the tools to specific analysis problems. Interoperability will enable integration of diverse data sources and tools and collaborative development of variant effect prediction methods.


Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Polimorfismo Genético , Conformación Proteica , Análisis de Secuencia de Proteína/métodos , Algoritmos , Congresos como Asunto , Estudio de Asociación del Genoma Completo/normas , Humanos , Análisis de Secuencia de Proteína/normas
20.
Structure ; 25(12): 1916-1927, 2017 12 05.
Artículo en Inglés | MEDLINE | ID: mdl-29174494

RESUMEN

The Worldwide PDB recently launched a deposition, biocuration, and validation tool: OneDep. At various stages of OneDep data processing, validation reports for three-dimensional structures of biological macromolecules are produced. These reports are based on recommendations of expert task forces representing crystallography, nuclear magnetic resonance, and cryoelectron microscopy communities. The reports provide useful metrics with which depositors can evaluate the quality of the experimental data, the structural model, and the fit between them. The validation module is also available as a stand-alone web server and as a programmatically accessible web service. A growing number of journals require the official wwPDB validation reports (produced at biocuration) to accompany manuscripts describing macromolecular structures. Upon public release of the structure, the validation report becomes part of the public PDB archive. Geometric quality scores for proteins in the PDB archive have improved over the past decade.


Asunto(s)
Bases de Datos de Proteínas/normas , Estudios de Validación como Asunto , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de Proteína/normas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA