Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
J Med Internet Res ; 17(6): e154, 2015 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-26091775

RESUMO

BACKGROUND: User content posted through Twitter has been used for biosurveillance, to characterize public perception of health-related topics, and as a means of distributing information to the general public. Most of the existing work surrounding Twitter and health care has shown Twitter to be an effective medium for these problems but more could be done to provide finer and more efficient access to all pertinent data. Given the diversity of user-generated content, small samples or summary presentations of the data arguably omit a large part of the virtual discussion taking place in the Twittersphere. Still, managing, processing, and querying large amounts of Twitter data is not a trivial task. This work describes tools and techniques capable of handling larger sets of Twitter data and demonstrates their use with the issue of antibiotics. OBJECTIVE: This work has two principle objectives: (1) to provide an open-source means to efficiently explore all collected tweets and query health-related topics on Twitter, specifically, questions such as what users are saying and how messages are spread, and (2) to characterize the larger discourse taking place on Twitter with respect to antibiotics. METHODS: Open-source software suites Hadoop, Flume, and Hive were used to collect and query a large number of Twitter posts. To classify tweets by topic, a deep network classifier was trained using a limited number of manually classified tweets. The particular machine learning approach used also allowed the use of a large number of unclassified tweets to increase performance. RESULTS: Query-based analysis of the collected tweets revealed that a large number of users contributed to the online discussion and that a frequent topic mentioned was resistance. A number of prominent events related to antibiotics led to a number of spikes in activity but these were short in duration. The category-based classifier developed was able to correctly classify 70% of manually labeled tweets (using a 10-fold cross validation procedure and 9 classes). The classifier also performed well when evaluated on a per category basis. CONCLUSIONS: Using existing tools such as Hive, Flume, Hadoop, and machine learning techniques, it is possible to construct tools and workflows to collect and query large amounts of Twitter data to characterize the larger discussion taking place on Twitter with respect to a particular health-related topic. Furthermore, using newer machine learning techniques and a limited number of manually labeled tweets, an entire body of collected tweets can be classified to indicate what topics are driving the virtual, online discussion. The resulting classifier can also be used to efficiently explore collected tweets by category and search for messages of interest or exemplary content.


Assuntos
Antibacterianos , Resistência Microbiana a Medicamentos , Internet , Opinião Pública , Mídias Sociais , Atitude Frente a Saúde , Humanos , Disseminação de Informação , Aprendizado de Máquina , Software
2.
Int J Mol Sci ; 16(7): 15384-404, 2015 Jul 07.
Artigo em Inglês | MEDLINE | ID: mdl-26198229

RESUMO

Protein disordered regions are segments of a protein chain that do not adopt a stable structure. Thus far, a variety of protein disorder prediction methods have been developed and have been widely used, not only in traditional bioinformatics domains, including protein structure prediction, protein structure determination and function annotation, but also in many other biomedical fields. The relationship between intrinsically-disordered proteins and some human diseases has played a significant role in disorder prediction in disease identification and epidemiological investigations. Disordered proteins can also serve as potential targets for drug discovery with an emphasis on the disordered-to-ordered transition in the disordered binding regions, and this has led to substantial research in drug discovery or design based on protein disordered region prediction. Furthermore, protein disorder prediction has also been applied to healthcare by predicting the disease risk of mutations in patients and studying the mechanistic basis of diseases. As the applications of disorder prediction increase, so too does the need to make quick and accurate predictions. To fill this need, we also present a new approach to predict protein residue disorder using wide sequence windows that is applicable on the genomic scale.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Área Sob a Curva , Bases de Dados de Proteínas , Desenho de Fármacos , Descoberta de Drogas , Proteínas Intrinsicamente Desordenadas/química , Redes Neurais de Computação
3.
BMC Bioinformatics ; 14: 88, 2013 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-23497251

RESUMO

BACKGROUND: A number of proteins contain regions which do not adopt a stable tertiary structure in their native state. Such regions known as disordered regions have been shown to participate in many vital cell functions and are increasingly being examined as drug targets. RESULTS: This work presents a new sequence based approach for the prediction of protein disorder. The method uses boosted ensembles of deep networks to make predictions and participated in the CASP10 experiment. In a 10 fold cross validation procedure on a dataset of 723 proteins, the method achieved an average balanced accuracy of 0.82 and an area under the ROC curve of 0.90. These results are achieved in part by a boosting procedure which is able to steadily increase balanced accuracy and the area under the ROC curve over several rounds. The method also compared competitively when evaluated against a number of state-of-the-art disorder predictors on CASP9 and CASP10 benchmark datasets. CONCLUSIONS: DNdisorder is available as a web service at http://iris.rnet.missouri.edu/dndisorder/.


Assuntos
Conformação Proteica , Análise de Sequência de Proteína/métodos , Proteínas/química , Software
4.
BMC Bioinformatics ; 14 Suppl 14: S12, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24267585

RESUMO

BACKGROUND: In recent years, the use and importance of predicted protein residue-residue contacts has grown considerably with demonstrated applications such as drug design, protein tertiary structure prediction and model quality assessment. Nevertheless, reported accuracies in the range of 25-35% stubbornly remain the norm for sequence based, long range contact predictions on hard targets. This is in spite of a prolonged effort on behalf of the community to improve the performance of residue-residue contact prediction. A thorough study of the quality of current residue-residue contact predictions and the evaluation metrics used as well as an analysis of current methods is needed to stimulate further advancement in contact prediction and its application. Such a study will better explain the quality and nature of residue-residue contact predictions generated by current methods and as a result lead to better use of this contact information. RESULTS: We evaluated several sequence based residue-residue contact predictors that participated in the tenth Critical Assessment of protein Structure Prediction (CASP) experiment. The evaluation was performed using standard assessment techniques such as those used by the official CASP assessors as well as two novel evaluation metrics (i.e., cluster accuracy and cluster count). An in-depth analysis revealed that while most residue-residue contact predictions generated are not accurate at the residue level, there is quite a strong contact signal present when allowing for less than residue level precision. Our residue-residue contact predictor, DNcon, performed particularly well achieving an accuracy of 66% for the top L/10 long range contacts when evaluated in a neighbourhood of size 2. The coverage of residue-residue contact areas was also greater with DNcon when compared to other methods. We also provide an analysis of DNcon with respect to its underlying architecture and features used for classification. CONCLUSIONS: Our novel evaluation metrics demonstrate that current residue-residue contact predictions do contain a strong contact signal and are of better quality than standard evaluation metrics indicate. Our method, DNcon, is a robust, state-of-the-art residue-residue sequence based contact predictor and excelled under a number of evaluation schemes. It is available as a web service at http://iris.rnet.missouri.edu/dncon/.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Modelos Moleculares , Estrutura Terciária de Proteína , Análise de Sequência de Proteína
5.
BMC Struct Biol ; 13: 2, 2013 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-23442819

RESUMO

BACKGROUND: Predicting protein structure from sequence is one of the most significant and challenging problems in bioinformatics. Numerous bioinformatics techniques and tools have been developed to tackle almost every aspect of protein structure prediction ranging from structural feature prediction, template identification and query-template alignment to structure sampling, model quality assessment, and model refinement. How to synergistically select, integrate and improve the strengths of the complementary techniques at each prediction stage and build a high-performance system is becoming a critical issue for constructing a successful, competitive protein structure predictor. RESULTS: Over the past several years, we have constructed a standalone protein structure prediction system MULTICOM that combines multiple sources of information and complementary methods at all five stages of the protein structure prediction process including template identification, template combination, model generation, model assessment, and model refinement. The system was blindly tested during the ninth Critical Assessment of Techniques for Protein Structure Prediction (CASP9) in 2010 and yielded very good performance. In addition to studying the overall performance on the CASP9 benchmark, we thoroughly investigated the performance and contributions of each component at each stage of prediction. CONCLUSIONS: Our comprehensive and comparative study not only provides useful and practical insights about how to select, improve, and integrate complementary methods to build a cutting-edge protein structure prediction system but also identifies a few new sources of information that may help improve the design of a protein structure prediction system. Several components used in the MULTICOM system are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.


Assuntos
Proteínas/química , Software , Benchmarking , Biologia Computacional , Bases de Dados de Proteínas , Internet , Estrutura Terciária de Proteína , Interface Usuário-Computador
6.
Bioinformatics ; 28(23): 3066-72, 2012 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-23047561

RESUMO

MOTIVATION: Protein residue-residue contacts continue to play a larger and larger role in protein tertiary structure modeling and evaluation. Yet, while the importance of contact information increases, the performance of sequence-based contact predictors has improved slowly. New approaches and methods are needed to spur further development and progress in the field. RESULTS: Here we present DNCON, a new sequence-based residue-residue contact predictor using deep networks and boosting techniques. Making use of graphical processing units and CUDA parallel computing technology, we are able to train large boosted ensembles of residue-residue contact predictors achieving state-of-the-art performance. AVAILABILITY: The web server of the prediction method (DNCON) is available at http://iris.rnet.missouri.edu/dncon/. CONTACT: chengji@missouri.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Estrutura Terciária de Proteína , Proteínas/química , Inteligência Artificial , Internet , Modelos Estatísticos
7.
BMC Bioinformatics ; 13: 65, 2012 Apr 30.
Artigo em Inglês | MEDLINE | ID: mdl-22545707

RESUMO

BACKGROUND: As genome sequencing is becoming routine in biomedical research, the total number of protein sequences is increasing exponentially, recently reaching over 108 million. However, only a tiny portion of these proteins (i.e. ~75,000 or < 0.07%) have solved tertiary structures determined by experimental techniques. The gap between protein sequence and structure continues to enlarge rapidly as the throughput of genome sequencing techniques is much higher than that of protein structure determination techniques. Computational software tools for predicting protein structure and structural features from protein sequences are crucial to make use of this vast repository of protein resources. RESULTS: To meet the need, we have developed a comprehensive MULTICOM toolbox consisting of a set of protein structure and structural feature prediction tools. These tools include secondary structure prediction, solvent accessibility prediction, disorder region prediction, domain boundary prediction, contact map prediction, disulfide bond prediction, beta-sheet topology prediction, fold recognition, multiple template combination and alignment, template-based tertiary structure modeling, protein model quality assessment, and mutation stability prediction. CONCLUSIONS: These tools have been rigorously tested by many users in the last several years and/or during the last three rounds of the Critical Assessment of Techniques for Protein Structure Prediction (CASP7-9) from 2006 to 2010, achieving state-of-the-art or near performance. In order to facilitate bioinformatics research and technological development in the field, we have made the MULTICOM toolbox freely available as web services and/or software packages for academic use and scientific research. It is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.


Assuntos
Estrutura Terciária de Proteína , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Biologia Computacional/métodos , Cadeias de Markov , Modelos Moleculares , Dobramento de Proteína
8.
Bioinformatics ; 27(12): 1715-6, 2011 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-21546397

RESUMO

SUMMARY: We built a web server named APOLLO, which can evaluate the absolute global and local qualities of a single protein model using machine learning methods or the global and local qualities of a pool of models using a pair-wise comparison approach. Based on our evaluations on 107 CASP9 (Critical Assessment of Techniques for Protein Structure Prediction) targets, the predicted quality scores generated from our machine learning and pair-wise methods have an average per-target correlation of 0.671 and 0.917, respectively, with the true model quality scores. Based on our test on 92 CASP9 targets, our predicted absolute local qualities have an average difference of 2.60 Å with the actual distances to native structure. AVAILABILITY: http://sysbio.rnet.missouri.edu/apollo/. Single and pair-wise global quality assessment software is also available at the site.


Assuntos
Modelos Moleculares , Conformação Proteica , Software , Inteligência Artificial , Proteínas/química , Controle de Qualidade
9.
BMC Bioinformatics ; 12: 43, 2011 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-21284866

RESUMO

BACKGROUND: Accurate identification of protein domain boundaries is useful for protein structure determination and prediction. However, predicting protein domain boundaries from a sequence is still very challenging and largely unsolved. RESULTS: We developed a new method to integrate the classification power of machine learning with evolutionary signals embedded in protein families in order to improve protein domain boundary prediction. The method first extracts putative domain boundary signals from a multiple sequence alignment between a query sequence and its homologs. The putative sites are then classified and scored by support vector machines in conjunction with input features such as sequence profiles, secondary structures, solvent accessibilities around the sites and their positions. The method was evaluated on a domain benchmark by 10-fold cross-validation and 60% of true domain boundaries can be recalled at a precision of 60%. The trade-off between the precision and recall can be adjusted according to specific needs by using different decision thresholds on the domain boundary scores assigned by the support vector machines. CONCLUSIONS: The good prediction accuracy and the flexibility of selecting domain boundary sites at different precision and recall values make our method a useful tool for protein structure determination and modelling. The method is available at http://sysbio.rnet.missouri.edu/dobo/.


Assuntos
Inteligência Artificial , Evolução Molecular , Estrutura Terciária de Proteína , Estrutura Secundária de Proteína , Proteínas/química , Alinhamento de Sequência , Análise de Sequência de Proteína
10.
BMC Struct Biol ; 11: 38, 2011 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-21989082

RESUMO

BACKGROUND: Protein residue-residue contact prediction is important for protein model generation and model evaluation. Here we develop a conformation ensemble approach to improve residue-residue contact prediction. We collect a number of structural models stemming from a variety of methods and implementations. The various models capture slightly different conformations and contain complementary information which can be pooled together to capture recurrent, and therefore more likely, residue-residue contacts. RESULTS: We applied our conformation ensemble approach to free modeling targets from both CASP8 and CASP9. Given a diverse ensemble of models, the method is able to achieve accuracies of. 48 for the top L/5 medium range contacts and. 36 for the top L/5 long range contacts for CASP8 targets (L being the target domain length). When applied to targets from CASP9, the accuracies of the top L/5 medium and long range contact predictions were. 34 and. 30 respectively. CONCLUSIONS: When operating on a moderately diverse ensemble of models, the conformation ensemble approach is an effective means to identify medium and long range residue-residue contacts. An immediate benefit of the method is that when tied with a scoring scheme, it can be used to successfully rank models.


Assuntos
Caspase 8/química , Caspase 9/química , Caspase 8/metabolismo , Caspase 9/metabolismo , Modelos Moleculares , Ligação Proteica , Estrutura Terciária de Proteína , Software
11.
Bioinformatics ; 26(7): 882-8, 2010 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-20150411

RESUMO

MOTIVATION: Protein structure prediction is one of the most important problems in structural bioinformatics. Here we describe MULTICOM, a multi-level combination approach to improve the various steps in protein structure prediction. In contrast to those methods which look for the best templates, alignments and models, our approach tries to combine complementary and alternative templates, alignments and models to achieve on average better accuracy. RESULTS: The multi-level combination approach was implemented via five automated protein structure prediction servers and one human predictor which participated in the eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. The MULTICOM servers and human predictor were consistently ranked among the top predictors on the CASP8 benchmark. The methods can predict moderate- to high-resolution models for most template-based targets and low-resolution models for some template-free targets. The results show that the multi-level combination of complementary templates, alternative alignments and similar models aided by model quality assessment can systematically improve both template-based and template-free protein modeling. AVAILABILITY: The MULTICOM server is freely available at http://casp.rnet.missouri.edu/multicom_3d.html .


Assuntos
Conformação Proteica , Proteínas/química , Software , Biologia Computacional , Bases de Dados de Proteínas
12.
Nucleic Acids Res ; 37(Web Server issue): W515-8, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19420062

RESUMO

Protein contact map prediction is useful for protein folding rate prediction, model selection and 3D structure prediction. Here we describe NNcon, a fast and reliable contact map prediction server and software. NNcon was ranked among the most accurate residue contact predictors in the Eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. Both NNcon server and software are available at http://casp.rnet.missouri.edu/nncon.html.


Assuntos
Conformação Proteica , Software , Internet , Modelos Moleculares , Redes Neurais de Computação , Dobramento de Proteína , Reprodutibilidade dos Testes
13.
ACS Appl Mater Interfaces ; 13(45): 53355-53362, 2021 Nov 17.
Artigo em Inglês | MEDLINE | ID: mdl-34160211

RESUMO

Rechargeable batteries provide crucial energy storage systems for renewable energy sources, as well as consumer electronics and electrical vehicles. There are a number of important parameters that determine the suitability of electrode materials for battery applications, such as the average voltage and the maximum specific capacity which contribute to the overall energy density. Another important performance criterion for battery electrode materials is their volume change upon charging and discharging, which contributes to determine the cyclability, Coulombic efficiency, and safety of a battery. In this work, we present deep neural network regression machine learning models (ML), trained on data obtained from the Materials Project database, for predicting average voltages and volume change upon charging and discharging of electrode materials for metal-ion batteries. Our models exhibit good performance as measured by the average mean absolute error obtained from a 10-fold cross-validation, as well as on independent test sets. We further assess the robustness of our ML models by investigating their screening potential beyond the training database. We produce Na-ion electrodes by systematically replacing Li-ions in the original database by Na-ions and, then, selecting a set of 22 electrodes that exhibit a good performance in energy density, as well as small volume variations upon charging and discharging, as predicted by the machine learning model. The ML predictions for these materials are then compared to quantum-mechanics based calculations. Our results reaffirm the significant role of machine learning techniques in the exploration of materials for battery applications.

14.
Ecol Evol ; 10(17): 9313-9325, 2020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-32953063

RESUMO

Simple biometric data of fish aid fishery management tasks such as monitoring the structure of fish populations and regulating recreational harvest. While these data are foundational to fishery research and management, the collection of length and weight data through physical handling of the fish is challenging as it is time consuming for personnel and can be stressful for the fish. Recent advances in imaging technology and machine learning now offer alternatives for capturing biometric data. To investigate the potential of deep convolutional neural networks to predict biometric data, several regressors were trained and evaluated on data stemming from the FishL™ Recognition System and manual measurements of length, girth, and weight. The dataset consisted of 694 fish from 22 different species common to Laurentian Great Lakes. Even with such a diverse dataset and variety of presentations by the fish, the regressors proved to be robust and achieved competitive mean percent errors in the range of 5.5 to 7.6% for length and girth on an evaluation dataset. Potential applications of this work could increase the efficiency and accuracy of routine survey work by fishery professionals and provide a means for longer-term automated collection of fish biometric data.

15.
BMC Bioinformatics ; 10: 436, 2009 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-20025768

RESUMO

BACKGROUND: Disordered regions are segments of the protein chain which do not adopt stable structures. Such segments are often of interest because they have a close relationship with protein expression and functionality. As such, protein disorder prediction is important for protein structure prediction, structure determination and function annotation. RESULTS: This paper presents our protein disorder prediction server, PreDisorder. It is based on our ab initio prediction method (MULTICOM-CMFR) which, along with our meta (or consensus) prediction method (MULTICOM), was recently ranked among the top disorder predictors in the eighth edition of the Critical Assessment of Techniques for Protein Structure Prediction (CASP8). We systematically benchmarked PreDisorder along with 26 other protein disorder predictors on the CASP8 data set and assessed its accuracy using a number of measures. The results show that it compared favourably with other ab initio methods and its performance is comparable to that of the best meta and clustering methods. CONCLUSION: PreDisorder is a fast and reliable server which can be used to predict protein disordered regions on genomic scale. It is available at http://casp.rnet.missouri.edu/predisorder.html.


Assuntos
Estrutura Terciária de Proteína , Proteínas/química , Proteômica/métodos , Software , Bases de Dados de Proteínas , Conformação Proteica , Dobramento de Proteína , Análise de Sequência de Proteína
16.
Proteins ; 77 Suppl 9: 181-4, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-19544564

RESUMO

Evaluating the quality of protein structure models is important for selecting and using models. Here, we describe the MULTICOM series of model quality predictors which contains three predictors tested in the CASP8 experiments. We evaluated these predictors on 120 CASP8 targets. The average correlations between predicted and real GDT-TS scores of the two semi-clustering methods (MULTICOM and MULTICOM-CLUSTER) and the one single-model ab initio method (MULTICOM-CMFR) are 0.90, 0.89, and 0.74, respectively; and their average difference (or GDT-TS loss) between the global GDT-TS scores of the top-ranked models and the best models are 0.05, 0.06, and 0.07, respectively. The average correlation between predicted and real local quality scores of the semi-clustering methods is above 0.64. Our results show that the novel semi-clustering approach that compares a model with top ranked reference models can improve initial quality scores generated by the ab initio method and a simple meta approach.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Modelos Moleculares , Conformação Proteica
17.
ACS Appl Mater Interfaces ; 11(20): 18494-18503, 2019 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-31034195

RESUMO

Machine-learning (ML) techniques have rapidly found applications in many domains of materials chemistry and physics where large data sets are available. Aiming to accelerate the discovery of materials for battery applications, in this work, we develop a tool ( http://se.cmich.edu/batteries ) based on ML models to predict voltages of electrode materials for metal-ion batteries. To this end, we use deep neural network, support vector machine, and kernel ridge regression as ML algorithms in combination with data taken from the Materials Project database, as well as feature vectors from properties of chemical compounds and elemental properties of their constituents. We show that our ML models have predictive capabilities for different reference test sets and, as an example, we utilize them to generate a voltage profile diagram and compare it to density functional theory calculations. In addition, using our models, we propose nearly 5000 candidate electrode materials for Na- and K-ion batteries. We also make available a web-accessible tool that, within a minute, can be used to estimate the voltage of any bulk electrode material for a number of metal ions. These results show that ML is a promising alternative for computationally demanding calculations as a first screening tool of novel materials for battery applications.

18.
Sci Data ; 5: 180190, 2018 10 09.
Artigo em Inglês | MEDLINE | ID: mdl-30299439

RESUMO

Using Dual-Frequency Identification Sonar (DIDSON), fishery acoustic observation data was collected from the Ocqueoc River, a tributary of Lake Huron in northern Michigan, USA. Data were collected March through July 2013 and 2016 and included the identification, via technology or expert analysis, of eight fish species as they passed through the DIDSON's field of view. A set of short DIDSON clips containing identified fish was curated. Additionally, two other datasets were created that include visualizations of the acoustic data and longer DIDSON clips. These datasets could complement future research characterizing the abundance and behavior of valued fishes such as walleye (Sander vitreus) or white sucker (Catostomus commersonii) or invasive fishes such as sea lamprey (Petromyzon marinus) or European carp (Cyprinus carpio). Given the abundance of DIDSON data and the fact that a portion of it is labeled, these data could aid in the creation of machine learning tools from DIDSON data, particularly for invasive sea lamprey which are amply represented and a destructive invader of the Laurentian Great Lakes.


Assuntos
Pesqueiros , Peixes , Animais , Peixes/classificação , Lagos , Rios , Estados Unidos
19.
Sci Rep ; 6: 19301, 2016 Jan 14.
Artigo em Inglês | MEDLINE | ID: mdl-26763289

RESUMO

Quality assessment of a protein model is to predict the absolute or relative quality of a protein model using computational methods before the native structure is available. Single-model methods only need one model as input and can predict the absolute residue-specific quality of an individual model. Here, we have developed four novel single-model methods (Wang_deep_1, Wang_deep_2, Wang_deep_3, and Wang_SVM) based on stacked denoising autoencoders (SdAs) and support vector machines (SVMs). We evaluated these four methods along with six other methods participating in CASP11 at the global and local levels using Pearson's correlation coefficients and ROC analysis. As for residue-specific quality assessment, our four methods achieved better performance than most of the six other CASP11 methods in distinguishing the reliably modeled residues from the unreliable measured by ROC analysis; and our SdA-based method Wang_deep_1 has achieved the highest accuracy, 0.77, compared to SVM-based methods and our ensemble of an SVM and SdAs. However, we found that Wang_deep_2 and Wang_deep_3, both based on an ensemble of multiple SdAs and an SVM, performed slightly better than Wang_deep_1 in terms of ROC analysis, indicating that integrating an SVM with deep networks works well in terms of certain measurements.


Assuntos
Aminoácidos/química , Modelos Moleculares , Conformação Proteica , Proteínas/química , Máquina de Vetores de Suporte , Algoritmos , Curva ROC
20.
Artigo em Inglês | MEDLINE | ID: mdl-25750595

RESUMO

Ab initio protein secondary structure (SS) predictions are utilized to generate tertiary structure predictions, which are increasingly demanded due to the rapid discovery of proteins. Although recent developments have slightly exceeded previous methods of SS prediction, accuracy has stagnated around 80 percent and many wonder if prediction cannot be advanced beyond this ceiling. Disciplines that have traditionally employed neural networks are experimenting with novel deep learning techniques in attempts to stimulate progress. Since neural networks have historically played an important role in SS prediction, we wanted to determine whether deep learning could contribute to the advancement of this field as well. We developed an SS predictor that makes use of the position-specific scoring matrix generated by PSI-BLAST and deep learning network architectures, which we call DNSS. Graphical processing units and CUDA software optimize the deep network architecture and efficiently train the deep networks. Optimal parameters for the training process were determined, and a workflow comprising three separately trained deep networks was constructed in order to make refined predictions. This deep learning network approach was used to predict SS for a fully independent test dataset of 198 proteins, achieving a Q3 accuracy of 80.7 percent and a Sov accuracy of 74.2 percent.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Estrutura Secundária de Proteína , Proteínas/química , Bases de Dados de Proteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA