RESUMEN
ApoB-100 is a member of a large lipid transfer protein superfamily and is one of the main apolipoproteins found on low-density lipoprotein (LDL) and very low-density lipoprotein (VLDL) particles. Despite its clinical significance for the development of cardiovascular disease, there is limited information on apoB-100 structure. We have developed a novel method based on the "divide and conquer" algorithm, using PSIPRED software, by dividing apoB-100 into five subunits and 11 domains. Models of each domain were prepared using I-TASSER, DEMO, RoseTTAFold, Phyre2, and MODELLER. Subsequently, we used disuccinimidyl sulfoxide (DSSO), a new mass spectrometry cleavable cross-linker, and the known position of disulfide bonds to experimentally validate each model. We obtained 65 unique DSSO cross-links, of which 87.5% were within a 26 Å threshold in the final model. We also evaluated the positions of cysteine residues involved in the eight known disulfide bonds in apoB-100, and each pair was measured within the expected 5.6 Å constraint. Finally, multiple domains were combined by applying constraints based on detected long-range DSSO cross-links to generate five subunits, which were subsequently merged to achieve an uninterrupted architecture for apoB-100 around a lipoprotein particle. Moreover, the dynamics of apoB-100 during particle size transitions was examined by comparing VLDL and LDL computational models and using experimental cross-linking data. In addition, the proposed model of receptor ligand binding of apoB-100 provides new insights into some of its functions.
Asunto(s)
Apolipoproteínas B , Cisteína , Apolipoproteína B-100 , Apolipoproteínas B/metabolismo , Simulación por Computador , Disulfuros , Ligandos , Lipoproteínas LDL/química , Lipoproteínas VLDL , Modelos Estructurales , SulfóxidosRESUMEN
Intrinsically disordered regions (IDR) play an important role in key biological processes and are closely related to human diseases. IDRs have great potential to serve as targets for drug discovery, most notably in disordered binding regions. Accurate prediction of IDRs is challenging because their genome wide occurrence and a low ratio of disordered residues make them difficult targets for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy which is time consuming and computationally expensive. This article describes an ab initio sequence-only prediction method-which tries to overcome the challenge of accurate prediction posed by IDRs-based on reduced amino acid alphabets and convolutional neural networks (CNNs). We experiment with six different 3-letter reduced alphabets. We argue that the dimensional reduction in the input alphabet facilitates the detection of complex patterns within the sequence by the convolutional step. Experimental results show that our proposed IDR predictor performs at the same level or outperforms other state-of-the-art methods in the same class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our method is suitable for proteome-wide disorder prediction yielding similar or better accuracy than existing approaches at a faster speed.
Asunto(s)
Biología Computacional/métodos , Minería de Datos/estadística & datos numéricos , Proteínas Intrínsecamente Desordenadas/química , Aprendizaje Automático , Redes Neurales de la Computación , Secuencia de Aminoácidos , Área Bajo la Curva , Benchmarking , Conjuntos de Datos como Asunto , Humanos , Reducción de Dimensionalidad Multifactorial , Curva ROC , Análisis de Secuencia de ProteínaRESUMEN
When designing live-attenuated respiratory syncytial virus (RSV) vaccine candidates, attenuating mutations can be developed through biologic selection or reverse-genetic manipulation and may include point mutations, codon and gene deletions, and genome rearrangements. Attenuation typically involves the reduction in virus replication, due to direct effects on viral structural and replicative machinery or viral factors that antagonize host defense or cause disease. However, attenuation must balance reduced replication and immunogenic antigen expression. In the present study, we explored a new approach in order to discover attenuating mutations. Specifically, we used protein structure modeling and computational methods to identify amino acid substitutions in the RSV nonstructural protein 1 (NS1) predicted to cause various levels of structural perturbation. Twelve different mutations predicted to alter the NS1 protein structure were introduced into infectious virus and analyzed in cell culture for effects on viral mRNA and protein expression, interferon and cytokine expression, and caspase activation. We found the use of structure-based machine learning to predict amino acid substitutions that reduce the thermodynamic stability of NS1 resulted in various levels of loss of NS1 function, exemplified by effects including reduced multi-cycle viral replication in cells competent for type I interferon, reduced expression of viral mRNAs and proteins, and increased interferon and apoptosis responses.
Asunto(s)
Aprendizaje Automático , Vacunas contra Virus Sincitial Respiratorio , Virus Sincitial Respiratorio Humano , Proteínas no Estructurales Virales , Replicación Viral , Humanos , Proteínas no Estructurales Virales/genética , Proteínas no Estructurales Virales/inmunología , Proteínas no Estructurales Virales/química , Proteínas no Estructurales Virales/metabolismo , Vacunas contra Virus Sincitial Respiratorio/inmunología , Vacunas contra Virus Sincitial Respiratorio/genética , Virus Sincitial Respiratorio Humano/genética , Virus Sincitial Respiratorio Humano/inmunología , Vacunas Atenuadas/inmunología , Vacunas Atenuadas/genética , Infecciones por Virus Sincitial Respiratorio/prevención & control , Infecciones por Virus Sincitial Respiratorio/virología , Infecciones por Virus Sincitial Respiratorio/inmunología , Sustitución de Aminoácidos , Mutación , Línea CelularRESUMEN
Introduction: Antimicrobial peptides (AMPs) are promising alternatives to traditional antibiotics for combating plant pathogenic bacteria in agriculture and the environment. However, identifying potent AMPs through laborious experimental assays is resource-intensive and time-consuming. To address these limitations, this study presents a bioinformatics approach utilizing machine learning models for predicting and selecting AMPs active against plant pathogenic bacteria. Methods: N-gram representations of peptide sequences with 3-letter and 9-letter reduced amino acid alphabets were used to capture the sequence patterns and motifs that contribute to the antimicrobial activity of AMPs. A 5-fold cross-validation technique was used to train the machine learning models and to evaluate their predictive accuracy and robustness. Results: The models were applied to predict putative AMPs encoded by intergenic regions and small open reading frames (ORFs) of the citrus genome. Approximately 7% of the 10,000-peptide dataset from the intergenic region and 7% of the 685,924-peptide dataset from the whole genome were predicted as probable AMPs. The prediction accuracy of the reported models range from 0.72 to 0.91. A subset of the predicted AMPs was selected for experimental test against Spiroplasma citri, the causative agent of citrus stubborn disease. The experimental results confirm the antimicrobial activity of the selected AMPs against the target bacterium, demonstrating the predictive capability of the machine learning models. Discussion: Hydrophobic amino acid residues and positively charged amino acid residues are among the key features in predicting AMPs by the Random Forest Algorithm. Aggregation propensity appears to be correlated with the effectiveness of the AMPs. The described models would contribute to the development of effective AMP-based strategies for plant disease management in agricultural and environmental settings. To facilitate broader accessibility, our model is publicly available on the AGRAMP (Agricultural Ngrams Antimicrobial Peptides) server.
RESUMEN
BACKGROUND: Successful management of chronic human immunodeficiency virus type 1 (HIV-1) infection with a cocktail of antiretroviral medications can be negatively affected by the presence of drug resistant mutations in the viral targets. These targets include the HIV-1 protease (PR) and reverse transcriptase (RT) proteins, for which a number of inhibitors are available on the market and routinely prescribed. Protein mutational patterns are associated with varying degrees of resistance to their respective inhibitors, with extremes that can range from continued susceptibility to cross-resistance across all drugs. RESULTS: Here we implement statistical learning algorithms to develop structure- and sequence-based models for systematically predicting the effects of mutations in the PR and RT proteins on resistance to each of eight and eleven inhibitors, respectively. Employing a four-body statistical potential, mutant proteins are represented as feature vectors whose components quantify relative environmental perturbations at amino acid residue positions in the respective target structures upon mutation. Two approaches are implemented in developing sequence-based models, based on use of either relative frequencies or counts of n-grams, to generate vectors for representing mutant proteins. To the best of our knowledge, this is the first reported study on structure- and sequence-based predictive models of HIV-1 PR and RT drug resistance developed by implementing a four-body statistical potential and n-grams, respectively, to generate mutant attribute vectors. Performance of the learning methods is evaluated on the basis of tenfold cross-validation, using previously assayed and publicly available in vitro data relating mutational patterns in the targets to quantified inhibitor susceptibility changes. CONCLUSION: Overall performance results are competitive with those of a previously published study utilizing a sequence-based strategy, while our structure- and sequence-based models provide orthogonal and complementary prediction methodologies, respectively. In a novel application, we describe a technique for identifying every possible pair of RT inhibitors as either potentially effective together as part of a cocktail, or a combination that is to be avoided.
Asunto(s)
Farmacorresistencia Viral , Inhibidores de la Proteasa del VIH/farmacología , Proteasa del VIH/genética , Transcriptasa Inversa del VIH/genética , VIH-1/efectos de los fármacos , VIH-1/enzimología , Inhibidores de la Transcriptasa Inversa/farmacología , Algoritmos , Dominio Catalítico/genética , Biología Computacional , Infecciones por VIH/tratamiento farmacológico , Infecciones por VIH/genética , Proteasa del VIH/química , Proteasa del VIH/metabolismo , Inhibidores de la Proteasa del VIH/metabolismo , Transcriptasa Inversa del VIH/antagonistas & inhibidores , Transcriptasa Inversa del VIH/química , VIH-1/genética , VIH-1/metabolismo , Humanos , Modelos Moleculares , Proteínas Mutantes/química , Proteínas Mutantes/metabolismo , Mutación , Fenotipo , Conformación Proteica , Inhibidores de la Transcriptasa Inversa/metabolismoRESUMEN
Introduction: The African Goat Improvement Network Image Collection Protocol (AGIN-ICP) is an accessible, easy to use, low-cost procedure to collect phenotypic data via digital images. The AGIN-ICP collects images to extract several phenotype measures including health status indicators (anemia status, age, and weight), body measurements, shapes, and coat color and pattern, from digital images taken with standard digital cameras or mobile devices. This strategy is to quickly survey, record, assess, analyze, and store these data for use in a wide variety of production and sampling conditions. Methods: The work was accomplished as part of the multinational African Goat Improvement Network (AGIN) collaborative and is presented here as a case study in the AGIN collaboration model and working directly with community-based breeding programs (CBBP). It was iteratively developed and tested over 3 years, in 12 countries with over 12,000 images taken. Results and discussion: The AGIN-ICP development is described, and field implementation and the quality of the resulting images for use in image analysis and phenotypic data extraction are iteratively assessed. Digital body measures were validated using the PreciseEdge Image Segmentation Algorithm (PE-ISA) and software showing strong manual to digital body measure Pearson correlation coefficients of height, length, and girth measures (0.931, 0.943, 0.893) respectively. It is critical to note that while none of the very detailed tasks in the AGIN-ICP described here is difficult, every single one of them is even easier to accidentally omit, and the impact of such a mistake could render a sample image, a sampling day's images, or even an entire sampling trip's images difficult or unusable for extracting digital phenotypes. Coupled with tissue sampling and genomic testing, it may be useful in the effort to identify and conserve important animal genetic resources and in CBBP genetic improvement programs by providing reliably measured phenotypes with modest cost. Potential users include farmers, animal husbandry officials, veterinarians, regional government or other public health officials, researchers, and others. Based on these results, a final AGIN-ICP is presented, optimizing the costs, ease, and speed of field implementation of the collection method without compromising the quality of the image data collection.
RESUMEN
Antibiotic resistance constitutes a global threat and could lead to a future pandemic. One strategy is to develop a new generation of antimicrobials. Naturally occurring antimicrobial peptides (AMPs) are recognized templates and some are already in clinical use. To accelerate the discovery of new antibiotics, it is useful to predict novel AMPs from the sequenced genomes of various organisms. The antimicrobial peptide database (APD) provided the first empirical peptide prediction program. It also facilitated the testing of the first machine-learning algorithms. This chapter provides an overview of machine-learning predictions of AMPs. Most of the predictors, such as AntiBP, CAMP, and iAMPpred, involve a single-label prediction of antimicrobial activity. This type of prediction has been expanded to antifungal, antiviral, antibiofilm, anti-TB, hemolytic, and anti-inflammatory peptides. The multiple functional roles of AMPs annotated in the APD also enabled multi-label predictions (iAMP-2L, MLAMP, and AMAP), which include antibacterial, antiviral, antifungal, antiparasitic, antibiofilm, anticancer, anti-HIV, antimalarial, insecticidal, antioxidant, chemotactic, spermicidal activities, and protease inhibiting activities. Also considered in predictions are peptide posttranslational modification, 3D structure, and microbial species-specific information. We compare important amino acids of AMPs implied from machine learning with the frequently occurring residues of the major classes of natural peptides. Finally, we discuss advances, limitations, and future directions of machine-learning predictions of antimicrobial peptides. Ultimately, we may assemble a pipeline of such predictions beyond antimicrobial activity to accelerate the discovery of novel AMP-based antimicrobials.
Asunto(s)
Antiinfecciosos , Péptidos Antimicrobianos , Aprendizaje Automático , Aminoácidos/química , Antiinfecciosos/química , Antiinfecciosos/farmacología , Péptidos Antimicrobianos/química , Péptidos Antimicrobianos/farmacología , Péptidos/químicaRESUMEN
Computer vision is a tool that could provide livestock producers with digital body measures and records that are important for animal health and production, namely body height and length, and chest girth. However, to build these tools, the scarcity of labeled training data sets with uniform images (pose, lighting) that also represent real-world livestock can be a challenge. Collecting images in a standard way, with manual image labeling is the gold standard to create such training data, but the time and cost can be prohibitive. We introduce the PreciseEdge image segmentation algorithm to address these issues by employing a standard image collection protocol with a semi-automated image labeling method, and a highly precise image segmentation for automated body measurement extraction directly from each image. These elements, from image collection to extraction are designed to work together to yield values highly correlated to real-world body measurements. PreciseEdge adds a brief preprocessing step inspired by chromakey to a modified GrabCut procedure to generate image masks for data extraction (body measurements) directly from the images. Three hundred RGB (red, green, blue) image samples were collected uniformly per the African Goat Improvement Network Image Collection Protocol (AGIN-ICP), which prescribes camera distance, poses, a blue backdrop, and a custom AGIN-ICP calibration sign. Images were taken in natural settings outdoors and in barns under high and low light, using a Ricoh digital camera producing JPG images (converted to PNG prior to processing). The rear and side AGIN-ICP poses were used for this study. PreciseEdge and GrabCut image segmentation methods were compared for differences in user input required to segment the images. The initial bounding box image output was captured for visual comparison. Automated digital body measurements extracted were compared to manual measures for each method. Both methods allow additional optional refinement (mouse strokes) to aid the segmentation algorithm. These optional mouse strokes were captured automatically and compared. Stroke count distributions for both methods were not normally distributed per Kolmogorov-Smirnov tests. Non-parametric Wilcoxon tests showed the distributions were different (p< 0.001) and the GrabCut stroke count was significantly higher (p = 5.115 e-49), with a mean of 577.08 (std 248.45) versus 221.57 (std 149.45) with PreciseEdge. Digital body measures were highly correlated to manual height, length, and girth measures, (0.931, 0.943, 0.893) for PreciseEdge and (0.936, 0. 944, 0.869) for GrabCut (Pearson correlation coefficient). PreciseEdge image segmentation allowed for masks yielding accurate digital body measurements highly correlated to manual, real-world measurements with over 38% less user input for an efficient, reliable, non-invasive alternative to livestock hand-held direct measuring tools.
Asunto(s)
Ganado , Enfermedades de Transmisión Sexual , Algoritmos , Animales , Procesamiento de Imagen Asistido por Computador/métodos , RatonesRESUMEN
BACKGROUND: HIV-1 targets human cells expressing both the CD4 receptor, which binds the viral envelope glycoprotein gp120, as well as either the CCR5 (R5) or CXCR4 (X4) co-receptors, which interact primarily with the third hypervariable loop (V3 loop) of gp120. Determination of HIV-1 affinity for either the R5 or X4 co-receptor on host cells facilitates the inclusion of co-receptor antagonists as a part of patient treatment strategies. A dataset of 1193 distinct gp120 V3 loop peptide sequences (989 R5-utilizing, 204 X4-capable) is utilized to train predictive classifiers based on implementations of random forest, support vector machine, boosted decision tree, and neural network machine learning algorithms. An in silico mutagenesis procedure employing multibody statistical potentials, computational geometry, and threading of variant V3 sequences onto an experimental structure, is used to generate a feature vector representation for each variant whose components measure environmental perturbations at corresponding structural positions. RESULTS: Classifier performance is evaluated based on stratified 10-fold cross-validation, stratified dataset splits (2/3 training, 1/3 validation), and leave-one-out cross-validation. Best reported values of sensitivity (85%), specificity (100%), and precision (98%) for predicting X4-capable HIV-1 virus, overall accuracy (97%), Matthew's correlation coefficient (89%), balanced error rate (0.08), and ROC area (0.97) all reach critical thresholds, suggesting that the models outperform six other state-of-the-art methods and come closer to competing with phenotype assays. CONCLUSIONS: The trained classifiers provide instantaneous and reliable predictions regarding HIV-1 co-receptor usage, requiring only translated V3 loop genotypes as input. Furthermore, the novelty of these computational mutagenesis based predictor attributes distinguishes the models as orthogonal and complementary to previous methods that utilize sequence, structure, and/or evolutionary information. The classifiers are available online at http://proteins.gmu.edu/automute.
Asunto(s)
Proteína gp120 de Envoltorio del VIH/química , VIH-1/metabolismo , Modelos Moleculares , Algoritmos , Simulación por Computador , Bases de Datos Genéticas , Proteína gp120 de Envoltorio del VIH/metabolismo , VIH-1/química , VIH-1/genética , Receptores CCR5/genética , Receptores CCR5/metabolismo , Receptores CXCR4/genética , Receptores CXCR4/metabolismoRESUMEN
BACKGROUND: There is a considerable literature on the source of the thermostability of proteins from thermophilic organisms. Understanding the mechanisms for this thermostability would provide insights into proteins generally and permit the design of synthetic hyperstable biocatalysts. RESULTS: We have systematically tested a large number of sequence and structure derived quantities for their ability to discriminate thermostable proteins from their non-thermostable orthologs using sets of mesophile-thermophile ortholog pairs. Most of the quantities tested correspond to properties previously reported to be associated with thermostability. Many of the structure related properties were derived from the Delaunay tessellation of protein structures. CONCLUSIONS: Carefully selected sequence based indices discriminate better than purely structure based indices. Combined sequence and structure based indices improve performance somewhat further. Based on our analysis, the strongest contributors to thermostability are an increase in ion pairs on the protein surface and a more strongly hydrophobic interior.
Asunto(s)
Proteínas/química , Secuencia de Aminoácidos , Proteínas Bacterianas/química , Modelos Moleculares , Fosfoglicerato Quinasa/química , Conformación Proteica , Estabilidad Proteica , Pyrococcus/química , Proteína de Unión a TATA-Box/química , Temperatura , Trypanosoma brucei brucei/químicaRESUMEN
Certain genetic variations in the human population are associated with heritable diseases, and single nucleotide polymorphisms (SNPs) represent the most common form of such differences in DNA sequence. In particular, substantial interest exists in determining whether a non-synonymous SNP (nsSNP), leading to a single residue replacement in the translated protein product, is neutral or disease-related. The nature of protein structure-function relationships suggests that nsSNP effects, either benign or leading to aberrant protein function possibly associated with disease, are dependent on relative structural changes introduced upon mutation. In this study, we characterize a representative sampling of 1790 documented neutral and disease-related human nsSNPs mapped to 243 diverse human protein structures, by quantifying environmental perturbations in the associated proteins with the use of a computational mutagenesis methodology that relies on a four-body, knowledge-based, statistical contact potential. These structural change data are used as attributes to generate a vector representation for each nsSNP, in combination with additional features reflecting sequence and structure of the corresponding protein. A trained model based on the random forest supervised classification algorithm achieves 76% cross-validation accuracy. Our classifier performs at least as well as other methods that use significantly larger datasets of nsSNPs for model training, and the novelty of our attributes differentiates the model as an orthogonal approach that can be utilized in conjunction with other techniques. A dedicated server for obtaining predictions, as well as supporting datasets and documentation, is available at http://proteins.gmu.edu/automute.
Asunto(s)
Biología Computacional/métodos , Enfermedad/genética , Bases del Conocimiento , Mutagénesis/genética , Polimorfismo de Nucleótido Simple/genética , Algoritmos , Aspartilglucosilaminasa/química , Bases de Datos Genéticas , Humanos , Aprendizaje , Modelos Moleculares , Estructura Secundaria de Proteína , Curva ROC , Relación Estructura-ActividadRESUMEN
MOTIVATION: Accurate predictive models for the impact of single amino acid substitutions on protein stability provide insight into protein structure and function. Such models are also valuable for the design and engineering of new proteins. Previously described methods have utilized properties of protein sequence or structure to predict the free energy change of mutants due to thermal (DeltaDeltaG) and denaturant (DeltaDeltaG(H2O)) denaturations, as well as mutant thermal stability (DeltaT(m)), through the application of either computational energy-based approaches or machine learning techniques. However, accuracy associated with applying these methods separately is frequently far from optimal. RESULTS: We detail a computational mutagenesis technique based on a four-body, knowledge-based, statistical contact potential. For any mutation due to a single amino acid replacement in a protein, the method provides an empirical normalized measure of the ensuing environmental perturbation occurring at every residue position. A feature vector is generated for the mutant by considering perturbations at the mutated position and it's ordered six nearest neighbors in the 3-dimensional (3D) protein structure. These predictors of stability change are evaluated by applying machine learning tools to large training sets of mutants derived from diverse proteins that have been experimentally studied and described. Predictive models based on our combined approach are either comparable to, or in many cases significantly outperform, previously published results. AVAILABILITY: A web server with supporting documentation is available at http://proteins.gmu.edu/automute.
Asunto(s)
Inteligencia Artificial , Biología Computacional , Mutagénesis , Proteínas/química , Proteínas/genética , Algoritmos , Simulación por Computador , Bases de Datos de Proteínas , Modelos Moleculares , Pliegue de Proteína , Estructura Terciaria de Proteína , Alineación de Secuencia , Análisis de Secuencia de Proteína , Relación Estructura-Actividad , TermodinámicaRESUMEN
Ras proteins play a pivotal role as oncogenes by participating in diverse signaling events, including those linked to cell growth, differentiation, and proliferation. Using experimental fitness data and implementing artificial intelligence and a computational mutagenesis technique, we developed models that reliably predict fitness for all single residue mutants of H-ras proto-oncogene protein p21. The computational mutagenesis generated a feature vector of protein structural changes for each variant, and these data correlated well with fitness. Random forest classification and tree regression machine learning algorithms were implemented for training predictive models. Cross-validations were used to evaluate model performance, and control experiments were performed to assess statistical significance. Classification models revealed a balanced accuracy rate as high as 82%, with a Matthew's correlation of 0.63, and an area under ROC curve of 0.90. Similarly, regression models displayed Pearson's correlation reaching 0.79. On the other hand, control data sets led to performance values consistent with random guessing. Comparisons with several related state-of-the-art methods reflected favorably on our trained models. This H-Ras proof-of-principle study suggests a complementary approach for understanding mechanisms with which other proteins are involved in oncogenesis, including related Ras isoforms, and for providing useful insights into designing future diagnostic and treatment modalities.
RESUMEN
There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. The utilization of classifiers that incorporate fuzzy logic provides a natural extension in order to account for the spectrum of possible functional consequences. We generated a dataset of single amino acid substitutions in human proteins having known three-dimensional structures. Each variant was uniquely represented as a feature vector that included computational geometry and knowledge-based statistical potential predictors obtained though application of Delaunay tessellation of protein structures. Additional attributes consisted of physicochemical properties of the native and replacement amino acids as well as topological location of the mutated residue position in the solved structure. Classification performance of the RF algorithm was evaluated on a training set consisting of the disease-associated and neutral nsSNPs taken from our dataset, and attributes were ranked according to their relative importance. Similarly, we evaluated the performance of adaptive neuro-fuzzy inference system (ANFIS). The utility of statistical geometry predictors was compared with that of traditional structural and evolutionary attributes employed by other researchers, revealing an equally effective yet complementary methodology. Among all attributes in our feature set, the statistical geometry predictors were found to be the most highly ranked. On the basis of the AUC (area under the ROC curve) measure of performance, the ANFIS and RF models were equally effective when only statistical geometry features were utilized. Tenfold cross-validation studies evaluating AUC, balanced error rate (BER), and Matthew's correlation coefficient (MCC) showed that our RF model was at least comparable with the well-established methods of SIFT and PolyPhen. The trained RF and ANFIS models were each subsequently used to predict the disease potential of human nsSNPs in our dataset that are currently unclassified (http://rna.gmu.edu/FuzzySnps/).
Asunto(s)
Árboles de Decisión , Lógica Difusa , Polimorfismo de Nucleótido Simple/genética , Proteínas/química , Proteínas/genética , Algoritmos , Secuencia de Aminoácidos , Sustitución de Aminoácidos , Área Bajo la Curva , Inteligencia Artificial , Distribución de Chi-Cuadrado , Biología Computacional/métodos , Bases de Datos Factuales , Humanos , Interacciones Hidrofóbicas e Hidrofílicas , Modelos Estadísticos , Datos de Secuencia Molecular , Redes Neurales de la Computación , Filogenia , Valor Predictivo de las Pruebas , Conformación Proteica , Estructura Secundaria de Proteína , Estructura Terciaria de Proteína , Curva ROC , Reproducibilidad de los Resultados , Homología de Secuencia de AminoácidoRESUMEN
MOTIVATION: An important area of research in biochemistry and molecular biology focuses on characterization of enzyme mutants. However, synthesis and analysis of experimental mutants is time consuming and expensive. We describe a machine-learning approach for inferring the activity levels of all unexplored single point mutants of an enzyme, based on a training set of such mutants with experimentally measured activity. RESULTS: Based on a Delaunay tessellation-derived four-body statistical potential function, a perturbation vector measuring environmental changes relative to wild type (wt) at every residue position uniquely characterizes each enzyme mutant for model development and prediction. First, a measure of model performance utilizing area (AUC) under the receiver operating characteristic (ROC) curve surpasses 0.83 and 0.77 for data sets of experimental HIV-1 protease and T4 lysozyme mutants, respectively. Additionally, a novel method is introduced for evaluating statistical significance associated with the number of correct test set predictions obtained from a trained model. Third, 100 stratified random splits of the protease and T4 lysozyme mutant data sets into training and test sets achieve 77.0% and 80.8% mean accuracy, respectively. Next, protease and T4 lysozyme models trained with experimental mutants are used to predict activity levels for all remaining mutants; a subsequent search for publications reporting on dozens of these test mutants reveals that experimental results are matched by 79% and 86% of predictions, respectively. Finally, learning curves for each mutant enzyme system indicate the influence of training set size on model performance. AVAILABILITY: Prediction databases at http://proteins.gmu.edu/automute/
Asunto(s)
Inteligencia Artificial , Enzimas/química , Enzimas/genética , Modelos Químicos , Mutagénesis Sitio-Dirigida/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Sustitución de Aminoácidos/genética , Simulación por Computador , Interpretación Estadística de Datos , Activación Enzimática , Modelos Moleculares , Datos de Secuencia Molecular , Mutación , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Relación Estructura-ActividadRESUMEN
Topological scores, measures of sequence-structure compatibility, are calculated for all 1,881 single point mutants of the human immunodeficiency virus (HIV)-1 protease using a four-body statistical potential function based on Delaunay tessellation of protein structure. Comparison of the mutant topological score data with experimental data from alanine scan studies specifically on the dimer interface residues supports previous findings that 1) L97 and F99 contribute greatly to the Gibbs energy of HIV-1 protease dimerization, 2) Q2 and T4 contribute the least toward the Gibbs energy, and 3) C-terminal residues are more sensitive to mutations than those at the N-terminus. For a more comprehensive treatment of the relationship between protease structure and function, mutant topological scores are compared with the activity levels for a set of 536 experimentally synthesized protease mutants, and a significant correlation is observed. Finally, this structure-function correlation is similarly identified by examining model systems consisting of 2,015 single point mutants of bacteriophage T4 lysozyme as well as 366 single point mutants of HIV-1 reverse transcriptase and is hypothesized to be a property generally applicable to all proteins.
Asunto(s)
Mutagénesis , Proteínas/química , Proteínas/metabolismo , Proteínas Recombinantes/química , Proteínas Recombinantes/metabolismo , Biología Computacional/métodos , Proteasa del VIH/química , Proteasa del VIH/metabolismo , Modelos Moleculares , Conformación Proteica , Proteínas/genética , Relación Estructura-ActividadRESUMEN
The Delaunay tessellation of several sets of real and simplified model protein structures has been used to explore graph theoretic properties of residue contact networks. The system of contacts defined by residues joined by edges in the Delaunay simplices can be thought of as a graph or network and analyzed using techniques from elementary graph theory and the theory of complex networks. Such analysis indicates that protein contact networks have small world character, but technically are not small world networks. This approach also indicates that networks formed by native structures and by most misfolded decoys can be differentiated by their respective graph properties. The characteristic features of residue contact networks can be used for the detection of structural elements in proteins, such as the ubiquitous closed loops consisting of 22-32 consecutive residues, where terminal residues are Delaunay neighbors.
Asunto(s)
Modelos Químicos , Modelos Moleculares , Complejos Multiproteicos/química , Complejos Multiproteicos/ultraestructura , Proteínas/química , Proteínas/ultraestructura , Sitios de Unión , Simulación por Computador , Unión Proteica , Conformación ProteicaRESUMEN
The ability to predict the effect of nonsynonymous SNPs (nsSNPs) on protein function is important for the success of genetic disease association studies. Here we present a statistical geometry approach to nsSNP classification based on Delaunay tessellation, whereby the impact of nsSNPs on protein function is correlated with the change in the four-body statistical potential (DeltaQ) of the protein caused by the amino acid substitution. We observed that the DeltaQ of polymorphic proteins with disease-associated nsSNPs (daSNPs) was on average significantly lower than the DeltaQ of the proteins with neutral SNPs (ntSNPs). Clustering amino acid substitutions into conservative and nonconservative groups, and using a three-letter alphabet based on side-chain polarity showed significantly lower DeltaQ in nonconservative changes to daSNPs and when hydrophobic residues were substituted by charged or by polar residues. We also found that the daSNPs in the protein core caused much lower DeltaQ than surface daSNPs. This approach demonstrates a strong correlation between the computed DeltaQ and SNP classification. Integration of our approach with the existing models will help achieve a more precise recognition of nsSNPs that underlie polygenic diseases. All of the programs were written in Java and are available from the authors upon request.
Asunto(s)
Biología Computacional/métodos , Predisposición Genética a la Enfermedad , Polimorfismo de Nucleótido Simple/fisiología , Sustitución de Aminoácidos , Interpretación Estadística de Datos , Humanos , Cómputos Matemáticos , Mutación , Conformación Proteica , Proteínas/química , Proteínas/genéticaRESUMEN
A simple, five-element descriptor, derived from the Delaunay tessellation of a protein structure in a single point per residue representation, can be assigned to each residue in the protein. The descriptor characterizes main-chain topology and connectivity in the neighborhood of the residue and does not explicitly depend on putative hydrogen bonds or any geometric parameter, including bond length, angles, and areas. Rules based on this descriptor can be used for accurate, robust, and computationally efficient secondary structure assignment that correlates well with the existing methods.
Asunto(s)
Biología Computacional/métodos , Proteómica/métodos , Algoritmos , Secuencia de Aminoácidos , Fenómenos Biofísicos , Biofisica , Simulación por Computador , Enlace de Hidrógeno , Modelos Químicos , Modelos Moleculares , Datos de Secuencia Molecular , Conformación Proteica , Pliegue de Proteína , Estructura Secundaria de Proteína , Estructura Terciaria de Proteína , Proteínas/química , Sensibilidad y Especificidad , Programas Informáticos , Homología Estructural de ProteínaRESUMEN
A topological representation of proteins is developed that makes use of two metrics: the Euclidean metric for identifying natural nearest neighboring residues via the Delaunay tessellation in Cartesian space and the distance between residues in sequence space. Using this representation, we introduce a quantitative and computationally inexpensive method for the comparison of protein structural topology. The method ultimately results in a numerical score quantifying the distance between proteins in a heuristically defined topological space. The properties of this scoring scheme are investigated and correlated with the standard Calpha distance root-mean-square deviation measure of protein similarity calculated by rigid body structural alignment. The topological comparison method is shown to have a characteristic dependence on protein conformational differences and secondary structure. This distinctive behavior is also observed in the comparison of proteins within families of structural relatives. The ability of the comparison method to successfully classify proteins into classes, superfamilies, folds, and families that are consistent with standard classification methods, both automated and human-driven, is demonstrated. Furthermore, it is shown that the scoring method allows for a fine-grained classification on the family, protein, and species level that agrees very well with currently established phylogenetic hierarchies. This fine classification is achieved without requiring visual inspection of proteins, sequence analysis, or the use of structural superimposition methods. Implications of the method for a fast, automated, topological hierarchical classification of proteins are discussed.