Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Front Microbiol ; 15: 1304044, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38516021

RESUMO

Introduction: Antimicrobial peptides (AMPs) are promising alternatives to traditional antibiotics for combating plant pathogenic bacteria in agriculture and the environment. However, identifying potent AMPs through laborious experimental assays is resource-intensive and time-consuming. To address these limitations, this study presents a bioinformatics approach utilizing machine learning models for predicting and selecting AMPs active against plant pathogenic bacteria. Methods: N-gram representations of peptide sequences with 3-letter and 9-letter reduced amino acid alphabets were used to capture the sequence patterns and motifs that contribute to the antimicrobial activity of AMPs. A 5-fold cross-validation technique was used to train the machine learning models and to evaluate their predictive accuracy and robustness. Results: The models were applied to predict putative AMPs encoded by intergenic regions and small open reading frames (ORFs) of the citrus genome. Approximately 7% of the 10,000-peptide dataset from the intergenic region and 7% of the 685,924-peptide dataset from the whole genome were predicted as probable AMPs. The prediction accuracy of the reported models range from 0.72 to 0.91. A subset of the predicted AMPs was selected for experimental test against Spiroplasma citri, the causative agent of citrus stubborn disease. The experimental results confirm the antimicrobial activity of the selected AMPs against the target bacterium, demonstrating the predictive capability of the machine learning models. Discussion: Hydrophobic amino acid residues and positively charged amino acid residues are among the key features in predicting AMPs by the Random Forest Algorithm. Aggregation propensity appears to be correlated with the effectiveness of the AMPs. The described models would contribute to the development of effective AMP-based strategies for plant disease management in agricultural and environmental settings. To facilitate broader accessibility, our model is publicly available on the AGRAMP (Agricultural Ngrams Antimicrobial Peptides) server.

2.
Front Genet ; 14: 1200770, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37745840

RESUMO

Introduction: The African Goat Improvement Network Image Collection Protocol (AGIN-ICP) is an accessible, easy to use, low-cost procedure to collect phenotypic data via digital images. The AGIN-ICP collects images to extract several phenotype measures including health status indicators (anemia status, age, and weight), body measurements, shapes, and coat color and pattern, from digital images taken with standard digital cameras or mobile devices. This strategy is to quickly survey, record, assess, analyze, and store these data for use in a wide variety of production and sampling conditions. Methods: The work was accomplished as part of the multinational African Goat Improvement Network (AGIN) collaborative and is presented here as a case study in the AGIN collaboration model and working directly with community-based breeding programs (CBBP). It was iteratively developed and tested over 3 years, in 12 countries with over 12,000 images taken. Results and discussion: The AGIN-ICP development is described, and field implementation and the quality of the resulting images for use in image analysis and phenotypic data extraction are iteratively assessed. Digital body measures were validated using the PreciseEdge Image Segmentation Algorithm (PE-ISA) and software showing strong manual to digital body measure Pearson correlation coefficients of height, length, and girth measures (0.931, 0.943, 0.893) respectively. It is critical to note that while none of the very detailed tasks in the AGIN-ICP described here is difficult, every single one of them is even easier to accidentally omit, and the impact of such a mistake could render a sample image, a sampling day's images, or even an entire sampling trip's images difficult or unusable for extracting digital phenotypes. Coupled with tissue sampling and genomic testing, it may be useful in the effort to identify and conserve important animal genetic resources and in CBBP genetic improvement programs by providing reliably measured phenotypes with modest cost. Potential users include farmers, animal husbandry officials, veterinarians, regional government or other public health officials, researchers, and others. Based on these results, a final AGIN-ICP is presented, optimizing the costs, ease, and speed of field implementation of the collection method without compromising the quality of the image data collection.

3.
Int J Mol Sci ; 23(19)2022 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-36232786

RESUMO

ApoB-100 is a member of a large lipid transfer protein superfamily and is one of the main apolipoproteins found on low-density lipoprotein (LDL) and very low-density lipoprotein (VLDL) particles. Despite its clinical significance for the development of cardiovascular disease, there is limited information on apoB-100 structure. We have developed a novel method based on the "divide and conquer" algorithm, using PSIPRED software, by dividing apoB-100 into five subunits and 11 domains. Models of each domain were prepared using I-TASSER, DEMO, RoseTTAFold, Phyre2, and MODELLER. Subsequently, we used disuccinimidyl sulfoxide (DSSO), a new mass spectrometry cleavable cross-linker, and the known position of disulfide bonds to experimentally validate each model. We obtained 65 unique DSSO cross-links, of which 87.5% were within a 26 Å threshold in the final model. We also evaluated the positions of cysteine residues involved in the eight known disulfide bonds in apoB-100, and each pair was measured within the expected 5.6 Å constraint. Finally, multiple domains were combined by applying constraints based on detected long-range DSSO cross-links to generate five subunits, which were subsequently merged to achieve an uninterrupted architecture for apoB-100 around a lipoprotein particle. Moreover, the dynamics of apoB-100 during particle size transitions was examined by comparing VLDL and LDL computational models and using experimental cross-linking data. In addition, the proposed model of receptor ligand binding of apoB-100 provides new insights into some of its functions.


Assuntos
Apolipoproteínas B , Cisteína , Apolipoproteína B-100 , Apolipoproteínas B/metabolismo , Simulação por Computador , Dissulfetos , Ligantes , Lipoproteínas LDL/química , Lipoproteínas VLDL , Modelos Estruturais , Sulfóxidos
4.
PLoS One ; 17(10): e0275821, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36227957

RESUMO

Computer vision is a tool that could provide livestock producers with digital body measures and records that are important for animal health and production, namely body height and length, and chest girth. However, to build these tools, the scarcity of labeled training data sets with uniform images (pose, lighting) that also represent real-world livestock can be a challenge. Collecting images in a standard way, with manual image labeling is the gold standard to create such training data, but the time and cost can be prohibitive. We introduce the PreciseEdge image segmentation algorithm to address these issues by employing a standard image collection protocol with a semi-automated image labeling method, and a highly precise image segmentation for automated body measurement extraction directly from each image. These elements, from image collection to extraction are designed to work together to yield values highly correlated to real-world body measurements. PreciseEdge adds a brief preprocessing step inspired by chromakey to a modified GrabCut procedure to generate image masks for data extraction (body measurements) directly from the images. Three hundred RGB (red, green, blue) image samples were collected uniformly per the African Goat Improvement Network Image Collection Protocol (AGIN-ICP), which prescribes camera distance, poses, a blue backdrop, and a custom AGIN-ICP calibration sign. Images were taken in natural settings outdoors and in barns under high and low light, using a Ricoh digital camera producing JPG images (converted to PNG prior to processing). The rear and side AGIN-ICP poses were used for this study. PreciseEdge and GrabCut image segmentation methods were compared for differences in user input required to segment the images. The initial bounding box image output was captured for visual comparison. Automated digital body measurements extracted were compared to manual measures for each method. Both methods allow additional optional refinement (mouse strokes) to aid the segmentation algorithm. These optional mouse strokes were captured automatically and compared. Stroke count distributions for both methods were not normally distributed per Kolmogorov-Smirnov tests. Non-parametric Wilcoxon tests showed the distributions were different (p< 0.001) and the GrabCut stroke count was significantly higher (p = 5.115 e-49), with a mean of 577.08 (std 248.45) versus 221.57 (std 149.45) with PreciseEdge. Digital body measures were highly correlated to manual height, length, and girth measures, (0.931, 0.943, 0.893) for PreciseEdge and (0.936, 0. 944, 0.869) for GrabCut (Pearson correlation coefficient). PreciseEdge image segmentation allowed for masks yielding accurate digital body measurements highly correlated to manual, real-world measurements with over 38% less user input for an efficient, reliable, non-invasive alternative to livestock hand-held direct measuring tools.


Assuntos
Gado , Infecções Sexualmente Transmissíveis , Algoritmos , Animais , Processamento de Imagem Assistida por Computador/métodos , Camundongos
5.
Methods Mol Biol ; 2405: 1-37, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35298806

RESUMO

Antibiotic resistance constitutes a global threat and could lead to a future pandemic. One strategy is to develop a new generation of antimicrobials. Naturally occurring antimicrobial peptides (AMPs) are recognized templates and some are already in clinical use. To accelerate the discovery of new antibiotics, it is useful to predict novel AMPs from the sequenced genomes of various organisms. The antimicrobial peptide database (APD) provided the first empirical peptide prediction program. It also facilitated the testing of the first machine-learning algorithms. This chapter provides an overview of machine-learning predictions of AMPs. Most of the predictors, such as AntiBP, CAMP, and iAMPpred, involve a single-label prediction of antimicrobial activity. This type of prediction has been expanded to antifungal, antiviral, antibiofilm, anti-TB, hemolytic, and anti-inflammatory peptides. The multiple functional roles of AMPs annotated in the APD also enabled multi-label predictions (iAMP-2L, MLAMP, and AMAP), which include antibacterial, antiviral, antifungal, antiparasitic, antibiofilm, anticancer, anti-HIV, antimalarial, insecticidal, antioxidant, chemotactic, spermicidal activities, and protease inhibiting activities. Also considered in predictions are peptide posttranslational modification, 3D structure, and microbial species-specific information. We compare important amino acids of AMPs implied from machine learning with the frequently occurring residues of the major classes of natural peptides. Finally, we discuss advances, limitations, and future directions of machine-learning predictions of antimicrobial peptides. Ultimately, we may assemble a pipeline of such predictions beyond antimicrobial activity to accelerate the discovery of novel AMP-based antimicrobials.


Assuntos
Anti-Infecciosos , Peptídeos Antimicrobianos , Aprendizado de Máquina , Aminoácidos/química , Anti-Infecciosos/química , Anti-Infecciosos/farmacologia , Peptídeos Antimicrobianos/química , Peptídeos Antimicrobianos/farmacologia , Peptídeos/química
6.
Proteins ; 88(11): 1472-1481, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-32535960

RESUMO

Intrinsically disordered regions (IDR) play an important role in key biological processes and are closely related to human diseases. IDRs have great potential to serve as targets for drug discovery, most notably in disordered binding regions. Accurate prediction of IDRs is challenging because their genome wide occurrence and a low ratio of disordered residues make them difficult targets for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy which is time consuming and computationally expensive. This article describes an ab initio sequence-only prediction method-which tries to overcome the challenge of accurate prediction posed by IDRs-based on reduced amino acid alphabets and convolutional neural networks (CNNs). We experiment with six different 3-letter reduced alphabets. We argue that the dimensional reduction in the input alphabet facilitates the detection of complex patterns within the sequence by the convolutional step. Experimental results show that our proposed IDR predictor performs at the same level or outperforms other state-of-the-art methods in the same class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our method is suitable for proteome-wide disorder prediction yielding similar or better accuracy than existing approaches at a faster speed.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/estatística & dados numéricos , Proteínas Intrinsicamente Desordenadas/química , Aprendizado de Máquina , Redes Neurais de Computação , Sequência de Aminoácidos , Área Sob a Curva , Benchmarking , Conjuntos de Dados como Assunto , Humanos , Redução Dimensional com Múltiplos Fatores , Curva ROC , Análise de Sequência de Proteína
7.
Heliyon ; 5(6): e01884, 2019 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-31211262

RESUMO

Ras proteins play a pivotal role as oncogenes by participating in diverse signaling events, including those linked to cell growth, differentiation, and proliferation. Using experimental fitness data and implementing artificial intelligence and a computational mutagenesis technique, we developed models that reliably predict fitness for all single residue mutants of H-ras proto-oncogene protein p21. The computational mutagenesis generated a feature vector of protein structural changes for each variant, and these data correlated well with fitness. Random forest classification and tree regression machine learning algorithms were implemented for training predictive models. Cross-validations were used to evaluate model performance, and control experiments were performed to assess statistical significance. Classification models revealed a balanced accuracy rate as high as 82%, with a Matthew's correlation of 0.63, and an area under ROC curve of 0.90. Similarly, regression models displayed Pearson's correlation reaching 0.79. On the other hand, control data sets led to performance values consistent with random guessing. Comparisons with several related state-of-the-art methods reflected favorably on our trained models. This H-Ras proof-of-principle study suggests a complementary approach for understanding mechanisms with which other proteins are involved in oncogenesis, including related Ras isoforms, and for providing useful insights into designing future diagnostic and treatment modalities.

8.
Adv Bioinformatics ; 2014: 278385, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25197272

RESUMO

The AUTO-MUTE 2.0 stand-alone software package includes a collection of programs for predicting functional changes to proteins upon single residue substitutions, developed by combining structure-based features with trained statistical learning models. Three of the predictors evaluate changes to protein stability upon mutation, each complementing a distinct experimental approach. Two additional classifiers are available, one for predicting activity changes due to residue replacements and the other for determining the disease potential of mutations associated with nonsynonymous single nucleotide polymorphisms (nsSNPs) in human proteins. These five command-line driven tools, as well as all the supporting programs, complement those that run our AUTO-MUTE web-based server. Nevertheless, all the codes have been rewritten and substantially altered for the new portable software, and they incorporate several new features based on user feedback. Included among these upgrades is the ability to perform three highly requested tasks: to run "big data" batch jobs; to generate predictions using modified protein data bank (PDB) structures, and unpublished personal models prepared using standard PDB file formatting; and to utilize NMR structure files that contain multiple models.

9.
Antiviral Res ; 106: 5-12, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-24681122

RESUMO

The enzyme integrase (IN) of human immunodeficiency virus type 1 (HIV-1) mediates integration of reverse transcribed viral DNA into the human genome, an essential step in the HIV-1 replication cycle. Elvitegravir (EVG) is an HIV-1 strand transfer inhibitor that binds IN and is the second drug in its class to be approved for clinical use in combination with other anti-HIV-1 medications. However, certain IN sequence mutational patterns have an effect on inhibitor binding, thereby altering the degree of IN mutant susceptibility to EVG. Employing a dataset of 115 translated IN sequences, each having a known EVG susceptibility value and consisting of a distinct set of amino acid replacements relative to the native IN, here we develop and evaluate statistical learning models for predicting the phenotypes (i.e., quantified EVG susceptibilities) of new IN mutants based solely on their genotypes (i.e., translated IN sequences). Each IN mutant is represented as a feature vector of structure-based attributes obtained via an in silico mutagenesis procedure that quantifies all anticipated IN residue-specific environmental perturbations from wild type upon mutation. Cross-validated performance based on four classification models show that balanced accuracy reaches 87%, while two regression models yield a Pearson's correlation coefficient as high as r=0.78. At the present time, the models may potentially be useful for diagnostic purposes, but only in conjunction with other tools and techniques, including experimental phenotype assays. However, as published experimental phenotypes for new IN variants become available, a larger and more diverse training set will likely lead to significantly more accurate models.


Assuntos
Fármacos Anti-HIV/farmacologia , Farmacorresistência Viral , Integrase de HIV/genética , Integrase de HIV/metabolismo , HIV-1/enzimologia , Quinolonas/farmacologia , Inteligência Artificial , Simulação por Computador , HIV-1/genética , Proteínas Mutantes/genética , Proteínas Mutantes/metabolismo , Ligação Proteica
10.
BMC Genomics ; 14 Suppl 4: S3, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24268064

RESUMO

BACKGROUND: Successful management of chronic human immunodeficiency virus type 1 (HIV-1) infection with a cocktail of antiretroviral medications can be negatively affected by the presence of drug resistant mutations in the viral targets. These targets include the HIV-1 protease (PR) and reverse transcriptase (RT) proteins, for which a number of inhibitors are available on the market and routinely prescribed. Protein mutational patterns are associated with varying degrees of resistance to their respective inhibitors, with extremes that can range from continued susceptibility to cross-resistance across all drugs. RESULTS: Here we implement statistical learning algorithms to develop structure- and sequence-based models for systematically predicting the effects of mutations in the PR and RT proteins on resistance to each of eight and eleven inhibitors, respectively. Employing a four-body statistical potential, mutant proteins are represented as feature vectors whose components quantify relative environmental perturbations at amino acid residue positions in the respective target structures upon mutation. Two approaches are implemented in developing sequence-based models, based on use of either relative frequencies or counts of n-grams, to generate vectors for representing mutant proteins. To the best of our knowledge, this is the first reported study on structure- and sequence-based predictive models of HIV-1 PR and RT drug resistance developed by implementing a four-body statistical potential and n-grams, respectively, to generate mutant attribute vectors. Performance of the learning methods is evaluated on the basis of tenfold cross-validation, using previously assayed and publicly available in vitro data relating mutational patterns in the targets to quantified inhibitor susceptibility changes. CONCLUSION: Overall performance results are competitive with those of a previously published study utilizing a sequence-based strategy, while our structure- and sequence-based models provide orthogonal and complementary prediction methodologies, respectively. In a novel application, we describe a technique for identifying every possible pair of RT inhibitors as either potentially effective together as part of a cocktail, or a combination that is to be avoided.


Assuntos
Farmacorresistência Viral , Inibidores da Protease de HIV/farmacologia , Protease de HIV/genética , Transcriptase Reversa do HIV/genética , HIV-1/efeitos dos fármacos , HIV-1/enzimologia , Inibidores da Transcriptase Reversa/farmacologia , Algoritmos , Domínio Catalítico/genética , Biologia Computacional , Infecções por HIV/tratamento farmacológico , Infecções por HIV/genética , Protease de HIV/química , Protease de HIV/metabolismo , Inibidores da Protease de HIV/metabolismo , Transcriptase Reversa do HIV/antagonistas & inibidores , Transcriptase Reversa do HIV/química , HIV-1/genética , HIV-1/metabolismo , Humanos , Modelos Moleculares , Proteínas Mutantes/química , Proteínas Mutantes/metabolismo , Mutação , Fenótipo , Conformação Proteica , Inibidores da Transcriptase Reversa/metabolismo
11.
Biophys Chem ; 153(2-3): 168-72, 2011 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-21146283

RESUMO

The development of drug resistance to antiretroviral medications used to treat infection with HIV-1 is a major concern. Given the cost and time constraints associated with phenotypic resistance testing, computational approaches leading to accurate predictive models of resistance based on a patient's mutational patterns in the target protein would provide a welcome alternative. A combined sequence-structure computational mutagenesis procedure is used to generate attribute vectors for each of 222 mutational patterns of HIV-1 reverse transcriptase that were isolated and sequenced from patients. Phenotypic fold-levels of resistance to the non-nucleoside inhibitor Nevirapine are known for over 25% of these mutants, whose values are used to assign each assayed mutant to a drug susceptibility class, either sensitive or resistant. Support vector machine and random forest supervised learning algorithms applied to this subset respectively classify mutants based on drug susceptibility with 85% and 92% cross-validation accuracy. The trained models are used to predict susceptibility to Nevirapine for all remaining mutant isolates, and predictions are in agreement for 90% of the test cases.


Assuntos
Farmacorresistência Viral/genética , Transcriptase Reversa do HIV/antagonistas & inibidores , Transcriptase Reversa do HIV/efeitos dos fármacos , HIV-1/efeitos dos fármacos , Mutação/efeitos dos fármacos , Nevirapina/farmacologia , Inibidores da Transcriptase Reversa/farmacologia , Testes Genéticos/métodos , Vetores Genéticos , Infecções por HIV/tratamento farmacológico , Infecções por HIV/genética , Transcriptase Reversa do HIV/genética , HIV-1/genética , Humanos , Modelos Genéticos , Mutação/genética
12.
Artigo em Inglês | MEDLINE | ID: mdl-22255025

RESUMO

A computational mutagenesis methodology founded upon a structure-dependent and knowledge-based four-body statistical potential is utilized in generating feature vectors that characterize over 8500 individual amino acid substitutions occurring in seven proteins, each mutant having been experimentally ascertained for its relative effect on native protein activity. The proteins are diverse with respect to host organism (viral, bacterial, human) and function (enzymatic, nucleic acid binding, signaling), the structures span all four major SCOP classifications, and the mutations occur at positions well distributed throughout the seven structures. Implementation of the random forest algorithm, for classifying mutant activity as either unaffected or affected relative to the native protein, yields 84% accuracy based on tenfold cross-validation. A freely available online server for obtaining predictions with the trained model, which also displays 84% accuracy on an independent test set of mutants, is available at http://proteins.gmu.edu/automute/AUTO-MUTE_Activity.html.


Assuntos
Proteínas/química , Algoritmos , Substituição de Aminoácidos , Mutagênese , Conformação Proteica , Proteínas/genética
13.
Artigo em Inglês | MEDLINE | ID: mdl-22255026

RESUMO

Protein engineering experiments involving single amino acid substitutions are routinely implemented for the analysis of protein structure, stability, and function. The resulting change in just one of these characteristics relative to the native protein constitutes the focus of any single study, as is the case with predictive computational models developed for the same purpose. Other than investigations into stability-activity trade-offs specifically resulting from active site residue replacements in a few enzymes, a literature survey fails to reveal a comprehensive analysis of stability-activity relationships in proteins upon mutation. Here, we employ a computational mutagenesis for quantifying overall protein structural change upon mutation, which is applied to a dataset of 938 single residue replacements distributed at positions throughout twenty diverse proteins. These mutants are selected based on the availability of both experimental stability and activity change data, and their structural change data are used to characterize the full range of stability-activity relationships.


Assuntos
Mutagênese , Proteínas/química
14.
BMC Bioinformatics ; 11: 494, 2010 Oct 05.
Artigo em Inglês | MEDLINE | ID: mdl-20923564

RESUMO

BACKGROUND: HIV-1 targets human cells expressing both the CD4 receptor, which binds the viral envelope glycoprotein gp120, as well as either the CCR5 (R5) or CXCR4 (X4) co-receptors, which interact primarily with the third hypervariable loop (V3 loop) of gp120. Determination of HIV-1 affinity for either the R5 or X4 co-receptor on host cells facilitates the inclusion of co-receptor antagonists as a part of patient treatment strategies. A dataset of 1193 distinct gp120 V3 loop peptide sequences (989 R5-utilizing, 204 X4-capable) is utilized to train predictive classifiers based on implementations of random forest, support vector machine, boosted decision tree, and neural network machine learning algorithms. An in silico mutagenesis procedure employing multibody statistical potentials, computational geometry, and threading of variant V3 sequences onto an experimental structure, is used to generate a feature vector representation for each variant whose components measure environmental perturbations at corresponding structural positions. RESULTS: Classifier performance is evaluated based on stratified 10-fold cross-validation, stratified dataset splits (2/3 training, 1/3 validation), and leave-one-out cross-validation. Best reported values of sensitivity (85%), specificity (100%), and precision (98%) for predicting X4-capable HIV-1 virus, overall accuracy (97%), Matthew's correlation coefficient (89%), balanced error rate (0.08), and ROC area (0.97) all reach critical thresholds, suggesting that the models outperform six other state-of-the-art methods and come closer to competing with phenotype assays. CONCLUSIONS: The trained classifiers provide instantaneous and reliable predictions regarding HIV-1 co-receptor usage, requiring only translated V3 loop genotypes as input. Furthermore, the novelty of these computational mutagenesis based predictor attributes distinguishes the models as orthogonal and complementary to previous methods that utilize sequence, structure, and/or evolutionary information. The classifiers are available online at http://proteins.gmu.edu/automute.


Assuntos
Proteína gp120 do Envelope de HIV/química , HIV-1/metabolismo , Modelos Moleculares , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Proteína gp120 do Envelope de HIV/metabolismo , HIV-1/química , HIV-1/genética , Receptores CCR5/genética , Receptores CCR5/metabolismo , Receptores CXCR4/genética , Receptores CXCR4/metabolismo
15.
J Theor Biol ; 266(4): 560-8, 2010 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-20655929

RESUMO

Certain genetic variations in the human population are associated with heritable diseases, and single nucleotide polymorphisms (SNPs) represent the most common form of such differences in DNA sequence. In particular, substantial interest exists in determining whether a non-synonymous SNP (nsSNP), leading to a single residue replacement in the translated protein product, is neutral or disease-related. The nature of protein structure-function relationships suggests that nsSNP effects, either benign or leading to aberrant protein function possibly associated with disease, are dependent on relative structural changes introduced upon mutation. In this study, we characterize a representative sampling of 1790 documented neutral and disease-related human nsSNPs mapped to 243 diverse human protein structures, by quantifying environmental perturbations in the associated proteins with the use of a computational mutagenesis methodology that relies on a four-body, knowledge-based, statistical contact potential. These structural change data are used as attributes to generate a vector representation for each nsSNP, in combination with additional features reflecting sequence and structure of the corresponding protein. A trained model based on the random forest supervised classification algorithm achieves 76% cross-validation accuracy. Our classifier performs at least as well as other methods that use significantly larger datasets of nsSNPs for model training, and the novelty of our attributes differentiates the model as an orthogonal approach that can be utilized in conjunction with other techniques. A dedicated server for obtaining predictions, as well as supporting datasets and documentation, is available at http://proteins.gmu.edu/automute.


Assuntos
Biologia Computacional/métodos , Doença/genética , Bases de Conhecimento , Mutagênese/genética , Polimorfismo de Nucleotídeo Único/genética , Algoritmos , Aspartilglucosilaminase/química , Bases de Dados Genéticas , Humanos , Aprendizagem , Modelos Moleculares , Estrutura Secundária de Proteína , Curva ROC , Relação Estrutura-Atividade
16.
Protein Eng Des Sel ; 23(8): 683-7, 2010 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-20573719

RESUMO

Utilizing cutting-edge supervised classification and regression algorithms, three web-based tools have been developed for predicting stability changes upon single residue substitutions in proteins with known native structures. Trained models classify independent mutant test sets with accuracies ranging from 87 to 94%. Attributes representing each mutant protein are based on a computational mutagenesis methodology relying on a four-body statistical potential, illustrating a novel integration of both energy-based and machine learning approaches. The servers are written in PHP and hosted on a Linux platform, and they can be freely accessed online along with detailed data sets, documentation and performance results at http://proteins.gmu.edu/automute.


Assuntos
Substituição de Aminoácidos , Biologia Computacional/métodos , Internet , Proteínas/química , Software , Algoritmos , Inteligência Artificial , Fenômenos Químicos , Estabilidade Proteica , Proteínas/metabolismo , Análise de Regressão , Termodinâmica
17.
BMC Struct Biol ; 10 Suppl 1: S5, 2010 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-20487512

RESUMO

BACKGROUND: There is a considerable literature on the source of the thermostability of proteins from thermophilic organisms. Understanding the mechanisms for this thermostability would provide insights into proteins generally and permit the design of synthetic hyperstable biocatalysts. RESULTS: We have systematically tested a large number of sequence and structure derived quantities for their ability to discriminate thermostable proteins from their non-thermostable orthologs using sets of mesophile-thermophile ortholog pairs. Most of the quantities tested correspond to properties previously reported to be associated with thermostability. Many of the structure related properties were derived from the Delaunay tessellation of protein structures. CONCLUSIONS: Carefully selected sequence based indices discriminate better than purely structure based indices. Combined sequence and structure based indices improve performance somewhat further. Based on our analysis, the strongest contributors to thermostability are an increase in ion pairs on the protein surface and a more strongly hydrophobic interior.


Assuntos
Proteínas/química , Sequência de Aminoácidos , Proteínas de Bactérias/química , Modelos Moleculares , Fosfoglicerato Quinase/química , Conformação Proteica , Estabilidade Proteica , Pyrococcus/química , Proteína de Ligação a TATA-Box/química , Temperatura , Trypanosoma brucei brucei/química
18.
Protein Eng Des Sel ; 22(11): 665-71, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-19690089

RESUMO

A computational mutagenesis methodology utilizing a four-body, knowledge-based, statistical contact potential is applied toward globally quantifying relative environmental perturbations (residual scores) in bacteriophage f1 gene V protein (GVP) due to single amino acid substitutions. We show that residual scores correlate well with experimentally measured relative changes in protein function upon mutation. Residual scores also distinguish between GVP amino acid positions grouped according to protein structural or functional roles or based on similarities in physicochemical characteristics. For each mutant, the in silico mutagenesis additionally yields local measures of environmental change (EC scores) occurring at every residue position (residual profile) relative to the native protein. Implementation of the random forest (RF) algorithm, utilizing experimental GVP mutants whose feature vector components include EC scores at the mutated position and at six structurally nearest neighbors, correctly classifies mutants based on function with up to 77% cross-validation accuracy while achieving 0.82 area under the receiver operating characteristic curve. A control experiment highlights the effectiveness of mutant feature vector signals, and a variety of learning curves are generated to analyze the impact of GVP mutant data set size on performance measures. An optimally trained RF model is subsequently used for inferring function for all the remaining unexplored GVP mutants.


Assuntos
Bacteriófagos , Modelos Biológicos , Proteínas Virais/genética , Proteínas Virais/metabolismo , Sequência de Aminoácidos , Substituição de Aminoácidos , Bacteriófagos/fisiologia , Escherichia coli/crescimento & desenvolvimento , Escherichia coli/virologia , Modelos Moleculares , Dados de Sequência Molecular , Conformação Proteica , Relação Estrutura-Atividade , Proteínas Virais/química
19.
Bioinformatics ; 24(18): 2002-9, 2008 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-18632749

RESUMO

MOTIVATION: Accurate predictive models for the impact of single amino acid substitutions on protein stability provide insight into protein structure and function. Such models are also valuable for the design and engineering of new proteins. Previously described methods have utilized properties of protein sequence or structure to predict the free energy change of mutants due to thermal (DeltaDeltaG) and denaturant (DeltaDeltaG(H2O)) denaturations, as well as mutant thermal stability (DeltaT(m)), through the application of either computational energy-based approaches or machine learning techniques. However, accuracy associated with applying these methods separately is frequently far from optimal. RESULTS: We detail a computational mutagenesis technique based on a four-body, knowledge-based, statistical contact potential. For any mutation due to a single amino acid replacement in a protein, the method provides an empirical normalized measure of the ensuing environmental perturbation occurring at every residue position. A feature vector is generated for the mutant by considering perturbations at the mutated position and it's ordered six nearest neighbors in the 3-dimensional (3D) protein structure. These predictors of stability change are evaluated by applying machine learning tools to large training sets of mutants derived from diverse proteins that have been experimentally studied and described. Predictive models based on our combined approach are either comparable to, or in many cases significantly outperform, previously published results. AVAILABILITY: A web server with supporting documentation is available at http://proteins.gmu.edu/automute.


Assuntos
Inteligência Artificial , Biologia Computacional , Mutagênese , Proteínas/química , Proteínas/genética , Algoritmos , Simulação por Computador , Bases de Dados de Proteínas , Modelos Moleculares , Dobramento de Proteína , Estrutura Terciária de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína , Relação Estrutura-Atividade , Termodinâmica
20.
Proteins ; 71(4): 1930-9, 2008 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-18186470

RESUMO

There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. The utilization of classifiers that incorporate fuzzy logic provides a natural extension in order to account for the spectrum of possible functional consequences. We generated a dataset of single amino acid substitutions in human proteins having known three-dimensional structures. Each variant was uniquely represented as a feature vector that included computational geometry and knowledge-based statistical potential predictors obtained though application of Delaunay tessellation of protein structures. Additional attributes consisted of physicochemical properties of the native and replacement amino acids as well as topological location of the mutated residue position in the solved structure. Classification performance of the RF algorithm was evaluated on a training set consisting of the disease-associated and neutral nsSNPs taken from our dataset, and attributes were ranked according to their relative importance. Similarly, we evaluated the performance of adaptive neuro-fuzzy inference system (ANFIS). The utility of statistical geometry predictors was compared with that of traditional structural and evolutionary attributes employed by other researchers, revealing an equally effective yet complementary methodology. Among all attributes in our feature set, the statistical geometry predictors were found to be the most highly ranked. On the basis of the AUC (area under the ROC curve) measure of performance, the ANFIS and RF models were equally effective when only statistical geometry features were utilized. Tenfold cross-validation studies evaluating AUC, balanced error rate (BER), and Matthew's correlation coefficient (MCC) showed that our RF model was at least comparable with the well-established methods of SIFT and PolyPhen. The trained RF and ANFIS models were each subsequently used to predict the disease potential of human nsSNPs in our dataset that are currently unclassified (http://rna.gmu.edu/FuzzySnps/).


Assuntos
Árvores de Decisões , Lógica Fuzzy , Polimorfismo de Nucleotídeo Único/genética , Proteínas/química , Proteínas/genética , Algoritmos , Sequência de Aminoácidos , Substituição de Aminoácidos , Área Sob a Curva , Inteligência Artificial , Distribuição de Qui-Quadrado , Biologia Computacional/métodos , Bases de Dados Factuais , Humanos , Interações Hidrofóbicas e Hidrofílicas , Modelos Estatísticos , Dados de Sequência Molecular , Redes Neurais de Computação , Filogenia , Valor Preditivo dos Testes , Conformação Proteica , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Curva ROC , Reprodutibilidade dos Testes , Homologia de Sequência de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA