Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
1.
Hum Mutat ; 40(9): 1530-1545, 2019 09.
Article in English | MEDLINE | ID: mdl-31301157

ABSTRACT

Accurate prediction of the impact of genomic variation on phenotype is a major goal of computational biology and an important contributor to personalized medicine. Computational predictions can lead to a better understanding of the mechanisms underlying genetic diseases, including cancer, but their adoption requires thorough and unbiased assessment. Cystathionine-beta-synthase (CBS) is an enzyme that catalyzes the first step of the transsulfuration pathway, from homocysteine to cystathionine, and in which variations are associated with human hyperhomocysteinemia and homocystinuria. We have created a computational challenge under the CAGI framework to evaluate how well different methods can predict the phenotypic effect(s) of CBS single amino acid substitutions using a blinded experimental data set. CAGI participants were asked to predict yeast growth based on the identity of the mutations. The performance of the methods was evaluated using several metrics. The CBS challenge highlighted the difficulty of predicting the phenotype of an ex vivo system in a model organism when classification models were trained on human disease data. We also discuss the variations in difficulty of prediction for known benign and deleterious variants, as well as identify methodological and experimental constraints with lessons to be learned for future challenges.


Subject(s)
Amino Acid Substitution , Computational Biology/methods , Cystathionine beta-Synthase/genetics , Cystathionine/metabolism , Cystathionine beta-Synthase/metabolism , Homocysteine/metabolism , Humans , Phenotype , Precision Medicine
2.
Hum Mutat ; 38(9): 1193-1200, 2017 09.
Article in English | MEDLINE | ID: mdl-28087895

ABSTRACT

The Critical Assessment of Genome Interpretation (CAGI) experiment is the first attempt to evaluate the state-of-the-art in genetic data interpretation. Among the proposed challenges, Crohn disease (CD) risk prediction has become the most classic problem spanning three editions. The scientific question is very hard: can anybody assess the risk to develop CD given the exome data alone? This is one of the ultimate goals of genetic analysis, which motivated most CAGI participants to look for powerful new methods. In the 2016 CD challenge, we implemented all the best methods proposed in the past editions. This resulted in 10 algorithms, which were evaluated fairly by CAGI organizers. We also used all the data available from CAGI 11 and 13 to maximize the amount of training samples. The most effective algorithms used known genes associated with CD from the literature. No method could evaluate effectively the importance of unannotated variants by using heuristics. As a downside, all CD datasets were strongly affected by sample stratification. This affected the performance reported by assessors. Therefore, we expect that future datasets will be normalized in order to remove population effects. This will improve methods comparison and promote algorithms focused on causal variants discovery.


Subject(s)
Computational Biology/methods , Crohn Disease/genetics , Algorithms , Genetic Predisposition to Disease , High-Throughput Nucleotide Sequencing , Humans , Practice Guidelines as Topic , Exome Sequencing
3.
Hum Mutat ; 38(9): 1042-1050, 2017 09.
Article in English | MEDLINE | ID: mdl-28440912

ABSTRACT

Correct phenotypic interpretation of variants of unknown significance for cancer-associated genes is a diagnostic challenge as genetic screenings gain in popularity in the next-generation sequencing era. The Critical Assessment of Genome Interpretation (CAGI) experiment aims to test and define the state of the art of genotype-phenotype interpretation. Here, we present the assessment of the CAGI p16INK4a challenge. Participants were asked to predict the effect on cellular proliferation of 10 variants for the p16INK4a tumor suppressor, a cyclin-dependent kinase inhibitor encoded by the CDKN2A gene. Twenty-two pathogenicity predictors were assessed with a variety of accuracy measures for reliability in a medical context. Different assessment measures were combined in an overall ranking to provide more robust results. The R scripts used for assessment are publicly available from a GitHub repository for future use in similar assessment exercises. Despite a limited test-set size, our findings show a variety of results, with some methods performing significantly better. Methods combining different strategies frequently outperform simpler approaches. The best predictor, Yang&Zhou lab, uses a machine learning method combining an empirical energy function measuring protein stability with an evolutionary conservation term. The p16INK4a challenge highlights how subtle structural effects can neutralize otherwise deleterious variants.


Subject(s)
Computational Biology/methods , Cyclin-Dependent Kinase Inhibitor p18/genetics , Genetic Variation , Cell Line, Tumor , Cell Proliferation , Computer Simulation , Cyclin-Dependent Kinase Inhibitor p16 , Cyclin-Dependent Kinase Inhibitor p18/chemistry , Databases, Genetic , Genetic Predisposition to Disease , Humans , Machine Learning , Protein Stability
4.
Hum Mutat ; 38(9): 1266-1276, 2017 09.
Article in English | MEDLINE | ID: mdl-28544481

ABSTRACT

The advent of next-generation sequencing has dramatically decreased the cost for whole-genome sequencing and increased the viability for its application in research and clinical care. The Personal Genome Project (PGP) provides unrestricted access to genomes of individuals and their associated phenotypes. This resource enabled the Critical Assessment of Genome Interpretation (CAGI) to create a community challenge to assess the bioinformatics community's ability to predict traits from whole genomes. In the CAGI PGP challenge, researchers were asked to predict whether an individual had a particular trait or profile based on their whole genome. Several approaches were used to assess submissions, including ROC AUC (area under receiver operating characteristic curve), probability rankings, the number of correct predictions, and statistical significance simulations. Overall, we found that prediction of individual traits is difficult, relying on a strong knowledge of trait frequency within the general population, whereas matching genomes to trait profiles relies heavily upon a small number of common traits including ancestry, blood type, and eye color. When a rare genetic disorder is present, profiles can be matched when one or more pathogenic variants are identified. Prediction accuracy has improved substantially over the last 6 years due to improved methodology and a better understanding of features.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Whole Genome Sequencing/methods , Area Under Curve , Genetic Predisposition to Disease , Human Genome Project , Humans , Phenotype , Quantitative Trait Loci
5.
Hum Mutat ; 38(9): 1182-1192, 2017 09.
Article in English | MEDLINE | ID: mdl-28634997

ABSTRACT

Precision medicine aims to predict a patient's disease risk and best therapeutic options by using that individual's genetic sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. For CAGI 4, three challenges involved using exome-sequencing data: Crohn's disease, bipolar disorder, and warfarin dosing. Previous CAGI challenges included prior versions of the Crohn's disease challenge. Here, we discuss the range of techniques used for phenotype prediction as well as the methods used for assessing predictive models. Additionally, we outline some of the difficulties associated with making predictions and evaluating them. The lessons learned from the exome challenges can be applied to both research and clinical efforts to improve phenotype prediction from genotype. In addition, these challenges serve as a vehicle for sharing clinical and research exome data in a secure manner with scientists who have a broad range of expertise, contributing to a collaborative effort to advance our understanding of genotype-phenotype relationships.


Subject(s)
Bipolar Disorder/genetics , Crohn Disease/genetics , Exome Sequencing/methods , Precision Medicine/methods , Warfarin/therapeutic use , Computational Biology/methods , Databases, Genetic , Genetic Predisposition to Disease , Humans , Information Dissemination , Pharmacogenomic Variants , Phenotype , Warfarin/pharmacology
6.
Nucleic Acids Res ; 43(W1): W134-40, 2015 Jul 01.
Article in English | MEDLINE | ID: mdl-26019177

ABSTRACT

Identifying protein functions can be useful for numerous applications in biology. The prediction of gene ontology (GO) functional terms from sequence remains however a challenging task, as shown by the recent CAFA experiments. Here we present INGA, a web server developed to predict protein function from a combination of three orthogonal approaches. Sequence similarity and domain architecture searches are combined with protein-protein interaction network data to derive consensus predictions for GO terms using functional enrichment. The INGA server can be queried both programmatically through RESTful services and through a web interface designed for usability. The latter provides output supporting the GO term predictions with the annotating sequences. INGA is validated on the CAFA-1 data set and was recently shown to perform consistently well in the CAFA-2 blind test. The INGA web server is available from URL: http://protein.bio.unipd.it/inga.


Subject(s)
Protein Interaction Mapping , Protein Structure, Tertiary , Sequence Homology, Amino Acid , Software , Gene Ontology , Humans , Internet , Molecular Sequence Annotation , Proteins/genetics , Proteins/physiology
7.
Bioinformatics ; 31(7): 1138-40, 2015 Apr 01.
Article in English | MEDLINE | ID: mdl-25414364

ABSTRACT

MOTIVATION: Protein sequence and structure representation and manipulation require dedicated software libraries to support methods of increasing complexity. Here, we describe the VIrtual Constrution TOol for pRoteins (Victor) C++ library, an open source platform dedicated to enabling inexperienced users to develop advanced tools and gathering contributions from the community. The provided application examples cover statistical energy potentials, profile-profile sequence alignments and ab initio loop modeling. Victor was used over the last 15 years in several publications and optimized for efficiency. It is provided as a GitHub repository with source files and unit tests, plus extensive online documentation, including a Wiki with help files and tutorials, examples and Doxygen documentation. AVAILABILITY AND IMPLEMENTATION: The C++ library and online documentation, distributed under a GPL license are available from URL: http://protein.bio.unipd.it/victor/.


Subject(s)
Databases, Protein , Libraries, Digital , Proteins/chemistry , Sequence Alignment/methods , Software , Computational Biology/methods , Humans , Structural Homology, Protein
8.
Bioinformatics ; 31(2): 201-8, 2015 Jan 15.
Article in English | MEDLINE | ID: mdl-25246432

ABSTRACT

MOTIVATION: Intrinsically disordered regions are key for the function of numerous proteins. Due to the difficulties in experimental disorder characterization, many computational predictors have been developed with various disorder flavors. Their performance is generally measured on small sets mainly from experimentally solved structures, e.g. Protein Data Bank (PDB) chains. MobiDB has only recently started to collect disorder annotations from multiple experimental structures. RESULTS: MobiDB annotates disorder for UniProt sequences, allowing us to conduct the first large-scale assessment of fast disorder predictors on 25 833 different sequences with X-ray crystallographic structures. In addition to a comprehensive ranking of predictors, this analysis produced the following interesting observations. (i) The predictors cluster according to their disorder definition, with a consensus giving more confidence. (ii) Previous assessments appear over-reliant on data annotated at the PDB chain level and performance is lower on entire UniProt sequences. (iii) Long disordered regions are harder to predict. (iv) Depending on the structural and functional types of the proteins, differences in prediction performance of up to 10% are observed. AVAILABILITY: The datasets are available from Web site at URL: http://mobidb.bio.unipd.it/lsd. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteins/chemistry , Sequence Analysis, Protein/methods , Tumor Suppressor Protein p53/chemistry , Crystallography, X-Ray , Databases, Protein , Humans , Molecular Sequence Annotation , Protein Structure, Tertiary
9.
Nucleic Acids Res ; 42(Database issue): D352-7, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24311564

ABSTRACT

RepeatsDB (http://repeatsdb.bio.unipd.it/) is a database of annotated tandem repeat protein structures. Tandem repeats pose a difficult problem for the analysis of protein structures, as the underlying sequence can be highly degenerate. Several repeat types haven been studied over the years, but their annotation was done in a case-by-case basis, thus making large-scale analysis difficult. We developed RepeatsDB to fill this gap. Using state-of-the-art repeat detection methods and manual curation, we systematically annotated the Protein Data Bank, predicting 10,745 repeat structures. In all, 2797 structures were classified according to a recently proposed classification schema, which was expanded to accommodate new findings. In addition, detailed annotations were performed in a subset of 321 proteins. These annotations feature information on start and end positions for the repeat regions and units. RepeatsDB is an ongoing effort to systematically classify and annotate structural protein repeats in a consistent way. It provides users with the possibility to access and download high-quality datasets either interactively or programmatically through web services.


Subject(s)
Databases, Protein , Repetitive Sequences, Amino Acid , Internet , Molecular Sequence Annotation , Protein Conformation
10.
Amino Acids ; 47(12): 2583-92, 2015 Dec.
Article in English | MEDLINE | ID: mdl-26215734

ABSTRACT

Protein function prediction from sequence using the Gene Ontology (GO) classification is useful in many biological problems. It has recently attracted increasing interest, thanks in part to the Critical Assessment of Function Annotation (CAFA) challenge. In this paper, we introduce Guilty by Association on STRING (GAS), a tool to predict protein function exploiting protein-protein interaction networks without sequence similarity. The assumption is that whenever a protein interacts with other proteins, it is part of the same biological process and located in the same cellular compartment. GAS retrieves interaction partners of a query protein from the STRING database and measures enrichment of the associated functional annotations to generate a sorted list of putative functions. A performance evaluation based on CAFA metrics and a fair comparison with optimized BLAST similarity searches is provided. The consensus of GAS and BLAST is shown to improve overall performance. The PPI approach is shown to outperform similarity searches for biological process and cellular compartment GO predictions. Moreover, an analysis of the best practices to exploit protein-protein interaction networks is also provided.


Subject(s)
Protein Interaction Mapping , Protein Interaction Maps , Proteins/chemistry , Algorithms , Automation , Computational Biology , Data Mining , Databases, Protein , Genome, Fungal , Pattern Recognition, Automated , Reproducibility of Results , Software
11.
BMC Genomics ; 15 Suppl 4: S7, 2014.
Article in English | MEDLINE | ID: mdl-25057121

ABSTRACT

BACKGROUND: The rapid growth of un-annotated missense variants poses challenges requiring novel strategies for their interpretation. From the thermodynamic point of view, amino acid changes can lead to a change in the internal energy of a protein and induce structural rearrangements. This is of great relevance for the study of diseases and protein design, justifying the development of prediction methods for variant-induced stability changes. RESULTS: Here we propose NeEMO, a tool for the evaluation of stability changes using an effective representation of proteins based on residue interaction networks (RINs). RINs are used to extract useful features describing interactions of the mutant amino acid with its structural environment. Benchmarking shows NeEMO to be very effective, allowing reliable predictions in different parts of the protein such as ß-strands and buried residues. Validation on a previously published independent dataset shows that NeEMO has a Pearson correlation coefficient of 0.77 and a standard error of 1 Kcal/mol, outperforming nine recent methods. The NeEMO web server can be freely accessed from URL: http://protein.bio.unipd.it/neemo/. CONCLUSIONS: NeEMO offers an innovative and reliable tool for the annotation of amino acid changes. A key contribution are RINs, which can be used for modeling proteins and their interactions effectively. Interestingly, the approach is very general, and can motivate the development of a new family of RIN-based protein structure analyzers. NeEMO may suggest innovative strategies for bioinformatics tools beyond protein stability prediction.


Subject(s)
Computational Biology , Proteins/metabolism , Amino Acids/chemistry , Amino Acids/metabolism , Internet , Mutation , Protein Interaction Maps , Protein Stability , Protein Structure, Tertiary , Proteins/chemistry , Proteins/genetics , Thermodynamics , User-Computer Interface
12.
PLoS One ; 10(4): e0124579, 2015.
Article in English | MEDLINE | ID: mdl-25893845

ABSTRACT

Over the last decade, we have witnessed an incredible growth in the amount of available genotype data due to high throughput sequencing (HTS) techniques. This information may be used to predict phenotypes of medical relevance, and pave the way towards personalized medicine. Blood phenotypes (e.g. ABO and Rh) are a purely genetic trait that has been extensively studied for decades, with currently over thirty known blood groups. Given the public availability of blood group data, it is of interest to predict these phenotypes from HTS data which may translate into more accurate blood typing in clinical practice. Here we propose BOOGIE, a fast predictor for the inference of blood groups from single nucleotide variant (SNV) databases. We focus on the prediction of thirty blood groups ranging from the well known ABO and Rh, to the less studied Junior or Diego. BOOGIE correctly predicted the blood group with 94% accuracy for the Personal Genome Project whole genome profiles where good quality SNV annotation was available. Additionally, our tool produces a high quality haplotype phase, which is of interest in the context of ethnicity-specific polymorphisms or traits. The versatility and simplicity of the analysis make it easily interpretable and allow easy extension of the protocol towards other phenotypes. BOOGIE can be downloaded from URL http://protein.bio.unipd.it/download/.


Subject(s)
Blood Group Antigens/genetics , High-Throughput Nucleotide Sequencing/methods , Software , ABO Blood-Group System/genetics , Exons/genetics , Genome, Human , Haplotypes/genetics , Humans , Molecular Sequence Annotation , Mutation/genetics , Phenotype , Polymorphism, Single Nucleotide/genetics
13.
PLoS One ; 9(6): e96986, 2014.
Article in English | MEDLINE | ID: mdl-24886840

ABSTRACT

Von Hippel-Lindau (VHL) syndrome is a hereditary condition predisposing to the development of different cancer forms, related to germline inactivation of the homonymous tumor suppressor pVHL. The best characterized function of pVHL is the ubiquitination dependent degradation of Hypoxia Inducible Factor (HIF) via the proteasome. It is also involved in several cellular pathways acting as a molecular hub and interacting with more than 200 different proteins. Molecular details of pVHL plasticity remain in large part unknown. Here, we present a novel manually curated Petri Net (PN) model of the main pVHL functional pathways. The model was built using functional information derived from the literature. It includes all major pVHL functions and is able to credibly reproduce VHL syndrome at the molecular level. The reliability of the PN model also allowed in silico knockout experiments, driven by previous model analysis. Interestingly, PN analysis suggests that the variability of different VHL manifestations is correlated with the concomitant inactivation of different metabolic pathways.


Subject(s)
Algorithms , Models, Biological , Protein Interaction Maps , Von Hippel-Lindau Tumor Suppressor Protein/metabolism , Cluster Analysis , Computer Simulation , Gene Knockout Techniques , Humans , Hypoxia-Inducible Factor 1, alpha Subunit/genetics , Hypoxia-Inducible Factor 1, alpha Subunit/metabolism , Transcription, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL