RESUMO
BACKGROUND: A major obstacle faced by families with rare diseases is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years and causal variants are identified in under 50%, even when capturing variants genome-wide. To aid in the interpretation and prioritization of the vast number of variants detected, computational methods are proliferating. Knowing which tools are most effective remains unclear. To evaluate the performance of computational methods, and to encourage innovation in method development, we designed a Critical Assessment of Genome Interpretation (CAGI) community challenge to place variant prioritization models head-to-head in a real-life clinical diagnostic setting. METHODS: We utilized genome sequencing (GS) data from families sequenced in the Rare Genomes Project (RGP), a direct-to-participant research study on the utility of GS for rare disease diagnosis and gene discovery. Challenge predictors were provided with a dataset of variant calls and phenotype terms from 175 RGP individuals (65 families), including 35 solved training set families with causal variants specified, and 30 unlabeled test set families (14 solved, 16 unsolved). We tasked teams to identify causal variants in as many families as possible. Predictors submitted variant predictions with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on the rank position of causal variants, and the maximum F-measure, based on precision and recall of causal variants across all EPCR values. RESULTS: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performers recalled causal variants in up to 13 of 14 solved families within the top 5 ranked variants. Newly discovered diagnostic variants were returned to two previously unsolved families following confirmatory RNA sequencing, and two novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant in an unsolved proband with phenotypes consistent with asparagine synthetase deficiency. CONCLUSIONS: Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.
Assuntos
Doenças Raras , Humanos , Doenças Raras/genética , Doenças Raras/diagnóstico , Genoma Humano/genética , Variação Genética/genética , Biologia Computacional/métodos , FenótipoRESUMO
In the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6), the Genetics of Neurodevelopmental Disorders Lab in Padua proposed a new ID-challenge to give the opportunity of developing computational methods for predicting patient's phenotype and the causal variants. Eight research teams and 30 models had access to the phenotype details and real genetic data, based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. In this study we evaluate the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and causal variants. Finally, we asked to develop a method to find new possible genetic causes for patients without a genetic diagnosis. As already done for the CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (causative, putative pathogenic and contributing factors) were provided. Considering the overall clinical manifestation of our cohort, we give out the variant data and phenotypic traits of the 150 patients from CAGI5 ID-Challenge as training and validation for the prediction methods development.
RESUMO
Background: A major obstacle faced by rare disease families is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years, and causal variants are identified in under 50%. The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing (GS) for diagnosis and gene discovery. Families are consented for sharing of sequence and phenotype data with researchers, allowing development of a Critical Assessment of Genome Interpretation (CAGI) community challenge, placing variant prioritization models head-to-head in a real-life clinical diagnostic setting. Methods: Predictors were provided a dataset of phenotype terms and variant calls from GS of 175 RGP individuals (65 families), including 35 solved training set families, with causal variants specified, and 30 test set families (14 solved, 16 unsolved). The challenge tasked teams with identifying the causal variants in as many test set families as possible. Ranked variant predictions were submitted with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on rank position of true positive causal variants and maximum F-measure, based on precision and recall of causal variants across EPCR thresholds. Results: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performing teams recalled the causal variants in up to 13 of 14 solved families by prioritizing high quality variant calls that were rare, predicted deleterious, segregating correctly, and consistent with reported phenotype. In unsolved families, newly discovered diagnostic variants were returned to two families following confirmatory RNA sequencing, and two prioritized novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant, in an unsolved proband with phenotype overlap with asparagine synthetase deficiency. Conclusions: By objective assessment of variant predictions, we provide insights into current state-of-the-art algorithms and platforms for genome sequencing analysis for rare disease diagnosis and explore areas for future optimization. Identification of diagnostic variants in unsolved families promotes synergy between researchers with clinical and computational expertise as a means of advancing the field of clinical genome interpretation.
RESUMO
Testing for variation in BRCA1 and BRCA2 (commonly referred to as BRCA1/2), has emerged as a standard clinical practice and is helping countless women better understand and manage their heritable risk of breast and ovarian cancer. Yet the increased rate of BRCA1/2 testing has led to an increasing number of Variants of Uncertain Significance (VUS), and the rate of VUS discovery currently outpaces the rate of clinical variant interpretation. Computational prediction is a key component of the variant interpretation pipeline. In the CAGI5 ENIGMA Challenge, six prediction teams submitted predictions on 326 newly-interpreted variants from the ENIGMA Consortium. By evaluating these predictions against the new interpretations, we have gained a number of insights on the state of the art of variant prediction and specific steps to further advance this state of the art.
Assuntos
Proteína BRCA1/genética , Proteína BRCA2/genética , Neoplasias da Mama/diagnóstico , Biologia Computacional/métodos , Neoplasias Ovarianas/diagnóstico , Neoplasias da Mama/genética , Detecção Precoce de Câncer , Feminino , Predisposição Genética para Doença , Testes Genéticos , Variação Genética , Humanos , Modelos Genéticos , Neoplasias Ovarianas/genéticaRESUMO
Whole-genome sequencing (WGS) holds great potential as a diagnostic test. However, the majority of patients currently undergoing WGS lack a molecular diagnosis, largely due to the vast number of undiscovered disease genes and our inability to assess the pathogenicity of most genomic variants. The CAGI SickKids challenges attempted to address this knowledge gap by assessing state-of-the-art methods for clinical phenotype prediction from genomes. CAGI4 and CAGI5 participants were provided with WGS data and clinical descriptions of 25 and 24 undiagnosed patients from the SickKids Genome Clinic Project, respectively. Predictors were asked to identify primary and secondary causal variants. In addition, for CAGI5, groups had to match each genome to one of three disorder categories (neurologic, ophthalmologic, and connective), and separately to each patient. The performance of matching genomes to categories was no better than random but two groups performed significantly better than chance in matching genomes to patients. Two of the ten variants proposed by two groups in CAGI4 were deemed to be diagnostic, and several proposed pathogenic variants in CAGI5 are good candidates for phenotype expansion. We discuss implications for improving in silico assessment of genomic variants and identifying new disease genes.
Assuntos
Biologia Computacional/métodos , Variação Genética , Doenças não Diagnosticadas/diagnóstico , Adolescente , Criança , Pré-Escolar , Simulação por Computador , Bases de Dados Genéticas , Feminino , Predisposição Genética para Doença , Humanos , Masculino , Fenótipo , Doenças não Diagnosticadas/genética , Sequenciamento Completo do GenomaRESUMO
The NAGLU challenge of the fourth edition of the Critical Assessment of Genome Interpretation experiment (CAGI4) in 2016, invited participants to predict the impact of variants of unknown significance (VUS) on the enzymatic activity of the lysosomal hydrolase α-N-acetylglucosaminidase (NAGLU). Deficiencies in NAGLU activity lead to a rare, monogenic, recessive lysosomal storage disorder, Sanfilippo syndrome type B (MPS type IIIB). This challenge attracted 17 submissions from 10 groups. We observed that top models were able to predict the impact of missense mutations on enzymatic activity with Pearson's correlation coefficients of up to .61. We also observed that top methods were significantly more correlated with each other than they were with observed enzymatic activity values, which we believe speaks to the importance of sequence conservation across the different methods. Improved functional predictions on the VUS will help population-scale analysis of disease epidemiology and rare variant association analysis.
Assuntos
Acetilglucosaminidase/metabolismo , Biologia Computacional/métodos , Mutação de Sentido Incorreto , Acetilglucosaminidase/genética , Humanos , Modelos Genéticos , Análise de RegressãoRESUMO
Thermodynamic stability is a fundamental property shared by all proteins. Changes in stability due to mutation are a widespread molecular mechanism in genetic diseases. Methods for the prediction of mutation-induced stability change have typically been developed and evaluated on incomplete and/or biased data sets. As part of the Critical Assessment of Genome Interpretation, we explored the utility of high-throughput variant stability profiling (VSP) assay data as an alternative for the assessment of computational methods and evaluated state-of-the-art predictors against over 7,000 nonsynonymous variants from two proteins. We found that predictions were modestly correlated with actual experimental values. Predictors fared better when evaluated as classifiers of extreme stability effects. While different methods emerging as top performers depending on the metric, it is nontrivial to draw conclusions on their adoption or improvement. Our analyses revealed that only 16% of all variants in VSP assays could be confidently defined as stability-affecting. Furthermore, it is unclear as to what extent VSP abundance scores were reasonable proxies for the stability-related quantities that participating methods were designed to predict. Overall, our observations underscore the need for clearly defined objectives when developing and using both computational and experimental methods in the context of measuring variant impact.
Assuntos
Biologia Computacional/métodos , Metiltransferases/química , Mutação , PTEN Fosfo-Hidrolase/química , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metiltransferases/genética , PTEN Fosfo-Hidrolase/genética , Estabilidade ProteicaRESUMO
The Critical Assessment of Genome Interpretation-5 intellectual disability challenge asked to use computational methods to predict patient clinical phenotypes and the causal variant(s) based on an analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental manifestations (i.e. ID, autism, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia) have been made available for this challenge. For each patient, predictors had to report the causative variants and which of the seven phenotypes were present. Since neurodevelopmental disorders are characterized by strong comorbidity, tested individuals often present more than one pathological condition. Considering the overall clinical manifestation of each patient, the correct phenotype has been predicted by at least one group for 93 individuals (62%). ID and ASD were the best predicted among the seven phenotypic traits. Also, causative or potentially pathogenic variants were predicted correctly by at least one group. However, the prediction of the correct causative variant seems to be insufficient to predict the correct phenotype. In some cases, the correct prediction has been supported by rare or common variants in genes different from the causative one.
Assuntos
Transtorno do Espectro Autista/genética , Biologia Computacional/métodos , Deficiência Intelectual/genética , Análise de Sequência de DNA/métodos , Feminino , Predisposição Genética para Doença , Humanos , Masculino , Fenótipo , Locos de Características QuantitativasRESUMO
A major challenge in genome interpretation is to estimate the fitness effect of coding variants of unknown significance (VUS). Labor, limited understanding of protein functions, and lack of assays generally limit direct experimental assessment of VUS, and make robust and accurate computational approaches a necessity. Often, however, algorithms that predict mutational effect disagree among themselves and with experimental data, slowing their adoption for clinical diagnostics. To objectively assess such methods, the Critical Assessment of Genome Interpretation (CAGI) community organizes contests to predict unpublished experimental data, available only to CAGI assessors. We review here the CAGI performance of evolutionary action (EA) predictions of mutational impact. EA models the fitness effect of coding mutations analytically, as a product of the gradient of the fitness landscape times the perturbation size. In practice, these terms are computed from phylogenetic considerations as the functional sensitivity of the mutated site and as the magnitude of amino acid substitution, respectively, and yield the percentage loss of wild-type activity. In five CAGI challenges, EA consistently performed on par or better than sophisticated machine learning approaches. This objective assessment suggests that a simple differential model of evolution can interpret the fitness effect of coding variations, opening diverse clinical applications.
Assuntos
Biologia Computacional/métodos , Mutação de Sentido Incorreto , Algoritmos , Área Sob a Curva , Evolução Molecular , Aptidão Genética , Humanos , Modelos Genéticos , Filogenia , Seleção GenéticaRESUMO
Dense liquid phases, metastable with respect to a solid phase, but stable with respect to the solution, have been known to form in solutions of proteins and small-molecule substances. Here, with the protein lumazine synthase as a test system, using dynamic and static light scattering and atomic force microscopy, we demonstrate submicron size clusters of dense liquid. In contrast to the macroscopic dense liquid, these clusters are metastable not only with respect to the crystals, but also with respect to the low-concentration solution: the characteristic cluster lifetime is limited to approximately 10 s, after which they decay. The cluster population is detectable only if they occupy >10(-6) of the solution volume and have a number density >105 cm-3 for 3 to 11% of the monitored time. The cluster volume fraction varies within wide limits and reaches up to 10(-3). Increasing protein concentration increases the frequency of cluster detection but does not affect the ranges of the cluster sizes, suggesting that a preferred cluster size exists. A simple Monte Carlo model with protein-like potentials reproduces the metastable clusters of dense liquid with limited lifetimes and variable sizes and suggests that the mean cluster size is determined by the kinetics of growth and decay and not by thermodynamics.
Assuntos
Proteínas/química , Algoritmos , Bacillus subtilis/enzimologia , Fenômenos Químicos , Físico-Química , Cristalização , Luz , Modelos Químicos , Método de Monte Carlo , Complexos Multienzimáticos/química , Espalhamento de Radiação , SoluçõesRESUMO
The solvent around protein molecules in solutions is structured and this structuring introduces a repulsion in the intermolecular interaction potential at intermediate separations. We use Monte Carlo simulations with isotropic, pair-additive systems interacting with such potentials. We test if the liquid-liquid and liquid-solid phase lines in model protein solutions can be predicted from universal curves and a pair of experimentally determined parameters, as done for atomic and colloid materials using several laws of corresponding states. As predictors, we test three properties at the critical point for liquid-liquid separation: temperature, as in the original van der Waals law, the second virial coefficient, and a modified second virial coefficient, all paired with the critical volume fraction. We find that the van der Waals law is best obeyed and appears more general than its original formulation: A single universal curve describes all tested nonconformal isotropic pair-additive systems. Published experimental data for the liquid-liquid equilibrium for several proteins at various conditions follow a single van der Waals curve. For the solid-liquid equilibrium, we find that no single system property serves as its predictor. We go beyond corresponding-states correlations and put forth semiempirical laws, which allow prediction of the critical temperature and volume fraction solely based on the range of attraction of the intermolecular interaction potential.
Assuntos
Biofísica/métodos , Físico-Química/métodos , Proteínas/química , Simulação por Computador , Ligação de Hidrogênio , Íons , Modelos Estatísticos , Conformação Molecular , Método de Monte Carlo , Solventes , Temperatura , TermodinâmicaRESUMO
Recent experiments have revealed several surprising features of the phase equilibria in protein solutions: liquid-liquid phase separation which is, in some cases, metastable with respect to the liquid-solid equilibrium, and in others-unobservable; widely varying crystallization enthalpies, including completely athermal crystallization; the co-existence of several crystalline polymorphs; and others. Other studies have shown that the solvent molecules at the hydrophobic and polar patches on the protein molecular surfaces are structured, introducing repulsive forces at surface separations equal to several water molecule sizes. In search of a causal link between the latter and former findings, we apply Monte Carlo simulation techniques in the investigation of phase diagrams associated with globular biological molecules in solution. We account for the solvent structuring via short-range isotropic two-body intermolecular potentials exhibiting multiple extrema. We show that the introduction of a repulsive maximum or a secondary attractive minimum at separations longer than the primary attractive minimum has dramatic effects on the phase diagram: liquid-liquid separation curves are driven to lower or higher temperatures, the sensitivity of the solubility curve (liquidus) to temperature, i.e., the enthalpy of crystallization, is significantly reduced or enhanced, metastable liquid-liquid separation may become stable and vice versa, and both low- and high-density crystalline phases are observed. The similarity of these features of the simulated phase behavior to those observed experimentally suggests that at least some of the mysteries of the protein phase equilibria may be due to the structuring of the solvent around the protein molecular surfaces. Another conclusion is that at least some of the dense liquids seen in protein solutions may be stable and not metastable with respect to a solid phase.