RESUMO
Recommendations from the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) for interpreting sequence variants specify the use of computational predictors as "supporting" level of evidence for pathogenicity or benignity using criteria PP3 and BP4, respectively. However, score intervals defined by tool developers, and ACMG/AMP recommendations that require the consensus of multiple predictors, lack quantitative support. Previously, we described a probabilistic framework that quantified the strengths of evidence (supporting, moderate, strong, very strong) within ACMG/AMP recommendations. We have extended this framework to computational predictors and introduce a new standard that converts a tool's scores to PP3 and BP4 evidence strengths. Our approach is based on estimating the local positive predictive value and can calibrate any computational tool or other continuous-scale evidence on any variant type. We estimate thresholds (score intervals) corresponding to each strength of evidence for pathogenicity and benignity for thirteen missense variant interpretation tools, using carefully assembled independent data sets. Most tools achieved supporting evidence level for both pathogenic and benign classification using newly established thresholds. Multiple tools reached score thresholds justifying moderate and several reached strong evidence levels. One tool reached very strong evidence level for benign classification on some variants. Based on these findings, we provide recommendations for evidence-based revisions of the PP3 and BP4 ACMG/AMP criteria using individual tools and future assessment of computational methods for clinical interpretation.
Assuntos
Calibragem , Humanos , Consenso , Escolaridade , VirulênciaRESUMO
BACKGROUND: A major obstacle faced by families with rare diseases is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years and causal variants are identified in under 50%, even when capturing variants genome-wide. To aid in the interpretation and prioritization of the vast number of variants detected, computational methods are proliferating. Knowing which tools are most effective remains unclear. To evaluate the performance of computational methods, and to encourage innovation in method development, we designed a Critical Assessment of Genome Interpretation (CAGI) community challenge to place variant prioritization models head-to-head in a real-life clinical diagnostic setting. METHODS: We utilized genome sequencing (GS) data from families sequenced in the Rare Genomes Project (RGP), a direct-to-participant research study on the utility of GS for rare disease diagnosis and gene discovery. Challenge predictors were provided with a dataset of variant calls and phenotype terms from 175 RGP individuals (65 families), including 35 solved training set families with causal variants specified, and 30 unlabeled test set families (14 solved, 16 unsolved). We tasked teams to identify causal variants in as many families as possible. Predictors submitted variant predictions with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on the rank position of causal variants, and the maximum F-measure, based on precision and recall of causal variants across all EPCR values. RESULTS: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performers recalled causal variants in up to 13 of 14 solved families within the top 5 ranked variants. Newly discovered diagnostic variants were returned to two previously unsolved families following confirmatory RNA sequencing, and two novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant in an unsolved proband with phenotypes consistent with asparagine synthetase deficiency. CONCLUSIONS: Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.
Assuntos
Doenças Raras , Humanos , Doenças Raras/genética , Doenças Raras/diagnóstico , Genoma Humano/genética , Variação Genética/genética , Biologia Computacional/métodos , FenótipoRESUMO
PURPOSE: To investigate the number of rare missense variants observed in human genome sequences by ACMG/AMP PP3/BP4 evidence strength, following the ClinGen-calibrated PP3/BP4 computational recommendations. METHODS: Missense variants from the genome sequences of 300 probands from the Rare Genomes Project with suspected rare disease were analyzed using computational prediction tools that were able to reach PP3_Strong and BP4_Moderate evidence strengths (BayesDel, MutPred2, REVEL, and VEST4). The numbers of variants at each evidence strength were analyzed across disease-associated genes and genome-wide. RESULTS: From a median of 75.5 rare (≤1% allele frequency) missense variants in disease-associated genes per proband, a median of one reached PP3_Strong, 3-5 PP3_Moderate, and 3-5 PP3_Supporting. Most were allocated BP4 evidence (median 41-49 per proband) or were indeterminate (median 17.5-19 per proband). Extending the analysis to all protein-coding genes genome-wide, the number of variants reaching PP3_Strong score thresholds increased approximately 2.6-fold compared with disease-associated genes, with a median per proband of 1-3 PP3_Strong, 8-16 PP3_Moderate, and 10-17 PP3_Supporting. CONCLUSION: A small number of variants per proband reached PP3_Strong and PP3_Moderate in 3424 disease-associated genes. Although not the intended use of the recommendations, this was also observed genome-wide. Use of PP3/BP4 evidence as recommended from calibrated computational prediction tools in the clinical diagnostic laboratory is unlikely to inappropriately contribute to the classification of an excessive number of variants as pathogenic or likely pathogenic by ACMG/AMP rules.
RESUMO
BACKGROUND & AIMS: Genetic variants affecting liver disease risk vary among racial and ethnic groups. Hispanics/Latinos in the United States have a high prevalence of PNPLA3 I148M, which increases liver disease risk, and a low prevalence of HSD17B13 predicted loss-of-function (pLoF) variants, which reduce risk. Less is known about the prevalence of liver disease-associated variants among Hispanic/Latino subpopulations defined by country of origin and genetic ancestry. We evaluated the prevalence of HSD17B13 pLoF variants and PNPLA3 I148M, and their associations with quantitative liver phenotypes in Hispanic/Latino participants from an electronic health record-linked biobank in New York City. METHODS: This study included 8739 adult Hispanic/Latino participants of the BioMe biobank with genotyping and exome sequencing data. We estimated the prevalence of Hispanic/Latino individuals harboring HSD17B13 and PNPLA3 variants, stratified by genetic ancestry, and performed association analyses between variants and liver enzymes and Fibrosis-4 (FIB-4) scores. RESULTS: Individuals with ancestry from Ecuador and Mexico had the lowest frequency of HSD17B13 pLoF variants (10%/7%) and the highest frequency of PNPLA3 I148M (54%/65%). These ancestry groups had the highest outpatient alanine aminotransferase (ALT) and aspartate aminotransferase (AST) levels, and the largest proportion of individuals with a FIB-4 score greater than 2.67. HSD17B13 pLoF variants were associated with reduced ALT level (P = .002), AST level (P < .001), and FIB-4 score (P = .045). PNPLA3 I148M was associated with increased ALT level, AST level, and FIB-4 score (P < .001 for all). HSD17B13 pLoF variants mitigated the increase in ALT conferred by PNPLA3 I148M (P = .006). CONCLUSIONS: Variation in HSD17B13 and PNPLA3 variants across genetic ancestry groups may contribute to differential risk for liver fibrosis among Hispanic/Latino individuals.
Assuntos
Cirrose Hepática , Hepatopatia Gordurosa não Alcoólica , Humanos , Predisposição Genética para Doença , Hispânico ou Latino/genética , Cirrose Hepática/enzimologia , Cirrose Hepática/genética , Hepatopatia Gordurosa não Alcoólica/enzimologia , Hepatopatia Gordurosa não Alcoólica/genética , Polimorfismo de Nucleotídeo ÚnicoRESUMO
BACKGROUND: Low back pain (LBP) is a common condition made up of a variety of anatomic and clinical subtypes. Lumbar disc herniation (LDH) and lumbar spinal stenosis (LSS) are two subtypes highly associated with LBP. Patients with LDH/LSS are often started with non-surgical treatments and if those are not effective then go on to have decompression surgery. However, recommendation of surgery is complicated as the outcome may depend on the patient's health characteristics. We developed a deep learning (DL) model to predict decompression surgery for patients with LDH/LSS. MATERIALS AND METHOD: We used datasets of 8387 and 8620 patients from a prospective study that collected data from four healthcare systems to predict early (within 2 months) and late surgery (within 12 months after a 2 month gap), respectively. We developed a DL model to use patients' demographics, diagnosis and procedure codes, drug names, and diagnostic imaging reports to predict surgery. For each prediction task, we evaluated the model's performance using classical and generalizability evaluation. For classical evaluation, we split the data into training (80%) and testing (20%). For generalizability evaluation, we split the data based on the healthcare system. We used the area under the curve (AUC) to assess performance for each evaluation. We compared results to a benchmark model (i.e. LASSO logistic regression). RESULTS: For classical performance, the DL model outperformed the benchmark model for early surgery with an AUC of 0.725 compared to 0.597. For late surgery, the DL model outperformed the benchmark model with an AUC of 0.655 compared to 0.635. For generalizability performance, the DL model outperformed the benchmark model for early surgery. For late surgery, the benchmark model outperformed the DL model. CONCLUSIONS: For early surgery, the DL model was preferred for classical and generalizability evaluation. However, for late surgery, the benchmark and DL model had comparable performance. Depending on the prediction task, the balance of performance may shift between DL and a conventional ML method. As a result, thorough assessment is needed to quantify the value of DL, a relatively computationally expensive, time-consuming and less interpretable method.
Assuntos
Aprendizado Profundo , Deslocamento do Disco Intervertebral , Dor Lombar , Estenose Espinal , Humanos , Descompressão Cirúrgica/efeitos adversos , Descompressão Cirúrgica/métodos , Estudos Prospectivos , Vértebras Lombares/cirurgia , Dor Lombar/diagnóstico , Dor Lombar/cirurgia , Dor Lombar/complicações , Deslocamento do Disco Intervertebral/cirurgia , Estenose Espinal/cirurgia , Resultado do Tratamento , Estudos RetrospectivosRESUMO
BACKGROUND: The increasing adoption of electronic health record (EHR) systems enables automated, large scale, and meaningful analysis of regional population health. We explored how EHR systems could inform surveillance of trauma-related emergency department visits arising from seasonal, holiday-related, and rare environmental events. METHODS: We analyzed temporal variation in diagnosis codes over 24 years of trauma visit data at the three hospitals in the University of Washington Medicine system in Seattle, Washington, USA. We identified seasons and days in which specific codes and categories of codes were statistically enriched, meaning that a significantly greater than average proportion of trauma visits included a given diagnosis code during that time period. RESULTS: We confirmed known seasonal patterns in emergency department visits for trauma. As expected, cold weather-related incidents (e.g. frostbite, snowboarding injury) were enriched in the winter, whereas fair weather-related incidents (e.g. bug bites, boating accidents, bicycle accidents) were enriched in the spring and summer. Our analysis of specific days of the year found that holidays were enriched for alcohol poisoning, assaults, and firework accidents. We also detected one time regional events such as the 2001 Nisqually earthquake and the 2006 Hanukkah Eve Windstorm. CONCLUSIONS: Though EHR systems were developed to prioritize operational rather than analytic priorities and have consequent limitations for surveillance, our EHR enrichment analysis nonetheless re-identified expected temporal population health patterns. EHRs are potentially a valuable source of information to inform public health policy, both in retrospective analysis and in a surveillance capacity.
Assuntos
Registros Eletrônicos de Saúde , Serviço Hospitalar de Emergência/estatística & dados numéricos , Intoxicação/epidemiologia , Vigilância da População/métodos , Ferimentos e Lesões/epidemiologia , Férias e Feriados , Humanos , Intoxicação/terapia , Estações do Ano , Washington/epidemiologia , Tempo (Meteorologia) , Ferimentos e Lesões/terapiaRESUMO
Thermodynamic stability is a fundamental property shared by all proteins. Changes in stability due to mutation are a widespread molecular mechanism in genetic diseases. Methods for the prediction of mutation-induced stability change have typically been developed and evaluated on incomplete and/or biased data sets. As part of the Critical Assessment of Genome Interpretation, we explored the utility of high-throughput variant stability profiling (VSP) assay data as an alternative for the assessment of computational methods and evaluated state-of-the-art predictors against over 7,000 nonsynonymous variants from two proteins. We found that predictions were modestly correlated with actual experimental values. Predictors fared better when evaluated as classifiers of extreme stability effects. While different methods emerging as top performers depending on the metric, it is nontrivial to draw conclusions on their adoption or improvement. Our analyses revealed that only 16% of all variants in VSP assays could be confidently defined as stability-affecting. Furthermore, it is unclear as to what extent VSP abundance scores were reasonable proxies for the stability-related quantities that participating methods were designed to predict. Overall, our observations underscore the need for clearly defined objectives when developing and using both computational and experimental methods in the context of measuring variant impact.
Assuntos
Biologia Computacional/métodos , Metiltransferases/química , Mutação , PTEN Fosfo-Hidrolase/química , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metiltransferases/genética , PTEN Fosfo-Hidrolase/genética , Estabilidade ProteicaRESUMO
The availability of disease-specific genomic data is critical for developing new computational methods that predict the pathogenicity of human variants and advance the field of precision medicine. However, the lack of gold standards to properly train and benchmark such methods is one of the greatest challenges in the field. In response to this challenge, the scientific community is invited to participate in the Critical Assessment for Genome Interpretation (CAGI), where unpublished disease variants are available for classification by in silico methods. As part of the CAGI-5 challenge, we evaluated the performance of 18 submissions and three additional methods in predicting the pathogenicity of single nucleotide variants (SNVs) in checkpoint kinase 2 (CHEK2) for cases of breast cancer in Hispanic females. As part of the assessment, the efficacy of the analysis method and the setup of the challenge were also considered. The results indicated that though the challenge could benefit from additional participant data, the combined generalized linear model analysis and odds of pathogenicity analysis provided a framework to evaluate the methods submitted for SNV pathogenicity identification and for comparison to other available methods. The outcome of this challenge and the approaches used can help guide further advancements in identifying SNV-disease relationships.
Assuntos
Neoplasias da Mama/genética , Quinase do Ponto de Checagem 2/genética , Biologia Computacional/métodos , Hispânico ou Latino/genética , Polimorfismo de Nucleotídeo Único , Adulto , Idoso , Neoplasias da Mama/etnologia , Estudos de Casos e Controles , Simulação por Computador , Feminino , Predisposição Genética para Doença , Humanos , Modelos Lineares , Pessoa de Meia-Idade , Estados Unidos/etnologia , Sequenciamento do ExomaRESUMO
Testing for variation in BRCA1 and BRCA2 (commonly referred to as BRCA1/2), has emerged as a standard clinical practice and is helping countless women better understand and manage their heritable risk of breast and ovarian cancer. Yet the increased rate of BRCA1/2 testing has led to an increasing number of Variants of Uncertain Significance (VUS), and the rate of VUS discovery currently outpaces the rate of clinical variant interpretation. Computational prediction is a key component of the variant interpretation pipeline. In the CAGI5 ENIGMA Challenge, six prediction teams submitted predictions on 326 newly-interpreted variants from the ENIGMA Consortium. By evaluating these predictions against the new interpretations, we have gained a number of insights on the state of the art of variant prediction and specific steps to further advance this state of the art.
Assuntos
Proteína BRCA1/genética , Proteína BRCA2/genética , Neoplasias da Mama/diagnóstico , Biologia Computacional/métodos , Neoplasias Ovarianas/diagnóstico , Neoplasias da Mama/genética , Detecção Precoce de Câncer , Feminino , Predisposição Genética para Doença , Testes Genéticos , Variação Genética , Humanos , Modelos Genéticos , Neoplasias Ovarianas/genéticaRESUMO
The NAGLU challenge of the fourth edition of the Critical Assessment of Genome Interpretation experiment (CAGI4) in 2016, invited participants to predict the impact of variants of unknown significance (VUS) on the enzymatic activity of the lysosomal hydrolase α-N-acetylglucosaminidase (NAGLU). Deficiencies in NAGLU activity lead to a rare, monogenic, recessive lysosomal storage disorder, Sanfilippo syndrome type B (MPS type IIIB). This challenge attracted 17 submissions from 10 groups. We observed that top models were able to predict the impact of missense mutations on enzymatic activity with Pearson's correlation coefficients of up to .61. We also observed that top methods were significantly more correlated with each other than they were with observed enzymatic activity values, which we believe speaks to the importance of sequence conservation across the different methods. Improved functional predictions on the VUS will help population-scale analysis of disease epidemiology and rare variant association analysis.
Assuntos
Acetilglucosaminidase/metabolismo , Biologia Computacional/métodos , Mutação de Sentido Incorreto , Acetilglucosaminidase/genética , Humanos , Modelos Genéticos , Análise de RegressãoRESUMO
The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10-12) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.
Assuntos
Doença/genética , Mutação de Sentido Incorreto/genética , Software , Área Sob a Curva , Análise Mutacional de DNA , Exoma/genética , Frequência do Gene , Humanos , Curva ROCRESUMO
The digital world is generating data at a staggering and still increasing rate. While these "big data" have unlocked novel opportunities to understand public health, they hold still greater potential for research and practice. This review explores several key issues that have arisen around big data. First, we propose a taxonomy of sources of big data to clarify terminology and identify threads common across some subtypes of big data. Next, we consider common public health research and practice uses for big data, including surveillance, hypothesis-generating research, and causal inference, while exploring the role that machine learning may play in each use. We then consider the ethical implications of the big data revolution with particular emphasis on maintaining appropriate care for privacy in a world in which technology is rapidly changing social norms regarding the need for (and even the meaning of) privacy. Finally, we make suggestions regarding structuring teams and training to succeed in working with big data in research and practice.
Assuntos
Big Data , Aprendizado de Máquina , Saúde Pública , Pesquisa/organização & administração , Causalidade , Confidencialidade/normas , Humanos , Vigilância da População/métodos , Pesquisa/normas , Terminologia como AssuntoRESUMO
MOTIVATION: Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease. RESULTS: To address this, we investigate protein features underlying loss-of-function genetic variation and develop a machine learning method, MutPred-LOF, for the discrimination of pathogenic and tolerated variants that can also generate hypotheses on specific molecular events disrupted by the variant. We investigate a large set of human variants derived from the Human Gene Mutation Database, ClinVar and the Exome Aggregation Consortium. Our prediction method shows an area under the Receiver Operating Characteristic curve of 0.85 for all loss-of-function variants and 0.75 for proteins in which both pathogenic and neutral variants have been observed. We applied MutPred-LOF to a set of 1142 de novo vari3ants from neurodevelopmental disorders and find enrichment of pathogenic variants in affected individuals. Overall, our results highlight the potential of computational tools to elucidate causal mechanisms underlying loss of protein function in loss-of-function variants. AVAILABILITY AND IMPLEMENTATION: http://mutpred.mutdb.org. CONTACT: predrag@indiana.edu.
Assuntos
Mutação com Perda de Função , Aprendizado de Máquina , Proteínas/genética , Análise de Sequência de Proteína/métodos , Software , Biologia Computacional/métodos , Humanos , Conformação Proteica , Proteínas/metabolismo , Proteínas/fisiologiaRESUMO
The steady advances in machine learning and accumulation of biomedical data have contributed to the development of numerous computational models that assess the impact of missense variants. Different methods, however, operationalize impact differently. Two common tasks in this context are the prediction of the pathogenicity of variants and the prediction of their effects on a protein's function. These are related but distinct problems, and it is unclear whether methods developed for one are optimized for the other. The Critical Assessment of Genome Interpretation (CAGI) experiment provides a means to address this question empirically. To this end, we participated in various protein-specific challenges in CAGI with two objectives in mind. First, to compare the performance of methods in the MutPred family with the state-of-the-art. Second and more importantly, to investigate the applicability of general-purpose pathogenicity predictors to the classification of specific function-altering variants without additional training or calibration. We find that our pathogenicity predictors performed competitively with other methods, outputting score distributions in agreement with experimental outcomes. Overall, we conclude that binary classifiers learned from disease-causing mutations are capable of modeling important aspects of the underlying biology and the alteration of protein function resulting from mutations.
Assuntos
Biologia Computacional/métodos , Mutação de Sentido Incorreto , Proteínas/genética , Bases de Dados Genéticas , Predisposição Genética para Doença , Humanos , Aprendizado de MáquinaRESUMO
Precision medicine aims to predict a patient's disease risk and best therapeutic options by using that individual's genetic sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. For CAGI 4, three challenges involved using exome-sequencing data: Crohn's disease, bipolar disorder, and warfarin dosing. Previous CAGI challenges included prior versions of the Crohn's disease challenge. Here, we discuss the range of techniques used for phenotype prediction as well as the methods used for assessing predictive models. Additionally, we outline some of the difficulties associated with making predictions and evaluating them. The lessons learned from the exome challenges can be applied to both research and clinical efforts to improve phenotype prediction from genotype. In addition, these challenges serve as a vehicle for sharing clinical and research exome data in a secure manner with scientists who have a broad range of expertise, contributing to a collaborative effort to advance our understanding of genotype-phenotype relationships.
Assuntos
Transtorno Bipolar/genética , Doença de Crohn/genética , Sequenciamento do Exoma/métodos , Medicina de Precisão/métodos , Varfarina/uso terapêutico , Biologia Computacional/métodos , Bases de Dados Genéticas , Predisposição Genética para Doença , Humanos , Disseminação de Informação , Variantes Farmacogenômicos , Fenótipo , Varfarina/farmacologiaRESUMO
Elucidating the precise molecular events altered by disease-causing genetic variants represents a major challenge in translational bioinformatics. To this end, many studies have investigated the structural and functional impact of amino acid substitutions. Most of these studies were however limited in scope to either individual molecular functions or were concerned with functional effects (e.g. deleterious vs. neutral) without specifically considering possible molecular alterations. The recent growth of structural, molecular and genetic data presents an opportunity for more comprehensive studies to consider the structural environment of a residue of interest, to hypothesize specific molecular effects of sequence variants and to statistically associate these effects with genetic disease. In this study, we analyzed data sets of disease-causing and putatively neutral human variants mapped to protein 3D structures as part of a systematic study of the loss and gain of various types of functional attribute potentially underlying pathogenic molecular alterations. We first propose a formal model to assess probabilistically function-impacting variants. We then develop an array of structure-based functional residue predictors, evaluate their performance, and use them to quantify the impact of disease-causing amino acid substitutions on catalytic activity, metal binding, macromolecular binding, ligand binding, allosteric regulation and post-translational modifications. We show that our methodology generates actionable biological hypotheses for up to 41% of disease-causing genetic variants mapped to protein structures suggesting that it can be reliably used to guide experimental validation. Our results suggest that a significant fraction of disease-causing human variants mapping to protein structures are function-altering both in the presence and absence of stability disruption.
Assuntos
Sequência de Aminoácidos/genética , Doença/genética , Modelos Estatísticos , Mutação/genética , Substituição de Aminoácidos/genética , Biologia Computacional , Simulação por Computador , Humanos , Modelos Moleculares , Ligação ProteicaRESUMO
Cross sections for 61 palmitoylated peptides and 73 cysteine-unmodified peptides are determined and used together with a previously obtained tryptic peptide library to derive a set of intrinsic size parameters (ISPs) for the palmitoyl (Pal) group (1.26 ± 0.04), carboxyamidomethyl (Am) group (0.92 ± 0.04), and the 20 amino acid residues to assess the influence of Pal- and Am-modification on cysteine and other amino acid residues. These values highlight the influence of the intrinsic hydrophobic and hydrophilic nature of these modifications on the overall cross sections. As a part of this analysis, we find that ISPs derived from a database of a modifier on one amino acid residue (CysPal) can be applied on the same modification group on different amino acid residues (SerPal and TyrPal). Using these ISP values, we are able to calculate peptide cross sections to within ± 2% of experimental values for 83% of Pal-modified peptide ions and 63% of Am-modified peptide ions. We propose that modification groups should be treated as individual contribution factors, instead of treating the combination of the particular group and the amino acid residue they are on as a whole when considering their effects on the peptide ion mobility features.
RESUMO
Purpose: To investigate the number of rare missense variants observed in human genome sequences by ACMG/AMP PP3/BP4 evidence strength, following the calibrated PP3/BP4 computational recommendations. Methods: Missense variants from the genome sequences of 300 probands from the Rare Genomes Project with suspected rare disease were analyzed using computational prediction tools able to reach PP3_Strong and BP4_Moderate evidence strengths (BayesDel, MutPred2, REVEL, and VEST4). The numbers of variants at each evidence strength were analyzed across disease-associated genes and genome-wide. Results: From a median of 75.5 rare (≤1% allele frequency) missense variants in disease-associated genes per proband, a median of one reached PP3_Strong, 3-5 PP3_Moderate, and 3-5 PP3_Supporting. Most were allocated BP4 evidence (median 41-49 per proband) or were indeterminate (median 17.5-19 per proband). Extending the analysis to all protein-coding genes genome-wide, the number of PP3_Strong variants increased approximately 2.6-fold compared to disease-associated genes, with a median per proband of 1-3 PP3_Strong, 8-16 PP3_Moderate, and 10-17 PP3_Supporting. Conclusion: A small number of variants per proband reached PP3_Strong and PP3_Moderate in 3,424 disease-associated genes, and though not the intended use of the recommendations, also genome-wide. Use of PP3/BP4 evidence as recommended from calibrated computational prediction tools in the clinical diagnostic laboratory is unlikely to inappropriately contribute to the classification of an excessive number of variants as Pathogenic or Likely Pathogenic by ACMG/AMP rules.
RESUMO
Purpose: We previously developed an approach to calibrate computational tools for clinical variant classification, updating recommendations for the reliable use of variant impact predictors to provide evidence strength up to Strong. A new generation of tools using distinctive approaches have since been released, and these methods must be independently calibrated for clinical application. Method: Using our local posterior probability-based calibration and our established data set of ClinVar pathogenic and benign variants, we determined the strength of evidence provided by three new tools (AlphaMissense, ESM1b, VARITY) and calibrated scores meeting each evidence strength. Results: All three tools reached the Strong level of evidence for variant pathogenicity and Moderate for benignity, though sometimes for few variants. Compared to previously recommended tools, these yielded at best only modest improvements in the tradeoffs of evidence strength and false positive predictions. Conclusion: At calibrated thresholds, three new computational predictors provided evidence for variant pathogenicity at similar strength to the four previously recommended predictors (and comparable with functional assays for some variants). This calibration broadens the scope of computational tools for application in clinical variant classification. Their new approaches offer promise for future advancement of the field.
RESUMO
Critical evaluation of computational tools for predicting variant effects is important considering their increased use in disease diagnosis and driving molecular discoveries. In the sixth edition of the Critical Assessment of Genome Interpretation (CAGI) challenge, a dataset of 28 STK11 rare variants (27 missense, 1 single amino acid deletion), identified in primary non-small cell lung cancer biopsies, was experimentally assayed to characterize computational methods from four participating teams and five publicly available tools. Predictors demonstrated a high level of performance on key evaluation metrics, measuring correlation with the assay outputs and separating loss-of-function (LoF) variants from wildtype-like (WT-like) variants. The best participant model, 3Cnet, performed competitively with well-known tools. Unique to this challenge was that the functional data was generated with both biological and technical replicates, thus allowing the assessors to realistically establish maximum predictive performance based on experimental variability. Three out of the five publicly available tools and 3Cnet approached the performance of the assay replicates in separating LoF variants from WT-like variants. Surprisingly, REVEL, an often-used model, achieved a comparable correlation with the real-valued assay output as that seen for the experimental replicates. Performing variant interpretation by combining the new functional evidence with computational and population data evidence led to 16 new variants receiving a clinically actionable classification of likely pathogenic (LP) or likely benign (LB). Overall, the STK11 challenge highlights the utility of variant effect predictors in biomedical sciences and provides encouraging results for driving research in the field of computational genome interpretation.