Search | VHL Search Portal

LEAP: Using machine learning to support variant classification in a clinical setting.

Lai, Carmen; Zimmer, Anjali D; O'Connor, Robert; Kim, Serra; Chan, Ray; van den Akker, Jeroen; Zhou, Alicia Y; Topper, Scott; Mishne, Gilad.

Hum Mutat ; 41(6): 1079-1090, 2020 06.

Article in English | MEDLINE | ID: mdl-32176384

ABSTRACT

Advances in genome sequencing have led to a tremendous increase in the discovery of novel missense variants, but evidence for determining clinical significance can be limited or conflicting. Here, we present Learning from Evidence to Assess Pathogenicity (LEAP), a machine learning model that utilizes a variety of feature categories to classify variants, and achieves high performance in multiple genes and different health conditions. Feature categories include functional predictions, splice predictions, population frequencies, conservation scores, protein domain data, and clinical observation data such as personal and family history and covariant information. L2-regularized logistic regression and random forest classification models were trained on missense variants detected and classified during the course of routine clinical testing at Color Genomics (14,226 variants from 24 cancer-related genes and 5,398 variants from 30 cardiovascular-related genes). Using 10-fold cross-validated predictions, the logistic regression model achieved an area under the receiver operating characteristic curve (AUROC) of 97.8% (cancer) and 98.8% (cardiovascular), while the random forest model achieved 98.3% (cancer) and 98.6% (cardiovascular). We demonstrate generalizability to different genes by validating predictions on genes withheld from training (96.8% AUROC). High accuracy and broad applicability make LEAP effective in the clinical setting as a high-throughput quality control layer.

Subject(s)

Genomics/methods , Machine Learning , Models, Genetic , Mutation, Missense , Area Under Curve , Cardiovascular Diseases/genetics , Humans , Logistic Models , Models, Statistical , Neoplasms/genetics , ROC Curve

Assessment of blind predictions of the clinical significance of BRCA1 and BRCA2 variants.

Cline, Melissa S; Babbi, Giulia; Bonache, Sandra; Cao, Yue; Casadio, Rita; de la Cruz, Xavier; Díez, Orland; Gutiérrez-Enríquez, Sara; Katsonis, Panagiotis; Lai, Carmen; Lichtarge, Olivier; Martelli, Pier L; Mishne, Gilad; Moles-Fernández, Alejandro; Montalban, Gemma; Mooney, Sean D; O'Conner, Robert; Ootes, Lars; Özkan, Selen; Padilla, Natalia; Pagel, Kymberleigh A; Pejaver, Vikas; Radivojac, Predrag; Riera, Casandra; Savojardo, Castrense; Shen, Yang; Sun, Yuanfei; Topper, Scott; Parsons, Michael T; Spurdle, Amanda B; Goldgar, David E.

Hum Mutat ; 40(9): 1546-1556, 2019 09.

Article in English | MEDLINE | ID: mdl-31294896

ABSTRACT

Testing for variation in BRCA1 and BRCA2 (commonly referred to as BRCA1/2), has emerged as a standard clinical practice and is helping countless women better understand and manage their heritable risk of breast and ovarian cancer. Yet the increased rate of BRCA1/2 testing has led to an increasing number of Variants of Uncertain Significance (VUS), and the rate of VUS discovery currently outpaces the rate of clinical variant interpretation. Computational prediction is a key component of the variant interpretation pipeline. In the CAGI5 ENIGMA Challenge, six prediction teams submitted predictions on 326 newly-interpreted variants from the ENIGMA Consortium. By evaluating these predictions against the new interpretations, we have gained a number of insights on the state of the art of variant prediction and specific steps to further advance this state of the art.

Subject(s)

BRCA1 Protein/genetics , BRCA2 Protein/genetics , Breast Neoplasms/diagnosis , Computational Biology/methods , Ovarian Neoplasms/diagnosis , Breast Neoplasms/genetics , Early Detection of Cancer , Female , Genetic Predisposition to Disease , Genetic Testing , Genetic Variation , Humans , Models, Genetic , Ovarian Neoplasms/genetics

A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing.

van den Akker, Jeroen; Mishne, Gilad; Zimmer, Anjali D; Zhou, Alicia Y.

BMC Genomics ; 19(1): 263, 2018 Apr 17.

Article in English | MEDLINE | ID: mdl-29665779

ABSTRACT

BACKGROUND: Next generation sequencing (NGS) has become a common technology for clinical genetic tests. The quality of NGS calls varies widely and is influenced by features like reference sequence characteristics, read depth, and mapping accuracy. With recent advances in NGS technology and software tools, the majority of variants called using NGS alone are in fact accurate and reliable. However, a small subset of difficult-to-call variants that still do require orthogonal confirmation exist. For this reason, many clinical laboratories confirm NGS results using orthogonal technologies such as Sanger sequencing. Here, we report the development of a deterministic machine-learning-based model to differentiate between these two types of variant calls: those that do not require confirmation using an orthogonal technology (high confidence), and those that require additional quality testing (low confidence). This approach allows reliable NGS-based calling in a clinical setting by identifying the few important variant calls that require orthogonal confirmation. RESULTS: We developed and tested the model using a set of 7179 variants identified by a targeted NGS panel and re-tested by Sanger sequencing. The model incorporated several signals of sequence characteristics and call quality to determine if a variant was identified at high or low confidence. The model was tuned to eliminate false positives, defined as variants that were called by NGS but not confirmed by Sanger sequencing. The model achieved very high accuracy: 99.4% (95% confidence interval: +/- 0.03%). It categorized 92.2% (6622/7179) of the variants as high confidence, and 100% of these were confirmed to be present by Sanger sequencing. Among the variants that were categorized as low confidence, defined as NGS calls of low quality that are likely to be artifacts, 92.1% (513/557) were found to be not present by Sanger sequencing. CONCLUSIONS: This work shows that NGS data contains sufficient characteristics for a machine-learning-based model to differentiate low from high confidence variants. Additionally, it reveals the importance of incorporating site-specific features as well as variant call features in such a model.

Subject(s)

High-Throughput Nucleotide Sequencing , Machine Learning , Models, Statistical , Base Sequence , Genetic Variation

Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores.

Homburger, Julian R; Neben, Cynthia L; Mishne, Gilad; Zhou, Alicia Y; Kathiresan, Sekar; Khera, Amit V.

Genome Med ; 11(1): 74, 2019 11 26.

Article in English | MEDLINE | ID: mdl-31771638

ABSTRACT

BACKGROUND: Inherited susceptibility to common, complex diseases may be caused by rare, pathogenic variants ("monogenic") or by the cumulative effect of numerous common variants ("polygenic"). Comprehensive genome interpretation should enable assessment for both monogenic and polygenic components of inherited risk. The traditional approach requires two distinct genetic testing technologies-high coverage sequencing of known genes to detect monogenic variants and a genome-wide genotyping array followed by imputation to calculate genome-wide polygenic scores (GPSs). We assessed the feasibility and accuracy of using low coverage whole genome sequencing (lcWGS) as an alternative to genotyping arrays to calculate GPSs. METHODS: First, we performed downsampling and imputation of WGS data from ten individuals to assess concordance with known genotypes. Second, we assessed the correlation between GPSs for 3 common diseases-coronary artery disease (CAD), breast cancer (BC), and atrial fibrillation (AF)-calculated using lcWGS and genotyping array in 184 samples. Third, we assessed concordance of lcWGS-based genotype calls and GPS calculation in 120 individuals with known genotypes, selected to reflect diverse ancestral backgrounds. Fourth, we assessed the relationship between GPSs calculated using lcWGS and disease phenotypes in a cohort of 11,502 individuals of European ancestry. RESULTS: We found imputation accuracy r2 values of greater than 0.90 for all ten samples-including those of African and Ashkenazi Jewish ancestry-with lcWGS data at 0.5×. GPSs calculated using lcWGS and genotyping array followed by imputation in 184 individuals were highly correlated for each of the 3 common diseases (r2 = 0.93-0.97) with similar score distributions. Using lcWGS data from 120 individuals of diverse ancestral backgrounds, we found similar results with respect to imputation accuracy and GPS correlations. Finally, we calculated GPSs for CAD, BC, and AF using lcWGS in 11,502 individuals of European ancestry, confirming odds ratios per standard deviation increment ranging 1.28 to 1.59, consistent with previous studies. CONCLUSIONS: lcWGS is an alternative technology to genotyping arrays for common genetic variant assessment and GPS calculation. lcWGS provides comparable imputation accuracy while also overcoming the ascertainment bias inherent to variant selection in genotyping array design.

Subject(s)

Genetic Variation , Genome, Human , Genome-Wide Association Study , Genomics , Genetic Predisposition to Disease , Genetics, Population , Genomics/methods , Genotype , Humans , Reproducibility of Results , Whole Genome Sequencing

A scalable, aggregated genotypic-phenotypic database for human disease variation.

Barrett, Ryan; Neben, Cynthia L; Zimmer, Anjali D; Mishne, Gilad; McKennon, Wendy; Zhou, Alicia Y; Ginsberg, Jeremy.

Database (Oxford) ; 20192019 01 01.

Article in English | MEDLINE | ID: mdl-30759220

ABSTRACT

Next generation sequencing multi-gene panels have greatly improved the diagnostic yield and cost effectiveness of genetic testing and are rapidly being integrated into the clinic for hereditary cancer risk. With this technology comes a dramatic increase in the volume, type and complexity of data. This invaluable data though is too often buried or inaccessible to researchers, especially to those without strong analytical or programming skills. To effectively share comprehensive, integrated genotypic-phenotypic data, we built Color Data, a publicly available, cloud-based database that supports broad access and data literacy. The database is composed of 50 000 individuals who were sequenced for 30 genes associated with hereditary cancer risk and provides useful information on allele frequency and variant classification, as well as associated phenotypic information such as demographics and personal and family history. Our user-friendly interface allows researchers to easily execute their own queries with filtering, and the results of queries can be shared and/or downloaded. The rapid and broad dissemination of these research results will help increase the value of, and reduce the waste in, scientific resources and data. Furthermore, the database is able to quickly scale and support integration of additional genes and human hereditary conditions. We hope that this database will help researchers and scientists explore genotype-phenotype correlations in hereditary cancer, identify novel variants for functional analysis and enable data-driven drug discovery and development.

Subject(s)

Databases, Genetic , Genetic Variation , Adult , Alleles , BRCA1 Protein/genetics , BRCA2 Protein/genetics , Colorectal Neoplasms, Hereditary Nonpolyposis/genetics , Female , Founder Effect , Genotype , Humans , Jews/genetics , Male , Middle Aged , Phenotype , Search Engine , User-Computer Interface

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL