Search | VHL Regional Portal

1.

Human Genetics and Genomics for Drug Target Identification and Prioritization: Open Targets' Perspective.

McDonagh, Ellen M; Trynka, Gosia; McCarthy, Mark; Holzinger, Emily Rose; Khader, Shameer; Nakic, Nikolina; Hu, Xinli; Cornu, Helena; Dunham, Ian; Hulcoop, David.

Annu Rev Biomed Data Sci ; 2024 Apr 12.

Article in English | MEDLINE | ID: mdl-38608311

ABSTRACT

Open Targets, a consortium among academic and industry partners, focuses on using human genetics and genomics to provide insights to key questions that build therapeutic hypotheses. Large-scale experiments generate foundational data, and open-source informatic platforms systematically integrate evidence for target-disease relationships and provide dynamic tooling for target prioritization. A locus-to-gene machine learning model uses evidence from genome-wide association studies (GWAS Catalog, UK BioBank, and FinnGen), functional genomic studies, epigenetic studies, and variant effect prediction to predict potential drug targets for complex diseases. These predictions are combined with genetic evidence from gene burden analyses, rare disease genetics, somatic mutations, perturbation assays, pathway analyses, scientific literature, differential expression, and mouse models to systematically build target-disease associations (https://platform.opentargets.org). Scored target attributes such as clinical precedence, tractability, and safety guide target prioritization. Here we provide our perspective on the value and impact of human genetics and genomics for generating therapeutic hypotheses.

2.

A novel de novo TP63 mutation in whole-exome sequencing of a Syrian family with Oral cleft and ectrodactyly.

Simpson, Claire L; Kimble, Danielle C; Chandrasekharappa, Settara C; Alqosayer, Khalid; Holzinger, Emily; Carrington, Blake; McElderry, John; Sood, Raman; Al-Souqi, Ghiath; Albacha-Hejazi, Hasan; Bailey-Wilson, Joan E.

Mol Genet Genomic Med ; 11(8): e2179, 2023 08.

Article in English | MEDLINE | ID: mdl-37070724

ABSTRACT

BACKGROUND: Oral clefts and ectrodactyly are common, heterogeneous birth defects. We performed whole-exome sequencing (WES) analysis in a Syrian family. The proband presented with both orofacial clefting and ectrodactyly but not ectodermal dysplasia as typically seen in ectrodactyly, ectodermal dysplasia, and cleft lip/palate syndrome-3. A paternal uncle with only an oral cleft was deceased and unavailable for analysis. METHODS: Variant annotation, Mendelian inconsistencies, and novel variants in known cleft genes were examined. Candidate variants were validated using Sanger sequencing, and pathogenicity assessed by knocking out the tp63 gene in zebrafish to evaluate its role during zebrafish development. RESULTS: Twenty-eight candidate de novo events were identified, one of which is in a known oral cleft and ectrodactyly gene, TP63 (c.956G > T, p.Arg319Leu), and confirmed by Sanger sequencing. CONCLUSION: TP63 mutations are associated with multiple autosomal dominant orofacial clefting and limb malformation disorders. The p.Arg319Leu mutation seen in this patient is de novo but also novel. Two known mutations in the same codon (c.956G > A, p.(Arg319His; rs121908839, c.955C > T), p.Arg319Cys) cause ectrodactyly, providing evidence that mutating this codon is deleterious. While this TP63 mutation is the best candidate for the patient's clinical presentation, whether it is responsible for the entire phenotype is unclear. Generation and characterization of tp63 knockout zebrafish showed necrosis and rupture of the head at 3 days post-fertilization (dpf). The embryonic phenotype could not be rescued by injection of zebrafish or human messenger RNA (mRNA). Further functional analysis is needed to determine what proportion of the phenotype is due to this mutation.

Subject(s)

Cleft Lip , Cleft Palate , Humans , Animals , Cleft Lip/genetics , Cleft Palate/genetics , Zebrafish/genetics , Exome Sequencing , Syria , Mutation , Transcription Factors/genetics , Tumor Suppressor Proteins/genetics

3.

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics.

Musolf, Anthony M; Holzinger, Emily R; Malley, James D; Bailey-Wilson, Joan E.

Hum Genet ; 141(9): 1515-1528, 2022 Sep.

Article in English | MEDLINE | ID: mdl-34862561

ABSTRACT

Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.

Subject(s)

Machine Learning , Support Vector Machine , Algorithms , Humans , Neural Networks, Computer

4.

Genetic Association Reveals Protection against Recurrence of Clostridium difficile Infection with Bezlotoxumab Treatment.

Shen, Judong; Mehrotra, Devan V; Dorr, Mary Beth; Zeng, Zhen; Li, Junhua; Xu, Xun; Nickle, David; Holzinger, Emily R; Chhibber, Aparna; Wilcox, Mark H; Blanchard, Rebecca L; Shaw, Peter M.

mSphere ; 5(3)2020 05 06.

Article in English | MEDLINE | ID: mdl-32376702

ABSTRACT

Bezlotoxumab is a human monoclonal antibody against Clostridium difficile toxin B, indicated to prevent recurrence of C. difficile infection (rCDI) in high-risk adults receiving antibacterial treatment for CDI. An exploratory genome-wide association study investigated whether human genetic variation influences bezlotoxumab response. DNA from 704 participants who achieved initial clinical cure in the phase 3 MODIFY I/II trials was genotyped. Single nucleotide polymorphisms (SNPs) and human leukocyte antigen (HLA) imputation were performed using IMPUTE2 and HIBAG, respectively. A joint test of genotype and genotype-by-treatment interaction in a logistic regression model was used to screen genetic variants associated with response to bezlotoxumab. The SNP rs2516513 and the HLA alleles HLA-DRB1*07:01 and HLA-DQA1*02:01, located in the extended major histocompatibility complex on chromosome 6, were associated with the reduction of rCDI in bezlotoxumab-treated participants. Carriage of a minor allele (homozygous or heterozygous) at any of the identified loci was related to a larger difference in the proportion of participants experiencing rCDI versus placebo; the effect was most prominent in the subgroup at high baseline risk for rCDI. Genotypes associated with an improved bezlotoxumab response showed no association with rCDI in the placebo cohort. These data suggest that a host-driven, immunological mechanism may impact bezlotoxumab response. Trial registration numbers are as follows: NCT01241552 (MODIFY I) and NCT01513239 (MODIFY II).IMPORTANCEClostridium difficile infection is associated with significant clinical morbidity and mortality; antibacterial treatments are effective, but recurrence of C. difficile infection is common. In this genome-wide association study, we explored whether host genetic variability affected treatment responses to bezlotoxumab, a human monoclonal antibody that binds C. difficile toxin B and is indicated for the prevention of recurrent C. difficile infection. Using data from the MODIFY I/II phase 3 clinical trials, we identified three genetic variants associated with reduced rates of C. difficile infection recurrence in bezlotoxumab-treated participants. The effects were most pronounced in participants at high risk of C. difficile infection recurrence. All three variants are located in the extended major histocompatibility complex on chromosome 6, suggesting the involvement of a host-driven immunological mechanism in the prevention of C. difficile infection recurrence.

Subject(s)

Antibodies, Monoclonal/therapeutic use , Broadly Neutralizing Antibodies/therapeutic use , Clostridioides difficile/drug effects , Clostridium Infections/drug therapy , Clostridium Infections/genetics , Adolescent , Adult , Aged , Aged, 80 and over , Alleles , Antibodies, Neutralizing/blood , Female , Genome-Wide Association Study , Genotype , HLA-D Antigens/genetics , Humans , Male , Middle Aged , Polymorphism, Single Nucleotide , Recurrence , Young Adult

5.

Exome genotyping and linkage analysis identifies two novel linked regions and replicates two others for myopia in Ashkenazi Jewish families.

Simpson, Claire L; Musolf, Anthony M; Li, Qing; Portas, Laura; Murgia, Federico; Cordero, Roberto Y; Cordero, Jennifer B; Moiz, Bilal A; Holzinger, Emily R; Middlebrooks, Candace D; Lewis, Deyana D; Bailey-Wilson, Joan E; Stambolian, Dwight.

BMC Med Genet ; 20(1): 27, 2019 01 31.

Article in English | MEDLINE | ID: mdl-30704416

ABSTRACT

BACKGROUND: Myopia is one of most common eye diseases in the world and affects 1 in 4 Americans. It is a complex disease caused by both environmental and genetics effects; the genetics effects are still not well understood. In this study, we performed genetic linkage analyses on Ashkenazi Jewish families with a strong familial history of myopia to elucidate any potential causal genes. METHODS: Sixty-four extended Ashkenazi Jewish families were previously collected from New Jersey. Genotypes from the Illumina ExomePlus array were merged with prior microsatellite linkage data from these families. Additional custom markers were added for candidate regions reported in literature for myopia or refractive error. Myopia was defined as mean spherical equivalent (MSE) of -1D or worse and parametric two-point linkage analyses (using TwoPointLods) and multi-point linkage analyses (using SimWalk2) were performed as well as collapsed haplotype pattern (CHP) analysis in SEQLinkage and association analyses performed with FBAT and rv-TDT. RESULTS: Strongest evidence of linkage was on 1p36(two-point LOD = 4.47) a region previously linked to refractive error (MYP14) but not myopia. Another genome-wide significant locus was found on 8q24.22 with a maximum two-point LOD score of 3.75. CHP analysis also detected the signal on 1p36, localized to the LINC00339 gene with a maximum HLOD of 3.47, as well as genome-wide significant signals on 7q36.1 and 11p15, which overlaps with the MYP7 locus. CONCLUSIONS: We identified 2 novel linkage peaks for myopia on chromosomes 7 and 8 in these Ashkenazi Jewish families and replicated 2 more loci on chromosomes 1 and 11, one previously reported in refractive error but not myopia in these families and the other locus previously reported in the literature. Strong candidate genes have been identified within these linkage peaks in our families. Targeted sequencing in these regions will be necessary to definitively identify causal variants under these linkage peaks.

Subject(s)

Chromosomes, Human/genetics , Genotyping Techniques/methods , Jews/genetics , Myopia/genetics , Chromosomes, Human, Pair 1/genetics , Chromosomes, Human, Pair 11/genetics , Chromosomes, Human, Pair 7/genetics , Chromosomes, Human, Pair 8/genetics , Exome , Female , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Lod Score , Male , Myopia/ethnology , Pedigree , RNA, Long Noncoding/genetics

6.

Analysis of sequence data to identify potential risk variants for oral clefts in multiplex families.

Holzinger, Emily R; Li, Qing; Parker, Margaret M; Hetmanski, Jacqueline B; Marazita, Mary L; Mangold, Elisabeth; Ludwig, Kerstin U; Taub, Margaret A; Begum, Ferdouse; Murray, Jeffrey C; Albacha-Hejazi, Hasan; Alqosayer, Khalid; Al-Souki, Giath; Albasha Hejazi, Abdullatiff; Scott, Alan F; Beaty, Terri H; Bailey-Wilson, Joan E.

Mol Genet Genomic Med ; 5(5): 570-579, 2017 Sep.

Article in English | MEDLINE | ID: mdl-28944239

ABSTRACT

BACKGROUND: Nonsyndromic oral clefts are craniofacial malformations, which include cleft lip with or without cleft palate. The etiology for oral clefts is complex with both genetic and environmental factors contributing to risk. Previous genome-wide association (GWAS) studies have identified multiple loci with small effects; however, many causal variants remain elusive. METHODS: In this study, we address this by specifically looking for rare, potentially damaging variants in family-based data. We analyzed both whole exome sequence (WES) data and whole genome sequence (WGS) data in multiplex cleft families to identify variants shared by affected individuals. RESULTS: Here we present the results from these analyses. Our most interesting finding was from a single Syrian family, which showed enrichment of nonsynonymous and potentially damaging rare variants in two genes: CASP9 and FAT4. CONCLUSION: Neither of these candidate genes has previously been associated with oral clefts and, if confirmed as contributing to disease risk, may indicate novel biological pathways in the genetic etiology for oral clefts.

7.

Discovery and replication of SNP-SNP interactions for quantitative lipid traits in over 60,000 individuals.

Holzinger, Emily R; Verma, Shefali S; Moore, Carrie B; Hall, Molly; De, Rishika; Gilbert-Diamond, Diane; Lanktree, Matthew B; Pankratz, Nathan; Amuzu, Antoinette; Burt, Amber; Dale, Caroline; Dudek, Scott; Furlong, Clement E; Gaunt, Tom R; Kim, Daniel Seung; Riess, Helene; Sivapalaratnam, Suthesh; Tragante, Vinicius; van Iperen, Erik P A; Brautbar, Ariel; Carrell, David S; Crosslin, David R; Jarvik, Gail P; Kuivaniemi, Helena; Kullo, Iftikhar J; Larson, Eric B; Rasmussen-Torvik, Laura J; Tromp, Gerard; Baumert, Jens; Cruickshanks, Karen J; Farrall, Martin; Hingorani, Aroon D; Hovingh, G K; Kleber, Marcus E; Klein, Barbara E; Klein, Ronald; Koenig, Wolfgang; Lange, Leslie A; MÓrz, Winfried; North, Kari E; Charlotte Onland-Moret, N; Reiner, Alex P; Talmud, Philippa J; van der Schouw, Yvonne T; Wilson, James G; Kivimaki, Mika; Kumari, Meena; Moore, Jason H; Drenos, Fotios; Asselbergs, Folkert W.

BioData Min ; 10: 25, 2017.

Article in English | MEDLINE | ID: mdl-28770004

ABSTRACT

BACKGROUND: The genetic etiology of human lipid quantitative traits is not fully elucidated, and interactions between variants may play a role. We performed a gene-centric interaction study for four different lipid traits: low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), total cholesterol (TC), and triglycerides (TG). RESULTS: Our analysis consisted of a discovery phase using a merged dataset of five different cohorts (n = 12,853 to n = 16,849 depending on lipid phenotype) and a replication phase with ten independent cohorts totaling up to 36,938 additional samples. Filters are often applied before interaction testing to correct for the burden of testing all pairwise interactions. We used two different filters: 1. A filter that tested only single nucleotide polymorphisms (SNPs) with a main effect of p < 0.001 in a previous association study. 2. A filter that only tested interactions identified by Biofilter 2.0. Pairwise models that reached an interaction significance level of p < 0.001 in the discovery dataset were tested for replication. We identified thirteen SNP-SNP models that were significant in more than one replication cohort after accounting for multiple testing. CONCLUSIONS: These results may reveal novel insights into the genetic etiology of lipid levels. Furthermore, we developed a pipeline to perform a computationally efficient interaction analysis with multi-cohort replication.

8.

Identifying gene-gene interactions that are highly associated with four quantitative lipid traits across multiple cohorts.

De, Rishika; Verma, Shefali S; Holzinger, Emily; Hall, Molly; Burt, Amber; Carrell, David S; Crosslin, David R; Jarvik, Gail P; Kuivaniemi, Helena; Kullo, Iftikhar J; Lange, Leslie A; Lanktree, Matthew B; Larson, Eric B; North, Kari E; Reiner, Alex P; Tragante, Vinicius; Tromp, Gerard; Wilson, James G; Asselbergs, Folkert W; Drenos, Fotios; Moore, Jason H; Ritchie, Marylyn D; Keating, Brendan; Gilbert-Diamond, Diane.

Hum Genet ; 136(2): 165-178, 2017 02.

Article in English | MEDLINE | ID: mdl-27848076

ABSTRACT

Genetic loci explain only 25-30 % of the heritability observed in plasma lipid traits. Epistasis, or gene-gene interactions may contribute to a portion of this missing heritability. Using the genetic data from five NHLBI cohorts of 24,837 individuals, we combined the use of the quantitative multifactor dimensionality reduction (QMDR) algorithm with two SNP-filtering methods to exhaustively search for SNP-SNP interactions that are associated with HDL cholesterol (HDL-C), LDL cholesterol (LDL-C), total cholesterol (TC) and triglycerides (TG). SNPs were filtered either on the strength of their independent effects (main effect filter) or the prior knowledge supporting a given interaction (Biofilter). After the main effect filter, QMDR identified 20 SNP-SNP models associated with HDL-C, 6 associated with LDL-C, 3 associated with TC, and 10 associated with TG (permutation P value <0.05). With the use of Biofilter, we identified 2 SNP-SNP models associated with HDL-C, 3 associated with LDL-C, 1 associated with TC and 8 associated with TG (permutation P value <0.05). In an independent dataset of 7502 individuals from the eMERGE network, we replicated 14 of the interactions identified after main effect filtering: 11 for HDL-C, 1 for LDL-C and 2 for TG. We also replicated 23 of the interactions found to be associated with TG after applying Biofilter. Prior knowledge supports the possible role of these interactions in the genetic etiology of lipid traits. This study also presents a computationally efficient pipeline for analyzing data from large genotyping arrays and detecting SNP-SNP interactions that are not primarily driven by strong main effects.

Subject(s)

Cardiovascular Diseases/genetics , Cholesterol, HDL/blood , Cholesterol, LDL/blood , Epistasis, Genetic , Phenotype , Triglycerides/blood , Body Mass Index , Cardiovascular Diseases/blood , Cohort Studies , Female , Genetic Loci , Genetic Markers , Genome, Human , Genotyping Techniques , Humans , Linear Models , Linkage Disequilibrium , Male , Multifactor Dimensionality Reduction , Polymorphism, Single Nucleotide

9.

Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data.

Holzinger, Emily R; Szymczak, Silke; Malley, James; Pugh, Elizabeth W; Ling, Hua; Griffith, Sean; Zhang, Peng; Li, Qing; Cropp, Cheryl D; Bailey-Wilson, Joan E.

BMC Proc ; 10(Suppl 7): 147-152, 2016.

Article in English | MEDLINE | ID: mdl-27980627

ABSTRACT

Current findings from genetic studies of complex human traits often do not explain a large proportion of the estimated variation of these traits due to genetic factors. This could be, in part, due to overly stringent significance thresholds in traditional statistical methods, such as linear and logistic regression. Machine learning methods, such as Random Forests (RF), are an alternative approach to identify potentially interesting variants. One major issue with these methods is that there is no clear way to distinguish between probable true hits and noise variables based on the importance metric calculated. To this end, we are developing a method called the Relative Recurrency Variable Importance Metric (r2VIM), a RF-based variable selection method. Here, we apply r2VIM to the unrelated Genetic Analysis Workshop 19 data with simulated systolic blood pressure as the phenotype. We compare the number of "true" functional variants identified by r2VIM with those identified by linear regression analyses that use a Bonferroni correction to calculate a significance threshold. Our results show that r2VIM performed comparably to linear regression. Our findings are proof-of-concept for r2VIM, as it identifies a similar number of functional and nonfunctional variants as a more commonly used technique when the optimal importance score threshold is used.

10.

REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.

Ioannidis, Nilah M; Rothstein, Joseph H; Pejaver, Vikas; Middha, Sumit; McDonnell, Shannon K; Baheti, Saurabh; Musolf, Anthony; Li, Qing; Holzinger, Emily; Karyadi, Danielle; Cannon-Albright, Lisa A; Teerlink, Craig C; Stanford, Janet L; Isaacs, William B; Xu, Jianfeng; Cooney, Kathleen A; Lange, Ethan M; Schleutker, Johanna; Carpten, John D; Powell, Isaac J; Cussenot, Olivier; Cancel-Tassin, Geraldine; Giles, Graham G; MacInnis, Robert J; Maier, Christiane; Hsieh, Chih-Lin; Wiklund, Fredrik; Catalona, William J; Foulkes, William D; Mandal, Diptasri; Eeles, Rosalind A; Kote-Jarai, Zsofia; Bustamante, Carlos D; Schaid, Daniel J; Hastie, Trevor; Ostrander, Elaine A; Bailey-Wilson, Joan E; Radivojac, Predrag; Thibodeau, Stephen N; Whittemore, Alice S; Sieh, Weiva.

Am J Hum Genet ; 99(4): 877-885, 2016 Oct 06.

Article in English | MEDLINE | ID: mdl-27666373

ABSTRACT

The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10-12) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.

Subject(s)

Disease/genetics , Mutation, Missense/genetics , Software , Area Under Curve , DNA Mutational Analysis , Exome/genetics , Gene Frequency , Humans , ROC Curve

11.

r2VIM: A new variable selection method for random forests in genome-wide association studies.

Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit; Malley, James D; Molloy, Anne M; Mills, James L; Brody, Lawrence C; Stambolian, Dwight; Bailey-Wilson, Joan E.

BioData Min ; 9: 7, 2016.

Article in English | MEDLINE | ID: mdl-26839594

ABSTRACT

BACKGROUND: Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses. RESULTS: We propose a new variable selection approach, recurrent relative variable importance measure (r2VIM). Importance values are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified while the approach selects none of the SNPs in an underpowered GWAS. CONCLUSIONS: The novel variable selection method r2VIM is a promising extension to standard RF for objectively selecting relevant SNPs in GWAS while controlling the number of false-positive results.

12.

Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19.

König, Inke R; Auerbach, Jonathan; Gola, Damian; Held, Elizabeth; Holzinger, Emily R; Legault, Marc-André; Sun, Rui; Tintle, Nathan; Yang, Hsin-Chou.

BMC Genet ; 17 Suppl 2: 1, 2016 Feb 03.

Article in English | MEDLINE | ID: mdl-26866367

ABSTRACT

In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.

Subject(s)

Data Mining/methods , Genomics/methods , Computational Biology/methods , Genetic Testing , Humans , Machine Learning

13.

Identifying gene-gene interactions that are highly associated with Body Mass Index using Quantitative Multifactor Dimensionality Reduction (QMDR).

De, Rishika; Verma, Shefali S; Drenos, Fotios; Holzinger, Emily R; Holmes, Michael V; Hall, Molly A; Crosslin, David R; Carrell, David S; Hakonarson, Hakon; Jarvik, Gail; Larson, Eric; Pacheco, Jennifer A; Rasmussen-Torvik, Laura J; Moore, Carrie B; Asselbergs, Folkert W; Moore, Jason H; Ritchie, Marylyn D; Keating, Brendan J; Gilbert-Diamond, Diane.

BioData Min ; 8: 41, 2015.

Article in English | MEDLINE | ID: mdl-26674805

ABSTRACT

BACKGROUND: Despite heritability estimates of 40-70 % for obesity, less than 2 % of its variation is explained by Body Mass Index (BMI) associated loci that have been identified so far. Epistasis, or gene-gene interactions are a plausible source to explain portions of the missing heritability of BMI. METHODS: Using genotypic data from 18,686 individuals across five study cohorts - ARIC, CARDIA, FHS, CHS, MESA - we filtered SNPs (Single Nucleotide Polymorphisms) using two parallel approaches. SNPs were filtered either on the strength of their main effects of association with BMI, or on the number of knowledge sources supporting a specific SNP-SNP interaction in the context of BMI. Filtered SNPs were specifically analyzed for interactions that are highly associated with BMI using QMDR (Quantitative Multifactor Dimensionality Reduction). QMDR is a nonparametric, genetic model-free method that detects non-linear interactions associated with a quantitative trait. RESULTS: We identified seven novel, epistatic models with a Bonferroni corrected p-value of association < 0.1. Prior experimental evidence helps explain the plausible biological interactions highlighted within our results and their relationship with obesity. We identified interactions between genes involved in mitochondrial dysfunction (POLG2), cholesterol metabolism (SOAT2), lipid metabolism (CYP11B2), cell adhesion (EZR), cell proliferation (MAP2K5), and insulin resistance (IGF1R). Moreover, we found an 8.8 % increase in the variance in BMI explained by these seven SNP-SNP interactions, beyond what is explained by the main effects of an index FTO SNP and the SNPs within these interactions. We also replicated one of these interactions and 58 proxy SNP-SNP models representing it in an independent dataset from the eMERGE study. CONCLUSION: This study highlights a novel approach for discovering gene-gene interactions by combining methods such as QMDR with traditional statistics.

14.

Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using biofilter, and gene-environment interactions using the Phenx Toolkit*.

Pendergrass, Sarah A; Verma, Shefali S; Hall, Molly A; Holzinger, Emily R; Moore, Carrie B; Wallace, John R; Dudek, Scott M; Huggins, Wayne; Kitchner, Terrie; Waudby, Carol; Berg, Richard; Mccarty, Catherine A; Ritchie, Marylyn D.

Pac Symp Biocomput ; : 495-505, 2015.

Article in English | MEDLINE | ID: mdl-25741542

ABSTRACT

Investigating the association between biobank derived genomic data and the information of linked electronic health records (EHRs) is an emerging area of research for dissecting the architecture of complex human traits, where cases and controls for study are defined through the use of electronic phenotyping algorithms deployed in large EHR systems. For our study, cataract cases and controls were identified within the Marshfield Personalized Medicine Research Project (PMRP) biobank and linked EHR, which is a member of the NHGRI-funded electronic Medical Records and Genomics (eMERGE) Network. Our goal was to explore potential gene-gene and gene-environment interactions within these data for 527,953 and 527,936 single nucleotide polymorphisms (SNPs) for gene-gene and gene-environment analyses, respectively, with minor allele frequency > 1%, in order to explore higher level associations with cataract risk beyond investigations of single SNP-phenotype associations. To build our SNP-SNP interaction models we utilized a prior-knowledge driven filtering method called Biofilter to minimize the multiple testing burden of exploring the vast array of interaction models possible from our extensive number of SNPs. Using Biofilter, we developed 57,376 prior-knowledge directed SNP-SNP models to test for association with cataract status. We selected models that required 6 sources of external domain knowledge. We identified 13 statistically significant SNP-SNP models with an interaction with p-value < 1 × 10(-4), as well as an overall model with p-value < 0.01 associated with cataract status. We also conducted gene-environment interaction analyses for all GWAS SNPs and a set of environmental factors from the PhenX Toolkit: smoking, UV exposure, and alcohol use;these environmental factors have been previously associated with the formation of cataracts. We found a total of 782 gene-environment models that exhibit an interaction with a p-value < 1 × 10(-4) associatedwith cataract status. Our results show these approaches enable advanced searches for epistasis and gene-environment interactions beyond GWAS, and that the EHR based approach provides an additional source of data for seeking these advanced explanatory models of the etiology of complex disease/outcome such as cataracts.

Subject(s)

Cataract/genetics , Algorithms , Biological Specimen Banks , Case-Control Studies , Computational Biology , Databases, Genetic , Electronic Health Records , Epistasis, Genetic , Gene-Environment Interaction , Genome-Wide Association Study , Humans , Phenotype , Polymorphism, Single Nucleotide , Software

15.

Methods of integrating data to uncover genotype-phenotype interactions.

Ritchie, Marylyn D; Holzinger, Emily R; Li, Ruowang; Pendergrass, Sarah A; Kim, Dokyoon.

Nat Rev Genet ; 16(2): 85-97, 2015 Feb.

Article in English | MEDLINE | ID: mdl-25582081

ABSTRACT

Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration - including meta-dimensional and multi-staged analyses - which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.

Subject(s)

Data Interpretation, Statistical , Genetic Variation , Genotype , Inheritance Patterns/physiology , Models, Biological , Phenotype , Systems Biology/methods , Humans , Meta-Analysis as Topic

16.

Variable selection method for the identification of epistatic models.

Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit; Malley, James; Li, Qing; Bailey-Wilson, Joan E.

Pac Symp Biocomput ; : 195-206, 2015.

Article in English | MEDLINE | ID: mdl-25592581

ABSTRACT

Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).

Subject(s)

Epistasis, Genetic , Models, Genetic , Algorithms , Computational Biology , Computer Simulation , Databases, Genetic , Genome-Wide Association Study/statistics & numerical data , Humans , Linkage Disequilibrium , Logistic Models , Machine Learning , Polymorphism, Single Nucleotide , Signal-To-Noise Ratio

17.

Genetic variation in iron metabolism is associated with neuropathic pain and pain severity in HIV-infected patients on antiretroviral therapy.

Kallianpur, Asha R; Jia, Peilin; Ellis, Ronald J; Zhao, Zhongming; Bloss, Cinnamon; Wen, Wanqing; Marra, Christina M; Hulgan, Todd; Simpson, David M; Morgello, Susan; McArthur, Justin C; Clifford, David B; Collier, Ann C; Gelman, Benjamin B; McCutchan, J Allen; Franklin, Donald; Samuels, David C; Rosario, Debralee; Holzinger, Emily; Murdock, Deborah G; Letendre, Scott; Grant, Igor.

PLoS One ; 9(8): e103123, 2014.

Article in English | MEDLINE | ID: mdl-25144566

ABSTRACT

HIV sensory neuropathy and distal neuropathic pain (DNP) are common, disabling complications associated with combination antiretroviral therapy (cART). We previously associated iron-regulatory genetic polymorphisms with a reduced risk of HIV sensory neuropathy during more neurotoxic types of cART. We here evaluated the impact of polymorphisms in 19 iron-regulatory genes on DNP in 560 HIV-infected subjects from a prospective, observational study, who underwent neurological examinations to ascertain peripheral neuropathy and structured interviews to ascertain DNP. Genotype-DNP associations were explored by logistic regression and permutation-based analytical methods. Among 559 evaluable subjects, 331 (59%) developed HIV-SN, and 168 (30%) reported DNP. Fifteen polymorphisms in 8 genes (p<0.05) and 5 variants in 4 genes (p<0.01) were nominally associated with DNP: polymorphisms in TF, TFRC, BMP6, ACO1, SLC11A2, and FXN conferred reduced risk (adjusted odds ratios [ORs] ranging from 0.2 to 0.7, all p<0.05); other variants in TF, CP, ACO1, BMP6, and B2M conferred increased risk (ORs ranging from 1.3 to 3.1, all p<0.05). Risks associated with some variants were statistically significant either in black or white subgroups but were consistent in direction. ACO1 rs2026739 remained significantly associated with DNP in whites (permutation p<0.0001) after correction for multiple tests. Several of the same iron-regulatory-gene polymorphisms, including ACO1 rs2026739, were also associated with severity of DNP (all p<0.05). Common polymorphisms in iron-management genes are associated with DNP and with DNP severity in HIV-infected persons receiving cART. Consistent risk estimates across population subgroups and persistence of the ACO1 rs2026739 association after adjustment for multiple testing suggest that genetic variation in iron-regulation and transport modulates susceptibility to DNP.

Subject(s)

Genetic Variation/genetics , HIV Infections/genetics , HIV Infections/physiopathology , Iron/metabolism , Neuralgia/physiopathology , Adult , Aged , Anti-Retroviral Agents/therapeutic use , Female , Genotype , HIV Infections/drug therapy , HIV Infections/metabolism , Humans , Iron Regulatory Protein 1/genetics , Linkage Disequilibrium/genetics , Male , Middle Aged , Multivariate Analysis , Neuralgia/genetics , Neuralgia/metabolism , Young Adult

18.

ATHENA: the analysis tool for heritable and environmental network associations.

Holzinger, Emily R; Dudek, Scott M; Frase, Alex T; Pendergrass, Sarah A; Ritchie, Marylyn D.

Bioinformatics ; 30(5): 698-705, 2014 Mar 01.

Article in English | MEDLINE | ID: mdl-24149050

ABSTRACT

MOTIVATION: Advancements in high-throughput technology have allowed researchers to examine the genetic etiology of complex human traits in a robust fashion. Although genome-wide association studies have identified many novel variants associated with hundreds of traits, a large proportion of the estimated trait heritability remains unexplained. One hypothesis is that the commonly used statistical techniques and study designs are not robust to the complex etiology that may underlie these human traits. This etiology could include non-linear gene × gene or gene × environment interactions. Additionally, other levels of biological regulation may play a large role in trait variability. RESULTS: To address the need for computational tools that can explore enormous datasets to detect complex susceptibility models, we have developed a software package called the Analysis Tool for Heritable and Environmental Network Associations (ATHENA). ATHENA combines various variable filtering methods with machine learning techniques to analyze high-throughput categorical (i.e. single nucleotide polymorphisms) and quantitative (i.e. gene expression levels) predictor variables to generate multivariable models that predict either a categorical (i.e. disease status) or quantitative (i.e. cholesterol levels) outcomes. The goal of this article is to demonstrate the utility of ATHENA using simulated and biological datasets that consist of both single nucleotide polymorphisms and gene expression variables to identify complex prediction models. Importantly, this method is flexible and can be expanded to include other types of high-throughput data (i.e. RNA-seq data and biomarker measurements). AVAILABILITY: ATHENA is freely available for download. The software, user manual and tutorial can be downloaded from http://ritchielab.psu.edu/ritchielab/software.

Subject(s)

Gene-Environment Interaction , Genome-Wide Association Study , Software , Humans , Phenotype , Polymorphism, Single Nucleotide

19.

Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using Biofilter, and gene-environment interactions using the PhenX Toolkit.

Pendergrass, Sarah A; Verma, Shefali S; Holzinger, Emily R; Moore, Carrie B; Wallace, John; Dudek, Scott M; Huggins, Wayne; Kitchner, Terrie; Waudby, Carol; Berg, Richard; McCarty, Catherine A; Ritchie, Marylyn D.

Pac Symp Biocomput ; : 147-58, 2013.

Article in English | MEDLINE | ID: mdl-23424120

ABSTRACT

Investigating the association between biobank derived genomic data and the information of linked electronic health records (EHRs) is an emerging area of research for dissecting the architecture of complex human traits, where cases and controls for study are defined through the use of electronic phenotyping algorithms deployed in large EHR systems. For our study, 2580 cataract cases and 1367 controls were identified within the Marshfield Personalized Medicine Research Project (PMRP) Biobank and linked EHR, which is a member of the NHGRI-funded electronic Medical Records and Genomics (eMERGE) Network. Our goal was to explore potential gene-gene and gene-environment interactions within these data for 529,431 single nucleotide polymorphisms (SNPs) with minor allele frequency > 1%, in order to explore higher level associations with cataract risk beyond investigations of single SNP-phenotype associations. To build our SNP-SNP interaction models we utilized a prior-knowledge driven filtering method called Biofilter to minimize the multiple testing burden of exploring the vast array of interaction models possible from our extensive number of SNPs. Using the Biofilter, we developed 57,376 prior-knowledge directed SNP-SNP models to test for association with cataract status. We selected models that required 6 sources of external domain knowledge. We identified 5 statistically significant models with an interaction term with p-value < 0.05, as well as an overall model with p-value < 0.05 associated with cataract status. We also conducted gene-environment interaction analyses for all GWAS SNPs and a set of environmental factors from the PhenX Toolkit: smoking, UV exposure, and alcohol use; these environmental factors have been previously associated with the formation of cataracts. We found a total of 288 models that exhibit an interaction term with a p-value ≤ 1×10(-4) associated with cataract status. Our results show these approaches enable advanced searches for epistasis and gene-environment interactions beyond GWAS, and that the EHR based approach provides an additional source of data for seeking these advanced explanatory models of the etiology of complex disease/outcome such as cataracts.

Subject(s)

Cataract/etiology , Cataract/genetics , Epistasis, Genetic , Gene-Environment Interaction , Aged , Case-Control Studies , Computational Biology , Databases, Genetic/statistics & numerical data , Electronic Health Records/statistics & numerical data , Female , Genome-Wide Association Study/statistics & numerical data , Humans , Male , Middle Aged , Models, Genetic , Models, Statistical , Polymorphism, Single Nucleotide , Software

20.

ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels.

Holzinger, Emily R; Dudek, Scott M; Frase, Alex T; Krauss, Ronald M; Medina, Marisa W; Ritchie, Marylyn D.

Pac Symp Biocomput ; : 385-96, 2013.

Article in English | MEDLINE | ID: mdl-23424143

ABSTRACT

Technology is driving the field of human genetics research with advances in techniques to generate high-throughput data that interrogate various levels of biological regulation. With this massive amount of data comes the important task of using powerful bioinformatics techniques to sift through the noise to find true signals that predict various human traits. A popular analytical method thus far has been the genome-wide association study (GWAS), which assesses the association of single nucleotide polymorphisms (SNPs) with the trait of interest. Unfortunately, GWAS has not been able to explain a substantial proportion of the estimated heritability for most complex traits. Due to the inherently complex nature of biology, this phenomenon could be a factor of the simplistic study design. A more powerful analysis may be a systems biology approach that integrates different types of data, or a meta-dimensional analysis. For this study we used the Analysis Tool for Heritable and Environmental Network Associations (ATHENA) to integrate high-throughput SNPs and gene expression variables (EVs) to predict high-density lipoprotein cholesterol (HDL-C) levels. We generated multivariable models that consisted of SNPs only, EVs only, and SNPs + EVs with testing r-squared values of 0.16, 0.11, and 0.18, respectively. Additionally, using just the SNPs and EVs from the best models, we generated a model with a testing r-squared of 0.32. A linear regression model with the same variables resulted in an adjusted r-squared of 0.23. With this systems biology approach, we were able to integrate different types of high-throughput data to generate meta-dimensional models that are predictive for the HDL-C in our data set. Additionally, our modeling method was able to capture more of the HDL-C variation than a linear regression model that included the same variables.

Subject(s)

Cholesterol, HDL/blood , Cholesterol, HDL/genetics , Software , Algorithms , Computational Biology , Databases, Genetic/statistics & numerical data , Gene Expression , Gene-Environment Interaction , Genome-Wide Association Study/statistics & numerical data , Genotype , HapMap Project , High-Throughput Screening Assays/statistics & numerical data , Humans , Meta-Analysis as Topic , Models, Genetic , Neural Networks, Computer , Polymorphism, Single Nucleotide , Systems Biology

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL