Search | VHL Search Portal

1.

AttentionPert: accurately modeling multiplexed genetic perturbations with multi-scale effects.

Bai, Ding; Ellington, Caleb N; Mo, Shentong; Song, Le; Xing, Eric P.

Bioinformatics ; 40(Suppl 1): i453-i461, 2024 06 28.

Article in English | MEDLINE | ID: mdl-38940174

ABSTRACT

MOTIVATION: Genetic perturbations (e.g. knockouts, variants) have laid the foundation for our understanding of many diseases, implicating pathogenic mechanisms and indicating therapeutic targets. However, experimental assays are fundamentally limited by the number of measurable perturbations. Computational methods can fill this gap by predicting perturbation effects under novel conditions, but accurately predicting the transcriptional responses of cells to unseen perturbations remains a significant challenge. RESULTS: We address this by developing a novel attention-based neural network, AttentionPert, which accurately predicts gene expression under multiplexed perturbations and generalizes to unseen conditions. AttentionPert integrates global and local effects in a multi-scale model, representing both the nonuniform system-wide impact of the genetic perturbation and the localized disturbance in a network of gene-gene similarities, enhancing its ability to predict nuanced transcriptional responses to both single and multi-gene perturbations. In comprehensive experiments, AttentionPert demonstrates superior performance across multiple datasets outperforming the state-of-the-art method in predicting differential gene expressions and revealing novel gene regulations. AttentionPert marks a significant improvement over current methods, particularly in handling the diversity of gene perturbations and in predicting out-of-distribution scenarios. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/BaiDing1234/AttentionPert.

Subject(s)

Computational Biology , Computational Biology/methods , Humans , Gene Regulatory Networks , Neural Networks, Computer , Gene Expression Profiling/methods

2.

US Residents' Preferences for Sharing of Electronic Health Record and Genetic Information: A Discrete Choice Experiment.

Wagner, Abram L; Zhang, Felicia; Ryan, Kerry A; Xing, Eric; Nong, Paige; Kardia, Sharon L R; Platt, Jodyn.

Value Health ; 26(9): 1301-1307, 2023 09.

Article in English | MEDLINE | ID: mdl-36736697

ABSTRACT

OBJECTIVES: The aim to this study was to assess preferences for sharing of electronic health record (EHR) and genetic information separately and to examine whether there are different preferences for sharing these 2 types of information. METHODS: Using a population-based, nationally representative survey of the United States, we conducted a discrete choice experiment in which half of the subjects (N = 790) responded to questions about sharing of genetic information and the other half (N = 751) to questions about sharing of EHR information. Conditional logistic regression models assessed relative preferences across attribute levels of where patients learn about health information sharing, whether shared data are deidentified, whether data are commercialized, how long biospecimens are kept, and what the purpose of sharing the information is. RESULTS: Individuals had strong preferences to share deidentified (vs identified) data (odds ratio [OR] 3.26, 95% confidence interval 2.68-3.96) and to be able to opt out of sharing information with commercial companies (OR 4.26, 95% confidence interval 3.42-5.30). There were no significant differences regarding how long biospecimens are kept or why the data are being shared. Individuals had a stronger preference for opting out of sharing genetic (OR 4.26) versus EHR information (OR 2.64) (P = .002). CONCLUSIONS: Hospital systems and regulatory bodies should consider patient preferences for sharing of personal medical records or genetic information. For both genetic and EHR information, patients strongly prefer their data to be deidentified and to have the choice to opt out of sharing information with commercial companies.

Subject(s)

Confidentiality , Electronic Health Records , Humans , United States , Information Dissemination , Logistic Models , Data Collection

3.

Active learning to classify macromolecular structures in situ for less supervision in cryo-electron tomography.

Du, Xuefeng; Wang, Haohan; Zhu, Zhenxi; Zeng, Xiangrui; Chang, Yi-Wei; Zhang, Jing; Xing, Eric; Xu, Min.

Bioinformatics ; 37(16): 2340-2346, 2021 Aug 25.

Article in English | MEDLINE | ID: mdl-33620460

ABSTRACT

MOTIVATION: Cryo-Electron Tomography (cryo-ET) is a 3D bioimaging tool that visualizes the structural and spatial organization of macromolecules at a near-native state in single cells, which has broad applications in life science. However, the systematic structural recognition and recovery of macromolecules captured by cryo-ET are difficult due to high structural complexity and imaging limits. Deep learning-based subtomogram classification has played critical roles for such tasks. As supervised approaches, however, their performance relies on sufficient and laborious annotation on a large training dataset. RESULTS: To alleviate this major labeling burden, we proposed a Hybrid Active Learning (HAL) framework for querying subtomograms for labeling from a large unlabeled subtomogram pool. Firstly, HAL adopts uncertainty sampling to select the subtomograms that have the most uncertain predictions. This strategy enforces the model to be aware of the inductive bias during classification and subtomogram selection, which satisfies the discriminativeness principle in AL literature. Moreover, to mitigate the sampling bias caused by such strategy, a discriminator is introduced to judge if a certain subtomogram is labeled or unlabeled and subsequently the model queries the subtomogram that have higher probabilities to be unlabeled. Such query strategy encourages to match the data distribution between the labeled and unlabeled subtomogram samples, which essentially encodes the representativeness criterion into the subtomogram selection process. Additionally, HAL introduces a subset sampling strategy to improve the diversity of the query set, so that the information overlap is decreased between the queried batches and the algorithmic efficiency is improved. Our experiments on subtomogram classification tasks using both simulated and real data demonstrate that we can achieve comparable testing performance (on average only 3% accuracy drop) by using less than 30% of the labeled subtomograms, which shows a very promising result for subtomogram classification task with limited labeling resources. AVAILABILITY AND IMPLEMENTATION: https://github.com/xulabs/aitom. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

4.

Coupled mixed model for joint genetic analysis of complex disorders with two independently collected data sets.

Wang, Haohan; Pei, Fen; Vanyukov, Michael M; Bahar, Ivet; Wu, Wei; Xing, Eric P.

BMC Bioinformatics ; 22(1): 50, 2021 Feb 05.

Article in English | MEDLINE | ID: mdl-33546598

ABSTRACT

BACKGROUND: In the last decade, Genome-wide Association studies (GWASs) have contributed to decoding the human genome by uncovering many genetic variations associated with various diseases. Many follow-up investigations involve joint analysis of multiple independently generated GWAS data sets. While most of the computational approaches developed for joint analysis are based on summary statistics, the joint analysis based on individual-level data with consideration of confounding factors remains to be a challenge. RESULTS: In this study, we propose a method, called Coupled Mixed Model (CMM), that enables a joint GWAS analysis on two independently collected sets of GWAS data with different phenotypes. The CMM method does not require the data sets to have the same phenotypes as it aims to infer the unknown phenotypes using a set of multivariate sparse mixed models. Moreover, CMM addresses the confounding variables due to population stratification, family structures, and cryptic relatedness, as well as those arising during data collection such as batch effects that frequently appear in joint genetic studies. We evaluate the performance of CMM using simulation experiments. In real data analysis, we illustrate the utility of CMM by an application to evaluating common genetic associations for Alzheimer's disease and substance use disorder using datasets independently collected for the two complex human disorders. Comparison of the results with those from previous experiments and analyses supports the utility of our method and provides new insights into the diseases. The software is available at https://github.com/HaohanWang/CMM .

Subject(s)

Genome-Wide Association Study , Phenotype , Software , Algorithms , Humans , Models, Genetic , Polymorphism, Single Nucleotide

5.

Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species.

Zheng, Yumin; Wang, Haohan; Zhang, Yang; Gao, Xin; Xing, Eric P; Xu, Min.

PLoS Comput Biol ; 16(11): e1008297, 2020 11.

Article in English | MEDLINE | ID: mdl-33151940

ABSTRACT

In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis-determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.

Subject(s)

Deep Learning , Deoxyguanosine/metabolism , Poly A/metabolism , Signal Transduction , Animals , Humans , Neural Networks, Computer , Species Specificity

6.

Computational catalyst discovery: Active classification through myopic multiscale sampling.

Tran, Kevin; Neiswanger, Willie; Broderick, Kirby; Xing, Eric; Schneider, Jeff; Ulissi, Zachary W.

J Chem Phys ; 154(12): 124118, 2021 Mar 28.

Article in English | MEDLINE | ID: mdl-33810693

ABSTRACT

The recent boom in computational chemistry has enabled several projects aimed at discovering useful materials or catalysts. We acknowledge and address two recurring issues in the field of computational catalyst discovery. First, calculating macro-scale catalyst properties is not straightforward when using ensembles of atomic-scale calculations [e.g., density functional theory (DFT)]. We attempt to address this issue by creating a multi-scale model that estimates bulk catalyst activity using adsorption energy predictions from both DFT and machine learning models. The second issue is that many catalyst discovery efforts seek to optimize catalyst properties, but optimization is an inherently exploitative objective that is in tension with the explorative nature of early-stage discovery projects. In other words, why invest so much time finding a "best" catalyst when it is likely to fail for some other, unforeseen problem? We address this issue by relaxing the catalyst discovery goal into a classification problem: "What is the set of catalysts that is worth testing experimentally?" Here, we present a catalyst discovery method called myopic multiscale sampling, which combines multiscale modeling with automated selection of DFT calculations. It is an active classification strategy that seeks to classify catalysts as "worth investigating" or "not worth investigating" experimentally. Our results show an â¼7-16 times speedup in catalyst classification relative to random sampling. These results were based on offline simulations of our algorithm on two different datasets: a larger, synthesized dataset and a smaller, real dataset.

7.

Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data.

Wang, Haohan; Lengerich, Benjamin J; Aragam, Bryon; Xing, Eric P.

Bioinformatics ; 35(7): 1181-1187, 2019 04 01.

Article in English | MEDLINE | ID: mdl-30184048

ABSTRACT

MOTIVATION: Association studies to discover links between genetic markers and phenotypes are central to bioinformatics. Methods of regularized regression, such as variants of the Lasso, are popular for this task. Despite the good predictive performance of these methods in the average case, they suffer from unstable selections of correlated variables and inconsistent selections of linearly dependent variables. Unfortunately, as we demonstrate empirically, such problematic situations of correlated and linearly dependent variables often exist in genomic datasets and lead to under-performance of classical methods of variable selection. RESULTS: To address these challenges, we propose the Precision Lasso. Precision Lasso is a Lasso variant that promotes sparse variable selection by regularization governed by the covariance and inverse covariance matrices of explanatory variables. We illustrate its capacity for stable and consistent variable selection in simulated data with highly correlated and linearly dependent variables. We then demonstrate the effectiveness of the Precision Lasso to select meaningful variables from transcriptomic profiles of breast cancer patients. Our results indicate that in settings with correlated and linearly dependent variables, the Precision Lasso outperforms popular methods of variable selection such as the Lasso, the Elastic Net and Minimax Concave Penalty (MCP) regression. AVAILABILITY AND IMPLEMENTATION: Software is available at https://github.com/HaohanWang/thePrecisionLasso. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , Software , Humans , Phenotype

8.

Deep mixed model for marginal epistasis detection and population stratification correction in genome-wide association studies.

Wang, Haohan; Yue, Tianwei; Yang, Jingkang; Wu, Wei; Xing, Eric P.

BMC Bioinformatics ; 20(Suppl 23): 656, 2019 Dec 27.

Article in English | MEDLINE | ID: mdl-31881907

ABSTRACT

BACKGROUND: Genome-wide Association Studies (GWAS) have contributed to unraveling associations between genetic variants in the human genome and complex traits for more than a decade. While many works have been invented as follow-ups to detect interactions between SNPs, epistasis are still yet to be modeled and discovered more thoroughly. RESULTS: In this paper, following the previous study of detecting marginal epistasis signals, and motivated by the universal approximation power of deep learning, we propose a neural network method that can potentially model arbitrary interactions between SNPs in genetic association studies as an extension to the mixed models in correcting confounding factors. Our method, namely Deep Mixed Model, consists of two components: 1) a confounding factor correction component, which is a large-kernel convolution neural network that focuses on calibrating the residual phenotypes by removing factors such as population stratification, and 2) a fixed-effect estimation component, which mainly consists of an Long-short Term Memory (LSTM) model that estimates the association effect size of SNPs with the residual phenotype. CONCLUSIONS: After validating the performance of our method using simulation experiments, we further apply it to Alzheimer's disease data sets. Our results help gain some explorative understandings of the genetic architecture of Alzheimer's disease.

Subject(s)

Epistasis, Genetic , Genome-Wide Association Study , Models, Genetic , Algorithms , Alzheimer Disease/genetics , Area Under Curve , Base Sequence , Computer Simulation , Humans , Polymorphism, Single Nucleotide/genetics , ROC Curve

9.

Personalized regression enables sample-specific pan-cancer analysis.

Lengerich, Benjamin J; Aragam, Bryon; Xing, Eric P.

Bioinformatics ; 34(13): i178-i186, 2018 07 01.

Article in English | MEDLINE | ID: mdl-29949997

ABSTRACT

Motivation: In many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient in a cohort may have a different driver mutation, making it difficult or impossible to identify causal mutations from an averaged view of the entire cohort. Unfortunately, many traditional methods for genomic analysis seek to estimate a single model which is shared by all samples in a population, ignoring this inter-sample heterogeneity entirely. In order to better understand patient heterogeneity, it is necessary to develop practical, personalized statistical models. Results: To uncover this inter-sample heterogeneity, we propose a novel regularizer for achieving patient-specific personalized estimation. This regularizer operates by learning two latent distance metrics-one between personalized parameters and one between clinical covariates-and attempting to match the induced distances as closely as possible. Crucially, we do not assume these distance metrics are already known. Instead, we allow the data to dictate the structure of these latent distance metrics. Finally, we apply our method to learn patient-specific, interpretable models for a pan-cancer gene expression dataset containing samples from more than 30 distinct cancer types and find strong evidence of personalization effects between cancer types as well as between individuals. Our analysis uncovers sample-specific aberrations that are overlooked by population-level methods, suggesting a promising new path for precision analysis of complex diseases such as cancer. Availability and implementation: Software for personalized linear and personalized logistic regression, along with code to reproduce experimental results, is freely available at github.com/blengerich/personalized_regression.

Subject(s)

Genomics/methods , Models, Genetic , Mutation , Neoplasms/genetics , Software , Female , Genetic Predisposition to Disease , Humans , Male , Models, Statistical , Polymorphism, Single Nucleotide , Precision Medicine/methods , Sequence Analysis, DNA/methods

10.

Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies.

Wang, Haohan; Aragam, Bryon; Xing, Eric P.

Methods ; 145: 2-9, 2018 08 01.

Article in English | MEDLINE | ID: mdl-29705212

ABSTRACT

A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of sample structure in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and human, and discuss the knowledge we discover with our method.

Subject(s)

Genome-Wide Association Study/methods , Models, Statistical , Polymorphism, Single Nucleotide , Animals , Humans , Plants/genetics

11.

Multiplex confounding factor correction for genomic association mapping with squared sparse linear mixed model.

Wang, Haohan; Liu, Xiang; Xiao, Yunpeng; Xu, Ming; Xing, Eric P.

Methods ; 145: 33-40, 2018 08 01.

Article in English | MEDLINE | ID: mdl-29705210

ABSTRACT

Genome-wide Association Study has presented a promising way to understand the association between human genomes and complex traits. Many simple polymorphic loci have been shown to explain a significant fraction of phenotypic variability. However, challenges remain in the non-triviality of explaining complex traits associated with multifactorial genetic loci, especially considering the confounding factors caused by population structure, family structure, and cryptic relatedness. In this paper, we propose a Squared-LMM (LMM2) model, aiming to jointly correct population and genetic confounding factors. We offer two strategies of utilizing LMM2 for association mapping: 1) It serves as an extension of univariate LMM, which could effectively correct population structure, but consider each SNP in isolation. 2) It is integrated with the multivariate regression model to discover association relationship between complex traits and multifactorial genetic loci. We refer to this second model as sparse Squared-LMM (sLMM2). Further, we extend LMM2/sLMM2 by raising the power of our squared model to the LMMn/sLMMn model. We demonstrate the practical use of our model with synthetic phenotypic variants generated from genetic loci of Arabidopsis Thaliana. The experiment shows that our method achieves a more accurate and significant prediction on the association relationship between traits and loci. We also evaluate our models on collected phenotypes and genotypes with the number of candidate genes that the models could discover. The results suggest the potential and promising usage of our method in genome-wide association studies.

Subject(s)

Genetic Loci , Genome-Wide Association Study/methods , Models, Statistical , Polymorphism, Genetic , Arabidopsis/genetics , Evolution, Molecular , Genes, Plant , Genetics, Population , Models, Genetic , Multigene Family

12.

Deep learning-based subdivision approach for large scale macromolecules structure recovery from electron cryo tomograms.

Xu, Min; Chai, Xiaoqi; Muthakana, Hariank; Liang, Xiaodan; Yang, Ge; Zeev-Ben-Mordehai, Tzviya; Xing, Eric P.

Bioinformatics ; 33(14): i13-i22, 2017 Jul 15.

Article in English | MEDLINE | ID: mdl-28881965

ABSTRACT

MOTIVATION: Cellular Electron CryoTomography (CECT) enables 3D visualization of cellular organization at near-native state and in sub-molecular resolution, making it a powerful tool for analyzing structures of macromolecular complexes and their spatial organizations inside single cells. However, high degree of structural complexity together with practical imaging limitations makes the systematic de novo discovery of structures within cells challenging. It would likely require averaging and classifying millions of subtomograms potentially containing hundreds of highly heterogeneous structural classes. Although it is no longer difficult to acquire CECT data containing such amount of subtomograms due to advances in data acquisition automation, existing computational approaches have very limited scalability or discrimination ability, making them incapable of processing such amount of data. RESULTS: To complement existing approaches, in this article we propose a new approach for subdividing subtomograms into smaller but relatively homogeneous subsets. The structures in these subsets can then be separately recovered using existing computation intensive methods. Our approach is based on supervised structural feature extraction using deep learning, in combination with unsupervised clustering and reference-free classification. Our experiments show that, compared with existing unsupervised rotation invariant feature and pose-normalization based approaches, our new approach achieves significant improvements in both discrimination ability and scalability. More importantly, our new approach is able to discover new structural classes and recover structures that do not exist in training data. AVAILABILITY AND IMPLEMENTATION: Source code freely available at http://www.cs.cmu.edu/â¼mxu1/software . CONTACT: mxu1@cs.cmu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Electron Microscope Tomography/methods , Machine Learning , Molecular Structure , Cluster Analysis , Image Processing, Computer-Assisted/methods

13.

Backward genotype-transcript-phenotype association mapping.

Lee, Seunghak; Wang, Haohan; Xing, Eric P.

Methods ; 129: 18-23, 2017 10 01.

Article in English | MEDLINE | ID: mdl-28917724

ABSTRACT

Genome-wide association studies have discovered a large number of genetic variants associated with complex diseases such as Alzheimer's disease. However, the genetic background of such diseases is largely unknown due to the complex mechanisms underlying genetic effects on traits, as well as a small sample size (e.g., 1000) and a large number of genetic variants (e.g., 1 million). Fortunately, datasets that contain genotypes, transcripts, and phenotypes are becoming more readily available, creating new opportunities for detecting disease-associated genetic variants. In this paper, we present a novel approach called "Backward Three-way Association Mapping" (BTAM) for detecting three-way associations among genotypes, transcripts, and phenotypes. Assuming that genotypes affect transcript levels, which in turn affect phenotypes, we first find transcripts associated with the phenotypes, and then find genotypes associated with the chosen transcripts. The backward ordering of association mappings allows us to avoid a large number of association testings between all genotypes and all transcripts, making it possible to identify three-way associations with a small computational cost. In our simulation study, we demonstrate that BTAM significantly improves the statistical power over "forward" three-way association mapping that finds genotypes associated with both transcripts and phenotypes and genotype-phenotype association mapping. Furthermore, we apply BTAM on an Alzheimer's disease dataset and report top 10 genotype-transcript-phenotype associations.

Subject(s)

Chromosome Mapping/methods , Genetic Association Studies/methods , Genetic Variation/genetics , Genome-Wide Association Study/methods , Algorithms , Genotype , Humans , Phenotype , Polymorphism, Single Nucleotide/genetics , Software

14.

Kernel methods for large-scale genomic data analysis.

Wang, Xuefeng; Xing, Eric P; Schaid, Daniel J.

Brief Bioinform ; 16(2): 183-92, 2015 Mar.

Article in English | MEDLINE | ID: mdl-25053743

ABSTRACT

Machine learning, particularly kernel methods, has been demonstrated as a promising new tool to tackle the challenges imposed by today's explosive data growth in genomics. They provide a practical and principled approach to learning how a large number of genetic variants are associated with complex phenotypes, to help reveal the complexity in the relationship between the genetic markers and the outcome of interest. In this review, we highlight the potential key role it will have in modern genomic data processing, especially with regard to integration with classical methods for gene prioritizing, prediction and data fusion.

Subject(s)

Genomics/statistics & numerical data , Machine Learning , Computational Biology , Data Interpretation, Statistical , Genome-Wide Association Study/statistics & numerical data , Humans , Logistic Models , Models, Statistical , Polymorphism, Single Nucleotide , Support Vector Machine

15.

A network-driven approach for genome-wide association mapping.

Lee, Seunghak; Kong, Soonho; Xing, Eric P.

Bioinformatics ; 32(12): i164-i173, 2016 06 15.

Article in English | MEDLINE | ID: mdl-27307613

ABSTRACT

MOTIVATION: It remains a challenge to detect associations between genotypes and phenotypes because of insufficient sample sizes and complex underlying mechanisms involved in associations. Fortunately, it is becoming more feasible to obtain gene expression data in addition to genotypes and phenotypes, giving us new opportunities to detect true genotype-phenotype associations while unveiling their association mechanisms. RESULTS: In this article, we propose a novel method, NETAM, that accurately detects associations between SNPs and phenotypes, as well as gene traits involved in such associations. We take a network-driven approach: NETAM first constructs an association network, where nodes represent SNPs, gene traits or phenotypes, and edges represent the strength of association between two nodes. NETAM assigns a score to each path from an SNP to a phenotype, and then identifies significant paths based on the scores. In our simulation study, we show that NETAM finds significantly more phenotype-associated SNPs than traditional genotype-phenotype association analysis under false positive control, taking advantage of gene expression data. Furthermore, we applied NETAM on late-onset Alzheimer's disease data and identified 477 significant path associations, among which we analyzed paths related to beta-amyloid, estrogen, and nicotine pathways. We also provide hypothetical biological pathways to explain our findings. AVAILABILITY AND IMPLEMENTATION: Software is available at http://www.sailing.cs.cmu.edu/ CONTACT: : epxing@cs.cmu.edu.

Subject(s)

Genome-Wide Association Study , Chromosome Mapping , Genetic Association Studies , Genotype , Phenotype , Polymorphism, Single Nucleotide

16.

A time-varying group sparse additive model for genome-wide association studies of dynamic complex traits.

Marchetti-Bowick, Micol; Yin, Junming; Howrylak, Judie A; Xing, Eric P.

Bioinformatics ; 32(19): 2903-10, 2016 10 01.

Article in English | MEDLINE | ID: mdl-27296983

ABSTRACT

MOTIVATION: Despite the widespread popularity of genome-wide association studies (GWAS) for genetic mapping of complex traits, most existing GWAS methodologies are still limited to the use of static phenotypes measured at a single time point. In this work, we propose a new method for association mapping that considers dynamic phenotypes measured at a sequence of time points. Our approach relies on the use of Time-Varying Group Sparse Additive Models (TV-GroupSpAM) for high-dimensional, functional regression. RESULTS: This new model detects a sparse set of genomic loci that are associated with trait dynamics, and demonstrates increased statistical power over existing methods. We evaluate our method via experiments on synthetic data and perform a proof-of-concept analysis for detecting single nucleotide polymorphisms associated with two phenotypes used to assess asthma severity: forced vital capacity, a sensitive measure of airway obstruction and bronchodilator response, which measures lung response to bronchodilator drugs. AVAILABILITY AND IMPLEMENTATION: Source code for TV-GroupSpAM freely available for download at http://www.cs.cmu.edu/~mmarchet/projects/tv_group_spam, implemented in MATLAB. CONTACT: epxing@cs.cmu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Models, Genetic , Polymorphism, Single Nucleotide , Chromosome Mapping , Genome , Genome-Wide Association Study , Humans , Phenotype

17.

Gene expression profiling of asthma phenotypes demonstrates molecular signatures of atopy and asthma control.

Howrylak, Judie A; Moll, Matthew; Weiss, Scott T; Raby, Benjamin A; Wu, Wei; Xing, Eric P.

J Allergy Clin Immunol ; 137(5): 1390-1397.e6, 2016 05.

Article in English | MEDLINE | ID: mdl-26792209

ABSTRACT

BACKGROUND: Recent studies have used cluster analysis to identify phenotypic clusters of asthma with differences in clinical traits, as well as differences in response to therapy with anti-inflammatory medications. However, the correspondence between different phenotypic clusters and differences in the underlying molecular mechanisms of asthma pathogenesis remains unclear. OBJECTIVE: We sought to determine whether clinical differences among children with asthma in different phenotypic clusters corresponded to differences in levels of gene expression. METHODS: We explored differences in gene expression profiles of CD4(+) lymphocytes isolated from the peripheral blood of 299 young adult participants in the Childhood Asthma Management Program study. We obtained gene expression profiles from study subjects between 9 and 14 years of age after they participated in a randomized, controlled longitudinal study examining the effects of inhaled anti-inflammatory medications over a 48-month study period, and we evaluated the correspondence between our earlier phenotypic cluster analysis and subsequent follow-up clinical and molecular profiles. RESULTS: We found that differences in clinical characteristics observed between subjects assigned to different phenotypic clusters persisted into young adulthood and that these clinical differences were associated with differences in gene expression patterns between subjects in different clusters. We identified a subset of genes associated with atopic status, validated the presence of an atopic signature among these genes in an independent cohort of asthmatic subjects, and identified the presence of common transcription factor binding sites corresponding to glucocorticoid receptor binding. CONCLUSION: These findings suggest that phenotypic clusters are associated with differences in the underlying pathobiology of asthma. Further experiments are necessary to confirm these findings.

Subject(s)

Asthma/genetics , Hypersensitivity, Immediate/genetics , Adolescent , Asthma/blood , Asthma/immunology , Asthma/physiopathology , CD4-Positive T-Lymphocytes/metabolism , Child , Eosinophils/immunology , Female , Gene Expression Profiling , Humans , Immunoglobulin E/blood , Male , Phenotype , Randomized Controlled Trials as Topic , Spirometry , Transcriptome

18.

Network analysis of breast cancer progression and reversal using a tree-evolving network algorithm.

Parikh, Ankur P; Curtis, Ross E; Kuhn, Irene; Becker-Weimann, Sabine; Bissell, Mina; Xing, Eric P; Wu, Wei.

PLoS Comput Biol ; 10(7): e1003713, 2014 Jul.

Article in English | MEDLINE | ID: mdl-25057922

ABSTRACT

The HMT3522 progression series of human breast cells have been used to discover how tissue architecture, microenvironment and signaling molecules affect breast cell growth and behaviors. However, much remains to be elucidated about malignant and phenotypic reversion behaviors of the HMT3522-T4-2 cells of this series. We employed a "pan-cell-state" strategy, and analyzed jointly microarray profiles obtained from different state-specific cell populations from this progression and reversion model of the breast cells using a tree-lineage multi-network inference algorithm, Treegl. We found that different breast cell states contain distinct gene networks. The network specific to non-malignant HMT3522-S1 cells is dominated by genes involved in normal processes, whereas the T4-2-specific network is enriched with cancer-related genes. The networks specific to various conditions of the reverted T4-2 cells are enriched with pathways suggestive of compensatory effects, consistent with clinical data showing patient resistance to anticancer drugs. We validated the findings using an external dataset, and showed that aberrant expression values of certain hubs in the identified networks are associated with poor clinical outcomes. Thus, analysis of various reversion conditions (including non-reverted) of HMT3522 cells using Treegl can be a good model system to study drug effects on breast cancer.

Subject(s)

Algorithms , Breast Neoplasms/genetics , Computational Biology/methods , Cell Line, Tumor , Computer Simulation , Databases, Factual , Disease Progression , Female , Gene Regulatory Networks , Humans , Kaplan-Meier Estimate , Markov Chains , Oligonucleotide Array Sequence Analysis

19.

High-dimensional feature selection by feature-wise kernelized Lasso.

Yamada, Makoto; Jitkrittum, Wittawat; Sigal, Leonid; Xing, Eric P; Sugiyama, Masashi.

Neural Comput ; 26(1): 185-207, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24102126

ABSTRACT

The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wise kernelized Lasso for capturing nonlinear input-output dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.

Subject(s)

Algorithms , Artificial Intelligence , Nonlinear Dynamics , Pattern Recognition, Automated/methods , Animals , Oligonucleotide Array Sequence Analysis , Rats

20.

GINI: from ISH images to gene interaction networks.

Puniyani, Kriti; Xing, Eric P.

PLoS Comput Biol ; 9(10): e1003227, 2013.

Article in English | MEDLINE | ID: mdl-24130465

ABSTRACT

Accurate inference of molecular and functional interactions among genes, especially in multicellular organisms such as Drosophila, often requires statistical analysis of correlations not only between the magnitudes of gene expressions, but also between their temporal-spatial patterns. The ISH (in-situ-hybridization)-based gene expression micro-imaging technology offers an effective approach to perform large-scale spatial-temporal profiling of whole-body mRNA abundance. However, analytical tools for discovering gene interactions from such data remain an open challenge due to various reasons, including difficulties in extracting canonical representations of gene activities from images, and in inference of statistically meaningful networks from such representations. In this paper, we present GINI, a machine learning system for inferring gene interaction networks from Drosophila embryonic ISH images. GINI builds on a computer-vision-inspired vector-space representation of the spatial pattern of gene expression in ISH images, enabled by our recently developed [Formula: see text] system; and a new multi-instance-kernel algorithm that learns a sparse Markov network model, in which, every gene (i.e., node) in the network is represented by a vector-valued spatial pattern rather than a scalar-valued gene intensity as in conventional approaches such as a Gaussian graphical model. By capturing the notion of spatial similarity of gene expression, and at the same time properly taking into account the presence of multiple images per gene via multi-instance kernels, GINI is well-positioned to infer statistically sound, and biologically meaningful gene interaction networks from image data. Using both synthetic data and a small manually curated data set, we demonstrate the effectiveness of our approach in network building. Furthermore, we report results on a large publicly available collection of Drosophila embryonic ISH images from the Berkeley Drosophila Genome Project, where GINI makes novel and interesting predictions of gene interactions. Software for GINI is available at http://sailing.cs.cmu.edu/Drosophila_ISH_images/

Subject(s)

Computational Biology/methods , Gene Expression Profiling/methods , Gene Regulatory Networks/genetics , Gene Regulatory Networks/physiology , Image Processing, Computer-Assisted/methods , In Situ Hybridization/methods , Animals , Drosophila/genetics , Drosophila/metabolism , Markov Chains

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL