Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 40
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Bioinformatics ; 37(18): 2971-2980, 2021 09 29.
Article in English | MEDLINE | ID: mdl-33760022

ABSTRACT

MOTIVATION: Knowledge manipulation of Gene Ontology (GO) and Gene Ontology Annotation (GOA) can be done primarily by using vector representation of GO terms and genes. Previous studies have represented GO terms and genes or gene products in Euclidean space to measure their semantic similarity using an embedding method such as the Word2Vec-based method to represent entities as numeric vectors. However, this method has the limitation that embedding large graph-structured data in the Euclidean space cannot prevent a loss of information of latent hierarchies, thus precluding the semantics of GO and GOA from being captured optimally. On the other hand, hyperbolic spaces such as the Poincaré balls are more suitable for modeling hierarchies, as they have a geometric property in which the distance increases exponentially as it nears the boundary because of negative curvature. RESULTS: In this article, we propose hierarchical representations of GO and genes (HiG2Vec) by applying Poincaré embedding specialized in the representation of hierarchy through a two-step procedure: GO embedding and gene embedding. Through experiments, we show that our model represents the hierarchical structure better than other approaches and predicts the interaction of genes or gene products similar to or better than previous studies. The results indicate that HiG2Vec is superior to other methods in capturing the GO and gene semantics and in data utilization as well. It can be robustly applied to manipulate various biological knowledge. AVAILABILITYAND IMPLEMENTATION: https://github.com/JaesikKim/HiG2Vec. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Proteins , Gene Ontology , Computational Biology/methods , Proteins/genetics , Semantics , Molecular Sequence Annotation , RNA
2.
Bioinformatics ; 37(16): 2405-2413, 2021 Aug 25.
Article in English | MEDLINE | ID: mdl-33543748

ABSTRACT

MOTIVATION: To better understand the molecular features of cancers, a comprehensive analysis using multi-omics data has been conducted. In addition, a pathway activity inference method has been developed to facilitate the integrative effects of multiple genes. In this respect, we have recently proposed a novel integrative pathway activity inference approach, iDRW and demonstrated the effectiveness of the method with respect to dichotomizing two survival groups. However, there were several limitations, such as a lack of generality. In this study, we designed a directed gene-gene graph using pathway information by assigning interactions between genes in multiple layers of networks. RESULTS: As a proof-of-concept study, it was evaluated using three genomic profiles of urologic cancer patients. The proposed integrative approach achieved improved outcome prediction performances compared with a single genomic profile alone and other existing pathway activity inference methods. The integrative approach also identified common/cancer-specific candidate driver pathways as predictive prognostic features in urologic cancers. Furthermore, it provides better biological insights into the prioritized pathways and genes in an integrated view using a multi-layered gene-gene network. Our framework is not specifically designed for urologic cancers and can be generally applicable for various datasets. AVAILABILITY AND IMPLEMENTATION: iDRW is implemented as the R software package. The source codes are available at https://github.com/sykim122/iDRW. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.
Nutr Metab Cardiovasc Dis ; 32(5): 1218-1226, 2022 05.
Article in English | MEDLINE | ID: mdl-35197214

ABSTRACT

BACKGROUND AND AIMS: We aimed to develop and evaluate a non-invasive deep learning algorithm for screening type 2 diabetes in UK Biobank participants using retinal images. METHODS AND RESULTS: The deep learning model for prediction of type 2 diabetes was trained on retinal images from 50,077 UK Biobank participants and tested on 12,185 participants. We evaluated its performance in terms of predicting traditional risk factors (TRFs) and genetic risk for diabetes. Next, we compared the performance of three models in predicting type 2 diabetes using 1) an image-only deep learning algorithm, 2) TRFs, 3) the combination of the algorithm and TRFs. Assessing net reclassification improvement (NRI) allowed quantification of the improvement afforded by adding the algorithm to the TRF model. When predicting TRFs with the deep learning algorithm, the areas under the curve (AUCs) obtained with the validation set for age, sex, and HbA1c status were 0.931 (0.928-0.934), 0.933 (0.929-0.936), and 0.734 (0.715-0.752), respectively. When predicting type 2 diabetes, the AUC of the composite logistic model using non-invasive TRFs was 0.810 (0.790-0.830), and that for the deep learning model using only fundus images was 0.731 (0.707-0.756). Upon addition of TRFs to the deep learning algorithm, discriminative performance was improved to 0.844 (0.826-0.861). The addition of the algorithm to the TRFs model improved risk stratification with an overall NRI of 50.8%. CONCLUSION: Our results demonstrate that this deep learning algorithm can be a useful tool for stratifying individuals at high risk of type 2 diabetes in the general population.


Subject(s)
Deep Learning , Diabetes Mellitus, Type 2 , Algorithms , Area Under Curve , Diabetes Mellitus, Type 2/diagnosis , Diabetes Mellitus, Type 2/epidemiology , Fundus Oculi , Humans
4.
Nature ; 512(7515): 449-52, 2014 Aug 28.
Article in English | MEDLINE | ID: mdl-25164756

ABSTRACT

Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal 'arms', and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, organization of large-scale topological domains, chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.


Subject(s)
Caenorhabditis elegans/cytology , Caenorhabditis elegans/genetics , Chromatin/genetics , Chromatin/metabolism , Drosophila melanogaster/cytology , Drosophila melanogaster/genetics , Animals , Cell Line , Centromere/genetics , Centromere/metabolism , Chromatin/chemistry , Chromatin Assembly and Disassembly/genetics , DNA Replication/genetics , Enhancer Elements, Genetic/genetics , Epigenesis, Genetic , Heterochromatin/chemistry , Heterochromatin/genetics , Heterochromatin/metabolism , Histones/chemistry , Histones/metabolism , Humans , Molecular Sequence Annotation , Nuclear Lamina/metabolism , Nucleosomes/chemistry , Nucleosomes/genetics , Nucleosomes/metabolism , Promoter Regions, Genetic/genetics , Species Specificity
5.
Sensors (Basel) ; 20(24)2020 Dec 16.
Article in English | MEDLINE | ID: mdl-33339334

ABSTRACT

As the number of patients with Alzheimer's disease (AD) increases, the effort needed to care for these patients increases as well. At the same time, advances in information and sensor technologies have reduced caring costs, providing a potential pathway for developing healthcare services for AD patients. For instance, if a virtual reality (VR) system can provide emotion-adaptive content, the time that AD patients spend interacting with VR content is expected to be extended, allowing caregivers to focus on other tasks. As the first step towards this goal, in this study, we develop a classification model that detects AD patients' emotions (e.g., happy, peaceful, or bored). We first collected electroencephalography (EEG) data from 30 Korean female AD patients who watched emotion-evoking videos at a medical rehabilitation center. We applied conventional machine learning algorithms, such as a multilayer perceptron (MLP) and support vector machine, along with deep learning models of recurrent neural network (RNN) architectures. The best performance was obtained from MLP, which achieved an average accuracy of 70.97%; the RNN model's accuracy reached only 48.18%. Our study results open a new stream of research in the field of EEG-based emotion detection for patients with neurological disorders.


Subject(s)
Alzheimer Disease , Electroencephalography , Emotions/classification , Machine Learning , Neural Networks, Computer , Alzheimer Disease/diagnosis , Female , Humans
6.
Sensors (Basel) ; 19(20)2019 Oct 20.
Article in English | MEDLINE | ID: mdl-31635194

ABSTRACT

In recent years, affective computing has been actively researched to provide a higher level of emotion-awareness. Numerous studies have been conducted to detect the user's emotions from physiological data. Among a myriad of target emotions, boredom, in particular, has been suggested to cause not only medical issues but also challenges in various facets of daily life. However, to the best of our knowledge, no previous studies have used electroencephalography (EEG) and galvanic skin response (GSR) together for boredom classification, although these data have potential features for emotion classification. To investigate the combined effect of these features on boredom classification, we collected EEG and GSR data from 28 participants using off-the-shelf sensors. During data acquisition, we used a set of stimuli comprising a video clip designed to elicit boredom and two other video clips of entertaining content. The collected samples were labeled based on the participants' questionnaire-based testimonies on experienced boredom levels. Using the collected data, we initially trained 30 models with 19 machine learning algorithms and selected the top three candidate classifiers. After tuning the hyperparameters, we validated the final models through 1000 iterations of 10-fold cross validation to increase the robustness of the test results. Our results indicated that a Multilayer Perceptron model performed the best with a mean accuracy of 79.98% (AUC: 0.781). It also revealed the correlation between boredom and the combined features of EEG and GSR. These results can be useful for building accurate affective computing systems and understanding the physiological properties of boredom.


Subject(s)
Boredom , Electroencephalography/methods , Machine Learning , Adult , Area Under Curve , Discriminant Analysis , Female , Galvanic Skin Response , Humans , Male , ROC Curve , Surveys and Questionnaires , Young Adult
7.
Bioinformatics ; 31(13): 2066-74, 2015 Jul 01.
Article in English | MEDLINE | ID: mdl-25725496

ABSTRACT

MOTIVATION: Genome-wide mapping of chromatin states is essential for defining regulatory elements and inferring their activities in eukaryotic genomes. A number of hidden Markov model (HMM)-based methods have been developed to infer chromatin state maps from genome-wide histone modification data for an individual genome. To perform a principled comparison of evolutionarily distant epigenomes, we must consider species-specific biases such as differences in genome size, strength of signal enrichment and co-occurrence patterns of histone modifications. RESULTS: Here, we present a new Bayesian non-parametric method called hierarchically linked infinite HMM (hiHMM) to jointly infer chromatin state maps in multiple genomes (different species, cell types and developmental stages) using genome-wide histone modification data. This flexible framework provides a new way to learn a consistent definition of chromatin states across multiple genomes, thus facilitating a direct comparison among them. We demonstrate the utility of this method using synthetic data as well as multiple modENCODE ChIP-seq datasets. CONCLUSION: The hierarchical and Bayesian non-parametric formulation in our approach is an important extension to the current set of methodologies for comparative chromatin landscape analysis. AVAILABILITY AND IMPLEMENTATION: Source codes are available at https://github.com/kasohn/hiHMM. Chromatin data are available at http://encode-x.med.harvard.edu/data_sets/chromatin/.


Subject(s)
Bayes Theorem , Chromatin/genetics , Regulatory Sequences, Nucleic Acid/genetics , Software , Statistics, Nonparametric , Animals , Chromatin Immunoprecipitation , Computational Biology/methods , Drosophila melanogaster/genetics , Gene Expression Regulation, Developmental , Histones/metabolism , Humans , Promoter Regions, Genetic
8.
Methods ; 67(3): 344-53, 2014 Jun 01.
Article in English | MEDLINE | ID: mdl-24561168

ABSTRACT

In order to improve our understanding of cancer and develop multi-layered theoretical models for the underlying mechanism, it is essential to have enhanced understanding of the interactions between multiple levels of genomic data that contribute to tumor formation and progression. Although there exist recent approaches such as a graph-based framework that integrates multi-omics data including copy number alteration, methylation, gene expression, and miRNA data for cancer clinical outcome prediction, most of previous methods treat each genomic data as independent and the possible interplay between them is not explicitly incorporated to the model. However, cancer is dysregulated by multiple levels in the biological system through genomic, epigenomic, transcriptomic, and proteomic level. Thus, genomic features are likely to interact with other genomic features in the different genomic levels. In order to deepen our knowledge, it would be desirable to incorporate such inter-relationship information when integrating multi-omics data for cancer clinical outcome prediction. In this study, we propose a new graph-based framework that integrates not only multi-omics data but inter-relationship between them for better elucidating cancer clinical outcomes. In order to highlight the validity of the proposed framework, serous cystadenocarcinoma data from TCGA was adopted as a pilot task. The proposed model incorporating inter-relationship between different genomic features showed significantly improved performance compared to the model that does not consider inter-relationship when integrating multi-omics data. For the pair between miRNA and gene expression data, the model integrating miRNA, for example, gene expression, and inter-relationship between them with an AUC of 0.8476 (REI) outperformed the model combining miRNA and gene expression data with an AUC of 0.8404. Similar results were also obtained for other pairs between different levels of genomic data. Integration of different levels of data and inter-relationship between them can aid in extracting new biological knowledge by drawing an integrative conclusion from many pieces of information collected from diverse types of genomic data, eventually leading to more effective screening strategies and alternative therapies that may improve outcomes.


Subject(s)
Cystadenocarcinoma/genetics , Genomics/methods , Ovarian Neoplasms/genetics , Cystadenocarcinoma/diagnosis , Cystadenocarcinoma/therapy , Female , Gene Expression Profiling , Humans , Ovarian Neoplasms/diagnosis , Ovarian Neoplasms/therapy , Precision Medicine , Prognosis , Treatment Outcome
9.
PLoS One ; 19(5): e0300530, 2024.
Article in English | MEDLINE | ID: mdl-38709721

ABSTRACT

BACKGROUND: Over several years of recent efforts to make sense and detect online hate speech, we still know relatively little about how hateful expressions enter online platforms and whether there are patterns and features characterizing the corpus of hateful speech. OBJECTIVE: In this research, we introduce a new conceptual framework suitable for better capturing the overall scope and dynamics of the current forms of online hateful speech. METHODS: We adopt several Python-based crawlers to collect a comprehensive data set covering a variety of subjects from a multiplicity of online communities in South Korea. We apply the notions of marginalization and polarization in identifying patterns and dynamics of online hateful speech. RESULTS: Our analyses suggest that polarization driven by political orientation and age difference predominates in the hateful speech in most communities, while marginalization of social minority groups is also salient in other communities. Furthermore, we identify a temporal shift in the trends of online hate from gender to age based, reflecting the changing sociopolitical conditions within the polarization dynamics in South Korea. CONCLUSION: By expanding our understanding of how hatred shifts and evolves in online communities, our study provides theoretical and practical implications for both researchers and policy-makers.


Subject(s)
Internet , Republic of Korea , Humans , Male , Female , Adult , Politics , Young Adult , Middle Aged
10.
Article in English | MEDLINE | ID: mdl-38768003

ABSTRACT

BACKGROUND: Intraoperative hypotension can lead to postoperative organ dysfunction. Previous studies primarily used invasive arterial pressure as the key biosignal for the detection of hypotension. However, these studies had limitations in incorporating different biosignal modalities and utilizing the periodic nature of biosignals. To address these limitations, we utilized frequency-domain information, which provides key insights that time-domain analysis cannot provide, as revealed by recent advances in deep learning. With the frequency-domain information, we propose a deep-learning approach that integrates multiple biosignal modalities. METHODS: We used the discrete Fourier transform technique, to extract frequency information from biosignal data, which we then combined with the original time-domain data as input for our deep learning model. To improve the interpretability of our results, we incorporated recent interpretable modules for deep-learning models into our analysis. RESULTS: We constructed 75,994 segments from the data of 3,226 patients to predict hypotension during surgery. Our proposed frequency-domain deep-learning model outperformed conventional approaches that rely solely on time-domain information. Notably, our model achieved a greater increase in AUROC performance than the time-domain deep learning models when trained on non-invasive biosignal data only (AUROC 0.898 [95% CI: 0.885-0.91] vs. 0.853 [95% CI: 0.839-0.867]). Further analysis revealed that the 1.5-3.0 Hz frequency band played an important role in predicting hypotension events. CONCLUSION: Utilizing the frequency domain not only demonstrated high performance on invasive data but also showed significant performance improvement when applied to non-invasive data alone. Our proposed framework offers clinicians a novel perspective for predicting intraoperative hypotension.

11.
Bioengineering (Basel) ; 10(7)2023 Jul 10.
Article in English | MEDLINE | ID: mdl-37508851

ABSTRACT

Feature selection methods are essential for accurate disease classification and identifying informative biomarkers. While information-theoretic methods have been widely used, they often exhibit limitations such as high computational costs. Our previously proposed method, ClearF, addresses these issues by using reconstruction error from low-dimensional embeddings as a proxy for the entropy term in the mutual information. However, ClearF still has limitations, including a nontransparent bottleneck layer selection process, which can result in unstable feature selection. To address these limitations, we propose ClearF++, which simplifies the bottleneck layer selection and incorporates feature-wise clustering to enhance biomarker detection. We compare its performance with other commonly used methods such as MultiSURF and IFS, as well as ClearF, across multiple benchmark datasets. Our results demonstrate that ClearF++ consistently outperforms these methods in terms of prediction accuracy and stability, even with limited samples. We also observe that employing the Deep Embedded Clustering (DEC) algorithm for feature-wise clustering improves performance, indicating its suitability for handling complex data structures with limited samples. ClearF++ offers an improved biomarker prioritization approach with enhanced prediction performance and faster execution. Its stability and effectiveness with limited samples make it particularly valuable for biomedical data analysis.

12.
BMC Genomics ; 13 Suppl 7: S17, 2012.
Article in English | MEDLINE | ID: mdl-23281707

ABSTRACT

BACKGROUND: Functional annotations are available only for a very small fraction of microRNAs (miRNAs) and very few miRNA target genes are experimentally validated. Therefore, functional analysis of miRNA clusters has typically relied on computational target gene prediction followed by Gene Ontology and/or pathway analysis. These previous methods share the limitation that they do not consider the many-to-many-to-many tri-partite network topology between miRNAs, target genes, and functional annotations. Moreover, the highly false-positive nature of sequence-based target prediction algorithms causes propagation of annotation errors throughout the tri-partite network. RESULTS: A new conceptual framework is proposed for functional analysis of miRNA clusters, which extends the conventional target gene-centric approaches to a more generalized tri-partite space. Under this framework, we construct miRNA-, target link-, and target gene-centric computational measures incorporating the whole tri-partite network topology. Each of these methods and all their possible combinations are evaluated on publicly available miRNA clusters and with a wide range of variations for miRNA-target gene relations. We find that the miRNA-centric measures outperform others in terms of the average specificity and functional homogeneity of the GO terms significantly enriched for each miRNA cluster. CONCLUSIONS: We propose novel miRNA-centric functional enrichment measures in a conceptual framework that connects the spaces of miRNAs, genes, and GO terms in a unified way. Our comprehensive evaluation result demonstrates that functional enrichment analysis of co-expressed and differentially expressed miRNA clusters can substantially benefit from the proposed miRNA-centric approaches.


Subject(s)
Algorithms , MicroRNAs/metabolism , Cluster Analysis , Computational Biology , Databases, Factual , Humans , Multigene Family , RNA, Messenger/metabolism
13.
J Clin Med ; 11(1)2021 Dec 24.
Article in English | MEDLINE | ID: mdl-35011837

ABSTRACT

Prurigo nodularis (PN) is a chronic dermatosis typified by extraordinarily itchy nodules. However, little is known of the nature and extent of PN in Asian people. This study aimed to describe the epidemiology, comorbidities, and prescription pattern of PN in Koreans based on a large dermatology outpatient cohort. Patients with PN were identified from the Catholic Medical Center (CMC) clinical data warehouse. Anonymized data on age, sex, diagnostic codes, prescriptions, visitation dates, and other relevant parameters were collected. Pearson correlation analysis was used to calculate the correlation between PN prevalence and patient age. Conditional logistic regression modeling was adopted to measure the comorbidity risk of PN. A total of 3591 patients with PN were identified at the Catholic Medical Center Health System dermatology outpatient clinic in the period 2007-2020. A comparison of the study patients with age- and sex-matched controls (dermatology outpatients without PN) indicated that PN was associated with various comorbidities including chronic kidney disease (adjusted odds ratio (aOR), 1.48; 95% confidence interval (CI), 1.29-1.70), dyslipidemia (aOR, 1.88; 95% CI, 1.56-2.27), type 2 diabetes mellitus (aOR, 1.37; 95% CI, 1.22-1.54), arterial hypertension (aOR, 1.50; 95% CI, 1.30-1.73), autoimmune thyroiditis (aOR, 2.43; 95% CI, 1.42-4.16), non-Hodgkin's lymphoma (aOR, 1.95; 95% CI, 1.23-3.07), and atopic dermatitis (aOR, 2.16, 95% CI, 1.91-2.45). Regarding prescription patterns, topical steroids were most favored, followed by topical calcineurin inhibitors; oral antihistamines were the most preferred systemic agent for PN. PN is a relatively rare but significant disease among Korean dermatology outpatients with a high comorbidity burden compared to dermatology outpatients without PN. There is great need for breakthroughs in PN treatment.

14.
Article in English | MEDLINE | ID: mdl-35299717

ABSTRACT

Alzheimer's disease (AD) is a progressive neurodegenerative brain disorder characterized by memory loss and cognitive decline. Early detection and accurate prognosis of AD is an important research topic, and numerous machine learning methods have been proposed to solve this problem. However, traditional machine learning models are facing challenges in effectively integrating longitudinal neuroimaging data and biologically meaningful structure and knowledge to build accurate and interpretable prognostic predictors. To bridge this gap, we propose an interpretable graph neural network (GNN) model for AD prognostic prediction based on longitudinal neuroimaging data while embracing the valuable knowledge of structural brain connectivity. In our empirical study, we demonstrate that 1) the proposed model outperforms several competing models (i.e., DNN, SVM) in terms of prognostic prediction accuracy, and 2) our model can capture neuroanatomical contribution to the prognostic predictor and yield biologically meaningful interpretation to facilitate better mechanistic understanding of the Alzheimer's disease. Source code is available at https://github.com/JaesikKim/temporal-GNN.

15.
Front Oncol ; 11: 790894, 2021.
Article in English | MEDLINE | ID: mdl-34912724

ABSTRACT

BACKGROUND: Preoperative chemoradiotherapy (CRT) is a standard treatment for locally advanced rectal cancer (LARC). However, individual responses to preoperative CRT vary from patient to patient. The aim of this study is to develop a scoring system for the response of preoperative CRT in LARC using blood features derived from machine learning. METHODS: Patients who underwent total mesorectal excision after preoperative CRT were included in this study. The performance of machine learning models using blood features before CRT (pre-CRT) and from 1 to 2 weeks after CRT (early-CRT) was evaluated. Based on the best model, important features were selected. The scoring system was developed from the selected model and features. The performance of the new scoring system was compared with those of systemic inflammatory indicators: neutrophil-to-lymphocyte ratio, platelet-to-lymphocyte ratio, lymphocyte-to-monocyte ratio, and the prognostic nutritional index. RESULTS: The models using early-CRT blood features had better performances than those using pre-CRT blood features. Based on the ridge regression model, which showed the best performance among the machine learning models (AUROC 0.6322 and AUPRC 0.5965), a novel scoring system for the response of preoperative CRT, named Response Prediction Score (RPS), was developed. The RPS system showed higher predictive power (AUROC 0.6747) than single blood features and systemic inflammatory indicators and stratified the tumor regression grade and overall downstaging clearly. CONCLUSION: We discovered that we can more accurately predict CRT response by using early-treatment blood data. With larger data, we can develop a more accurate and reliable indicator that can be used in real daily practices. In the future, we urge the collection of early-treatment blood data and pre-treatment blood data.

16.
Bioinformatics ; 25(12): i204-12, 2009 Jun 15.
Article in English | MEDLINE | ID: mdl-19477989

ABSTRACT

MOTIVATION: Many complex disease syndromes such as asthma consist of a large number of highly related, rather than independent, clinical phenotypes, raising a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. Although a causal genetic variation may influence a group of highly correlated traits jointly, most of the previous association analyses considered each phenotype separately, or combined results from a set of single-phenotype analyses. RESULTS: We propose a new statistical framework called graph-guided fused lasso to address this issue in a principled way. Our approach represents the dependency structure among the quantitative traits explicitly as a network, and leverages this trait network to encode structured regularizations in a multivariate regression model over the genotypes and traits, so that the genetic markers that jointly influence subgroups of highly correlated traits can be detected with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently, our approach analyzes all of the traits jointly in a single statistical method to discover the genetic markers that perturb a subset of correlated traits jointly rather than a single trait. Using simulated datasets based on the HapMap consortium data and an asthma dataset, we compare the performance of our method with the single-marker analysis, and other sparse regression methods that do not use any structural information in the traits. Our results show that there is a significant advantage in detecting the true causal single nucleotide polymorphisms when we incorporate the correlation pattern in traits using our proposed methods. AVAILABILITY: Software for GFlasso is available at http://www.sailing.cs.cmu.edu/gflasso.html.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks/genetics , Genome-Wide Association Study , Quantitative Trait Loci , Models, Genetic , Phenotype , Polymorphism, Single Nucleotide , Regression Analysis
17.
BMC Med Genomics ; 12(Suppl 5): 95, 2019 07 11.
Article in English | MEDLINE | ID: mdl-31296201

ABSTRACT

BACKGROUND: Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS: In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS: The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.


Subject(s)
Biomarkers/metabolism , Computational Biology/methods , Supervised Machine Learning , Benchmarking
18.
BMC Med Genomics ; 12(Suppl 5): 94, 2019 07 11.
Article in English | MEDLINE | ID: mdl-31296204

ABSTRACT

BACKGROUND: The analysis of integrated multi-omics data enables the identification of disease-related biomarkers that cannot be identified from a single omics profile. Although protein-level data reflects the cellular status of cancer tissue more directly than gene-level data, past studies have mainly focused on multi-omics integration using gene-level data as opposed to protein-level data. However, the use of protein-level data (such as mass spectrometry) in multi-omics integration has some limitations. For example, the correlation between the characteristics of gene-level data (such as mRNA) and protein-level data is weak, and it is difficult to detect low-abundance signaling proteins that are used to target cancer. The reverse phase protein array (RPPA) is a highly sensitive antibody-based quantification method for signaling proteins. However, the number of protein features in RPPA data is extremely low compared to the number of gene features in gene-level data. In this study, we present a new method for integrating RPPA profiles with RNA-Seq and DNA methylation profiles for survival prediction based on the integrative directed random walk (iDRW) framework proposed in our previous study. In the iDRW framework, each omics profile is merged into a single pathway profile that reflects the topological information of the pathway. In order to address the sparsity of RPPA profiles, we employ the random walk with restart (RWR) approach on the pathway network. RESULTS: Our model was validated using survival prediction analysis for a breast cancer dataset from The Cancer Genome Atlas. Our proposed model exhibited improved performance compared with other methods that utilize pathway information and also out-performed models that did not include the RPPA data utilized in our study. The risk pathways identified for breast cancer in this study were closely related to well-known breast cancer risk pathways. CONCLUSIONS: Our results indicated that RPPA data is useful for survival prediction for breast cancer patients under our framework. We also observed that iDRW effectively integrates RNA-Seq, DNA methylation, and RPPA profiles, while variation in the composition of the omics data can affect both prediction performance and risk pathway identification. These results suggest that omics data composition is a critical parameter for iDRW.


Subject(s)
Breast Neoplasms/metabolism , Protein Array Analysis , Proteomics , Breast Neoplasms/genetics , DNA Methylation , Humans , Survival Analysis
19.
Front Genet ; 10: 617, 2019.
Article in English | MEDLINE | ID: mdl-31316553

ABSTRACT

As large amounts of heterogeneous biomedical data become available, numerous methods for integrating such datasets have been developed to extract complementary knowledge from multiple domains of sources. Recently, a deep learning approach has shown promising results in a variety of research areas. However, applying the deep learning approach requires expertise for constructing a deep architecture that can take multimodal longitudinal data. Thus, in this paper, a deep learning-based python package for data integration is developed. The python package deep learning-based multimodal longitudinal data integration framework (MildInt) provides the preconstructed deep learning architecture for a classification task. MildInt contains two learning phases: learning feature representation from each modality of data and training a classifier for the final decision. Adopting deep architecture in the first phase leads to learning more task-relevant feature representation than a linear model. In the second phase, linear regression classifier is used for detecting and investigating biomarkers from multimodal data. Thus, by combining the linear model and the deep learning model, higher accuracy and better interpretability can be achieved. We validated the performance of our package using simulation data and real data. For the real data, as a pilot study, we used clinical and multimodal neuroimaging datasets in Alzheimer's disease to predict the disease progression. MildInt is capable of integrating multiple forms of numerical data including time series and non-time series data for extracting complementary features from the multimodal dataset.

20.
Biol Direct ; 14(1): 8, 2019 04 29.
Article in English | MEDLINE | ID: mdl-31036036

ABSTRACT

BACKGROUND: Integrating the rich information from multi-omics data has been a popular approach to survival prediction and bio-marker identification for several cancer studies. To facilitate the integrative analysis of multiple genomic profiles, several studies have suggested utilizing pathway information rather than using individual genomic profiles. METHODS: We have recently proposed an integrative directed random walk-based method utilizing pathway information (iDRW) for more robust and effective genomic feature extraction. In this study, we applied iDRW to multiple genomic profiles for two different cancers, and designed a directed gene-gene graph which reflects the interaction between gene expression and copy number data. In the experiments, the performances of the iDRW method and four state-of-the-art pathway-based methods were compared using a survival prediction model which classifies samples into two survival groups. RESULTS: The results show that the integrative analysis guided by pathway information not only improves prediction performance, but also provides better biological insights into the top pathways and genes prioritized by the model in both the neuroblastoma and the breast cancer datasets. The pathways and genes selected by the iDRW method were shown to be related to the corresponding cancers. CONCLUSIONS: In this study, we demonstrated the effectiveness of a directed random walk-based multi-omics data integration method applied to gene expression and copy number data for both breast cancer and neuroblastoma datasets. We revamped a directed gene-gene graph considering the impact of copy number variation on gene expression and redefined the weight initialization and gene-scoring method. The benchmark result for iDRW with four pathway-based methods demonstrated that the iDRW method improved survival prediction performance and jointly identified cancer-related pathways and genes for two different cancer datasets. REVIEWERS: This article was reviewed by Helena Molina-Abril and Marta Hidalgo.


Subject(s)
Breast Neoplasms/epidemiology , DNA Copy Number Variations , Gene Expression Regulation, Neoplastic , Genome, Human , Neuroblastoma/epidemiology , Breast Neoplasms/genetics , Computational Biology/methods , Humans , Models, Genetic , Neuroblastoma/genetics , Survival Analysis
SELECTION OF CITATIONS
SEARCH DETAIL