RESUMEN
Most causal discovery tools assume the local causal Markov condition. However, the theoretical assumptions that underlie the local causal Markov condition are often not met in practice. This is especially marked in genomics, where the unwanted presence of measurement errors, averaging effects, and feedback loops significantly undermine the legitimacy of the local causal Markov condition. Furthermore, these causal discovery algorithms require very large samples, orders above what is often available. In this paper, relaxing the local causal Markov condition and using Reichenbach's common cause principle instead, we present a more flexible approach to causal discovery, the directed topological overlap matrix (DTOM). DTOM is robust w.r.t. the presence of measurement errors, averaging effects, feedback loops, and is significantly more sample efficient. We study the utility of DTOM for discovering causal relations in biological data using three real gene expression data-sets. We first examine if DTOM can help distinguish the Myostatin mutation in the Piedmontese cattle by contrasting the muscle transcriptomes of the Piedmontese and Wagyu crosses: the Myostatin mutation is the cause of the double-muscling the Piedmontese cattle are famous for. We then consider a large-scale gene deletion study in yeast. We show that DTOM allows us to distinguish the deleted gene in a sample knowing only the set of differentially expressed genes in that sample. We then examine the progression of Alzheimer's disease (AD) under the lens of DTOM. The genes implicated as having a causal role in the progression of AD by our DTOM analysis were significantly enriched in cellular components that had been repeatedly implicated in the progression of AD.
Asunto(s)
Genómica , Miostatina , Bovinos , Animales , Miostatina/genética , Mutación , TranscriptomaRESUMEN
Multimodal fusion of different types of neural image data provides an irreplaceable opportunity to take advantages of complementary cross-modal information that may only partially be contained in single modality. To jointly analyze multimodal data, deep neural networks can be especially useful because many studies have suggested that deep learning strategy is very efficient to reveal complex and non-linear relations buried in the data. However, most deep models, e.g., convolutional neural network and its numerous extensions, can only operate on regular Euclidean data like voxels in 3D MRI. The interrelated and hidden structures that beyond the grid neighbors, such as brain connectivity, may be overlooked. Moreover, how to effectively incorporate neuroscience knowledge into multimodal data fusion with a single deep framework is understudied. In this work, we developed a graph-based deep neural network to simultaneously model brain structure and function in Mild Cognitive Impairment (MCI): the topology of the graph is initialized using structural network (from diffusion MRI) and iteratively updated by incorporating functional information (from functional MRI) to maximize the capability of differentiating MCI patients from elderly normal controls. This resulted in a new connectome by exploring "deep relations" between brain structure and function in MCI patients and we named it as Deep Brain Connectome. Though deep brain connectome is learned individually, it shows consistent patterns of alteration comparing to structural network at group level. With deep brain connectome, our developed deep model can achieve 92.7% classification accuracy on ADNI dataset.
Asunto(s)
Disfunción Cognitiva , Conectoma , Anciano , Encéfalo/diagnóstico por imagen , Disfunción Cognitiva/diagnóstico por imagen , Humanos , Imagen por Resonancia Magnética , Redes Neurales de la ComputaciónRESUMEN
Long non-coding RNA (lncRNA), microRNA, and messenger RNA enable key regulations of various biological processes through a variety of diverse interaction mechanisms. Identifying the interactions and cross-talk between these heterogeneous RNA classes is essential in order to uncover the functional role of individual RNA transcripts, especially for unannotated and sparsely discovered RNA sequences with no known interactions. Recently, sequence-based deep learning and network embedding methods are gaining traction as high-performing and flexible approaches that can either predict RNA-RNA interactions from sequence or infer missing interactions from patterns that may exist in the network topology. However, most of the current methods have several limitations, e.g., the inability to perform inductive predictions, to distinguish the directionality of interactions, or to integrate various sequence, interaction, expression, and genomic annotation datasets. We proposed a novel deep learning framework, rna2rna, which learns from RNA sequences to produce a low-dimensional embedding that preserves proximities in both the interaction topology and the functional affinity topology. In this proposed embedding space, the two-part "source and target contexts"capture the receptive fields of each RNA transcript to encapsulate heterogeneous cross-talk interactions between lncRNAs and microRNAs. The proximity between RNAs in this embedding space also uncovers the second-order relationships that allow for accurate inference of novel directed interactions or functional similarities between any two RNA sequences. In a prospective evaluation, our method exhibits superior performance compared to state-of-art approaches at predicting missing interactions from several RNA-RNA interaction databases. Additional results suggest that our proposed framework can capture a manifold for heterogeneous RNA sequences to discover novel functional annotations.
Asunto(s)
MicroARNs , ARN Largo no Codificante , Secuencia de Bases , Biología Computacional , Humanos , MicroARNs/genética , Estudios ProspectivosRESUMEN
Expression quantitative trait loci (eQTL) mapping studies identify genetic loci that regulate gene expression. eQTL mapping studies can capture gene regulatory interactions and provide insight into the genetic mechanism of biological systems. Recently, the integration of multi-omics data, such as single-nucleotide polymorphisms (SNPs), copy number variations (CNVs), DNA methylation, and gene expression, plays an important role in elucidating complex biological systems, since biological systems involve a sequence of complex interactions between various biological processes. This chapter introduces multi-omics data that have been used in many eQTL studies and integrative methodologies that incorporate multi-omics data for eQTL studies. Furthermore, we describe a statistical approach that can detect nonlinear causal relationships between eQTLs, called eQTL epistasis, and its importance.
Asunto(s)
Mapeo Cromosómico , Epistasis Genética , Expresión Génica , Estudio de Asociación del Genoma Completo , Genómica , Sitios de Carácter Cuantitativo , Algoritmos , Biología Computacional/métodos , Estudio de Asociación del Genoma Completo/métodos , Genómica/métodos , Humanos , Polimorfismo Genético , Polimorfismo de Nucleótido SimpleRESUMEN
Over the past few years, it has been established that a number of long intergenic non-coding RNAs (lincRNAs) are linked to a wide variety of human diseases. The relationship among many other lincRNAs still remains as puzzle. Validation of such link between the two entities through biological experiments is expensive. However, piles of information about the two are becoming available, thanks to the High Throughput Sequencing (HTS) platforms, Genome Wide Association Studies (GWAS), etc., thereby opening opportunity for cutting-edge machine learning and data mining approaches. However, there are only a few in silico lincRNA-disease association inference tools available to date, and none of these utilizes side information of both the entities. The recently developed Inductive Matrix Completion (IMC) technique provides a recommendation platform among two entities considering respective side information. But, the formulation of IMC is incapable of handling noise and outliers that may present in the dataset, while data sparsity consideration is another issue with the standard IMC method. Thus, a robust version of IMC is needed that can solve these two issues. As a remedy, in this paper, we propose Robust Inductive Matrix Completion (RIMC) using l2,1 norm loss function as well as l2,1 norm based regularization. We applied RIMC to the available association data between human lincRNAs and OMIM disease phenotypes as well as a diverse set of side information about the lincRNAs and the diseases. Our method performs better than the state-of-the-art methods in terms of precision@k and recall@k at the top- k disease prioritization to the subject lincRNAs. We also demonstrate that RIMC is equally effective for querying about novel lincRNAs, as well as predicting rank of a newly known disease for a set of well-characterized lincRNAs. Availability: All the supporting datasets are available at the publicly accessible URL located at http://biomecis.uta.edu/~ashis/res/RIMC/.
Asunto(s)
Biología Computacional/métodos , Estudio de Asociación del Genoma Completo , Secuenciación de Nucleótidos de Alto Rendimiento , Polimorfismo de Nucleótido Simple , ARN Largo no Codificante , Algoritmos , Área Bajo la Curva , Minería de Datos , Bases de Datos Factuales , Humanos , Aprendizaje Automático , Modelos Estadísticos , FenotipoRESUMEN
BACKGROUND: The majority of cancer-related deaths are due to lung cancer, and there is a need for reliable diagnostic biomarkers to predict stages in non-small cell lung cancer cases. Recently, microRNAs were found to have potential as both biomarkers and therapeutic targets for lung cancer. However, some of the microRNA's functions are unknown, and their roles in cancer stage progression have been mostly undiscovered in this clinically and genetically heterogeneous disease. As evidence suggests that microRNA dysregulations are implicated in many diseases, it is essential to consider the changes in microRNA-target regulation across different lung cancer subtypes. RESULTS: We proposed a pipeline to identify microRNA synergistic modules with similar dysregulation patterns across multiple subtypes by constructing the MicroRNA Dysregulational Synergistic Network. From the network, we extracted microRNA modules and incorporated them as prior knowledge to the Sparse Group Lasso classifier. This leads to a more relevant selection of microRNA biomarkers, thereby improving the cancer stage classification accuracy. We applied our method to the TCGA Lung Adenocarcinoma and the Lung Squamous Cell Carcinoma datasets. In cross-validation tests, the area under ROC curve rate for the cancer stages prediction has increased considerably when incorporating the learned microRNA dysregulation modules. The extracted modules from multiple independent subtypes differential analyses were found to have high agreement with microRNA family annotations, and they can also be used to identify mutual biomarkers between different subtypes. Among the top-ranked candidate microRNAs selected by the model, 87% were reported to be related to Lung Adenocarcinoma. The overall result demonstrates that clustering microRNAs from the dysregulation pattern between microRNAs and their targets leads to biomarkers with high precision and recall rate to known differentially expressed disease-associated microRNAs. CONCLUSIONS: The results indicated that our method improves microRNA biomarker selection by detecting similar microRNA dysregulational synergistic patterns across the multiple subtypes. Since microRNA-target dysregulations are implicated in many cancers, we believe this tool can have broad applications for discovery of novel microRNA biomarkers in heterogeneous cancer diseases.
Asunto(s)
Carcinoma de Pulmón de Células no Pequeñas/genética , Regulación Neoplásica de la Expresión Génica , Redes Reguladoras de Genes , Neoplasias Pulmonares/genética , MicroARNs/genética , Adenocarcinoma del Pulmón/genética , Adenocarcinoma del Pulmón/patología , Biomarcadores de Tumor/genética , Carcinoma de Pulmón de Células no Pequeñas/patología , Carcinoma de Células Escamosas/genética , Carcinoma de Células Escamosas/patología , Estudios de Cohortes , Perfilación de la Expresión Génica , Humanos , Neoplasias Pulmonares/patología , MicroARNs/metabolismo , Estadificación de Neoplasias , Curva ROC , Tamaño de la MuestraRESUMEN
Most of cancer-related deaths are due to lung cancer, and there is a need for reliable prognosis biomarkers to predict stages in lung adenocarcinoma cases. Recently, microRNAs are found to have potential as both biomarkers and therapeutic targets for lung cancer. As evidence suggests microRNA dysregulations are implicated in many cancer malignancies, it is important to consider the changes in miRNA-target associations among different lung cancer biological states. We proposed a novel clustering strategy to identify groups of miRNAs with similar dysregulated targets. Then, we incorporated the learned clusters of miRNA as prior knowledge to a Sparse Group Lasso classifier to improve classification results, thereby leading to more relevant selection of microRNA biomarkers. We apply the method to the TCGA Lung Adenocarcinoma dataset. In cross-validation tests, the AUC rate for each stages is 1.0, 0.71, 0.68, 0.64, and 0.90 for normal, Stage I, Stage II, Stage III, and Stage IV, respectively. Among the candidate miRNAs selected in the model, 87% are reported to be related to Lung Adenocarcinoma. Further result demonstrates that clustering miRNAs by considering the dysregulation between miRNAs and mRNA targets leads to biomarkers with higher precision and recall rate to known lung adenocarcinoma miRNAs.
Asunto(s)
Adenocarcinoma , Neoplasias Pulmonares , Adenocarcinoma del Pulmón , Biomarcadores de Tumor , Perfilación de la Expresión Génica , Regulación Neoplásica de la Expresión Génica , Humanos , MicroARNsRESUMEN
Ionizing radiation (IR) causing damages to Deoxyribonucleic acid (DNA) constitutes a broad range of base damage and double strand break, and thereby, it induces the operation of relevant signaling pathways such as DNA repair, cell cycle control, and cell apoptosis. The goal of this paper is to study how the exposure to low dose radiation affects the human body by observing the signaling pathway associated with Ataxia Telangiectasia mutated (ATM) using Reverse-Phase Protein Array (RPPA) and isogenic human Ataxia Telangiectasia (A-T) cells under different amounts and durations of IR exposure. In order to verify which proteins could be involved in a DNA damage-caused pathway, only proteins that highly interact with each other under IR are selected by using correlation coefficient. The pathway inference is derived from learning Bayesian networks in combination with prior knowledge such as Protein-Protein Interactions (PPIs) and signaling pathways from well-known databases. Learning Bayesian networks is based on a score and search scheme that provides the highest scored network structure given a score function, and the prior knowledge is included in the score function as a prior probability by using Dempster-Shafer theory (DST). In this way, the inferred network can be more likely to be similar to already discovered pathways and consistent with confirmed PPIs for more reliable inference. The experimental results show which proteins are involved in signaling pathways under IR, how the inferred pathways are different under low and high doses of IR, and how the selected proteins regulate each other in the inferred pathways. As our main contribution, overall results confirm that low dose IR could cause DNA damage and thereby induce and affect related signaling pathways such as apoptosis, cell cycle, and DNA repair.
Asunto(s)
Teorema de Bayes , Daño del ADN/fisiología , Daño del ADN/efectos de la radiación , Análisis por Matrices de Proteínas/métodos , Transducción de Señal/efectos de la radiación , Proteínas de la Ataxia Telangiectasia Mutada/metabolismo , Quinasa 1 Reguladora del Ciclo Celular (Checkpoint 1)/metabolismo , Quinasa de Punto de Control 2/metabolismo , Humanos , Fosfohidrolasa PTEN/metabolismo , Dosis de Radiación , Radiación Ionizante , Proteína p53 Supresora de Tumor/metabolismoRESUMEN
Gene regulatory networks provide comprehensive insights and indepth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called 'multi-omics data', that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN's capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analysed.
RESUMEN
Detecting arrhythmia from ECG data is now feasible on mobile devices, but in this environment it is necessary to trade computational efficiency against accuracy. We propose an adaptive strategy for feature extraction that only considers normalized beat morphology features when running in a resource-constrained environment; but in a high-performance environment it takes account of a wider range of ECG features. This process is augmented by a cascaded random forest classifier. Experiments on data from the MIT-BIH Arrhythmia Database showed classification accuracies from 96.59% to 98.51%, which are comparable to state-of-the art methods.
Asunto(s)
Arritmias Cardíacas/diagnóstico , Electrocardiografía/instrumentación , Aplicaciones Móviles , Procesamiento de Señales Asistido por Computador/instrumentación , Algoritmos , Electrocardiografía/normas , Frecuencia Cardíaca , HumanosRESUMEN
Human diseases involve a sequence of complex interactions between multiple biological processes. In particular, multiple genomic data such as Single Nucleotide Polymorphism (SNP), Copy Number Variation (CNV), DNA Methylation (DM), and their interactions simultaneously play an important role in human diseases. However, despite the widely known complex multi-layer biological processes and increased availability of the heterogeneous genomic data, most research has considered only a single type of genomic data. Furthermore, recent integrative genomic studies for the multiple genomic data have also been facing difficulties due to the high-dimensionality and complexity, especially when considering their intra- and inter-block interactions. In this paper, we introduce a novel multi-block bipartite graph and its inference methods, MB2I and sMB2I, for the integrative genomic study. The proposed methods not only integrate multiple genomic data but also incorporate intra/inter-block interactions by using a multi-block bipartite graph. In addition, the methods can be used to predict quantitative traits (e.g., gene expression, survival time) from the multi-block genomic data. The performance was assessed by simulation experiments that implement practical situations. We also applied the method to the human brain data of psychiatric disorders. The experimental results were analyzed by maximum edge biclique and biclustering, and biological findings were discussed.
Asunto(s)
Genómica/métodos , Algoritmos , Simulación por Computador , Variaciones en el Número de Copia de ADN/genética , Metilación de ADN , Bases de Datos Genéticas , Humanos , Trastornos Mentales/genética , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
BACKGROUNDS: A large number of long intergenic non-coding RNAs (lincRNAs) are linked to a broad spectrum of human diseases. The disease association with many other lincRNAs still remain as puzzle. Validation of such links between the two entities through biological experiments are expensive. However, a plethora lincRNA-data are available now, thanks to the High Throughput Sequencing (HTS) platforms, Genome Wide Association Studies (GWAS), etc, which opens the opportunity for cutting-edge machine learning and data mining approaches to extract meaningful relationships among lincRNAs and diseases. However, there are only a few in silico lincRNA-disease association inference tools available to date, and none of them utilizes side information of both the entities simultaneously in a single framework. METHODS: The recently developed Inductive Matrix Completion (IMC) technique provides a recommendation platform among two entities considering respective side information about them. However, the formulation of IMC is incapable of handling noise and outliers that may be present in the datasets, while data sparsity consideration is another issue with the standard IMC method. Thus, a robust version of IMC is needed that can solve the two issues. As a remedy, in this paper, we propose Stable Robust Inductive Matrix Completion (SRIMC) that utilizes the l 2,1 norm based regularization to optimize the objective function with a unique 2-step stable solution approach. RESULTS: We applied SRIMC to the available association data between human lincRNAs and OMIM disease phenotypes as well as a diverse set of side information about the lincRNAs and the diseases. The method performs better than the state-of-the-art methods in terms of p r e c i s i o n @ k and r e c a l l @ k at the top-k disease prioritization to the subject lincRNAs. We also demonstrate that SRIMC is equally effective for querying about novel lincRNAs, as well as predicting rank of a newly known disease for a set of well-characterized lincRNAs. CONCLUSIONS: With the experimental results and computational evaluation, we show that SRIMC is robust in handling datasets with noise and outliers as well as dealing with novel lincRNAs and disease phenotypes.
Asunto(s)
Biología Computacional/métodos , Enfermedad/genética , ARN Largo no Codificante/genética , Algoritmos , Estudio de Asociación del Genoma Completo , HumanosRESUMEN
BACKGROUND: Inferring gene regulatory networks is one of the most interesting research areas in the systems biology. Many inference methods have been developed by using a variety of computational models and approaches. However, there are two issues to solve. First, depending on the structural or computational model of inference method, the results tend to be inconsistent due to innately different advantages and limitations of the methods. Therefore the combination of dissimilar approaches is demanded as an alternative way in order to overcome the limitations of standalone methods through complementary integration. Second, sparse linear regression that is penalized by the regularization parameter (lasso) and bootstrapping-based sparse linear regression methods were suggested in state of the art methods for network inference but they are not effective for a small sample size data and also a true regulator could be missed if the target gene is strongly affected by an indirect regulator with high correlation or another true regulator. RESULTS: We present two novel network inference methods based on the integration of three different criteria, (i) z-score to measure the variation of gene expression from knockout data, (ii) mutual information for the dependency between two genes, and (iii) linear regression-based feature selection. Based on these criterion, we propose a lasso-based random feature selection algorithm (LARF) to achieve better performance overcoming the limitations of bootstrapping as mentioned above. CONCLUSIONS: In this work, there are three main contributions. First, our z score-based method to measure gene expression variations from knockout data is more effective than similar criteria of related works. Second, we confirmed that the true regulator selection can be effectively improved by LARF. Lastly, we verified that an integrative approach can clearly outperform a single method when two different methods are effectively jointed. In the experiments, our methods were validated by outperforming the state of the art methods on DREAM challenge data, and then LARF was applied to inferences of gene regulatory network associated with psychiatric disorders.
Asunto(s)
Algoritmos , Redes Reguladoras de Genes , Trastornos Mentales/genética , Simulación por Computador , Expresión Génica , Técnicas de Inactivación de Genes , Humanos , Levaduras/genéticaRESUMEN
RNA-seq, the next generation sequencing platform, enables researchers to explore deep into the transcriptome of organisms, such as identifying functional non-coding RNAs (ncRNAs), and quantify their expressions on tissues. The functions of ncRNAs are mostly related to their secondary structures. Thus by exploring the clustering in terms of structural profiles of the corresponding read-segments would be essential and this fuels in our motivation behind this research. In this manuscript we proposed PR2S2Clust, Patched RNA-seq Read Segments' Structure-oriented Clustering, which is an analysis platform to extract features to prepare the secondary structure profiles of the RNA-seq read segments. It provides a strategy to employ the profiles to annotate the segments into ncRNA classes using several clustering strategies. The system considers seven pairwise structural distance metrics by considering short-read mappings onto each structure, which we term as the "patched structure" while clustering the segments. In this regard, we show applications of both classical and ensemble clusterings of the partitional and hierarchical variations. Extensive real-world experiments over three publicly available RNA-seq datasets and a comparative analysis over four competitive systems confirm the effectiveness and superiority of the proposed system. The source codes and dataset of PR2S2Clust are available at the http://biomecis.uta.edu/~ashis/res/PR2S2Clust-suppl/ .
Asunto(s)
Algoritmos , ARN no Traducido/química , Análisis de Secuencia de ARN/métodos , Análisis por Conglomerados , Bases de Datos Genéticas , Humanos , Conformación de Ácido Nucleico , ARN no Traducido/genéticaRESUMEN
BACKGROUND: Computational modeling and simulation play an important role in analyzing the behavior of complex biological systems in response to the implantation of biomedical devices. Quantitative computational modeling discloses the nature of foreign body responses. Such understanding will shed insight on the cause of foreign body responses, which will lead to improved biomaterial design and will reduce foreign body reactions. One of the major obstacles in computational modeling is to build a mathematical model that represents the biological system and to quantitatively define the model parameters. RESULTS: In this paper, we considered quantitative inter connections and logical relationships among diverse proteins and cells, which have been reported in biological experiments and literature. Based on the established biological discovery, we have built a mathematical model while unveiling the key components that contribute to biomaterial-mediated inflammatory responses. For the parameter estimation of the mathematical model, we proposed a global optimization algorithm, called Discrete Selection Levenberg-Marquardt (DSLM). This is an extension of Levenberg-Marquardt (LM) algorithm which is a gradient-based local optimization algorithm. The proposed DSLM suggests a new approach for the selection of optimal parameters in the discrete space with fast computational convergence. CONCLUSIONS: The computational modeling not only provides critical clues to recognize current knowledge of fibrosis development but also enables the prediction of yet-to-be observed biological phenomena.
Asunto(s)
Algoritmos , Materiales Biocompatibles/administración & dosificación , Movimiento Celular , Simulación por Computador , Cuerpos Extraños/inmunología , Reacción a Cuerpo Extraño/inmunología , Fagocitos/fisiología , Animales , Cuerpos Extraños/metabolismo , Implantes Experimentales , Ratones , Tejido Subcutáneo/inmunologíaRESUMEN
With the latest development of Surface-Enhanced Raman Scattering (SERS) technique, quantitative analysis of Raman spectra has shown the potential and promising trend of development in vivo molecular imaging. Partial Least Squares Regression (PLSR) is state-of-the-art method. But it only relies on training samples, which makes it difficult to incorporate complex domain knowledge. Based on probabilistic Principal Component Analysis (PCA) and probabilistic curve fitting idea, we propose a probabilistic PLSR (PPLSR) model and an Estimation Maximisation (EM) algorithm for estimating parameters. This model explains PLSR from a probabilistic viewpoint, describes its essential meaning and provides a foundation to develop future Bayesian nonparametrics models. Two real Raman spectra datasets were used to evaluate this model, and experimental results show its effectiveness.
Asunto(s)
Algoritmos , Mezclas Complejas/análisis , Mezclas Complejas/química , Modelos Estadísticos , Análisis de Regresión , Espectrometría Raman/métodos , Simulación por Computador , Interpretación Estadística de Datos , Análisis de los Mínimos Cuadrados , Reconocimiento de Normas Patrones Automatizadas/métodos , Reproducibilidad de los Resultados , Sensibilidad y EspecificidadAsunto(s)
Ingeniería Biomédica , Informática Médica , Alcoholismo , Humanos , Estudios InterdisciplinariosRESUMEN
Human diseases are abnormal medical conditions in which multiple biological components are complicatedly involved. Nevertheless, most contributions of research have been made with a single type of genetic data such as Single Nucleotide Polymorphism (SNP) or Copy Number Variation (CNV). Furthermore, epigenetic modifications and transcriptional regulations have to be considered to fully exploit the knowledge of the complex human diseases as well as the genomic variants. We call the collection of the multiple heterogeneous data "multiblock data." In this paper, we propose a novel Multiblock Discriminant Analysis (MultiDA) method that provides a new integrative genomic model for the multiblock analysis and an efficient algorithm for discriminant analysis. The integrative genomic model is built by exploiting the representative genomic data including SNP, CNV, DNA methylation, and gene expression. The efficient algorithm for the discriminant analysis identifies discriminative factors of the multiblock data. The discriminant analysis is essential to discover biomarkers in computational biology. The performance of the proposed MultiDA was assessed by intensive simulation experiments, where the outstanding performance comparing the related methods was reported. As a target application, we applied MultiDA to human brain data of psychiatric disorders. The findings and gene regulatory network derived from the experiment are discussed.
Asunto(s)
Algoritmos , Metilación de ADN , Epigénesis Genética , Genómica/métodos , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Transcripción Genética , HumanosRESUMEN
MOTIVATION: Epistasis is the interactions among multiple genetic variants. It has emerged to explain the 'missing heritability' that a marginal genetic effect does not account for by genome-wide association studies, and also to understand the hierarchical relationships between genes in the genetic pathways. The Fisher's geometric model is common in detecting the epistatic effects. However, despite the substantial successes of many studies with the model, it often fails to discover the functional dependence between genes in an epistasis study, which is an important role in inferring hierarchical relationships of genes in the biological pathway. RESULTS: We justify the imperfectness of Fisher's model in the simulation study and its application to the biological data. Then, we propose a novel generic epistasis model that provides a flexible solution for various biological putative epistatic models in practice. The proposed method enables one to efficiently characterize the functional dependence between genes. Moreover, we suggest a statistical strategy for determining a recessive or dominant link among epistatic expression quantitative trait locus to enable the ability to infer the hierarchical relationships. The proposed method is assessed by simulation experiments of various settings and is applied to human brain data regarding schizophrenia. AVAILABILITY AND IMPLEMENTATION: The MATLAB source codes are publicly available at: http://biomecis.uta.edu/epistasis.