RESUMEN
BACKGROUND: Chemical bioavailability is an important dose metric in environmental risk assessment. Although many approaches have been used to evaluate bioavailability, not a single approach is free from limitations. Previously, we developed a new genomics-based approach that integrated microarray technology and regression modeling for predicting bioavailability (tissue residue) of explosives compounds in exposed earthworms. In the present study, we further compared 18 different regression models and performed variable selection simultaneously with parameter estimation. RESULTS: This refined approach was applied to both previously collected and newly acquired earthworm microarray gene expression datasets for three explosive compounds. Our results demonstrate that a prediction accuracy of R(2) = 0.71-0.82 was achievable at a relatively low model complexity with as few as 3-10 predictor genes per model. These results are much more encouraging than our previous ones. CONCLUSION: This study has demonstrated that our approach is promising for bioavailability measurement, which warrants further studies of mixed contamination scenarios in field settings.
Asunto(s)
Sustancias Explosivas/farmacocinética , Perfilación de la Expresión Génica/métodos , Oligoquetos/genética , Contaminantes del Suelo/farmacocinética , Animales , Azocinas/farmacocinética , Disponibilidad Biológica , Oligoquetos/metabolismo , Análisis de Secuencia por Matrices de Oligonucleótidos , Análisis de Regresión , Triazinas/farmacocinética , Trinitrotolueno/farmacocinéticaRESUMEN
Cancer is a disease characterized largely by the accumulation of out-of-control somatic mutations during the lifetime of a patient. Distinguishing driver mutations from passenger mutations has posed a challenge in modern cancer research. With the advanced development of microarray experiments and clinical studies, a large numbers of candidate cancer genes have been extracted and distinguishing informative genes out of them is essential. As a matter of fact, we proposed to find the informative genes for cancer by using mutation data from ovarian cancers in our framework. In our model we utilized the patient gene mutation profile, gene expression data and gene gene interactions network to construct a graphical representation of genes and patients. Markov processes for mutation and patients are triggered separately. After this process, cancer genes are prioritized automatically by examining their scores at their stationary distributions in the eigenvector. Extensive experiments demonstrate that the integration of heterogeneous sources of information is essential in finding important cancer genes.
Asunto(s)
Biología Computacional/métodos , Neoplasias Ováricas/genética , Transcriptoma , Femenino , Redes Reguladoras de Genes , Humanos , Cadenas de Markov , MutaciónRESUMEN
BACKGROUND: In drug discovery and development, it is crucial to determine which conformers (instances) of a given molecule are responsible for its observed biological activity and at the same time to recognize the most representative subset of features (molecular descriptors). Due to experimental difficulty in obtaining the bioactive conformers, computational approaches such as machine learning techniques are much needed. Multiple Instance Learning (MIL) is a machine learning method capable of tackling this type of problem. In the MIL framework, each instance is represented as a feature vector, which usually resides in a high-dimensional feature space. The high dimensionality may provide significant information for learning tasks, but at the same time it may also include a large number of irrelevant or redundant features that might negatively affect learning performance. Reducing the dimensionality of data will hence facilitate the classification task and improve the interpretability of the model. RESULTS: In this work we propose a novel approach, named multiple instance learning via joint instance and feature selection. The iterative joint instance and feature selection is achieved using an instance-based feature mapping and 1-norm regularized optimization. The proposed approach was tested on four biological activity datasets. CONCLUSIONS: The empirical results demonstrate that the selected instances (prototype conformers) and features (pharmacophore fingerprints) have competitive discriminative power and the convergence of the selection process is also fast.
Asunto(s)
Descubrimiento de Drogas , Algoritmos , Inteligencia Artificial , Humanos , Imagenología Tridimensional , Ligandos , Modelos Moleculares , Conformación MolecularRESUMEN
BACKGROUND: In the context of drug discovery and development, much effort has been exerted to determine which conformers of a given molecule are responsible for the observed biological activity. In this work we aimed to predict bioactive conformers using a variant of supervised learning, named multiple-instance learning. A single molecule, treated as a bag of conformers, is biologically active if and only if at least one of its conformers, treated as an instance, is responsible for the observed bioactivity; and a molecule is inactive if none of its conformers is responsible for the observed bioactivity. The implementation requires instance-based embedding, and joint feature selection and classification. The goal of the present project is to implement multiple-instance learning in drug activity prediction, and subsequently to identify the bioactive conformers for each molecule. METHODS: We encoded the 3-dimensional structures using pharmacophore fingerprints which are binary strings, and accomplished instance-based embedding using calculated dissimilarity distances. Four dissimilarity measures were employed and their performances were compared. 1-norm SVM was used for joint feature selection and classification. The approach was applied to four data sets, and the best proposed model for each data set was determined by using the dissimilarity measure yielding the smallest number of selected features. RESULTS: The predictive abilities of the proposed approach were compared with three classical predictive models without instance-based embedding. The proposed approach produced the best predictive models for one data set and second best predictive models for the rest of the data sets, based on the external validations. To validate the ability of the proposed approach to find bioactive conformers, 12 small molecules with co-crystallized structures were seeded in one data set. 10 out of 12 co-crystallized structures were indeed identified as significant conformers using the proposed approach. CONCLUSIONS: The proposed approach was proven not to suffer from overfitting and to be highly competitive with classical predictive models, so it is very powerful for drug activity prediction. The approach was also validated as a useful method for pursuit of bioactive conformers.
Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Descubrimiento de Drogas , Modelos Teóricos , Conformación Molecular , Relación Estructura-Actividad CuantitativaRESUMEN
Treatment of pediatric acute lymphoblastic leukemia (ALL) is based on the concept of tailoring the intensity of therapy to a patient's risk of relapse. To determine whether gene expression profiling could enhance risk assignment, we used oligonucleotide microarrays to analyze the pattern of genes expressed in leukemic blasts from 360 pediatric ALL patients. Distinct expression profiles identified each of the prognostically important leukemia subtypes, including T-ALL, E2A-PBX1, BCR-ABL, TEL-AML1, MLL rearrangement, and hyperdiploid >50 chromosomes. In addition, another ALL subgroup was identified based on its unique expression profile. Examination of the genes comprising the expression signatures provided important insights into the biology of these leukemia subgroups. Further, within some genetic subgroups, expression profiles identified those patients that would eventually fail therapy. Thus, the single platform of expression profiling should enhance the accurate risk stratification of pediatric ALL patients.
Asunto(s)
Perfilación de la Expresión Génica , Leucemia-Linfoma Linfoblástico de Células Precursoras/diagnóstico , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Algoritmos , Niño , Biología Computacional , Humanos , Inmunofenotipificación , Leucemia Mieloide Aguda/clasificación , Leucemia Mieloide Aguda/diagnóstico , Leucemia Mieloide Aguda/genética , Leucemia Mieloide Aguda/patología , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Leucemia-Linfoma Linfoblástico de Células Precursoras/clasificación , Leucemia-Linfoma Linfoblástico de Células Precursoras/patología , Pronóstico , Recurrencia , Factores de Riesgo , Insuficiencia del TratamientoRESUMEN
A predominant source of complication in SARS-CoV-2 patients arises from a severe systemic inflammation that can lead to tissue damage and organ failure. The high inflammatory burden of this viral infection often results in cardiovascular comorbidities. A better understanding of the interaction between immune pathways and cardiovascular proteins might inform medical decisions and therapeutic approaches. In this study we hypothesized that helper T-cell inflammatory pathways (Th1, Th2 and Th17) synergistically correlate with cardiometabolic proteins in serum of COVID-19 patients. We found that Th1, Th2 and Th17 cytokines and chemokines are able to predict expression of 186 cardiometabolic proteins profiled by Olink proteomics.
Asunto(s)
COVID-19 , Enfermedades Cardiovasculares , Enfermedades Cardiovasculares/metabolismo , Humanos , Proteómica , SARS-CoV-2 , Células TH1/metabolismo , Células Th17/metabolismo , Células Th2/metabolismoRESUMEN
BACKGROUND: It is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task. RESULTS: We consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem. CONCLUSIONS: The proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.
Asunto(s)
Inteligencia Artificial , Modelos Biológicos , Algoritmos , Entropía , Glucógeno Sintasa Quinasa 3/antagonistas & inhibidores , Glucógeno Sintasa Quinasa 3 beta , Humanos , Leucemia-Linfoma Linfoblástico de Células Precursoras/tratamiento farmacológico , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Pronóstico , Receptor Cannabinoide CB1/metabolismo , Receptor Cannabinoide CB2/metabolismoRESUMEN
Identifying statistical dependence between the features and the label is a fundamental problem in supervised learning. This paper presents a framework for estimating dependence between numerical features and a categorical label using generalized Gini distance, an energy distance in reproducing kernel Hilbert spaces (RKHS). Two Gini distance based dependence measures are explored: Gini distance covariance and Gini distance correlation. Unlike Pearson covariance and correlation, which do not characterize independence, the above Gini distance based measures define dependence as well as independence of random variables. The test statistics are simple to calculate and do not require probability density estimation. Uniform convergence bounds and asymptotic bounds are derived for the test statistics. Comparisons with distance covariance statistics are provided. It is shown that Gini distance statistics converge faster than distance covariance statistics in the uniform convergence bounds, hence tighter upper bounds on both Type I and Type II errors. Moreover, the probability of Gini distance covariance statistic under-performing the distance covariance statistic in Type II error decreases to 0 exponentially with the increase of the sample size. Extensive experimental results are presented to demonstrate the performance of the proposed method.
RESUMEN
BACKGROUND: Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes in a single experiment. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is inevitable to address the challenge. Gene selection has been investigated extensively over the last decade. Most selection procedures, however, are not sufficient for accurate inference of underlying biology, because biological significance does not necessarily have to be statistically significant. Additional biological knowledge needs to be integrated into the gene selection procedure. RESULTS: We propose a general framework for gene ranking. We construct a bipartite graph from the Gene Ontology (GO) and gene expression data. The graph describes the relationship between genes and their associated molecular functions. Under a species condition, edge weights of the graph are assigned to be gene expression level. Such a graph provides a mathematical means to represent both species-independent and species-dependent biological information. We also develop a new ranking algorithm to analyze the weighted graph via a kernelized spatial depth (KSD) approach. Consequently, the importance of gene and molecular function can be simultaneously ranked by a real-valued measure, KSD, which incorporates the global and local structure of the graph. Over-expressed and under-regulated genes also can be separately ranked. CONCLUSION: The gene-function bigraph integrates molecular function annotations into gene expression data. The relevance of genes is described in the graph (through a common function). The proposed method provides an exploratory framework for gene data analysis.
Asunto(s)
Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Algoritmos , Análisis de Secuencia por Matrices de OligonucleótidosRESUMEN
Background: Breast cancer is intrinsically heterogeneous and is commonly classified into four main subtypes associated with distinct biological features and clinical outcomes. However, currently available data resources and methods are limited in identifying molecular subtyping on protein-coding genes, and little is known about the roles of long non-coding RNAs (lncRNAs), which occupies 98% of the whole genome. lncRNAs may also play important roles in subgrouping cancer patients and are associated with clinical phenotypes. Methods: The purpose of this project was to identify lncRNA gene signatures that are associated with breast cancer subtypes and clinical outcomes. We identified lncRNA gene signatures from The Cancer Genome Atlas (TCGA )RNAseq data that are associated with breast cancer subtypes by an optimized 1-Norm SVM feature selection algorithm. We evaluated the prognostic performance of these gene signatures with a semi-supervised principal component (superPC) method. Results: Although lncRNAs can independently predict breast cancer subtypes with satisfactory accuracy, a combined gene signature including both coding and non-coding genes will give the best clinically relevant prediction performance. We highlighted eight potential biomarkers (three from coding genes and five from non-coding genes) that are significantly associated with survival outcomes. Conclusion: Our proposed methods are a novel means of identifying subtype-specific coding and non-coding potential biomarkers that are both clinically relevant and biologically significant.
RESUMEN
BACKGROUND: Mean-based clustering algorithms such as bisecting k-means generally lack robustness. Although componentwise median is a more robust alternative, it can be a poor center representative for high dimensional data. We need a new algorithm that is robust and works well in high dimensional data sets e.g. gene expression data. RESULTS: Here we propose a new robust divisive clustering algorithm, the bisecting k-spatialMedian, based on the statistical spatial depth. A new subcluster selection rule, Relative Average Depth, is also introduced. We demonstrate that the proposed clustering algorithm outperforms the componentwise-median-based bisecting k-median algorithm for high dimension and low sample size (HDLSS) data via applications of the algorithms on two real HDLSS gene expression data sets. When further applied on noisy real data sets, the proposed algorithm compares favorably in terms of robustness with the componentwise-median-based bisecting k-median algorithm. CONCLUSION: Statistical data depths provide an alternative way to find the "center" of multivariate data sets and are useful and robust for clustering.
Asunto(s)
Algoritmos , Interpretación Estadística de Datos , Perfilación de la Expresión Génica/métodos , Modelos Biológicos , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Simulación por ComputadorRESUMEN
BACKGROUND: Recursive Feature Elimination is a common and well-studied method for reducing the number of attributes used for further analysis or development of prediction models. The effectiveness of the RFE algorithm is generally considered excellent, but the primary obstacle in using it is the amount of computational power required. RESULTS: Here we introduce a variant of RFE which employs ideas from simulated annealing. The goal of the algorithm is to improve the computational performance of recursive feature elimination by eliminating chunks of features at a time with as little effect on the quality of the reduced feature set as possible. The algorithm has been tested on several large gene expression data sets. The RFE algorithm is implemented using a Support Vector Machine to assist in identifying the least useful gene(s) to eliminate. CONCLUSION: The algorithm is simple and efficient and generates a set of attributes that is very similar to the set produced by RFE.
Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Algoritmos , Simulación por ComputadorRESUMEN
This paper contains a description of several common normalization methods used in microarray analysis, and compares the effect of these methods on microarray data. The importance of background subtraction is also addressed. The research focuses on three parts. The first uses three statistical methods: t-test, Wilcoxon signed rank test, and sign test to measure the difference between background subtracted data and nonbackground subtracted data. The second part of the study uses the same three statistical methods to compare whether data normalized with different normalization methods yield similar results. The third part of the study focuses on whether these differently normalized data will influence the result of gene selection (dimension reduction). The comparisons are done for several data sets to help identify similarity patterns. The conclusion of this study is that background subtraction can make a difference, especially for some data sets with poorer quality data. The choice of normalization method, for the most part, makes little difference in the sense that the methods produce similarly normalized data. But, based on the third part of analysis, we found that when gene selection is performed on these differently normalized data, somewhat different gene sets are obtained. Thus, the choice of normalization method will likely have some effect on the final analysis.
Asunto(s)
Análisis de Secuencia por Matrices de Oligonucleótidos , Selección GenéticaRESUMEN
BACKGROUND: Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. METHODS: In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features. RESULTS: We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression. CONCLUSION: It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied.
Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Algoritmos , Modelos LinealesRESUMEN
Glycogen synthase kinase-3 (GSK-3) is a multifunctional serine/threonine protein kinase which regulates a wide range of cellular processes, involving various signalling pathways. GSK-3ß has emerged as an important therapeutic target for diabetes and Alzheimer's disease. To identify structurally novel GSK-3ß inhibitors, we performed virtual screening by implementing a combined ligand-based/structure-based approach, which included quantitative structure-activity relationship (QSAR) analysis and docking prediction. To integrate and analyze complex data sets from multiple experimental sources, we drafted and validated a hierarchical QSAR method, which adopts a two-level structure to take data heterogeneity into account. A collection of 728 GSK-3 inhibitors with diverse structural scaffolds was obtained from published papers that used different experimental assay protocols. Support vector machines and random forests were implemented with wrapper-based feature selection algorithms to construct predictive learning models. The best models for each single group of compounds were then used to build the final hierarchical QSAR model, with an overall R(2) of 0.752 for the 141 compounds in the test set. The compounds obtained from the virtual screening experiment were tested for GSK-3ß inhibition. The bioassay results confirmed that 2 hit compounds are indeed GSK-3ß inhibitors exhibiting sub-micromolar inhibitory activity, and therefore validated our combined ligand-based/structure-based approach as effective for virtual screening experiments.
RESUMEN
Incorporating various sources of biological information is important for biological discovery. For example, genes have a multiview representation. They can be represented by features such as sequence length and pairwise similarities. Hence, the types vary from numerical features to categorical features. We propose a large margin Random Forests (RF) classification approach based on RF proximity kernals. Random Forests accommodate mixed data types naturally. The performance on four biological datasets is promising compared with other state of the art methods including Support Vector Machines (SVMs) and RF classifiers. It demonstrates high potential in the discovery of functional roles of biomolecules.