Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
Mol Omics ; 18(5): 408-416, 2022 06 13.
Artículo en Inglés | MEDLINE | ID: mdl-35284913

RESUMEN

A predominant source of complication in SARS-CoV-2 patients arises from a severe systemic inflammation that can lead to tissue damage and organ failure. The high inflammatory burden of this viral infection often results in cardiovascular comorbidities. A better understanding of the interaction between immune pathways and cardiovascular proteins might inform medical decisions and therapeutic approaches. In this study we hypothesized that helper T-cell inflammatory pathways (Th1, Th2 and Th17) synergistically correlate with cardiometabolic proteins in serum of COVID-19 patients. We found that Th1, Th2 and Th17 cytokines and chemokines are able to predict expression of 186 cardiometabolic proteins profiled by Olink proteomics.


Asunto(s)
COVID-19 , Enfermedades Cardiovasculares , Enfermedades Cardiovasculares/metabolismo , Humanos , Proteómica , SARS-CoV-2 , Células TH1/metabolismo , Células Th17/metabolismo , Células Th2/metabolismo
2.
IEEE Trans Pattern Anal Mach Intell ; 43(6): 1947-1963, 2021 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-31869782

RESUMEN

Identifying statistical dependence between the features and the label is a fundamental problem in supervised learning. This paper presents a framework for estimating dependence between numerical features and a categorical label using generalized Gini distance, an energy distance in reproducing kernel Hilbert spaces (RKHS). Two Gini distance based dependence measures are explored: Gini distance covariance and Gini distance correlation. Unlike Pearson covariance and correlation, which do not characterize independence, the above Gini distance based measures define dependence as well as independence of random variables. The test statistics are simple to calculate and do not require probability density estimation. Uniform convergence bounds and asymptotic bounds are derived for the test statistics. Comparisons with distance covariance statistics are provided. It is shown that Gini distance statistics converge faster than distance covariance statistics in the uniform convergence bounds, hence tighter upper bounds on both Type I and Type II errors. Moreover, the probability of Gini distance covariance statistic under-performing the distance covariance statistic in Type II error decreases to 0 exponentially with the increase of the sample size. Extensive experimental results are presented to demonstrate the performance of the proposed method.

3.
Genes (Basel) ; 9(2)2018 Jan 26.
Artículo en Inglés | MEDLINE | ID: mdl-29373522

RESUMEN

Background: Breast cancer is intrinsically heterogeneous and is commonly classified into four main subtypes associated with distinct biological features and clinical outcomes. However, currently available data resources and methods are limited in identifying molecular subtyping on protein-coding genes, and little is known about the roles of long non-coding RNAs (lncRNAs), which occupies 98% of the whole genome. lncRNAs may also play important roles in subgrouping cancer patients and are associated with clinical phenotypes. Methods: The purpose of this project was to identify lncRNA gene signatures that are associated with breast cancer subtypes and clinical outcomes. We identified lncRNA gene signatures from The Cancer Genome Atlas (TCGA )RNAseq data that are associated with breast cancer subtypes by an optimized 1-Norm SVM feature selection algorithm. We evaluated the prognostic performance of these gene signatures with a semi-supervised principal component (superPC) method. Results: Although lncRNAs can independently predict breast cancer subtypes with satisfactory accuracy, a combined gene signature including both coding and non-coding genes will give the best clinically relevant prediction performance. We highlighted eight potential biomarkers (three from coding genes and five from non-coding genes) that are significantly associated with survival outcomes. Conclusion: Our proposed methods are a novel means of identifying subtype-specific coding and non-coding potential biomarkers that are both clinically relevant and biologically significant.

4.
BMC Genomics ; 17: 205, 2016 Mar 08.
Artículo en Inglés | MEDLINE | ID: mdl-26956490

RESUMEN

BACKGROUND: Chemical bioavailability is an important dose metric in environmental risk assessment. Although many approaches have been used to evaluate bioavailability, not a single approach is free from limitations. Previously, we developed a new genomics-based approach that integrated microarray technology and regression modeling for predicting bioavailability (tissue residue) of explosives compounds in exposed earthworms. In the present study, we further compared 18 different regression models and performed variable selection simultaneously with parameter estimation. RESULTS: This refined approach was applied to both previously collected and newly acquired earthworm microarray gene expression datasets for three explosive compounds. Our results demonstrate that a prediction accuracy of R(2) = 0.71-0.82 was achievable at a relatively low model complexity with as few as 3-10 predictor genes per model. These results are much more encouraging than our previous ones. CONCLUSION: This study has demonstrated that our approach is promising for bioavailability measurement, which warrants further studies of mixed contamination scenarios in field settings.


Asunto(s)
Sustancias Explosivas/farmacocinética , Perfilación de la Expresión Génica/métodos , Oligoquetos/genética , Contaminantes del Suelo/farmacocinética , Animales , Azocinas/farmacocinética , Disponibilidad Biológica , Oligoquetos/metabolismo , Análisis de Secuencia por Matrices de Oligonucleótidos , Análisis de Regresión , Triazinas/farmacocinética , Trinitrotolueno/farmacocinética
5.
BMC Genomics ; 16 Suppl 9: S3, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26328548

RESUMEN

Cancer is a disease characterized largely by the accumulation of out-of-control somatic mutations during the lifetime of a patient. Distinguishing driver mutations from passenger mutations has posed a challenge in modern cancer research. With the advanced development of microarray experiments and clinical studies, a large numbers of candidate cancer genes have been extracted and distinguishing informative genes out of them is essential. As a matter of fact, we proposed to find the informative genes for cancer by using mutation data from ovarian cancers in our framework. In our model we utilized the patient gene mutation profile, gene expression data and gene gene interactions network to construct a graphical representation of genes and patients. Markov processes for mutation and patients are triggered separately. After this process, cancer genes are prioritized automatically by examining their scores at their stationary distributions in the eigenvector. Extensive experiments demonstrate that the integration of heterogeneous sources of information is essential in finding important cancer genes.


Asunto(s)
Biología Computacional/métodos , Neoplasias Ováricas/genética , Transcriptoma , Femenino , Redes Reguladoras de Genes , Humanos , Cadenas de Markov , Mutación
6.
BMC Syst Biol ; 8 Suppl 3: S5, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25350120

RESUMEN

BACKGROUND: Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. METHODS: In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features. RESULTS: We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression. CONCLUSION: It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied.


Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Algoritmos , Modelos Lineales
7.
Mol Inform ; 33(9): 627-40, 2014 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-27486081

RESUMEN

Glycogen synthase kinase-3 (GSK-3) is a multifunctional serine/threonine protein kinase which regulates a wide range of cellular processes, involving various signalling pathways. GSK-3ß has emerged as an important therapeutic target for diabetes and Alzheimer's disease. To identify structurally novel GSK-3ß inhibitors, we performed virtual screening by implementing a combined ligand-based/structure-based approach, which included quantitative structure-activity relationship (QSAR) analysis and docking prediction. To integrate and analyze complex data sets from multiple experimental sources, we drafted and validated a hierarchical QSAR method, which adopts a two-level structure to take data heterogeneity into account. A collection of 728 GSK-3 inhibitors with diverse structural scaffolds was obtained from published papers that used different experimental assay protocols. Support vector machines and random forests were implemented with wrapper-based feature selection algorithms to construct predictive learning models. The best models for each single group of compounds were then used to build the final hierarchical QSAR model, with an overall R(2) of 0.752 for the 141 compounds in the test set. The compounds obtained from the virtual screening experiment were tested for GSK-3ß inhibition. The bioassay results confirmed that 2 hit compounds are indeed GSK-3ß inhibitors exhibiting sub-micromolar inhibitory activity, and therefore validated our combined ligand-based/structure-based approach as effective for virtual screening experiments.

8.
BMC Bioinformatics ; 14 Suppl 14: S16, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24267824

RESUMEN

BACKGROUND: In drug discovery and development, it is crucial to determine which conformers (instances) of a given molecule are responsible for its observed biological activity and at the same time to recognize the most representative subset of features (molecular descriptors). Due to experimental difficulty in obtaining the bioactive conformers, computational approaches such as machine learning techniques are much needed. Multiple Instance Learning (MIL) is a machine learning method capable of tackling this type of problem. In the MIL framework, each instance is represented as a feature vector, which usually resides in a high-dimensional feature space. The high dimensionality may provide significant information for learning tasks, but at the same time it may also include a large number of irrelevant or redundant features that might negatively affect learning performance. Reducing the dimensionality of data will hence facilitate the classification task and improve the interpretability of the model. RESULTS: In this work we propose a novel approach, named multiple instance learning via joint instance and feature selection. The iterative joint instance and feature selection is achieved using an instance-based feature mapping and 1-norm regularized optimization. The proposed approach was tested on four biological activity datasets. CONCLUSIONS: The empirical results demonstrate that the selected instances (prototype conformers) and features (pharmacophore fingerprints) have competitive discriminative power and the convergence of the selection process is also fast.


Asunto(s)
Descubrimiento de Drogas , Algoritmos , Inteligencia Artificial , Humanos , Imagenología Tridimensional , Ligandos , Modelos Moleculares , Conformación Molecular
9.
BMC Bioinformatics ; 13 Suppl 15: S3, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23046442

RESUMEN

BACKGROUND: In the context of drug discovery and development, much effort has been exerted to determine which conformers of a given molecule are responsible for the observed biological activity. In this work we aimed to predict bioactive conformers using a variant of supervised learning, named multiple-instance learning. A single molecule, treated as a bag of conformers, is biologically active if and only if at least one of its conformers, treated as an instance, is responsible for the observed bioactivity; and a molecule is inactive if none of its conformers is responsible for the observed bioactivity. The implementation requires instance-based embedding, and joint feature selection and classification. The goal of the present project is to implement multiple-instance learning in drug activity prediction, and subsequently to identify the bioactive conformers for each molecule. METHODS: We encoded the 3-dimensional structures using pharmacophore fingerprints which are binary strings, and accomplished instance-based embedding using calculated dissimilarity distances. Four dissimilarity measures were employed and their performances were compared. 1-norm SVM was used for joint feature selection and classification. The approach was applied to four data sets, and the best proposed model for each data set was determined by using the dissimilarity measure yielding the smallest number of selected features. RESULTS: The predictive abilities of the proposed approach were compared with three classical predictive models without instance-based embedding. The proposed approach produced the best predictive models for one data set and second best predictive models for the rest of the data sets, based on the external validations. To validate the ability of the proposed approach to find bioactive conformers, 12 small molecules with co-crystallized structures were seeded in one data set. 10 out of 12 co-crystallized structures were indeed identified as significant conformers using the proposed approach. CONCLUSIONS: The proposed approach was proven not to suffer from overfitting and to be highly competitive with classical predictive models, so it is very powerful for drug activity prediction. The approach was also validated as a useful method for pursuit of bioactive conformers.


Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Descubrimiento de Drogas , Modelos Teóricos , Conformación Molecular , Relación Estructura-Actividad Cuantitativa
10.
IEEE Trans Nanobioscience ; 11(3): 228-36, 2012 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-22987128

RESUMEN

There are a vast number of biology related research problems involving a combination of multiple sources of data to achieve a better understanding of the underlying problems. It is important to select and interpret the most important information from these sources. Thus it will be beneficial to have a good algorithm to simultaneously extract rules and select features for better interpretation of the predictive model. We propose an efficient algorithm, Combined Rule Extraction and Feature Elimination (CRF), based on 1-norm regularized random forests. CRF simultaneously extracts a small number of rules generated by random forests and selects important features. We applied CRF to several drug activity prediction and microarray data sets. CRF is capable of producing performance comparable with state-of-the-art prediction algorithms using a small number of decision rules. Some of the decision rules are biologically significant.


Asunto(s)
Algoritmos , Inteligencia Artificial , Biología Computacional/métodos , Árboles de Decisión , Miembro 1 de la Subfamilia B de Casetes de Unión a ATP/genética , Bases de Datos Factuales , Humanos , Modelos Teóricos , Neoplasias/genética , Análisis de Secuencia por Matrices de Oligonucleótidos , Receptores de Cannabinoides/genética , Reproducibilidad de los Resultados
11.
Int J Bioinform Res Appl ; 8(1-2): 38-53, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22450269

RESUMEN

Incorporating various sources of biological information is important for biological discovery. For example, genes have a multiview representation. They can be represented by features such as sequence length and pairwise similarities. Hence, the types vary from numerical features to categorical features. We propose a large margin Random Forests (RF) classification approach based on RF proximity kernals. Random Forests accommodate mixed data types naturally. The performance on four biological datasets is promising compared with other state of the art methods including Support Vector Machines (SVMs) and RF classifiers. It demonstrates high potential in the discovery of functional roles of biomolecules.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Reconocimiento de Normas Patrones Automatizadas , Máquina de Vectores de Soporte
12.
BMC Bioinformatics ; 12 Suppl 10: S22, 2011 Oct 18.
Artículo en Inglés | MEDLINE | ID: mdl-22166097

RESUMEN

BACKGROUND: It is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task. RESULTS: We consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem. CONCLUSIONS: The proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.


Asunto(s)
Inteligencia Artificial , Modelos Biológicos , Algoritmos , Entropía , Glucógeno Sintasa Quinasa 3/antagonistas & inhibidores , Glucógeno Sintasa Quinasa 3 beta , Humanos , Leucemia-Linfoma Linfoblástico de Células Precursoras/tratamiento farmacológico , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Pronóstico , Receptor Cannabinoide CB1/metabolismo , Receptor Cannabinoide CB2/metabolismo
14.
BMC Bioinformatics ; 10 Suppl 11: S19, 2009 Oct 08.
Artículo en Inglés | MEDLINE | ID: mdl-19811684

RESUMEN

BACKGROUND: Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes in a single experiment. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is inevitable to address the challenge. Gene selection has been investigated extensively over the last decade. Most selection procedures, however, are not sufficient for accurate inference of underlying biology, because biological significance does not necessarily have to be statistically significant. Additional biological knowledge needs to be integrated into the gene selection procedure. RESULTS: We propose a general framework for gene ranking. We construct a bipartite graph from the Gene Ontology (GO) and gene expression data. The graph describes the relationship between genes and their associated molecular functions. Under a species condition, edge weights of the graph are assigned to be gene expression level. Such a graph provides a mathematical means to represent both species-independent and species-dependent biological information. We also develop a new ranking algorithm to analyze the weighted graph via a kernelized spatial depth (KSD) approach. Consequently, the importance of gene and molecular function can be simultaneously ranked by a real-valued measure, KSD, which incorporates the global and local structure of the graph. Over-expressed and under-regulated genes also can be separately ranked. CONCLUSION: The gene-function bigraph integrates molecular function annotations into gene expression data. The relevance of genes is described in the graph (through a common function). The proposed method provides an exploratory framework for gene data analysis.


Asunto(s)
Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Algoritmos , Análisis de Secuencia por Matrices de Oligonucleótidos
17.
BMC Bioinformatics ; 8 Suppl 7: S8, 2007 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-18047731

RESUMEN

BACKGROUND: Mean-based clustering algorithms such as bisecting k-means generally lack robustness. Although componentwise median is a more robust alternative, it can be a poor center representative for high dimensional data. We need a new algorithm that is robust and works well in high dimensional data sets e.g. gene expression data. RESULTS: Here we propose a new robust divisive clustering algorithm, the bisecting k-spatialMedian, based on the statistical spatial depth. A new subcluster selection rule, Relative Average Depth, is also introduced. We demonstrate that the proposed clustering algorithm outperforms the componentwise-median-based bisecting k-median algorithm for high dimension and low sample size (HDLSS) data via applications of the algorithms on two real HDLSS gene expression data sets. When further applied on noisy real data sets, the proposed algorithm compares favorably in terms of robustness with the componentwise-median-based bisecting k-median algorithm. CONCLUSION: Statistical data depths provide an alternative way to find the "center" of multivariate data sets and are useful and robust for clustering.


Asunto(s)
Algoritmos , Interpretación Estadística de Datos , Perfilación de la Expresión Génica/métodos , Modelos Biológicos , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Simulación por Computador
18.
BMC Bioinformatics ; 7 Suppl 2: S12, 2006 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-17118133

RESUMEN

BACKGROUND: Recursive Feature Elimination is a common and well-studied method for reducing the number of attributes used for further analysis or development of prediction models. The effectiveness of the RFE algorithm is generally considered excellent, but the primary obstacle in using it is the amount of computational power required. RESULTS: Here we introduce a variant of RFE which employs ideas from simulated annealing. The goal of the algorithm is to improve the computational performance of recursive feature elimination by eliminating chunks of features at a time with as little effect on the quality of the reduced feature set as possible. The algorithm has been tested on several large gene expression data sets. The RFE algorithm is implemented using a Support Vector Machine to assist in identifying the least useful gene(s) to eliminate. CONCLUSION: The algorithm is simple and efficient and generates a set of attributes that is very similar to the set produced by RFE.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Algoritmos , Simulación por Computador
20.
DNA Cell Biol ; 23(10): 635-42, 2004 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-15585121

RESUMEN

This paper contains a description of several common normalization methods used in microarray analysis, and compares the effect of these methods on microarray data. The importance of background subtraction is also addressed. The research focuses on three parts. The first uses three statistical methods: t-test, Wilcoxon signed rank test, and sign test to measure the difference between background subtracted data and nonbackground subtracted data. The second part of the study uses the same three statistical methods to compare whether data normalized with different normalization methods yield similar results. The third part of the study focuses on whether these differently normalized data will influence the result of gene selection (dimension reduction). The comparisons are done for several data sets to help identify similarity patterns. The conclusion of this study is that background subtraction can make a difference, especially for some data sets with poorer quality data. The choice of normalization method, for the most part, makes little difference in the sense that the methods produce similarly normalized data. But, based on the third part of analysis, we found that when gene selection is performed on these differently normalized data, somewhat different gene sets are obtained. Thus, the choice of normalization method will likely have some effect on the final analysis.


Asunto(s)
Análisis de Secuencia por Matrices de Oligonucleótidos , Selección Genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...