RESUMO
The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.
Assuntos
Biometria/métodos , Algoritmos , Artrite Reumatoide/diagnóstico , Bioensaio/normas , Simulação por Computador , Análise Discriminante , Humanos , Lúpus Eritematoso Sistêmico/diagnósticoRESUMO
BACKGROUND: For heterogeneous tissues, such as blood, measurements of gene expression are confounded by relative proportions of cell types involved. Conclusions have to rely on estimation of gene expression signals for homogeneous cell populations, e.g. by applying micro-dissection, fluorescence activated cell sorting, or in-silico deconfounding. We studied feasibility and validity of a non-negative matrix decomposition algorithm using experimental gene expression data for blood and sorted cells from the same donor samples. Our objective was to optimize the algorithm regarding detection of differentially expressed genes and to enable its use for classification in the difficult scenario of reversely regulated genes. This would be of importance for the identification of candidate biomarkers in heterogeneous tissues. RESULTS: Experimental data and simulation studies involving noise parameters estimated from these data revealed that for valid detection of differential gene expression, quantile normalization and use of non-log data are optimal. We demonstrate the feasibility of predicting proportions of constituting cell types from gene expression data of single samples, as a prerequisite for a deconfounding-based classification approach.Classification cross-validation errors with and without using deconfounding results are reported as well as sample-size dependencies. Implementation of the algorithm, simulation and analysis scripts are available. CONCLUSIONS: The deconfounding algorithm without decorrelation using quantile normalization on non-log data is proposed for biomarkers that are difficult to detect, and for cases where confounding by varying proportions of cell types is the suspected reason. In this case, a deconfounding ranking approach can be used as a powerful alternative to, or complement of, other statistical learning approaches to define candidate biomarkers for molecular diagnosis and prediction in biomedicine, in realistically noisy conditions and with moderate sample sizes.
Assuntos
Algoritmos , Biomarcadores/química , Biologia Computacional/métodos , Perfilação da Expressão Gênica , Análise de Sequência com Séries de OligonucleotídeosRESUMO
Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called 'power parameter', which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.
Assuntos
Análise Discriminante , Análise dos Mínimos Quadrados , Perfilação da Expressão GênicaRESUMO
Although tuberculosis (TB) causes more deaths than any other pathogen, most infected individuals harbor the pathogen without signs of disease. We explored the metabolome of >400 small molecules in serum of uninfected individuals, latently infected healthy individuals and patients with active TB. We identified changes in amino acid, lipid and nucleotide metabolism pathways, providing evidence for anti-inflammatory metabolomic changes in TB. Metabolic profiles indicate increased activity of indoleamine 2,3 dioxygenase 1 (IDO1), decreased phospholipase activity, increased abundance of adenosine metabolism products, as well as indicators of fibrotic lesions in active disease as compared to latent infection. Consistent with our predictions, we experimentally demonstrate TB-induced IDO1 activity. Furthermore, we demonstrate a link between metabolic profiles and cytokine signaling. Finally, we show that 20 metabolites are sufficient for robust discrimination of TB patients from healthy individuals. Our results provide specific insights into the biology of TB and pave the way for the rational development of metabolic biomarkers for TB.
Assuntos
Tolerância Imunológica , Metabolômica , Estresse Fisiológico , Tuberculose Pulmonar/imunologia , Tuberculose Pulmonar/metabolismo , Biomarcadores/metabolismo , Estudos de Casos e Controles , Análise por Conglomerados , Feminino , Humanos , Indolamina-Pirrol 2,3,-Dioxigenase/metabolismo , Inflamação/metabolismo , Cinurenina/biossíntese , Masculino , Tuberculose Pulmonar/enzimologia , Tuberculose Pulmonar/fisiopatologiaRESUMO
Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest.