Búsqueda | Portal Regional de la BVS

Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes.

Källberg, David; Vidman, Linda; Rydén, Patrik.

Front Genet ; 12: 632620, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-33719342

RESUMEN

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (-0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

Combining epigenetic and clinicopathological variables improves specificity in prognostic prediction in clear cell renal cell carcinoma.

Andersson-Evelönn, Emma; Vidman, Linda; Källberg, David; Landfors, Mattias; Liu, Xijia; Ljungberg, Börje; Hultdin, Magnus; Rydén, Patrik; Degerman, Sofie.

J Transl Med ; 18(1): 435, 2020 11 13.

Artículo en Inglés | MEDLINE | ID: mdl-33187526

RESUMEN

BACKGROUND: Metastasized clear cell renal cell carcinoma (ccRCC) is associated with a poor prognosis. Almost one-third of patients with non-metastatic tumors at diagnosis will later progress with metastatic disease. These patients need to be identified already at diagnosis, to undertake closer follow up and/or adjuvant treatment. Today, clinicopathological variables are used to risk classify patients, but molecular biomarkers are needed to improve risk classification to identify the high-risk patients which will benefit most from modern adjuvant therapies. Interestingly, DNA methylation profiling has emerged as a promising prognostic biomarker in ccRCC. This study aimed to derive a model for prediction of tumor progression after nephrectomy in non-metastatic ccRCC by combining DNA methylation profiling with clinicopathological variables. METHODS: A novel cluster analysis approach (Directed Cluster Analysis) was used to identify molecular biomarkers from genome-wide methylation array data. These novel DNA methylation biomarkers, together with previously identified CpG-site biomarkers and clinicopathological variables, were used to derive predictive classifiers for tumor progression. RESULTS: The "triple classifier" which included both novel and previously identified DNA methylation biomarkers together with clinicopathological variables predicted tumor progression more accurately than the currently used Mayo scoring system, by increasing the specificity from 50% in Mayo to 64% in our triple classifier at 85% fixed sensitivity. The cumulative incidence of progress (pCIP5yr) was 7.5% in low-risk vs 44.7% in high-risk in M0 patients classified by the triple classifier at diagnosis. CONCLUSIONS: The triple classifier panel that combines clinicopathological variables with genome-wide methylation data has the potential to improve specificity in prognosis prediction for patients with non-metastatic ccRCC.

Asunto(s)

Carcinoma de Células Renales , Neoplasias Renales , Biomarcadores de Tumor/genética , Carcinoma de Células Renales/diagnóstico , Carcinoma de Células Renales/genética , Metilación de ADN/genética , Epigénesis Genética , Humanos , Neoplasias Renales/diagnóstico , Neoplasias Renales/genética , Pronóstico

Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.

Vidman, Linda; Källberg, David; Rydén, Patrik.

PLoS One ; 14(12): e0219102, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-31805048

RESUMEN

BACKGROUND: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance. RESULTS: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males. CONCLUSIONS: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

Asunto(s)

Análisis por Conglomerados , Neoplasias/genética , RNA-Seq , Adolescente , Adulto , Anciano , Anciano de 80 o más Años , Algoritmos , Análisis de Datos , Conjuntos de Datos como Asunto , Femenino , Investigación Genética , Humanos , Masculino , Persona de Mediana Edad , ARN Neoplásico , Tamaño de la Muestra , Factores Sexuales , Adulto Joven

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA