Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
Bioinform Biol Insights ; 18: 11779322241249563, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38812741

RESUMO

Glioma is currently one of the most prevalent types of primary brain cancer. Given its high level of heterogeneity along with the complex biological molecular markers, many efforts have been made to accurately classify the type of glioma in each patient, which, in turn, is critical to improve early diagnosis and increase survival. Nonetheless, as a result of the fast-growing technological advances in high-throughput sequencing and evolving molecular understanding of glioma biology, its classification has been recently subject to significant alterations. In this study, we integrate multiple glioma omics modalities (including mRNA, DNA methylation, and miRNA) from The Cancer Genome Atlas (TCGA), while using the revised glioma reclassified labels, with a supervised method based on sparse canonical correlation analysis (DIABLO) to discriminate between glioma types. We were able to find a set of highly correlated features distinguishing glioblastoma from lower-grade gliomas (LGGs) that were mainly associated with the disruption of receptor tyrosine kinases signaling pathways and extracellular matrix organization and remodeling. Concurrently, the discrimination of the LGG types was characterized primarily by features involved in ubiquitination and DNA transcription processes. Furthermore, we could identify several novel glioma biomarkers likely helpful in both diagnosis and prognosis of the patients, including the genes PPP1R8, GPBP1L1, KIAA1614, C14orf23, CCDC77, BVES, EXD3, CD300A, and HEPN1. Collectively, this comprehensive approach not only allowed a highly accurate discrimination of the different TCGA glioma patients but also presented a step forward in advancing our comprehension of the underlying molecular mechanisms driving glioma heterogeneity. Ultimately, our study also revealed novel candidate biomarkers that might constitute potential therapeutic targets, marking a significant stride toward personalized and more effective treatment strategies for patients with glioma.

2.
Front Microbiol ; 14: 1261889, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37808286

RESUMO

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

3.
BioData Min ; 16(1): 26, 2023 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-37752578

RESUMO

Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.

4.
BMC Bioinformatics ; 24(1): 17, 2023 Jan 16.
Artigo em Inglês | MEDLINE | ID: mdl-36647008

RESUMO

Colorectal cancer (CRC) is the third most common cancer and the second most deathly worldwide. It is a very heterogeneous disease that can develop via distinct pathways where metastasis is the primary cause of death. Therefore, it is crucial to understand the molecular mechanisms underlying metastasis. RNA-sequencing is an essential tool used for studying the transcriptional landscape. However, the high-dimensionality of gene expression data makes selecting novel metastatic biomarkers problematic. To distinguish early-stage CRC patients at risk of developing metastasis from those that are not, three types of binary classification approaches were used: (1) classification methods (decision trees, linear and radial kernel support vector machines, logistic regression, and random forest) using differentially expressed genes (DEGs) as input features; (2) regularized logistic regression based on the Elastic Net penalty and the proposed iTwiner-a network-based regularizer accounting for gene correlation information; and (3) classification methods based on the genes pre-selected using regularized logistic regression. Classifiers using the DEGs as features showed similar results, with random forest showing the highest accuracy. Using regularized logistic regression on the full dataset yielded no improvement in the methods' accuracy. Further classification using the pre-selected genes found by different penalty factors, instead of the DEGs, significantly improved the accuracy of the binary classifiers. Moreover, the use of network-based correlation information (iTwiner) for gene selection produced the best classification results and the identification of more stable and robust gene sets. Some are known to be tumor suppressor genes (OPCML-IT2), to be related to resistance to cancer therapies (RAC1P3), or to be involved in several cancer processes such as genome stability (XRCC6P2), tumor growth and metastasis (MIR602) and regulation of gene transcription (NME2P2). We show that the classification of CRC patients based on pre-selected features by regularized logistic regression is a valuable alternative to using DEGs, significantly increasing the models' predictive performance. Moreover, the use of correlation-based penalization for biomarker selection stands as a promising strategy for predicting patients' groups based on RNA-seq data.


Assuntos
Neoplasias Colorretais , Humanos , Biomarcadores , Modelos Logísticos , Neoplasias Colorretais/genética , Neoplasias Colorretais/patologia , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Moléculas de Adesão Celular , Proteínas Ligadas por GPI
5.
Stat Methods Med Res ; 31(5): 947-958, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35072570

RESUMO

The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy.Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods.We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of 16,600 genes and more than 1,000 samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.


Assuntos
Neoplasias de Mama Triplo Negativas , Humanos , Neoplasias de Mama Triplo Negativas/genética
6.
Cancers (Basel) ; 13(5)2021 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-33801334

RESUMO

Network science has long been recognized as a well-established discipline across many biological domains. In the particular case of cancer genomics, network discovery is challenged by the multitude of available high-dimensional heterogeneous views of data. Glioblastoma (GBM) is an example of such a complex and heterogeneous disease that can be tackled by network science. Identifying the architecture of molecular GBM networks is essential to understanding the information flow and better informing drug development and pre-clinical studies. Here, we review network-based strategies that have been used in the study of GBM, along with the available software implementations for reproducibility and further testing on newly coming datasets. Promising results have been obtained from both bulk and single-cell GBM data, placing network discovery at the forefront of developing a molecularly-informed-based personalized medicine.

7.
Biomedicines ; 8(11)2020 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-33182598

RESUMO

Colorectal cancer (CRC) is one of the leading causes of mortality and morbidity in the world. Being a heterogeneous disease, cancer therapy and prognosis represent a significant challenge to medical care. The molecular information improves the accuracy with which patients are classified and treated since similar pathologies may show different clinical outcomes and other responses to treatment. However, the high dimensionality of gene expression data makes the selection of novel genes a problematic task. We propose TCox, a novel penalization function for Cox models, which promotes the selection of genes that have distinct correlation patterns in normal vs. tumor tissues. We compare TCox to other regularized survival models, Elastic Net, HubCox, and OrphanCox. Gene expression and clinical data of CRC and normal (TCGA) patients are used for model evaluation. Each model is tested 100 times. Within a specific run, eighteen of the features selected by TCox are also selected by the other survival regression models tested, therefore undoubtedly being crucial players in the survival of colorectal cancer patients. Moreover, the TCox model exclusively selects genes able to categorize patients into significant risk groups. Our work demonstrates the ability of the proposed weighted regularizer TCox to disclose novel molecular drivers in CRC survival by accounting for correlation-based network information from both tumor and normal tissue. The results presented support the relevance of network information for biomarker identification in high-dimensional gene expression data and foster new directions for the development of network-based feature selection methods in precision oncology.

8.
BMC Bioinformatics ; 21(1): 59, 2020 Feb 18.
Artigo em Inglês | MEDLINE | ID: mdl-32070274

RESUMO

BACKGROUND: Understanding cellular and molecular heterogeneity in glioblastoma (GBM), the most common and aggressive primary brain malignancy, is a crucial step towards the development of effective therapies. Besides the inter-patient variability, the presence of multiple cell populations within tumors calls for the need to develop modeling strategies able to extract the molecular signatures driving tumor evolution and treatment failure. With the advances in single-cell RNA Sequencing (scRNA-Seq), tumors can now be dissected at the cell level, unveiling information from their life history to their clinical implications. RESULTS: We propose a classification setting based on GBM scRNA-Seq data, through sparse logistic regression, where different cell populations (neoplastic and normal cells) are taken as classes. The goal is to identify gene features discriminating between the classes, but also those shared by different neoplastic clones. The latter will be approached via the network-based twiner regularizer to identify gene signatures shared by neoplastic cells from the tumor core and infiltrating neoplastic cells originated from the tumor periphery, as putative disease biomarkers to target multiple neoplastic clones. Our analysis is supported by the literature through the identification of several known molecular players in GBM. Moreover, the relevance of the selected genes was confirmed by their significance in the survival outcomes in bulk GBM RNA-Seq data, as well as their association with several Gene Ontology (GO) biological process terms. CONCLUSIONS: We presented a methodology intended to identify genes discriminating between GBM clones, but also those playing a similar role in different GBM neoplastic clones (including migrating cells), therefore potential targets for therapy research. Our results contribute to a deeper understanding on the genetic features behind GBM, by disclosing novel therapeutic directions accounting for GBM heterogeneity.


Assuntos
Neoplasias Encefálicas/genética , Glioblastoma/genética , RNA-Seq , Neoplasias Encefálicas/metabolismo , Classificação/métodos , Ontologia Genética , Glioblastoma/metabolismo , Humanos , Análise de Célula Única
9.
BMC Bioinformatics ; 20(1): 356, 2019 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-31238876

RESUMO

BACKGROUND: Breast and prostate cancers are typical examples of hormone-dependent cancers, showing remarkable similarities at the hormone-related signaling pathways level, and exhibiting a high tropism to bone. While the identification of genes playing a specific role in each cancer type brings invaluable insights for gene therapy research by targeting disease-specific cell functions not accounted so far, identifying a common gene signature to breast and prostate cancers could unravel new targets to tackle shared hormone-dependent disease features, like bone relapse. This would potentially allow the development of new targeted therapies directed to genes regulating both cancer types, with a consequent positive impact in cancer management and health economics. RESULTS: We address the challenge of extracting gene signatures from transcriptomic data of prostate adenocarcinoma (PRAD) and breast invasive carcinoma (BRCA) samples, particularly estrogen positive (ER+), and androgen positive (AR+) triple-negative breast cancer (TNBC), using sparse logistic regression. The introduction of gene network information based on the distances between BRCA and PRAD correlation matrices is investigated, through the proposed twin networks recovery (twiner) penalty, as a strategy to ensure similarly correlated gene features in two diseases to be less penalized during the feature selection procedure. CONCLUSIONS: Our analysis led to the identification of genes that show a similar correlation pattern in BRCA and PRAD transcriptomic data, and are selected as key players in the classification of breast and prostate samples into ER+ BRCA/AR+ TNBC/PRAD tumor and normal tissues, and also associated with survival time distributions. The results obtained are supported by the literature and are expected to unveil the similarities between the diseases, disclose common disease biomarkers, and help in the definition of new strategies for more effective therapies.


Assuntos
Perfilação da Expressão Gênica/métodos , Neoplasias da Próstata/genética , Transcriptoma , Neoplasias de Mama Triplo Negativas/genética , Estrogênios/metabolismo , Feminino , Redes Reguladoras de Genes , Humanos , Modelos Logísticos , Masculino , Análise de Componente Principal , Neoplasias da Próstata/mortalidade , Neoplasias da Próstata/patologia , Receptores Androgênicos/metabolismo , Análise de Sobrevida , Neoplasias de Mama Triplo Negativas/mortalidade , Neoplasias de Mama Triplo Negativas/patologia
10.
Stat Methods Med Res ; 28(10-11): 3042-3056, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30146936

RESUMO

Correct classification of breast cancer subtypes is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma transcriptomic data publicly available from The Cancer Genome Atlas data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail to identify outliers due to their heavy influence, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60% have been previously reported as biologically relevant to triple-negative breast cancer, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for triple-negative breast cancer. Out of these, JAM3, SFT2D2, and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between triple-negative breast cancer and non-triple-negative breast cancer data. The individual role of FOXA1 in triple-negative breast cancer and non-triple-negative breast cancer, and the strong FOXA1-AGR2 connection in triple-negative breast cancer stand out. The goal of our paper is to contribute to the breast cancer/triple-negative breast cancer understanding and management. At the same time it demonstrates that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data.


Assuntos
Modelos Genéticos , Neoplasias de Mama Triplo Negativas/genética , Biomarcadores Tumorais/genética , Bases de Dados Genéticas , Feminino , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Humanos , Prognóstico
11.
BMC Bioinformatics ; 19(1): 168, 2018 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-29728051

RESUMO

BACKGROUND: Learning accurate models from 'omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems. Furthermore, the presence of outliers, either experimental errors or interesting abnormal clinical cases, may severely hamper a correct classification of patients and the identification of reliable biomarkers for a particular disease. We propose to address this problem through an ensemble classification setting based on distinct feature selection and modeling strategies, including logistic regression with elastic net regularization, Sparse Partial Least Squares - Discriminant Analysis (SPLS-DA) and Sparse Generalized PLS (SGPLS), coupled with an evaluation of the individuals' outlierness based on the Cook's distance. The consensus is achieved with the Rank Product statistics corrected for multiple testing, which gives a final list of sorted observations by their outlierness level. RESULTS: We applied this strategy for the classification of Triple-Negative Breast Cancer (TNBC) RNA-Seq and clinical data from the Cancer Genome Atlas (TCGA). The detected 24 outliers were identified as putative mislabeled samples, corresponding to individuals with discrepant clinical labels for the HER2 receptor, but also individuals with abnormal expression values of ER, PR and HER2, contradictory with the corresponding clinical labels, which may invalidate the initial TNBC label. Moreover, the model consensus approach leads to the selection of a set of genes that may be linked to the disease. These results are robust to a resampling approach, either by selecting a subset of patients or a subset of genes, with a significant overlap of the outlier patients identified. CONCLUSIONS: The proposed ensemble outlier detection approach constitutes a robust procedure to identify abnormal cases and consensus covariates, which may improve biomarker selection for precision medicine applications. The method can also be easily extended to other regression models and datasets.


Assuntos
Neoplasias de Mama Triplo Negativas/genética , Sequenciamento Completo do Genoma/métodos , Feminino , Humanos , Tamanho da Amostra , Neoplasias de Mama Triplo Negativas/patologia
12.
BioData Min ; 11: 1, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29456628

RESUMO

BACKGROUND: Survival analysis is a statistical technique widely used in many fields of science, in particular in the medical area, and which studies the time until an event of interest occurs. Outlier detection in this context has gained great importance due to the fact that the identification of long or short-term survivors may lead to the detection of new prognostic factors. However, the results obtained using different outlier detection methods and residuals are seldom the same and are strongly dependent of the specific Cox proportional hazards model selected. In particular, when the inherent data have a high number of covariates, dimensionality reduction becomes a key challenge, usually addressed through regularized optimization, e.g. using Lasso, Ridge or Elastic Net regression. In the case of transcriptomics studies, this is an ubiquitous problem, since each observation has a very high number of associated covariates (genes). RESULTS: In order to solve this issue, we propose to use the Rank Product test, a non-parametric technique, as a method to identify discrepant observations independently of the selection method and deviance considered. An example based on the The Cancer Genome Atlas (TCGA) ovarian cancer dataset is presented, where the covariates are patients' gene expressions. Three sub-models were considered, and, for each one, different outliers were obtained. Additionally, a resampling strategy was conducted to demonstrate the methods' consistency and robustness. The Rank Product worked as a consensus method to identify observations that can be influential under survival models, thus potential outliers in the high-dimensional space. CONCLUSIONS: The proposed technique allows us to combine the different results obtained by each sub-model and find which observations are systematically ranked as putative outliers to be explored further from a clinical point of view.

13.
Biotechnol Prog ; 32(2): 447-55, 2016 03.
Artigo em Inglês | MEDLINE | ID: mdl-26701677

RESUMO

Human mesenchymal stem/stromal cells (MSCs) have received considerable attention in the field of cell-based therapies due to their high differentiation potential and ability to modulate immune responses. However, since these cells can only be isolated in very low quantities, successful realization of these therapies requires MSCs ex-vivo expansion to achieve relevant cell doses. The metabolic activity is one of the parameters often monitored during MSCs cultivation by using expensive multi-analytical methods, some of them time-consuming. The present work evaluates the use of mid-infrared (MIR) spectroscopy, through rapid and economic high-throughput analyses associated to multivariate data analysis, to monitor three different MSCs cultivation runs conducted in spinner flasks, under xeno-free culture conditions, which differ in the type of microcarriers used and the culture feeding strategy applied. After evaluating diverse spectral preprocessing techniques, the optimized partial least square (PLS) regression models based on the MIR spectra to estimate the glucose, lactate and ammonia concentrations yielded high coefficients of determination (R(2) ≥ 0.98, ≥0.98, and ≥0.94, respectively) and low prediction errors (RMSECV ≤ 4.7%, ≤4.4% and ≤5.7%, respectively). Besides PLS models valid for specific expansion protocols, a robust model simultaneously valid for the three processes was also built for predicting glucose, lactate and ammonia, yielding a R(2) of 0.95, 0.97 and 0.86, and a RMSECV of 0.33, 0.57, and 0.09 mM, respectively. Therefore, MIR spectroscopy combined with multivariate data analysis represents a promising tool for both optimization and control of MSCs expansion processes. © 2016 American Institute of Chemical Engineers Biotechnol. Prog., 32:447-455, 2016.


Assuntos
Reatores Biológicos , Técnicas de Cultura de Células , Células-Tronco Mesenquimais/citologia , Ensaios de Triagem em Larga Escala , Humanos , Análise Multivariada , Espectroscopia de Luz Próxima ao Infravermelho
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA