Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
1.
BMC Bioinformatics ; 24(1): 17, 2023 Jan 16.
Artigo em Inglês | MEDLINE | ID: mdl-36647008

RESUMO

Colorectal cancer (CRC) is the third most common cancer and the second most deathly worldwide. It is a very heterogeneous disease that can develop via distinct pathways where metastasis is the primary cause of death. Therefore, it is crucial to understand the molecular mechanisms underlying metastasis. RNA-sequencing is an essential tool used for studying the transcriptional landscape. However, the high-dimensionality of gene expression data makes selecting novel metastatic biomarkers problematic. To distinguish early-stage CRC patients at risk of developing metastasis from those that are not, three types of binary classification approaches were used: (1) classification methods (decision trees, linear and radial kernel support vector machines, logistic regression, and random forest) using differentially expressed genes (DEGs) as input features; (2) regularized logistic regression based on the Elastic Net penalty and the proposed iTwiner-a network-based regularizer accounting for gene correlation information; and (3) classification methods based on the genes pre-selected using regularized logistic regression. Classifiers using the DEGs as features showed similar results, with random forest showing the highest accuracy. Using regularized logistic regression on the full dataset yielded no improvement in the methods' accuracy. Further classification using the pre-selected genes found by different penalty factors, instead of the DEGs, significantly improved the accuracy of the binary classifiers. Moreover, the use of network-based correlation information (iTwiner) for gene selection produced the best classification results and the identification of more stable and robust gene sets. Some are known to be tumor suppressor genes (OPCML-IT2), to be related to resistance to cancer therapies (RAC1P3), or to be involved in several cancer processes such as genome stability (XRCC6P2), tumor growth and metastasis (MIR602) and regulation of gene transcription (NME2P2). We show that the classification of CRC patients based on pre-selected features by regularized logistic regression is a valuable alternative to using DEGs, significantly increasing the models' predictive performance. Moreover, the use of correlation-based penalization for biomarker selection stands as a promising strategy for predicting patients' groups based on RNA-seq data.


Assuntos
Neoplasias Colorretais , Humanos , Biomarcadores , Modelos Logísticos , Neoplasias Colorretais/genética , Neoplasias Colorretais/patologia , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Moléculas de Adesão Celular , Proteínas Ligadas por GPI
2.
BMC Bioinformatics ; 21(1): 59, 2020 Feb 18.
Artigo em Inglês | MEDLINE | ID: mdl-32070274

RESUMO

BACKGROUND: Understanding cellular and molecular heterogeneity in glioblastoma (GBM), the most common and aggressive primary brain malignancy, is a crucial step towards the development of effective therapies. Besides the inter-patient variability, the presence of multiple cell populations within tumors calls for the need to develop modeling strategies able to extract the molecular signatures driving tumor evolution and treatment failure. With the advances in single-cell RNA Sequencing (scRNA-Seq), tumors can now be dissected at the cell level, unveiling information from their life history to their clinical implications. RESULTS: We propose a classification setting based on GBM scRNA-Seq data, through sparse logistic regression, where different cell populations (neoplastic and normal cells) are taken as classes. The goal is to identify gene features discriminating between the classes, but also those shared by different neoplastic clones. The latter will be approached via the network-based twiner regularizer to identify gene signatures shared by neoplastic cells from the tumor core and infiltrating neoplastic cells originated from the tumor periphery, as putative disease biomarkers to target multiple neoplastic clones. Our analysis is supported by the literature through the identification of several known molecular players in GBM. Moreover, the relevance of the selected genes was confirmed by their significance in the survival outcomes in bulk GBM RNA-Seq data, as well as their association with several Gene Ontology (GO) biological process terms. CONCLUSIONS: We presented a methodology intended to identify genes discriminating between GBM clones, but also those playing a similar role in different GBM neoplastic clones (including migrating cells), therefore potential targets for therapy research. Our results contribute to a deeper understanding on the genetic features behind GBM, by disclosing novel therapeutic directions accounting for GBM heterogeneity.


Assuntos
Neoplasias Encefálicas/genética , Glioblastoma/genética , RNA-Seq , Neoplasias Encefálicas/metabolismo , Classificação/métodos , Ontologia Genética , Glioblastoma/metabolismo , Humanos , Análise de Célula Única
3.
BMC Bioinformatics ; 20(1): 356, 2019 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-31238876

RESUMO

BACKGROUND: Breast and prostate cancers are typical examples of hormone-dependent cancers, showing remarkable similarities at the hormone-related signaling pathways level, and exhibiting a high tropism to bone. While the identification of genes playing a specific role in each cancer type brings invaluable insights for gene therapy research by targeting disease-specific cell functions not accounted so far, identifying a common gene signature to breast and prostate cancers could unravel new targets to tackle shared hormone-dependent disease features, like bone relapse. This would potentially allow the development of new targeted therapies directed to genes regulating both cancer types, with a consequent positive impact in cancer management and health economics. RESULTS: We address the challenge of extracting gene signatures from transcriptomic data of prostate adenocarcinoma (PRAD) and breast invasive carcinoma (BRCA) samples, particularly estrogen positive (ER+), and androgen positive (AR+) triple-negative breast cancer (TNBC), using sparse logistic regression. The introduction of gene network information based on the distances between BRCA and PRAD correlation matrices is investigated, through the proposed twin networks recovery (twiner) penalty, as a strategy to ensure similarly correlated gene features in two diseases to be less penalized during the feature selection procedure. CONCLUSIONS: Our analysis led to the identification of genes that show a similar correlation pattern in BRCA and PRAD transcriptomic data, and are selected as key players in the classification of breast and prostate samples into ER+ BRCA/AR+ TNBC/PRAD tumor and normal tissues, and also associated with survival time distributions. The results obtained are supported by the literature and are expected to unveil the similarities between the diseases, disclose common disease biomarkers, and help in the definition of new strategies for more effective therapies.


Assuntos
Perfilação da Expressão Gênica/métodos , Neoplasias da Próstata/genética , Transcriptoma , Neoplasias de Mama Triplo Negativas/genética , Estrogênios/metabolismo , Feminino , Redes Reguladoras de Genes , Humanos , Modelos Logísticos , Masculino , Análise de Componente Principal , Neoplasias da Próstata/mortalidade , Neoplasias da Próstata/patologia , Receptores Androgênicos/metabolismo , Análise de Sobrevida , Neoplasias de Mama Triplo Negativas/mortalidade , Neoplasias de Mama Triplo Negativas/patologia
4.
BMC Bioinformatics ; 19(1): 168, 2018 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-29728051

RESUMO

BACKGROUND: Learning accurate models from 'omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems. Furthermore, the presence of outliers, either experimental errors or interesting abnormal clinical cases, may severely hamper a correct classification of patients and the identification of reliable biomarkers for a particular disease. We propose to address this problem through an ensemble classification setting based on distinct feature selection and modeling strategies, including logistic regression with elastic net regularization, Sparse Partial Least Squares - Discriminant Analysis (SPLS-DA) and Sparse Generalized PLS (SGPLS), coupled with an evaluation of the individuals' outlierness based on the Cook's distance. The consensus is achieved with the Rank Product statistics corrected for multiple testing, which gives a final list of sorted observations by their outlierness level. RESULTS: We applied this strategy for the classification of Triple-Negative Breast Cancer (TNBC) RNA-Seq and clinical data from the Cancer Genome Atlas (TCGA). The detected 24 outliers were identified as putative mislabeled samples, corresponding to individuals with discrepant clinical labels for the HER2 receptor, but also individuals with abnormal expression values of ER, PR and HER2, contradictory with the corresponding clinical labels, which may invalidate the initial TNBC label. Moreover, the model consensus approach leads to the selection of a set of genes that may be linked to the disease. These results are robust to a resampling approach, either by selecting a subset of patients or a subset of genes, with a significant overlap of the outlier patients identified. CONCLUSIONS: The proposed ensemble outlier detection approach constitutes a robust procedure to identify abnormal cases and consensus covariates, which may improve biomarker selection for precision medicine applications. The method can also be easily extended to other regression models and datasets.


Assuntos
Neoplasias de Mama Triplo Negativas/genética , Sequenciamento Completo do Genoma/métodos , Feminino , Humanos , Tamanho da Amostra , Neoplasias de Mama Triplo Negativas/patologia
5.
J Ind Microbiol Biotechnol ; 44(1): 49-61, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-27830421

RESUMO

To increase the knowledge of the recombinant cyprosin production process in Saccharomyces cerevisiae cultures, it is relevant to implement efficient bioprocess monitoring techniques. The present work focuses on the implementation of a mid-infrared (MIR) spectroscopy-based tool for monitoring the recombinant culture in a rapid, economic, and high-throughput (using a microplate system) mode. Multivariate data analysis on the MIR spectra of culture samples was conducted. Principal component analysis (PCA) enabled capturing the general metabolic status of the yeast cells, as replicated samples appear grouped together in the score plot and groups of culture samples according to the main growth phase can be clearly distinguished. The PCA-loading vectors also revealed spectral regions, and the corresponding chemical functional groups and biomolecules that mostly contributed for the cell biomolecular fingerprint associated with the culture growth phase. These data were corroborated by the analysis of the samples' second derivative spectra. Partial least square (PLS) regression models built based on the MIR spectra showed high predictive ability for estimating the bioprocess critical variables: biomass (R 2 = 0.99, RMSEP 2.8%); cyprosin activity (R 2 = 0.98, RMSEP 3.9%); glucose (R 2 = 0.93, RMSECV 7.2%); galactose (R 2 = 0.97, RMSEP 4.6%); ethanol (R 2 = 0.97, RMSEP 5.3%); and acetate (R 2 = 0.95, RMSEP 7.0%). In conclusion, high-throughput MIR spectroscopy and multivariate data analysis were effective in identifying the main growth phases and specific cyprosin production phases along the yeast culture as well as in quantifying the critical variables of the process. This knowledge will promote future process optimization and control the recombinant cyprosin bioprocess according to Quality by Design framework.


Assuntos
Ácido Aspártico Endopeptidases/biossíntese , Proteínas Recombinantes/biossíntese , Espectroscopia de Infravermelho com Transformada de Fourier/métodos , Biomassa , Biotecnologia/métodos , Etanol/metabolismo , Galactose , Glucose/análise , Análise dos Mínimos Quadrados , Análise de Componente Principal , Análise de Regressão , Saccharomyces cerevisiae , Espectrofotometria Infravermelho , Temperatura
6.
Anal Bioanal Chem ; 407(26): 8097-108, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26329279

RESUMO

Reporter genes are routinely used in every laboratory for molecular and cellular biology for studying heterologous gene expression and general cellular biological mechanisms, such as transfection processes. Although well characterized and broadly implemented, reporter genes present serious limitations, either by involving time-consuming procedures or by presenting possible side effects on the expression of the heterologous gene or even in the general cellular metabolism. Fourier transform mid-infrared (FT-MIR) spectroscopy was evaluated to simultaneously analyze in a rapid (minutes) and high-throughput mode (using 96-wells microplates), the transfection efficiency, and the effect of the transfection process on the host cell biochemical composition and metabolism. Semi-adherent HEK and adherent AGS cell lines, transfected with the plasmid pVAX-GFP using Lipofectamine, were used as model systems. Good partial least squares (PLS) models were built to estimate the transfection efficiency, either considering each cell line independently (R (2) ≥ 0.92; RMSECV ≤ 2 %) or simultaneously considering both cell lines (R (2) = 0.90; RMSECV = 2 %). Additionally, the effect of the transfection process on the HEK cell biochemical and metabolic features could be evaluated directly from the FT-IR spectra. Due to the high sensitivity of the technique, it was also possible to discriminate the effect of the transfection process from the transfection reagent on KEK cells, e.g., by the analysis of spectral biomarkers and biochemical and metabolic features. The present results are far beyond what any reporter gene assay or other specific probe can offer for these purposes.


Assuntos
Espectroscopia de Infravermelho com Transformada de Fourier/métodos , Transfecção , Genes Reporter , Células HEK293 , Ensaios de Triagem em Larga Escala/métodos , Humanos , Análise dos Mínimos Quadrados , Metaboloma
7.
Bioinform Biol Insights ; 18: 11779322241249563, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38812741

RESUMO

Glioma is currently one of the most prevalent types of primary brain cancer. Given its high level of heterogeneity along with the complex biological molecular markers, many efforts have been made to accurately classify the type of glioma in each patient, which, in turn, is critical to improve early diagnosis and increase survival. Nonetheless, as a result of the fast-growing technological advances in high-throughput sequencing and evolving molecular understanding of glioma biology, its classification has been recently subject to significant alterations. In this study, we integrate multiple glioma omics modalities (including mRNA, DNA methylation, and miRNA) from The Cancer Genome Atlas (TCGA), while using the revised glioma reclassified labels, with a supervised method based on sparse canonical correlation analysis (DIABLO) to discriminate between glioma types. We were able to find a set of highly correlated features distinguishing glioblastoma from lower-grade gliomas (LGGs) that were mainly associated with the disruption of receptor tyrosine kinases signaling pathways and extracellular matrix organization and remodeling. Concurrently, the discrimination of the LGG types was characterized primarily by features involved in ubiquitination and DNA transcription processes. Furthermore, we could identify several novel glioma biomarkers likely helpful in both diagnosis and prognosis of the patients, including the genes PPP1R8, GPBP1L1, KIAA1614, C14orf23, CCDC77, BVES, EXD3, CD300A, and HEPN1. Collectively, this comprehensive approach not only allowed a highly accurate discrimination of the different TCGA glioma patients but also presented a step forward in advancing our comprehension of the underlying molecular mechanisms driving glioma heterogeneity. Ultimately, our study also revealed novel candidate biomarkers that might constitute potential therapeutic targets, marking a significant stride toward personalized and more effective treatment strategies for patients with glioma.

8.
BioData Min ; 16(1): 26, 2023 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-37752578

RESUMO

Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.

9.
Front Microbiol ; 14: 1250909, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37869650

RESUMO

Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.

10.
Front Microbiol ; 14: 1261889, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37808286

RESUMO

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

11.
Front Microbiol ; 14: 1250806, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38075858

RESUMO

The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.

12.
Front Microbiol ; 14: 1257002, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37808321

RESUMO

The rapid development of machine learning (ML) techniques has opened up the data-dense field of microbiome research for novel therapeutic, diagnostic, and prognostic applications targeting a wide range of disorders, which could substantially improve healthcare practices in the era of precision medicine. However, several challenges must be addressed to exploit the benefits of ML in this field fully. In particular, there is a need to establish "gold standard" protocols for conducting ML analysis experiments and improve interactions between microbiome researchers and ML experts. The Machine Learning Techniques in Human Microbiome Studies (ML4Microbiome) COST Action CA18131 is a European network established in 2019 to promote collaboration between discovery-oriented microbiome researchers and data-driven ML experts to optimize and standardize ML approaches for microbiome analysis. This perspective paper presents the key achievements of ML4Microbiome, which include identifying predictive and discriminatory 'omics' features, improving repeatability and comparability, developing automation procedures, and defining priority areas for the novel development of ML methods targeting the microbiome. The insights gained from ML4Microbiome will help to maximize the potential of ML in microbiome research and pave the way for new and improved healthcare practices.

13.
Stat Methods Med Res ; 31(5): 947-958, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35072570

RESUMO

The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy.Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods.We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of 16,600 genes and more than 1,000 samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.


Assuntos
Neoplasias de Mama Triplo Negativas , Humanos , Neoplasias de Mama Triplo Negativas/genética
14.
Toxins (Basel) ; 14(10)2022 09 30.
Artigo em Inglês | MEDLINE | ID: mdl-36287948

RESUMO

Diarrhetic Shellfish Poisoning (DSP) is an acute intoxication caused by the consumption of contaminated shellfish, which is common in many regions of the world. To safeguard human health, most countries implement programs focused on the surveillance of toxic phytoplankton abundance and shellfish toxicity levels, an effort that can be complemented by a deeper understanding of the underlying phenomena. In this work, we identify patterns of seasonality in shellfish toxicity across the Portuguese coast and analyse time-lagged correlations between this toxicity and various potential risk factors. We extend the understanding of these relations through the introduction of temporal lags, allowing the analysis of time series at different points in time and the study of the predictive power of the tested variables. This study confirms previous findings about toxicity seasonality patterns on the Portuguese coast and provides further quantitative data about the relations between shellfish toxicity and geographical location, shellfish species, toxic phytoplankton abundances, and environmental conditions. Furthermore, multiple pairs of areas and shellfish species are identified as having correlations high enough to allow for a predictive analysis. These results represent the first step towards understanding the dynamics of DSP toxicity in Portuguese shellfish producing areas, such as temporal and spatial variability, and towards the development of a shellfish safety forecasting system.


Assuntos
Intoxicação por Frutos do Mar , Humanos , Toxinas Marinhas/toxicidade , Toxinas Marinhas/análise , Frutos do Mar/análise , Fitoplâncton
15.
Cancers (Basel) ; 13(5)2021 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-33801334

RESUMO

Network science has long been recognized as a well-established discipline across many biological domains. In the particular case of cancer genomics, network discovery is challenged by the multitude of available high-dimensional heterogeneous views of data. Glioblastoma (GBM) is an example of such a complex and heterogeneous disease that can be tackled by network science. Identifying the architecture of molecular GBM networks is essential to understanding the information flow and better informing drug development and pre-clinical studies. Here, we review network-based strategies that have been used in the study of GBM, along with the available software implementations for reproducibility and further testing on newly coming datasets. Promising results have been obtained from both bulk and single-cell GBM data, placing network discovery at the forefront of developing a molecularly-informed-based personalized medicine.

16.
Front Microbiol ; 12: 634511, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33737920

RESUMO

The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.

17.
Front Microbiol ; 12: 635781, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33692771

RESUMO

The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 "ML4Microbiome" that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.

18.
Anal Chem ; 82(4): 1462-9, 2010 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-20095581

RESUMO

A rapid detection of the nonauthenticity of suspect tablets is a key first step in the fight against pharmaceutical counterfeiting. The chemical characterization of these tablets is the logical next step to evaluate their impact on patient health and help authorities in tracking their source. Hyperspectral unmixing of near-infrared (NIR) image data is an emerging effective technology to infer the number of compounds, their spectral signatures, and the mixing fractions in a given tablet, with a resolution of a few tens of micrometers. In a linear mixing scenario, hyperspectral vectors belong to a simplex whose vertices correspond to the spectra of the compounds present in the sample. SISAL (simplex identification via split augmented Lagrangian), MVSA (minimum volume simplex analysis), and MVES (minimum-volume enclosing simplex) are recent algorithms designed to identify the vertices of the minimum volume simplex containing the spectral vectors and the mixing fractions at each pixel (vector). This work demonstrates the usefulness of these techniques, based on minimum volume criteria, for unmixing NIR hyperspectral data of tablets. The experiments herein reported show that SISAL/MVSA and MVES largely outperform MCR-ALS (multivariate curve resolution-alternating least-squares), which is considered the state-of-the-art in spectral unmixing for analytical chemistry. These experiments are based on synthetic data (studying the effect of noise and the presence/absence of pure pixels) and on a real data set composed of NIR images of counterfeit tablets.


Assuntos
Fraude , Preparações Farmacêuticas/análise , Preparações Farmacêuticas/química , Espectrofotometria Infravermelho , Comprimidos , Fatores de Tempo
19.
Biomedicines ; 8(11)2020 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-33182598

RESUMO

Colorectal cancer (CRC) is one of the leading causes of mortality and morbidity in the world. Being a heterogeneous disease, cancer therapy and prognosis represent a significant challenge to medical care. The molecular information improves the accuracy with which patients are classified and treated since similar pathologies may show different clinical outcomes and other responses to treatment. However, the high dimensionality of gene expression data makes the selection of novel genes a problematic task. We propose TCox, a novel penalization function for Cox models, which promotes the selection of genes that have distinct correlation patterns in normal vs. tumor tissues. We compare TCox to other regularized survival models, Elastic Net, HubCox, and OrphanCox. Gene expression and clinical data of CRC and normal (TCGA) patients are used for model evaluation. Each model is tested 100 times. Within a specific run, eighteen of the features selected by TCox are also selected by the other survival regression models tested, therefore undoubtedly being crucial players in the survival of colorectal cancer patients. Moreover, the TCox model exclusively selects genes able to categorize patients into significant risk groups. Our work demonstrates the ability of the proposed weighted regularizer TCox to disclose novel molecular drivers in CRC survival by accounting for correlation-based network information from both tumor and normal tissue. The results presented support the relevance of network information for biomarker identification in high-dimensional gene expression data and foster new directions for the development of network-based feature selection methods in precision oncology.

20.
Stat Methods Med Res ; 28(10-11): 3042-3056, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30146936

RESUMO

Correct classification of breast cancer subtypes is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma transcriptomic data publicly available from The Cancer Genome Atlas data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail to identify outliers due to their heavy influence, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60% have been previously reported as biologically relevant to triple-negative breast cancer, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for triple-negative breast cancer. Out of these, JAM3, SFT2D2, and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between triple-negative breast cancer and non-triple-negative breast cancer data. The individual role of FOXA1 in triple-negative breast cancer and non-triple-negative breast cancer, and the strong FOXA1-AGR2 connection in triple-negative breast cancer stand out. The goal of our paper is to contribute to the breast cancer/triple-negative breast cancer understanding and management. At the same time it demonstrates that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data.


Assuntos
Modelos Genéticos , Neoplasias de Mama Triplo Negativas/genética , Biomarcadores Tumorais/genética , Bases de Dados Genéticas , Feminino , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Humanos , Prognóstico
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA