RESUMO
Unsupervised learning, particularly clustering, plays a pivotal role in disease subtyping and patient stratification, especially with the abundance of large-scale multi-omics data. Deep learning models, such as variational autoencoders (VAEs), can enhance clustering algorithms by leveraging inter-individual heterogeneity. However, the impact of confounders-external factors unrelated to the condition, e.g. batch effect or age-on clustering is often overlooked, introducing bias and spurious biological conclusions. In this work, we introduce four novel VAE-based deconfounding frameworks tailored for clustering multi-omics data. These frameworks effectively mitigate confounding effects while preserving genuine biological patterns. The deconfounding strategies employed include (i) removal of latent features correlated with confounders, (ii) a conditional VAE, (iii) adversarial training, and (iv) adding a regularization term to the loss function. Using real-life multi-omics data from The Cancer Genome Atlas, we simulated various confounding effects (linear, nonlinear, categorical, mixed) and assessed model performance across 50 repetitions based on reconstruction error, clustering stability, and deconfounding efficacy. Our results demonstrate that our novel models, particularly the conditional multi-omics VAE (cXVAE), successfully handle simulated confounding effects and recover biologically driven clustering structures. cXVAE accurately identifies patient labels and unveils meaningful pathological associations among cancer types, validating deconfounded representations. Furthermore, our study suggests that some of the proposed strategies, such as adversarial training, prove insufficient in confounder removal. In summary, our study contributes by proposing innovative frameworks for simultaneous multi-omics data integration, dimensionality reduction, and deconfounding in clustering. Benchmarking on open-access data offers guidance to end-users, facilitating meaningful patient stratification for optimized precision medicine.
Assuntos
Algoritmos , Humanos , Análise por Conglomerados , Neoplasias/genética , Neoplasias/classificação , Aprendizado Profundo , Genômica/métodos , Biologia Computacional/métodos , Aprendizado de Máquina não Supervisionado , MultiômicaRESUMO
The heterogeneity of tumor clones drives the selection and evolution of distinct tumor cell populations, resulting in an intricate and dynamic tumor evolution process. While tumor bulk DNA sequencing helps elucidate intratumor heterogeneity, challenges such as the misidentification of mutation multiplicity due to copy number variations and uncertainties in the reconstruction process hinder the accurate inference of tumor evolution. In this study, we introduce a novel approach, REconstructing Tumor Clonal Heterogeneity and Evolutionary Relationships (RETCHER), which characterizes more realistic cancer cell fractions by accurately identifying mutation multiplicity while considering uncertainty during the reconstruction process and the credibility and reasonableness of subclone clustering. This method comprehensively and accurately infers multiple forms of tumor clonal heterogeneity and phylogenetic relationships. RETCHER outperforms existing methods on simulated data and infers clearer subclone structures and evolutionary relationships in real multisample sequencing data from five tumor types. By precisely analysing the complex clonal heterogeneity within tumors, RETCHER provides a new approach to tumor evolution research and offers scientific evidence for developing precise and personalized treatment strategies. This approach is expected to play a significant role in tumor evolution research, clinical diagnosis, and treatment. RETCHER is available for free at https://github.com/zlsys3/RETCHER.
Assuntos
Heterogeneidade Genética , Neoplasias , Humanos , Neoplasias/genética , Neoplasias/patologia , Neoplasias/classificação , Análise de Sequência de DNA/métodos , Mutação , Variações do Número de Cópias de DNA , Filogenia , Evolução Clonal , Algoritmos , Biologia Computacional/métodosRESUMO
The identification of relevant biomarkers from high-dimensional cancer data remains a significant challenge due to the complexity and heterogeneity inherent in various cancer types. Conventional feature selection methods often struggle to effectively navigate the vast solution space while maintaining high predictive accuracy. In response to these challenges, we introduce a novel feature selection approach that integrates Random Drift Optimization (RDO) with XGBoost, specifically designed to enhance the performance of cancer classification tasks. Our proposed framework not only improves classification accuracy but also offers valuable insights into the underlying biological mechanisms driving cancer progression. Through comprehensive experiments conducted on real-world cancer datasets, including Central Nervous System (CNS), Leukemia, Breast, and Ovarian cancers, we demonstrate the efficacy of our method in identifying a smaller subset of unique and relevant genes. This selection results in significantly improved classification efficiency and accuracy. When compared with popular classifiers such as Support Vector Machine, K-Nearest Neighbor, and Naive Bayes, our approach consistently outperforms these models in terms of both accuracy and F-measure metrics. For instance, our framework achieved an accuracy of 97.24% in the CNS dataset, 99.14% in Leukemia, 95.21% in Ovarian, and 87.62% in Breast cancer, showcasing its robustness and effectiveness across different types of cancer data. These results underline the potential of our RDO-XGBoost framework as a promising solution for feature selection in cancer data analysis, offering enhanced predictive performance and valuable biological insights.
Assuntos
Neoplasias , Humanos , Neoplasias/classificação , Algoritmos , Máquina de Vetores de Suporte , Biomarcadores Tumorais/genética , Teorema de Bayes , Biologia Computacional/métodos , FemininoRESUMO
BACKGROUND: Manfredini et al. demonstrate that the new rating protocol, EMOnco, can triage of cancer patients in acute care settings safely, considering their cancer type, stage and treatment histories and oncological emergencies, enabling the appropriate classification from high-risk patients to non-urgent patients. BACKGROUND: â EMOnco considers variables related to the cancer history and treatment. BACKGROUND: â Triages patients in the emergency care in less than three minutes. BACKGROUND: â Cancer patients need priority care regarding infection, and this protocol consider it. BACKGROUND: â EMOnco has shown to be a valid and reliable scale for the triage of oncological patients in the emergency room or acute care clinics. OBJECTIVE: To validate a risk rating scale for triaging of cancer patients in emergency rooms that can identify individuals needing urgent care or in imminent worsening of the clinical condition. METHODS: This is a health instrument validation study developed in the emergency care ward of a Brazilian hospital, a referral center for cancer and hematological diseases. We built the Emergency Oncology Scale (EMOnco) based on literature review and a Delphi survey with 20 experienced oncologists (physicians and nurses). We validated the scale by assessing its construct validity, interobserver agreement and reliability after applying them in a convenience sample of all consecutive patients with cancer who visited the ward between August 2017 and January 2018. We compared the EMOnco Scores with those from other scales, used by six trained nurses: the Emergency Severity Index, the Manchester Triage System, and the Karnofsky Performance Status. We also recorded socio-demographic and clinical features and the Sequential Organ Failure Assessment (SOFA) results in the intensive care unit. RESULTS: We included 250 patients with locally advanced or recurrent disease and undergoing chemotherapy. EMOnco screening took 2.24 (± 2.9) minutes in average. The interobserver correlation coefficient was 0.9. EMOnco was highly correlated with Emergency Severity Index (r=0.617) and also correlated with Karnofsky Performance Status (0.420) Manchester Triage System (0.491; p<0.001 for all). CONCLUSION: EMOnco in Portuguese considers variables related to the cancer history and treatment and has proven to be a valid and reliable for the risk classification of oncological patients in emergency care services.
Assuntos
Serviço Hospitalar de Emergência , Neoplasias , Triagem , Humanos , Triagem/métodos , Triagem/normas , Neoplasias/terapia , Neoplasias/classificação , Reprodutibilidade dos Testes , Serviço Hospitalar de Emergência/normas , Masculino , Feminino , Medição de Risco/métodos , Pessoa de Meia-Idade , Brasil , Adulto , Idoso , Técnica DelphiRESUMO
Cancer staging is an essential clinical attribute informing patient prognosis and clinical trial eligibility. However, it is not routinely recorded in structured electronic health records. Here, we present BB-TEN: Big Bird - TNM staging Extracted from Notes, a generalizable method for the automated classification of TNM stage directly from pathology report text. We train a BERT-based model using publicly available pathology reports across approximately 7000 patients and 23 cancer types. We explore the use of different model types, with differing input sizes, parameters, and model architectures. Our final model goes beyond term-extraction, inferring TNM stage from context when it is not included in the report text explicitly. As external validation, we test our model on almost 8000 pathology reports from Columbia University Medical Center, finding that our trained model achieved an AU-ROC of 0.815-0.942. This suggests that our model can be applied broadly to other institutions without additional institution-specific fine-tuning.
Assuntos
Registros Eletrônicos de Saúde , Estadiamento de Neoplasias , Neoplasias , Humanos , Estadiamento de Neoplasias/métodos , Neoplasias/patologia , Neoplasias/classificação , AlgoritmosRESUMO
Computer-assisted diagnosis (CAD) plays a key role in cancer diagnosis or screening. Whereas, current CAD performs poorly on whole slide image (WSI) analysis, and thus fails to generalize well. This research aims to develop an automatic classification system to distinguish between different types of carcinomas. Obtaining rich deep features in multi-class classification while achieving high accuracy is still a challenging problem. The detection and classification of cancerous cells in WSI are quite challenging due to the misclassification of normal lumps and cancerous cells. This is due to cluttering, occlusion, and irregular cell distribution. Researchers in the past mostly obtained the hand-crafted features while neglecting the above-mentioned challenges which led to a reduction of the classification accuracy. To mitigate this problem we proposed an efficient dual attention-based network (CytoNet). The proposed network is composed of two main modules (i) Efficient-Net and (ii) Dual Attention Module (DAM). Efficient-Net is capable of obtaining higher accuracy and enhancing efficiency as compared to existing Convolutional Neural Networks (CNNs). It is also useful to obtain the most generic features as it has been trained on ImageNet. Whereas DAM is very robust in obtaining attention and targeted features while negating the background. In this way, the combination of an efficient and attention module is useful to obtain the robust, and intrinsic features to obtain comparable performance. Further, we evaluated the proposed network on two well-known datasets (i) Our generated thyroid dataset (ii) Mendeley Cervical dataset (Hussain in Data Brief, 2019) with enhanced performance compared to their counterparts. CytoNet demonstrated a 99% accuracy rate on the thyroid dataset in comparison to its counterpart. The precision, recall, and F1-score values achieved on the Mendeley Cervical dataset are 0.992, 0.985, and 0.977, respectively. The code implementation is available on GitHub. https://github.com/naveedilyas/CytoNet-An-Efficient-Dual-Attention-based-Automatic-Prediction-of-Cancer-Sub-types-in-Cytol.
Assuntos
Redes Neurais de Computação , Humanos , Neoplasias/patologia , Neoplasias/classificação , Neoplasias/diagnóstico , Diagnóstico por Computador/métodos , Feminino , Algoritmos , Citodiagnóstico/métodos , Processamento de Imagem Assistida por Computador/métodosRESUMO
Cancer-associated fibroblasts (CAFs) are heterogeneous and ubiquitous stromal cells within the tumor microenvironment (TME). Numerous CAF types have been described, typically using single-cell technologies such as single-cell RNA sequencing. There is no general classification system for CAFs, hampering their study and therapeutic targeting. We propose a simple CAF classification system based on single-cell phenotypes and spatial locations of CAFs in multiple cancer types, assess how our scheme fits within current knowledge, and invite the CAF research community to further refine it.
Assuntos
Fibroblastos Associados a Câncer , Neoplasias , Análise de Célula Única , Microambiente Tumoral , Fibroblastos Associados a Câncer/patologia , Fibroblastos Associados a Câncer/metabolismo , Humanos , Neoplasias/classificação , Neoplasias/patologia , Neoplasias/genética , Análise de Célula Única/métodos , Fenótipo , AnimaisRESUMO
BACKGROUND: Accurate identification of cancer subtypes is crucial for disease prognosis evaluation and personalized patient management. Recent advances in computational methods have demonstrated that multi-omics data provides valuable insights into tumor molecular subtyping. However, the high dimensionality and small sample size of the data may result in ambiguous and overlapping cancer subtypes during clustering. In this study, we propose a novel contrastive-learning-based approach to address this issue. The proposed end-to-end deep learning method can extract crucial information from the multi-omics features by self-supervised learning for patient clustering. RESULTS: By applying our method to nine public cancer datasets, we have demonstrated superior performance compared to existing methods in separating patients with different survival outcomes (p < 0.05). To further evaluate the impact of various omics data on cancer survival, we developed an XGBoost classification model and found that mRNA had the highest importance score, followed by DNA methylation and miRNA. In the presented case study, our method successfully clustered subtypes and identified 14 cancer-related genes, of which 12 (85.7%) were validated through literature review. CONCLUSIONS: Our findings demonstrate that our method is capable of identifying cancer subtypes that are both statistically and biologically significant. The code about COLCS is given at: https://github.com/Mercuriiio/COLCS .
Assuntos
Aprendizado Profundo , Neoplasias , Humanos , Neoplasias/genética , Neoplasias/classificação , Metilação de DNA , Redes Neurais de Computação , Biologia Computacional/métodos , MicroRNAs/genética , Análise por Conglomerados , MultiômicaRESUMO
Cancer classification is crucial for effective patient treatment, and recent years have seen various methods emerge based on protein expression levels. However, existing methods oversimplify by assuming uniform interaction strengths and neglecting intermediate influences among proteins. Addressing these limitations, GATDE employs a graph attention network enhanced with diffusion on protein-protein interactions. By constructing a weighted protein-protein interaction network, GATDE captures the diversity of these interactions and uses a diffusion process to assess multi-hop influences between proteins. This information is subsequently incorporated into the graph attention network, resulting in precise cancer classification. Experimental results on breast cancer and pan-cancer datasets demonstrate that GATDE surpasses current leading methods. Additionally, in-depth case studies further validate the effectiveness of the diffusion process and the attention mechanism, highlighting GATDE's robustness and potential for real-world applications.
Assuntos
Neoplasias , Mapeamento de Interação de Proteínas , Mapas de Interação de Proteínas , Humanos , Neoplasias/metabolismo , Neoplasias/classificação , Mapeamento de Interação de Proteínas/métodos , Neoplasias da Mama/metabolismo , Neoplasias da Mama/classificação , Neoplasias da Mama/genética , Biologia Computacional/métodos , Feminino , AlgoritmosRESUMO
In recent years, multi-omics clustering has become a powerful tool in cancer research, offering a comprehensive perspective on the diverse molecular characteristics inherent to various cancer subtypes. However, most existing multi-omics clustering methods directly integrate heterogeneous features from different omics, which may struggle to deal with the noise or redundancy of multi-omics data and lead to poor clustering results. Therefore, we propose a novel multi-omics clustering method to extract interpretable and discriminative features from various omics before data integration. The clinical information is used to supervise the process of feature extraction based on SHAP (SHapley Additive exPlanation) values. Singular value decomposition (SVD) is then applied to integrate the extracted features of different omics by constructing a latent subspace. Finally, we utilize shared nearest neighbor-based spectral clustering on the latent representation to obtain the clustering result. The proposed method is evaluated on several cancer datasets across three levels of omics, in comparison to several state-of-the-art multi-omics clustering methods. The comparison results demonstrate the superior performance of the proposed method in multi-omics data analysis for cancer subtyping. Additionally, experiments reveal the efficacy of utilizing clinical information based on SHAP values for feature extraction, enhancing the performance of clustering analyses. Moreover, enrichment analysis of the identified gene signatures in different subtypes is also performed to further demonstrate the effectiveness of the proposed method. Availability: The proposed method can be freely accessible at https://github.com/Tianyi-Shi-Tsukuba/Multi-omics-clustering-based-on-SHAP. Data will be made available on request.
Assuntos
Neoplasias , Humanos , Análise por Conglomerados , Neoplasias/genética , Neoplasias/classificação , Neoplasias/metabolismo , Algoritmos , Genômica/métodos , Biologia Computacional/métodos , Aprendizado de Máquina , MultiômicaRESUMO
BACKGROUND: Applying graph convolutional networks (GCN) to the classification of free-form natural language texts leveraged by graph-of-words features (TextGCN) was studied and confirmed to be an effective means of describing complex natural language texts. However, the text classification models based on the TextGCN possess weaknesses in terms of memory consumption and model dissemination and distribution. In this paper, we present a fast message passing network (FastMPN), implementing a GCN with message passing architecture that provides versatility and flexibility by allowing trainable node embedding and edge weights, helping the GCN model find the better solution. We applied the FastMPN model to the task of clinical information extraction from cancer pathology reports, extracting the following six properties: main site, subsite, laterality, histology, behavior, and grade. RESULTS: We evaluated the clinical task performance of the FastMPN models in terms of micro- and macro-averaged F1 scores. A comparison was performed with the multi-task convolutional neural network (MT-CNN) model. Results show that the FastMPN model is equivalent to or better than the MT-CNN. CONCLUSIONS: Our implementation revealed that our FastMPN model, which is based on the PyTorch platform, can train a large corpus (667,290 training samples) with 202,373 unique words in less than 3 minutes per epoch using one NVIDIA V100 hardware accelerator. Our experiments demonstrated that using this implementation, the clinical task performance scores of information extraction related to tumors from cancer pathology reports were highly competitive.
Assuntos
Processamento de Linguagem Natural , Neoplasias , Redes Neurais de Computação , Humanos , Neoplasias/classificação , Mineração de DadosRESUMO
BACKGROUND: Advances in precision oncology led to approval of tumour-agnostic molecularly guided treatment options (MGTOs). The minimum requirements for claiming tumour-agnostic potential remain elusive. METHODS: The European Society for Medical Oncology (ESMO) Precision Medicine Working Group (PMWG) coordinated a project to optimise tumour-agnostic drug development. International experts examined and summarised the publicly available data used for regulatory assessment of the tumour-agnostic indications approved by the US Food and Drug Administration and/or the European Medicines Agency as of December 2023. Different scenarios of minimum objective response rate (ORR), number of tumour types investigated, and number of evaluable patients per tumour type were assessed for developing a screening tool for tumour-agnostic potential. This tool was tested using the tumour-agnostic indications approved during the first half of 2024. A taxonomy for MGTOs and a framework for tumour-agnostic drug development were conceptualised. RESULTS: Each tumour-agnostic indication had data establishing objective response in at least one out of five patients (ORR ≥ 20%) in two-thirds (≥4) of the investigated tumour types, with at least five evaluable patients in each tumour type. These minimum requirements were met by tested indications and may serve as a screening tool for tumour-agnostic potential, requiring further validation. We propose a conceptual taxonomy classifying MGTOs based on the therapeutic effect obtained by targeting a driver molecular aberration across tumours and its modulation by tumour-specific biology: tumour-agnostic, tumour-modulated, or tumour-restricted. The presence of biology-informed mechanistic rationale, early regulatory advice, and adequate trial design demonstrating signs of biology-driven tumour-agnostic activity, followed by confirmatory evidence, should be the principles for tumour-agnostic drug development. CONCLUSION: The ESMO Tumour-Agnostic Classifier (ETAC) focuses on the interplay of targeted driver molecular aberration and tumour-specific biology modulating the therapeutic effect of MGTOs. We propose minimum requirements to screen for tumour-agnostic potential (ETAC-S) as part of tumour-agnostic drug development. Definition of ETAC cut-offs is warranted.
Assuntos
Desenvolvimento de Medicamentos , Terapia de Alvo Molecular , Neoplasias , Medicina de Precisão , Humanos , Neoplasias/tratamento farmacológico , Neoplasias/patologia , Neoplasias/classificação , Desenvolvimento de Medicamentos/métodos , Medicina de Precisão/métodos , Medicina de Precisão/normas , Terapia de Alvo Molecular/métodos , Oncologia/métodos , Oncologia/normas , Antineoplásicos/uso terapêutico , Europa (Continente) , Biomarcadores Tumorais/genéticaRESUMO
Anticancer peptides (ACPs) are a class of molecules that have gained significant attention in the field of cancer research and therapy. ACPs are short chains of amino acids, the building blocks of proteins, and they possess the ability to selectively target and kill cancer cells. One of the key advantages of ACPs is their ability to selectively target cancer cells while sparing healthy cells to a greater extent. This selectivity is often attributed to differences in the surface properties of cancer cells compared to normal cells. That is why ACPs are being investigated as potential candidates for cancer therapy. ACPs may be used alone or in combination with other treatment modalities like chemotherapy and radiation therapy. While ACPs hold promise as a novel approach to cancer treatment, there are challenges to overcome, including optimizing their stability, improving selectivity, and enhancing their delivery to cancer cells, continuous increasing in number of peptide sequences, developing a reliable and precise prediction model. In this work, we propose an efficient transformer-based framework to identify ACPs for by performing accurate a reliable and precise prediction model. For this purpose, four different transformer models, namely ESM, ProtBERT, BioBERT, and SciBERT are employed to detect ACPs from amino acid sequences. To demonstrate the contribution of the proposed framework, extensive experiments are carried on widely-used datasets in the literature, two versions of AntiCp2, cACP-DeepGram, ACP-740. Experiment results show the usage of proposed model enhances classification accuracy when compared to the literature studies. The proposed framework, ESM, exhibits 96.45% of accuracy for AntiCp2 dataset, 97.66% of accuracy for cACP-DeepGram dataset, and 88.51% of accuracy for ACP-740 dataset, thence determining new state-of-the-art. The code of proposed framework is publicly available at github (https://github.com/mstf-yalcin/acp-esm).
Assuntos
Antineoplásicos , Peptídeos , Peptídeos/uso terapêutico , Antineoplásicos/uso terapêutico , Humanos , Neoplasias/tratamento farmacológico , Neoplasias/classificaçãoRESUMO
With hundreds of copies of rDNA, it is unknown whether they possess sequence variations that form different types of ribosomes. Here, we developed an algorithm for long-read variant calling, termed RGA, which revealed that variations in human rDNA loci are predominantly insertion-deletion (indel) variants. We developed full-length rRNA sequencing (RIBO-RT) and in situ sequencing (SWITCH-seq), which showed that translating ribosomes possess variation in rRNA. Over 1,000 variants are lowly expressed. However, tens of variants are abundant and form distinct rRNA subtypes with different structures near indels as revealed by long-read rRNA structure probing coupled to dimethyl sulfate sequencing. rRNA subtypes show differential expression in endoderm/ectoderm-derived tissues, and in cancer, low-abundance rRNA variants can become highly expressed. Together, this study identifies the diversity of ribosomes at the level of rRNA variants, their chromosomal location, and unique structure as well as the association of ribosome variation with tissue-specific biology and cancer.
Assuntos
RNA Ribossômico , Ribossomos , Humanos , Ribossomos/metabolismo , Ribossomos/genética , RNA Ribossômico/genética , Neoplasias/genética , Neoplasias/classificação , Variação Genética , Mutação INDEL , Algoritmos , DNA Ribossômico/genéticaRESUMO
In high-dimensional gene expression data, selecting an optimal subset of genes is crucial for achieving high classification accuracy and reliable diagnosis of diseases. This paper proposes a two-stage hybrid model for gene selection based on clustering and a swarm intelligence algorithm to identify the most informative genes with high accuracy. First, a clustering-based multivariate filter approach is performed to explore the interactions between the features and eliminate any redundant or irrelevant ones. Then, by controlling for the problem of premature convergence in the binary Bat algorithm, the optimal gene subset is determined using different classifiers with the Monte Carlo cross-validation data partitioning model. The effectiveness of our proposed framework is evaluated using eight gene expression datasets, by comparison with other recently published algorithms in the literature. Experiments confirm that in seven out of eight datasets, the proposed method can achieve superior results in terms of classification accuracy and gene subset size. In particular, it achieves a classification accuracy of 100% in Lymphoma and Ovarian datasets and above 97.4% in the rest with a minimum number of genes. The results demonstrate that our proposed algorithm has the potential to solve the feature selection problem in different applications with high-dimensional datasets.
Assuntos
Algoritmos , Neoplasias , Humanos , Neoplasias/genética , Neoplasias/classificação , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Bases de Dados Genéticas , Biologia Computacional/métodos , FemininoRESUMO
Cancer subtyping refers to categorizing a particular cancer type into distinct subtypes or subgroups based on a range of molecular characteristics, clinical manifestations, histological features, and other relevant factors. The identification of cancer subtypes can significantly enhance precision in clinical practice and facilitate personalized diagnosis and treatment strategies. Recent advancements in the field have witnessed the emergence of numerous network fusion methods aimed at identifying cancer subtypes. The majority of these fusion algorithms, however, solely rely on the fusion network of a single core matrix for the identification of cancer subtypes and fail to comprehensively capture similarity. To tackle this issue, in this study, we propose a novel cancer subtype recognition method, referred to as PCA-constrained multi-core matrix fusion network (PCA-MM-FN). The PCA-MM-FN algorithm initially employs three distinct methods to obtain three core matrices. Subsequently, the obtained core matrices are projected into a shared subspace using principal component analysis, followed by a weighted network fusion. Lastly, spectral clustering is conducted on the fused network. The results obtained from conducting experiments on the mRNA expression, DNA methylation, and miRNA expression of five TCGA datasets and three multi-omics benchmark datasets demonstrate that the proposed PCA-MM-FN approach exhibits superior accuracy in identifying cancer subtypes compared to the existing methods.
Assuntos
Algoritmos , Biologia Computacional , Metilação de DNA , MicroRNAs , Neoplasias , Análise de Componente Principal , Humanos , Neoplasias/genética , Neoplasias/classificação , MicroRNAs/genética , Biologia Computacional/métodos , Análise por Conglomerados , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Bases de Dados GenéticasRESUMO
Quantitative measurement of RNA expression levels through RNA-Seq is an ideal replacement for conventional cancer diagnosis via microscope examination. Currently, cancer-related RNA-Seq studies focus on two aspects: classifying the status and tissue of origin of a sample and discovering marker genes. Existing studies typically identify marker genes by statistically comparing healthy and cancer samples. However, this approach overlooks marker genes with low expression level differences and may be influenced by experimental results. This paper introduces "GENESO," a novel framework for pan-cancer classification and marker gene discovery using the occlusion method in conjunction with deep learning. we first trained a baseline deep LSTM neural network capable of distinguishing the origins and statuses of samples utilizing RNA-Seq data. Then, we propose a novel marker gene discovery method called "Symmetrical Occlusion (SO)". It collaborates with the baseline LSTM network, mimicking the "gain of function" and "loss of function" of genes to evaluate their importance in pan-cancer classification quantitatively. By identifying the genes of utmost importance, we then isolate them to train new neural networks, resulting in higher-performance LSTM models that utilize only a reduced set of highly relevant genes. The baseline neural network achieves an impressive validation accuracy of 96.59% in pan-cancer classification. With the help of SO, the accuracy of the second network reaches 98.30%, while using 67% fewer genes. Notably, our method excels in identifying marker genes that are not differentially expressed. Moreover, we assessed the feasibility of our method using single-cell RNA-Seq data, employing known marker genes as a validation test.
Assuntos
Aprendizado Profundo , Neoplasias , Humanos , Neoplasias/genética , Neoplasias/classificação , Redes Neurais de Computação , Biomarcadores Tumorais/genética , RNA-Seq/métodosRESUMO
Traditional cancer classification based on organ of origin and histology is increasingly at odds with precision oncology. Tumors in different organs can share molecular features, while those in the same organ can be heterogeneous. This disconnect impacts clinical trials, drug development, and patient care. Recent advances in artificial intelligence (AI), particularly machine learning and deep learning, offer promising avenues for reclassifying cancers through comprehensive integration of molecular, histopathological, imaging, and clinical characteristics. AI-driven approaches have the potential to reveal novel cancer subtypes, identify new prognostic variables, and guide more precise treatment strategies for improving patient outcomes.
Assuntos
Inteligência Artificial , Neoplasias , Humanos , Neoplasias/classificação , Neoplasias/patologia , Neoplasias/diagnóstico , Neoplasias/diagnóstico por imagem , Medicina de Precisão/métodos , Aprendizado Profundo , Algoritmos , Aprendizado de Máquina , PrognósticoRESUMO
Recent research on multi-view clustering algorithms for complex disease subtyping often overlooks aspects like clustering stability and critical assessment of prognostic relevance. Furthermore, current frameworks do not allow for a comparison between data-driven and pathway-driven clustering, highlighting a significant gap in the methodology. We present the COPS R-package, tailored for robust evaluation of single and multi-omics clustering results. COPS features advanced methods, including similarity networks, kernel-based approaches, dimensionality reduction, and pathway knowledge integration. Some of these methods are not accessible through R, and some correspond to new approaches proposed with COPS. Our framework was rigorously applied to multi-omics data across seven cancer types, including breast, prostate, and lung, utilizing mRNA, CNV, miRNA, and DNA methylation data. Unlike previous studies, our approach contrasts data- and knowledge-driven multi-view clustering methods and incorporates cross-fold validation for robustness. Clustering outcomes were assessed using the ARI score, survival analysis via Cox regression models including relevant covariates, and the stability of the results. While survival analysis and gold-standard agreement are standard metrics, they vary considerably across methods and datasets. Therefore, it is essential to assess multi-view clustering methods using multiple criteria, from cluster stability to prognostic relevance, and to provide ways of comparing these metrics simultaneously to select the optimal approach for disease subtype discovery in novel datasets. Emphasizing multi-objective evaluation, we applied the Pareto efficiency concept to gauge the equilibrium of evaluation metrics in each cancer case-study. Affinity Network Fusion, Integrative Non-negative Matrix Factorization, and Multiple Kernel K-Means with linear or Pathway Induced Kernels were the most stable and effective in discerning groups with significantly different survival outcomes in several case studies.
Assuntos
Algoritmos , Biologia Computacional , Neoplasias , Humanos , Análise por Conglomerados , Neoplasias/genética , Neoplasias/classificação , Biologia Computacional/métodos , Metilação de DNA/genética , MicroRNAs/genética , Genômica/métodos , Software , Análise de Sobrevida , Prognóstico , Masculino , Feminino , Perfilação da Expressão Gênica/métodos , Variações do Número de Cópias de DNA/genética , MultiômicaRESUMO
There is a rapid growth in the volume of data in the cancer field and fine-grained classification is in high demand especially for interdisciplinary and collaborative research. There is thus a need to establish a multi-label classifier with higher resolution to reduce the burden of screening articles for clinical relevance. This research trains a multi-label classifier with scalability for classifying literature on cancer research directly at the publication level. Firstly, a corpus was divided into a training set and a testing set at a ratio of 7:3. Secondly, we compared the performance of classifiers developed by "PubMedBERT + TextRNN" and "BioBERT + TextRNN" with ICRP CT. Finally, the classifier was obtained based on the optimal combination "PubMedBERT + TextRNN", with P= 0.952014, R=0.936696, F1=0.931664. The quantitative comparisons demonstrate that the resulting classifier is fit for high-resolution classification of cancer literature at the publication level to support accurate retrieving and academic statistics.