RESUMO
Lung adenocarcinoma (LUAD) is the most common histologic subtype of lung cancer. Early-stage patients have a 30-50% probability of metastatic recurrence after surgical treatment. Here, we propose a new computational framework, Interpretable Biological Pathway Graph Neural Networks (IBPGNET), based on pathway hierarchy relationships to predict LUAD recurrence and explore the internal regulatory mechanisms of LUAD. IBPGNET can integrate different omics data efficiently and provide global interpretability. In addition, our experimental results show that IBPGNET outperforms other classification methods in 5-fold cross-validation. IBPGNET identified PSMC1 and PSMD11 as genes associated with LUAD recurrence, and their expression levels were significantly higher in LUAD cells than in normal cells. The knockdown of PSMC1 and PSMD11 in LUAD cells increased their sensitivity to afatinib and decreased cell migration, invasion and proliferation. In addition, the cells showed significantly lower EGFR expression, indicating that PSMC1 and PSMD11 may mediate therapeutic sensitivity through EGFR expression.
Assuntos
Adenocarcinoma de Pulmão , Neoplasias Pulmonares , Humanos , Adenocarcinoma de Pulmão/genética , Adenocarcinoma de Pulmão/metabolismo , Neoplasias Pulmonares/metabolismo , Linhagem Celular Tumoral , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Regulação Neoplásica da Expressão Gênica , Receptores ErbB/genética , Proliferação de CélulasRESUMO
Cancer is a complex and high-mortality disease regulated by multiple factors. Accurate cancer subtyping is crucial for formulating personalized treatment plans and improving patient survival rates. The underlying mechanisms that drive cancer progression can be comprehensively understood by analyzing multi-omics data. However, the high noise levels in omics data often pose challenges in capturing consistent representations and adequately integrating their information. This paper proposed a novel variational autoencoder-based deep learning model, named Deeply Integrating Latent Consistent Representations (DILCR). Firstly, multiple independent variational autoencoders and contrastive loss functions were designed to separate noise from omics data and capture latent consistent representations. Subsequently, an Attention Deep Integration Network was proposed to integrate consistent representations across different omics levels effectively. Additionally, we introduced the Improved Deep Embedded Clustering algorithm to make integrated variable clustering friendly. The effectiveness of DILCR was evaluated using 10 typical cancer datasets from The Cancer Genome Atlas and compared with 14 state-of-the-art integration methods. The results demonstrated that DILCR effectively captures the consistent representations in omics data and outperforms other integration methods in cancer subtyping. In the Kidney Renal Clear Cell Carcinoma case study, cancer subtypes were identified by DILCR with significant biological significance and interpretability.
Assuntos
Carcinoma de Células Renais , Neoplasias Renais , Neoplasias , Humanos , Multiômica , Neoplasias/genética , Carcinoma de Células Renais/genética , Algoritmos , Análise por Conglomerados , Neoplasias Renais/genéticaRESUMO
The high-throughput genomic and proteomic scanning approaches allow investigators to measure the quantification of genome-wide genes (or gene products) for certain disease conditions, which plays an essential role in promoting the discovery of disease mechanisms. The high-throughput approaches often generate a large gene list of interest (GOIs), such as differentially expressed genes/proteins. However, researchers have to perform manual triage and validation to explore the most promising, biologically plausible linkages between the known disease genes and GOIs (disease signals) for further study. Here, to address this challenge, we proposed a network-based strategy DDK-Linker to facilitate the exploration of disease signals hidden in omics data by linking GOIs to disease knowns genes. Specifically, it reconstructed gene distances in the protein-protein interaction (PPI) network through six network methods (random walk with restart, Deepwalk, Node2Vec, LINE, HOPE, Laplacian) to discover disease signals in omics data that have shorter distances to disease genes. Furthermore, benefiting from the establishment of knowledge base we established, the abundant bioinformatics annotations were provided for each candidate disease signal. To assist in omics data interpretation and facilitate the usage, we have developed this strategy into an application that users can access through a website or download the R package. We believe DDK-Linker will accelerate the exploring of disease genes and drug targets in a variety of omics data, such as genomics, transcriptomics and proteomics data, and provide clues for complex disease mechanism and pharmacological research. DDK-Linker is freely accessible at http://ddklinker.ncpsb.org.cn/.
Assuntos
Proteômica , Software , Proteômica/métodos , Genômica/métodos , Biologia Computacional/métodos , Mapas de Interação de ProteínasRESUMO
Enrichment analysis contextualizes biological features in pathways to facilitate a systematic understanding of high-dimensional data and is widely used in biomedical research. The emerging reporter score-based analysis (RSA) method shows more promising sensitivity, as it relies on P-values instead of raw values of features. However, RSA cannot be directly applied to multi-group and longitudinal experimental designs and is often misused due to the lack of a proper tool. Here, we propose the Generalized Reporter Score-based Analysis (GRSA) method for multi-group and longitudinal omics data. A comparison with other popular enrichment analysis methods demonstrated that GRSA had increased sensitivity across multiple benchmark datasets. We applied GRSA to microbiome, transcriptome and metabolome data and discovered new biological insights in omics studies. Finally, we demonstrated the application of GRSA beyond functional enrichment using a taxonomy database. We implemented GRSA in an R package, ReporterScore, integrating with a powerful visualization module and updatable pathway databases, which is available on the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/ReporterScore). We believe that the ReporterScore package will be a valuable asset for broad biomedical research fields.
Assuntos
Pesquisa Biomédica , Microbiota , Benchmarking , Bases de Dados Factuais , MetabolomaRESUMO
RNA Polymerase II (Pol II) transcriptional elongation pausing is an integral part of the dynamic regulation of gene transcription in the genome of metazoans. It plays a pivotal role in many vital biological processes and disease progression. However, experimentally measuring genome-wide Pol II pausing is technically challenging and the precise governing mechanism underlying this process is not fully understood. Here, we develop RP3 (RNA Polymerase II Pausing Prediction), a network regularized logistic regression machine learning method, to predict Pol II pausing events by integrating genome sequence, histone modification, gene expression, chromatin accessibility, and protein-protein interaction data. RP3 can accurately predict Pol II pausing in diverse cellular contexts and unveil the transcription factors that are associated with the Pol II pausing machinery. Furthermore, we utilize a forward feature selection framework to systematically identify the combination of histone modification signals associated with Pol II pausing. RP3 is freely available at https://github.com/AMSSwanglab/RP3.
Assuntos
Código das Histonas , RNA Polimerase II , RNA Polimerase II/metabolismo , Humanos , Elongação da Transcrição Genética , Cromatina/metabolismo , Cromatina/genética , Histonas/metabolismo , Aprendizado de Máquina , AnimaisRESUMO
Recent advances in sequencing, mass spectrometry, and cytometry technologies have enabled researchers to collect multiple 'omics data types from a single sample. These large datasets have led to a growing consensus that a holistic approach is needed to identify new candidate biomarkers and unveil mechanisms underlying disease etiology, a key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches, their supervised counterparts have received less attention in the literature and no gold standard has emerged yet. In this work, we present a thorough comparison of a selection of six methods, representative of the main families of intermediate integrative approaches (matrix factorization, multiple kernel methods, ensemble learning, and graph-based methods). As non-integrative control, random forest was performed on concatenated and separated data types. Methods were evaluated for classification performance on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology, and vaccines) and data modalities. A total of 15 simulation scenarios were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, class imbalance, effect size). On real data, the method comparison showed that integrative approaches performed better or equally well than their non-integrative counterpart. By contrast, DIABLO and the four random forest alternatives outperform the others across the majority of simulation scenarios. The strengths and limitations of these methods are discussed in detail as well as guidelines for future applications.
Assuntos
Biologia Computacional , Humanos , Biologia Computacional/métodos , Algoritmos , Genômica/métodos , Genômica/estatística & dados numéricos , MultiômicaRESUMO
Deep learning-based multi-omics data integration methods have the capability to reveal the mechanisms of cancer development, discover cancer biomarkers and identify pathogenic targets. However, current methods ignore the potential correlations between samples in integrating multi-omics data. In addition, providing accurate biological explanations still poses significant challenges due to the complexity of deep learning models. Therefore, there is an urgent need for a deep learning-based multi-omics integration method to explore the potential correlations between samples and provide model interpretability. Herein, we propose a novel interpretable multi-omics data integration method (DeepKEGG) for cancer recurrence prediction and biomarker discovery. In DeepKEGG, a biological hierarchical module is designed for local connections of neuron nodes and model interpretability based on the biological relationship between genes/miRNAs and pathways. In addition, a pathway self-attention module is constructed to explore the correlation between different samples and generate the potential pathway feature representation for enhancing the prediction performance of the model. Lastly, an attribution-based feature importance calculation method is utilized to discover biomarkers related to cancer recurrence and provide a biological interpretation of the model. Experimental results demonstrate that DeepKEGG outperforms other state-of-the-art methods in 5-fold cross validation. Furthermore, case studies also indicate that DeepKEGG serves as an effective tool for biomarker discovery. The code is available at https://github.com/lanbiolab/DeepKEGG.
Assuntos
Biomarcadores Tumorais , Aprendizado Profundo , Recidiva Local de Neoplasia , Humanos , Biomarcadores Tumorais/metabolismo , Biomarcadores Tumorais/genética , Recidiva Local de Neoplasia/metabolismo , Recidiva Local de Neoplasia/genética , Biologia Computacional/métodos , Neoplasias/genética , Neoplasias/metabolismo , Neoplasias/patologia , Genômica/métodos , MultiômicaRESUMO
Midbrain dopaminergic neurons (mDANs) control voluntary movement, cognition, and reward behavior under physiological conditions and are implicated in human diseases such as Parkinson's disease (PD). Many transcription factors (TFs) controlling human mDAN differentiation during development have been described, but much of the regulatory landscape remains undefined. Using a tyrosine hydroxylase (TH) human iPSC reporter line, we here generate time series transcriptomic and epigenomic profiles of purified mDANs during differentiation. Integrative analysis predicts novel regulators of mDAN differentiation and super-enhancers are used to identify key TFs. We find LBX1, NHLH1 and NR2F1/2 to promote mDAN differentiation and show that overexpression of either LBX1 or NHLH1 can also improve mDAN specification. A more detailed investigation of TF targets reveals that NHLH1 promotes the induction of neuronal miR-124, LBX1 regulates cholesterol biosynthesis, and NR2F1/2 controls neuronal activity.
Assuntos
Neurônios Dopaminérgicos , Células-Tronco Pluripotentes Induzidas , Humanos , Neurônios Dopaminérgicos/metabolismo , Multiômica , Mesencéfalo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Células-Tronco Pluripotentes Induzidas/metabolismo , Diferenciação Celular/genética , Fatores de Transcrição Hélice-Alça-Hélice Básicos/genéticaRESUMO
Instrumental variable (IV) analysis has been widely applied in epidemiology to infer causal relationships using observational data. Genetic variants can also be viewed as valid IVs in Mendelian randomization and transcriptome-wide association studies. However, most multivariate IV approaches cannot scale to high-throughput experimental data. Here, we leverage the flexibility of our previous work, a hierarchical model that jointly analyzes marginal summary statistics (hJAM), to a scalable framework (SHA-JAM) that can be applied to a large number of intermediates and a large number of correlated genetic variants-situations often encountered in modern experiments leveraging omic technologies. SHA-JAM aims to estimate the conditional effect for high-dimensional risk factors on an outcome by incorporating estimates from association analyses of single-nucleotide polymorphism (SNP)-intermediate or SNP-gene expression as prior information in a hierarchical model. Results from extensive simulation studies demonstrate that SHA-JAM yields a higher area under the receiver operating characteristics curve (AUC), a lower mean-squared error of the estimates, and a much faster computation speed, compared to an existing approach for similar analyses. In two applied examples for prostate cancer, we investigated metabolite and transcriptome associations, respectively, using summary statistics from a GWAS for prostate cancer with more than 140,000 men and high dimensional publicly available summary data for metabolites and transcriptomes.
Assuntos
Polimorfismo de Nucleotídeo Único , Neoplasias da Próstata , Humanos , Neoplasias da Próstata/genética , Masculino , Estudo de Associação Genômica Ampla/métodos , Modelos Estatísticos , Análise da Randomização Mendeliana , Curva ROC , Simulação por ComputadorRESUMO
A wealth of single-cell protocols makes it possible to characterize different molecular layers at unprecedented resolution. Integrating the resulting multimodal single-cell data to find cell-to-cell correspondences remains a challenge. We argue that data integration needs to happen at a meaningful biological level of abstraction and that it is necessary to consider the inherent discrepancies between modalities to strike a balance between biological discovery and noise removal. A survey of current methods reveals that a distinction between technical and biological origins of presumed unwanted variation between datasets is not yet commonly considered. The increasing availability of paired multimodal data will aid the development of improved methods by providing a ground truth on cell-to-cell matches.
RESUMO
High-dimensional omics data often contain intricate and multifaceted information, resulting in the coexistence of multiple plausible sample partitions based on different subsets of selected features. Conventional clustering methods typically yield only one clustering solution, limiting their capacity to fully capture all facets of cluster structures in high-dimensional data. To address this challenge, we propose a model-based multifacet clustering (MFClust) method based on a mixture of Gaussian mixture models, where the former mixture achieves facet assignment for gene features and the latter mixture determines cluster assignment of samples. We demonstrate superior facet and cluster assignment accuracy of MFClust through simulation studies. The proposed method is applied to three transcriptomic applications from postmortem brain and lung disease studies. The result captures multifacet clustering structures associated with critical clinical variables and provides intriguing biological insights for further hypothesis generation and discovery.
RESUMO
Driven by multi-omics data, some multi-view clustering algorithms have been successfully applied to cancer subtypes prediction, aiming to identify subtypes with biometric differences in the same cancer, thereby improving the clinical prognosis of patients and designing personalized treatment plan. Due to the fact that the number of patients in omics data is much smaller than the number of genes, multi-view spectral clustering based on similarity learning has been widely developed. However, these algorithms still suffer some problems, such as over-reliance on the quality of pre-defined similarity matrices for clustering results, inability to reasonably handle noise and redundant information in high-dimensional omics data, ignoring complementary information between omics data, etc. This paper proposes multi-view spectral clustering with latent representation learning (MSCLRL) method to alleviate the above problems. First, MSCLRL generates a corresponding low-dimensional latent representation for each omics data, which can effectively retain the unique information of each omics and improve the robustness and accuracy of the similarity matrix. Second, the obtained latent representations are assigned appropriate weights by MSCLRL, and global similarity learning is performed to generate an integrated similarity matrix. Third, the integrated similarity matrix is used to feed back and update the low-dimensional representation of each omics. Finally, the final integrated similarity matrix is used for clustering. In 10 benchmark multi-omics datasets and 2 separate cancer case studies, the experiments confirmed that the proposed method obtained statistically and biologically meaningful cancer subtypes.
Assuntos
Multiômica , Neoplasias , Humanos , Algoritmos , Neoplasias/genética , Análise por ConglomeradosRESUMO
Differentiating cancer subtypes is crucial to guide personalized treatment and improve the prognosis for patients. Integrating multi-omics data can offer a comprehensive landscape of cancer biological process and provide promising ways for cancer diagnosis and treatment. Taking the heterogeneity of different omics data types into account, we propose a hierarchical multi-kernel learning (hMKL) approach, a novel cancer molecular subtyping method to identify cancer subtypes by adopting a two-stage kernel learning strategy. In stage 1, we obtain a composite kernel borrowing the cancer integration via multi-kernel learning (CIMLR) idea by optimizing the kernel parameters for individual omics data type. In stage 2, we obtain a final fused kernel through a weighted linear combination of individual kernels learned from stage 1 using an unsupervised multiple kernel learning method. Based on the final fusion kernel, k-means clustering is applied to identify cancer subtypes. Simulation studies show that hMKL outperforms the one-stage CIMLR method when there is data heterogeneity. hMKL can estimate the number of clusters correctly, which is the key challenge in subtyping. Application to two real data sets shows that hMKL identified meaningful subtypes and key cancer-associated biomarkers. The proposed method provides a novel toolkit for heterogeneous multi-omics data integration and cancer subtypes identification.
Assuntos
Aprendizado Profundo , Neoplasias , Humanos , Multiômica , Neoplasias/genética , Análise por Conglomerados , Simulação por Computador , Biomarcadores Tumorais/genéticaRESUMO
Brain imaging genomics is an emerging interdisciplinary field, where integrated analysis of multimodal medical image-derived phenotypes (IDPs) and multi-omics data, bridging the gap between macroscopic brain phenotypes and their cellular and molecular characteristics. This approach aims to better interpret the genetic architecture and molecular mechanisms associated with brain structure, function and clinical outcomes. More recently, the availability of large-scale imaging and multi-omics datasets from the human brain has afforded the opportunity to the discovering of common genetic variants contributing to the structural and functional IDPs of the human brain. By integrative analyses with functional multi-omics data from the human brain, a set of critical genes, functional genomic regions and neuronal cell types have been identified as significantly associated with brain IDPs. Here, we review the recent advances in the methods and applications of multi-omics integration in brain imaging analysis. We highlight the importance of functional genomic datasets in understanding the biological functions of the identified genes and cell types that are associated with brain IDPs. Moreover, we summarize well-known neuroimaging genetics datasets and discuss challenges and future directions in this field.
Assuntos
Encéfalo , Genômica , Humanos , Genômica/métodos , Encéfalo/diagnóstico por imagem , Encéfalo/metabolismo , Fenótipo , Neuroimagem/métodosRESUMO
Cooperative driver pathways discovery helps researchers to study the pathogenesis of cancer. However, most discovery methods mainly focus on genomics data, and neglect the known pathway information and other related multi-omics data; thus they cannot faithfully decipher the carcinogenic process. We propose CDPMiner (Cooperative Driver Pathways Miner) to discover cooperative driver pathways by multiplex network embedding, which can jointly model relational and attribute information of multi-type molecules. CDPMiner first uses the pathway topology to quantify the weight of genes in different pathways, and optimizes the relations between genes and pathways. Then it constructs an attributed multiplex network consisting of micro RNAs, long noncoding RNAs, genes and pathways, embeds the network through deep joint matrix factorization to mine more essential information for pathway-level analysis and reconstructs the pathway interaction network. Finally, CDPMiner leverages the reconstructed network and mutation data to define the driver weight between pathways to discover cooperative driver pathways. Experimental results on Breast invasive carcinoma and Stomach adenocarcinoma datasets show that CDPMiner can effectively fuse multi-omics data to discover more driver pathways, which indeed cooperatively trigger cancers and are valuable for carcinogenesis analysis. Ablation study justifies CDPMiner for a more comprehensive analysis of cancer by fusing multi-omics data.
Assuntos
Algoritmos , Neoplasias da Mama , Humanos , Feminino , Genômica/métodos , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Mutação , Carcinogênese/genéticaRESUMO
The availability of high-throughput sequencing data creates opportunities to comprehensively understand human diseases as well as challenges to train machine learning models using such high dimensions of data. Here, we propose a denoised multi-omics integration framework, which contains a distribution-based feature denoising algorithm, Feature Selection with Distribution (FSD), for dimension reduction and a multi-omics integration framework, Attention Multi-Omics Integration (AttentionMOI) to predict cancer prognosis and identify cancer subtypes. We demonstrated that FSD improved model performance either using single omic data or multi-omics data in 15 The Cancer Genome Atlas Program (TCGA) cancers for survival prediction and kidney cancer subtype identification. And our integration framework AttentionMOI outperformed machine learning models and current multi-omics integration algorithms with high dimensions of features. Furthermore, FSD identified features that were associated to cancer prognosis and could be considered as biomarkers.
Assuntos
Genômica , Neoplasias , Humanos , Genômica/métodos , Multiômica , Neoplasias/genética , AlgoritmosRESUMO
Omics data from clinical samples are the predominant source of target discovery and drug development. Typically, hundreds or thousands of differentially expressed genes or proteins can be identified from omics data. This scale of possibilities is overwhelming for target discovery and validation using biochemical or cellular experiments. Most of these proteins and genes have no corresponding drugs or even active compounds. Moreover, a proportion of them may have been previously reported as being relevant to the disease of interest. To facilitate translational drug discovery from omics data, we have developed a new classification tool named Omics and Text driven Translational Medicine (OTTM). This tool can markedly narrow the range of proteins or genes that merit further validation via drug availability assessment and literature mining. For the 4489 candidate proteins identified in our previous proteomics study, OTTM recommended 40 FDA-approved or clinical trial drugs. Of these, 15 are available commercially and were tested on hepatocellular carcinoma Hep-G2 cells. Two drugs-tafenoquine succinate (an FDA-approved antimalarial drug targeting CYC1) and branaplam (a Phase 3 clinical drug targeting SMN1 for the treatment of spinal muscular atrophy)-showed potent inhibitory activity against Hep-G2 cell viability, suggesting that CYC1 and SMN1 may be potential therapeutic target proteins for hepatocellular carcinoma. In summary, OTTM is an efficient classification tool that can accelerate the discovery of effective drugs and targets using thousands of candidate proteins identified from omics data. The online and local versions of OTTM are available at http://otter-simm.com/ottm.html.
Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Humanos , Ciência Translacional Biomédica , Proteômica , Descoberta de DrogasRESUMO
Disrupted protein phosphorylation due to genetic variation is a widespread phenomenon that triggers oncogenic transformation of healthy cells. However, few relevant phosphorylation disruption events have been verified due to limited biological experimental methods. Because of the lack of reliable benchmark datasets, current bioinformatics methods primarily use sequence-based traits to study variant impact on phosphorylation (VIP). Here, we increased the number of experimentally supported VIP events from less than 30 to 740 by manually curating and reanalyzing multi-omics data from 916 patients provided by the Clinical Proteomic Tumor Analysis Consortium. To predict VIP events in cancer cells, we developed VIPpred, a machine learning method characterized by multidimensional features that exhibits robust performance across different cancer types. Our method provided a pan-cancer landscape of VIP events, which are enriched in cancer-related pathways and cancer driver genes. We found that variant-induced increases in phosphorylation events tend to inhibit the protein degradation of oncogenes and promote tumor suppressor protein degradation. Our work provides new insights into phosphorylation-related cancer biology as well as novel avenues for precision therapy.
Assuntos
Neoplasias , Proteômica , Humanos , Fosforilação , Oncogenes , Carcinogênese/genética , Neoplasias/metabolismoRESUMO
Environmental perturbations are encountered by microorganisms regularly and will require metabolic adaptations to ensure an organism can survive in the newly presenting conditions. In order to study the mechanisms of metabolic adaptation in such conditions, various experimental and computational approaches have been used. Genome-scale metabolic models (GEMs) are one of the most powerful approaches to study metabolism, providing a platform to study the systems level adaptations of an organism to different environments which could otherwise be infeasible experimentally. In this review, we are describing the application of GEMs in understanding how microbes reprogram their metabolic system as a result of environmental variation. In particular, we provide the details of metabolic model reconstruction approaches, various algorithms and tools for model simulation, consequences of genetic perturbations, integration of '-omics' datasets for creating context-specific models and their application in studying metabolic adaptation due to the change in environmental conditions.
Assuntos
Algoritmos , Simulação por ComputadorRESUMO
SUMMARY: One of the first steps in single-cell omics data analysis is visualization, which allows researchers to see how well-separated cell-types are from each other. When visualizing multiple datasets at once, data integration/batch correction methods are used to merge the datasets. While needed for downstream analyses, these methods modify features space (e.g. gene expression)/PCA space in order to mix cell-types between batches as well as possible. This obscures sample-specific features and breaks down local embedding structures that can be seen when a sample is embedded alone. Therefore, in order to improve in visual comparisons between large numbers of samples (e.g., multiple patients, omic modalities, different time points), we introduce Compound-SNE, which performs what we term a soft alignment of samples in embedding space. We show that Compound-SNE is able to align cell-types in embedding space across samples, while preserving local embedding structures from when samples are embedded independently. AVAILABILITY AND IMPLEMENTATION: Python code for Compound-SNE is available for download at https://github.com/HaghverdiLab/Compound-SNE. SUPPLEMENTARY INFORMATION: Available online. Provides algorithmic details and additional tests.