RESUMEN
MOTIVATION: The identification and understanding of drug-target interactions (DTIs) play a pivotal role in the drug discovery and development process. Sequence representations of drugs and proteins in computational model offer advantages such as their widespread availability, easier input quality control, and reduced computational resource requirements. These make them an efficient and accessible tools for various computational biology and drug discovery applications. Many sequence-based DTI prediction methods have been developed over the years. Despite the advancement in methodology, cold start DTI prediction involving unknown drug or protein remains a challenging task, particularly for sequence-based models. Introducing DTI-LM, a novel framework leveraging advanced pretrained language models, we harness their exceptional context-capturing abilities along with neighborhood information to predict DTIs. DTI-LM is specifically designed to rely solely on sequence representations for drugs and proteins, aiming to bridge the gap between warm start and cold start predictions. RESULTS: Large-scale experiments on four datasets show that DTI-LM can achieve state-of-the-art performance on DTI predictions. Notably, it excels in overcoming the common challenges faced by sequence-based models in cold start predictions for proteins, yielding impressive results. The incorporation of neighborhood information through a graph attention network further enhances prediction accuracy. Nevertheless, a disparity persists between cold start predictions for proteins and drugs. A detailed examination of DTI-LM reveals that language models exhibit contrasting capabilities in capturing similarities between drugs and proteins. AVAILABILITY AND IMPLEMENTATION: Source code is available at: https://github.com/compbiolabucf/DTI-LM.
Asunto(s)
Biología Computacional , Proteínas , Proteínas/química , Proteínas/metabolismo , Biología Computacional/métodos , Descubrimiento de Drogas/métodos , Preparaciones Farmacéuticas/química , Algoritmos , Programas InformáticosRESUMEN
The use of high-throughput omics technologies is becoming increasingly popular in all facets of biomedical science. The mRNA sequencing (RNA-seq) method reports quantitative measures of more than tens of thousands of biological features. It provides a more comprehensive molecular perspective of studied cancer mechanisms compared to traditional approaches. Graph-based learning models have been proposed to learn important hidden representations from gene expression data and network structure to improve cancer outcome prediction, patient stratification, and cell clustering. However, these graph-based methods cannot rank the importance of the different neighbors for a particular sample in the downstream cancer subtype analyses. In this study, we introduce omicsGAT, a graph attention network (GAT) model to integrate graph-based learning with an attention mechanism for RNA-seq data analysis. The multi-head attention mechanism in omicsGAT can more effectively secure information of a particular sample by assigning different attention coefficients to its neighbors. Comprehensive experiments on The Cancer Genome Atlas (TCGA) breast cancer and bladder cancer bulk RNA-seq data and two single-cell RNA-seq datasets validate that (1) the proposed model can effectively integrate neighborhood information of a sample and learn an embedding vector to improve disease phenotype prediction, cancer patient stratification, and cell clustering of the sample and (2) the attention matrix generated from the multi-head attention coefficients provides more useful information compared to the sample correlation-based adjacency matrix. From the results, we can conclude that some neighbors play a more important role than others in cancer subtype analyses of a particular sample based on the attention coefficient.
Asunto(s)
Neoplasias , Análisis por Conglomerados , Genoma , Humanos , Neoplasias/genética , ARN MensajeroRESUMEN
BACKGROUND: The eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3'-untranslated region (3'-UTR) of mRNA produces transcripts with shorter or longer 3'-UTR. Often, 3'-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3'-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3'-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3'-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3'-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations. METHODS: APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3'-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3'-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3'-UTR annotation and read coverage on the 3'-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user's manual are freely available at https://github.com/compbiolabucf/APA-Scan . RESULT: APA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3'-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3'-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3'-UTR APA events and improve genome annotation. CONCLUSION: APA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3'-UTR APA events. The pipeline integrates both RNA-seq and 3'-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots.
Asunto(s)
MicroARNs , Poliadenilación , Regiones no Traducidas 3'/genética , Animales , Fibroblastos/metabolismo , Ratones , MicroARNs/metabolismo , Isoformas de Proteínas/genética , Precursores del ARN/metabolismo , ARN Mensajero/genética , ARN Mensajero/metabolismo , RNA-SeqRESUMEN
MOTIVATION: Time-lapse microscopy is a powerful technique that relies on images of live cells cultured ex vivo that are captured at regular intervals of time to describe and quantify their behavior under certain experimental conditions. This imaging method has great potential in advancing the field of precision oncology by quantifying the response of cancer cells to various therapies and identifying the most efficacious treatment for a given patient. Digital image processing algorithms developed so far require high-resolution images involving very few cells originating from homogeneous cell line populations. We propose a novel framework that tracks cancer cells to capture their behavior and quantify cell viability to inform clinical decisions in a high-throughput manner. RESULTS: The brightfield microscopy images a large number of patient-derived cells in an ex vivo reconstruction of the tumor microenvironment treated with 31 drugs for up to 6 days. We developed a robust and user-friendly pipeline CancerCellTracker that detects cells in co-culture, tracks these cells across time and identifies cell death events using changes in cell attributes. We validated our computational pipeline by comparing the timing of cell death estimates by CancerCellTracker from brightfield images and a fluorescent channel featuring ethidium homodimer. We benchmarked our results using a state-of-the-art algorithm implemented in ImageJ and previously published in the literature. We highlighted CancerCellTracker's efficiency in estimating the percentage of live cells in the presence of bone marrow stromal cells. AVAILABILITY AND IMPLEMENTATION: https://github.com/compbiolabucf/CancerCellTracker. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Antineoplásicos , Neoplasias , Humanos , Microscopía/métodos , Imagen de Lapso de Tiempo , Programas Informáticos , Neoplasias/diagnóstico por imagen , Neoplasias/tratamiento farmacológico , Medicina de Precisión , Algoritmos , Microambiente TumoralRESUMEN
MOTIVATION: Accurate disease phenotype prediction plays an important role in the treatment of heterogeneous diseases like cancer in the era of precision medicine. With the advent of high throughput technologies, more comprehensive multi-omics data is now available that can effectively link the genotype to phenotype. However, the interactive relation of multi-omics datasets makes it particularly challenging to incorporate different biological layers to discover the coherent biological signatures and predict phenotypic outcomes. In this study, we introduce omicsGAN, a generative adversarial network model to integrate two omics data and their interaction network. The model captures information from the interaction network as well as the two omics datasets and fuse them to generate synthetic data with better predictive signals. RESULTS: Large-scale experiments on The Cancer Genome Atlas breast cancer, lung cancer and ovarian cancer datasets validate that (i) the model can effectively integrate two omics data (e.g. mRNA and microRNA expression data) and their interaction network (e.g. microRNA-mRNA interaction network). The synthetic omics data generated by the proposed model has a better performance on cancer outcome classification and patients survival prediction compared to original omics datasets. (ii) The integrity of the interaction network plays a vital role in the generation of synthetic data with higher predictive quality. Using a random interaction network does not allow the framework to learn meaningful information from the omics datasets; therefore, results in synthetic data with weaker predictive signals. AVAILABILITY AND IMPLEMENTATION: Source code is available at: https://github.com/CompbioLabUCF/omicsGAN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Neoplasias Pulmonares , MicroARNs , Humanos , Multiómica , Programas Informáticos , Genoma , MicroARNs/genéticaRESUMEN
Deregulation of gene expression is associated with the pathogenesis of numerous human diseases including cancer. Current data analyses on gene expression are mostly focused on differential gene/transcript expression in big data-driven studies. However, a poor connection to the proteome changes is a widespread problem in current data analyses. This is partly due to the complexity of gene regulatory pathways at the post-transcriptional level. In this study, we overcome these limitations and introduce a graph-based learning model, PTNet, which simulates the microRNAs (miRNAs) that regulate gene expression post-transcriptionally in silico. Our model does not require large-scale proteomics studies to measure the protein expression and can successfully predict the protein levels by considering the miRNA-mRNA interaction network, the mRNA expression, and the miRNA expression. Large-scale experiments on simulations and real cancer high-throughput datasets using PTNet validated that (i) the miRNA-mediated interaction network affects the abundance of corresponding proteins and (ii) the predicted protein expression has a higher correlation with the proteomics data (ground-truth) than the mRNA expression data. The classification performance also shows that the predicted protein expression has an improved prediction power on cancer outcomes compared to the prediction done by the mRNA expression data only or considering both mRNA and miRNA. Availability: PTNet toolbox is available at http://github.com/CompbioLabUCF/PTNet.
Asunto(s)
Redes Reguladoras de Genes , MicroARNs/genética , Neoplasias/genética , Algoritmos , Simulación por Computador , Conjuntos de Datos como Asunto , Humanos , ProteómicaRESUMEN
BACKGROUND: Drug sensitivity prediction and drug responsive biomarker selection on high-throughput genomic data is a critical step in drug discovery. Many computational methods have been developed to serve this purpose including several deep neural network models. However, the modular relations among genomic features have been largely ignored in these methods. To overcome this limitation, the role of the gene co-expression network on drug sensitivity prediction is investigated in this study. METHODS: In this paper, we first introduce a network-based method to identify representative features for drug response prediction by using the gene co-expression network. Then, two graph-based neural network models are proposed and both models integrate gene network information directly into neural network for outcome prediction. Next, we present a large-scale comparative study among the proposed network-based methods, canonical prediction algorithms (i.e., Elastic Net, Random Forest, Partial Least Squares Regression, and Support Vector Regression), and deep neural network models for drug sensitivity prediction. All the source code and processed datasets in this study are available at https://github.com/compbiolabucf/drug-sensitivity-prediction . RESULTS: In the comparison of different feature selection methods and prediction methods on a non-small cell lung cancer (NSCLC) cell line RNA-seq gene expression dataset with 50 different drug treatments, we found that (1) the network-based feature selection method improves the prediction performance compared to Pearson correlation coefficients; (2) Random Forest outperforms all the other canonical prediction algorithms and deep neural network models; (3) the proposed graph-based neural network models show better prediction performance compared to deep neural network model; (4) the prediction performance is drug dependent and it may relate to the drug's mechanism of action. CONCLUSIONS: Network-based feature selection method and prediction models improve the performance of the drug response prediction. The relations between the genomic features are more robust and stable compared to the correlation between each individual genomic feature and the drug response in high dimension and low sample size genomic datasets.