Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 74
Filter
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38605642

ABSTRACT

MicroRNAs (miRNAs) synergize with various biomolecules in human cells resulting in diverse functions in regulating a wide range of biological processes. Predicting potential disease-associated miRNAs as valuable biomarkers contributes to the treatment of human diseases. However, few previous methods take a holistic perspective and only concentrate on isolated miRNA and disease objects, thereby ignoring that human cells are responsible for multiple relationships. In this work, we first constructed a multi-view graph based on the relationships between miRNAs and various biomolecules, and then utilized graph attention neural network to learn the graph topology features of miRNAs and diseases for each view. Next, we added an attention mechanism again, and developed a multi-scale feature fusion module, aiming to determine the optimal fusion results for the multi-view topology features of miRNAs and diseases. In addition, the prior attribute knowledge of miRNAs and diseases was simultaneously added to achieve better prediction results and solve the cold start problem. Finally, the learned miRNA and disease representations were then concatenated and fed into a multi-layer perceptron for end-to-end training and predicting potential miRNA-disease associations. To assess the efficacy of our model (called MUSCLE), we performed 5- and 10-fold cross-validation (CV), which got average the Area under ROC curves of 0.966${\pm }$0.0102 and 0.973${\pm }$0.0135, respectively, outperforming most current state-of-the-art models. We then examined the impact of crucial parameters on prediction performance and performed ablation experiments on the feature combination and model architecture. Furthermore, the case studies about colon cancer, lung cancer and breast cancer also fully demonstrate the good inductive capability of MUSCLE. Our data and code are free available at a public GitHub repository: https://github.com/zht-code/MUSCLE.git.


Subject(s)
Colonic Neoplasms , Lung Neoplasms , MicroRNAs , Humans , Muscles , Learning , MicroRNAs/genetics , Algorithms , Computational Biology
2.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36460622

ABSTRACT

Drug response prediction in cancer cell lines is of great significance in personalized medicine. In this study, we propose GADRP, a cancer drug response prediction model based on graph convolutional networks (GCNs) and autoencoders (AEs). We first use a stacked deep AE to extract low-dimensional representations from cell line features, and then construct a sparse drug cell line pair (DCP) network incorporating drug, cell line, and DCP similarity information. Later, initial residual and layer attention-based GCN (ILGCN) that can alleviate over-smoothing problem is utilized to learn DCP features. And finally, fully connected network is employed to make prediction. Benchmarking results demonstrate that GADRP can significantly improve prediction performance on all metrics compared with baselines on five datasets. Particularly, experiments of predictions of unknown DCP responses, drug-cancer tissue associations, and drug-pathway associations illustrate the predictive power of GADRP. All results highlight the effectiveness of GADRP in predicting drug responses, and its potential value in guiding anti-cancer drug selection.


Subject(s)
Antineoplastic Agents , Neoplasms , Humans , Neoplasms/drug therapy , Antineoplastic Agents/pharmacology , Antineoplastic Agents/therapeutic use , Benchmarking , Cell Line , Learning
3.
Brief Bioinform ; 23(2)2022 03 10.
Article in English | MEDLINE | ID: mdl-35180781

ABSTRACT

Although there are a large number of structural variations in the chromosomes of each individual, there is a lack of more accurate methods for identifying clinical pathogenic variants. Here, we proposed SVPath, a machine learning-based method to predict the pathogenicity of deletions, insertions and duplications structural variations that occur in exons. We constructed three types of annotation features for each structural variation event in the ClinVar database. First, we treated complex structural variations as multiple consecutive single nucleotide polymorphisms events, and annotated them with correlation scores based on single nucleic acid substitutions, such as the impact on protein function. Second, we determined which genes the variation occurred in, and constructed gene-based annotation features for each structural variation. Third, we also calculated related features based on the transcriptome, such as histone signal, the overlap ratio of variation and genomic element definitions, etc. Finally, we employed a gradient boosting decision tree machine learning method, and used the deletions, insertions and duplications in the ClinVar database to train a structural variation pathogenicity prediction model SVPath. These structural variations are clearly indicated as pathogenic or benign. Experimental results show that our SVPath has achieved excellent predictive performance and outperforms existing state-of-the-art tools. SVPath is very promising in evaluating the clinical pathogenicity of structural variants. SVPath can be used in clinical research to predict the clinical significance of unknown pathogenicity and new structural variation, so as to explore the relationship between diseases and structural variations in a computational way.


Subject(s)
Machine Learning , Polymorphism, Single Nucleotide , Exons , Humans , Molecular Sequence Annotation , Virulence
4.
Brief Bioinform ; 23(3)2022 05 13.
Article in English | MEDLINE | ID: mdl-35443040

ABSTRACT

Target prediction and virtual screening are two powerful tools of computer-aided drug design. Target identification is of great significance for hit discovery, lead optimization, drug repurposing and elucidation of the mechanism. Virtual screening can improve the hit rate of drug screening to shorten the cycle of drug discovery and development. Therefore, target prediction and virtual screening are of great importance for developing highly effective drugs against COVID-19. Here we present D3AI-CoV, a platform for target prediction and virtual screening for the discovery of anti-COVID-19 drugs. The platform is composed of three newly developed deep learning-based models i.e., MultiDTI, MPNNs-CNN and MPNNs-CNN-R models. To compare the predictive performance of D3AI-CoV with other methods, an external test set, named Test-78, was prepared, which consists of 39 newly published independent active compounds and 39 inactive compounds from DrugBank. For target prediction, the areas under the receiver operating characteristic curves (AUCs) of MultiDTI and MPNNs-CNN models are 0.93 and 0.91, respectively, whereas the AUCs of the other reported approaches range from 0.51 to 0.74. For virtual screening, the hit rate of D3AI-CoV is also better than other methods. D3AI-CoV is available for free as a web application at http://www.d3pharma.com/D3Targets-2019-nCoV/D3AI-CoV/index.php, which can serve as a rapid online tool for predicting potential targets for active compounds and for identifying active molecules against a specific target protein for COVID-19 treatment.


Subject(s)
COVID-19 Drug Treatment , Deep Learning , Antiviral Agents/pharmacology , Antiviral Agents/therapeutic use , Drug Repositioning , Humans , Molecular Docking Simulation , SARS-CoV-2
5.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-34117734

ABSTRACT

Recent studies have demonstrated that the excessive inflammatory response is an important factor of death in coronavirus disease 2019 (COVID-19) patients. In this study, we propose a deep representation on heterogeneous drug networks, termed DeepR2cov, to discover potential agents for treating the excessive inflammatory response in COVID-19 patients. This work explores the multi-hub characteristic of a heterogeneous drug network integrating eight unique networks. Inspired by the multi-hub characteristic, we design 3 billion special meta paths to train a deep representation model for learning low-dimensional vectors that integrate long-range structure dependency and complex semantic relation among network nodes. Based on the representation vectors and transcriptomics data, we predict 22 drugs that bind to tumor necrosis factor-α or interleukin-6, whose therapeutic associations with the inflammation storm in COVID-19 patients, and molecular binding model are further validated via data from PubMed publications, ongoing clinical trials and a docking program. In addition, the results on five biomedical applications suggest that DeepR2cov significantly outperforms five existing representation approaches. In summary, DeepR2cov is a powerful network representation approach and holds the potential to accelerate treatment of the inflammatory responses in COVID-19 patients. The source code and data can be downloaded from https://github.com/pengsl-lab/DeepR2cov.git.


Subject(s)
COVID-19 Drug Treatment , Drug Repositioning , Inflammation/drug therapy , SARS-CoV-2/drug effects , Anti-Inflammatory Agents/chemistry , Anti-Inflammatory Agents/therapeutic use , COVID-19/complications , COVID-19/genetics , COVID-19/virology , Computational Biology , Deep Learning , Humans , Inflammation/complications , Inflammation/genetics , Inflammation/virology , Neural Networks, Computer , SARS-CoV-2/pathogenicity , Software , Transcriptome/drug effects , Transcriptome/genetics
6.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-34009265

ABSTRACT

Accurate identification of the miRNA-disease associations (MDAs) helps to understand the etiology and mechanisms of various diseases. However, the experimental methods are costly and time-consuming. Thus, it is urgent to develop computational methods towards the prediction of MDAs. Based on the graph theory, the MDA prediction is regarded as a node classification task in the present study. To solve this task, we propose a novel method MDA-GCNFTG, which predicts MDAs based on Graph Convolutional Networks (GCNs) via graph sampling through the Feature and Topology Graph to improve the training efficiency and accuracy. This method models both the potential connections of feature space and the structural relationships of MDA data. The nodes of the graphs are represented by the disease semantic similarity, miRNA functional similarity and Gaussian interaction profile kernel similarity. Moreover, we considered six tasks simultaneously on the MDA prediction problem at the first time, which ensure that under both balanced and unbalanced sample distribution, MDA-GCNFTG can predict not only new MDAs but also new diseases without known related miRNAs and new miRNAs without known related diseases. The results of 5-fold cross-validation show that the MDA-GCNFTG method has achieved satisfactory performance on all six tasks and is significantly superior to the classic machine learning methods and the state-of-the-art MDA prediction methods. Moreover, the effectiveness of GCNs via the graph sampling strategy and the feature and topology graph in MDA-GCNFTG has also been demonstrated. More importantly, case studies for two diseases and three miRNAs are conducted and achieved satisfactory performance.


Subject(s)
Biomarkers , Computational Biology/methods , Disease Susceptibility , Gene Expression Regulation , MicroRNAs/genetics , Software , Algorithms , Databases, Genetic , Humans , Reproducibility of Results , Workflow
7.
Methods ; 198: 11-18, 2022 02.
Article in English | MEDLINE | ID: mdl-34419588

ABSTRACT

Coronavirus Disease-19 (COVID-19) has lead global epidemics with high morbidity and mortality. However, there are currently no proven effective drugs targeting COVID-19. Identifying drug-virus associations can not only provide insights into the understanding of drug-virus interaction mechanism, but also guide and facilitate the screening of compound candidates for antiviral drug discovery. Since conventional experiment methods are time-consuming, laborious and expensive, computational methods to identify potential drug candidates for viruses (e.g., COVID-19) provide an alternative strategy. In this work, we propose a novel framework of Heterogeneous Graph Attention Networks for Drug-Virus Association predictions, named HGATDVA. First, we fully incorporate multiple sources of biomedical data, e.g., drug chemical information, virus genome sequences and viral protein sequences, to construct abundant features for drugs and viruses. Second, we construct two drug-virus heterogeneous graphs. For each graph, we design a self-enhanced graph attention network (SGAT) to explicitly model the dependency between a node and its local neighbors and derive the graph-specific representations for nodes. Third, we further develop a neural network architecture with tri-aggregator to aggregate the graph-specific representations to generate the final node representations. Extensive experiments were conducted on two datasets, i.e., DrugVirus and MDAD, and the results demonstrated that our model outperformed 7 state-of-the-art methods. Case study on SARS-CoV-2 validated the effectiveness of our model in identifying potential drugs for viruses.


Subject(s)
COVID-19 , Pharmaceutical Preparations , Drug Interactions , Humans , Neural Networks, Computer , SARS-CoV-2
8.
Nucleic Acids Res ; 49(D1): D848-D854, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33010154

ABSTRACT

High-throughput genetic screening based on CRISPR/Cas9 or RNA-interference (RNAi) enables the exploration of genes associated with the phenotype of interest on a large scale. The rapid accumulation of public available genetic screening data provides a wealth of knowledge about genotype-to-phenotype relationships and a valuable resource for the systematic analysis of gene functions. Here we present CRISP-view, a comprehensive database of CRISPR/Cas9 and RNAi screening datasets that span multiple phenotypes, including in vitro and in vivo cell proliferation and viability, response to cancer immunotherapy, virus response, protein expression, etc. By 22 September 2020, CRISP-view has collected 10 321 human samples and 825 mouse samples from 167 papers. All the datasets have been curated, annotated, and processed by a standard MAGeCK-VISPR analysis pipeline with quality control (QC) metrics. We also developed a user-friendly webserver to visualize, explore, and search these datasets. The webserver is freely available at http://crispview.weililab.org.


Subject(s)
CRISPR-Cas Systems/genetics , Databases, Genetic , Genetic Testing , Metadata , Molecular Sequence Annotation , Phenotype , User-Computer Interface
9.
Genome Res ; 29(2): 270-280, 2019 02.
Article in English | MEDLINE | ID: mdl-30670627

ABSTRACT

Aberrant DNA methylation is a distinguishing feature of cancer. Yet, how methylation affects immune surveillance and tumor metastasis remains ambiguous. We introduce a novel method, Guide Positioning Sequencing (GPS), for precisely detecting whole-genome DNA methylation with cytosine coverage as high as 96% and unbiased coverage of GC-rich and repetitive regions. Systematic comparisons of GPS with whole-genome bisulfite sequencing (WGBS) found that methylation difference between gene body and promoter is an effective predictor of gene expression with a correlation coefficient of 0.67 (GPS) versus 0.33 (WGBS). Moreover, Methylation Boundary Shift (MBS) in promoters or enhancers is capable of modulating expression of genes associated with immunity and tumor metabolism. Furthermore, aberrant DNA methylation results in tissue-specific enhancer switching, which is responsible for altering cell identity during liver cancer development. Altogether, we demonstrate that GPS is a powerful tool with improved accuracy and efficiency over WGBS in simultaneously detecting genome-wide DNA methylation and genomic variation. Using GPS, we show that aberrant DNA methylation is associated with altering cell identity and immune surveillance networks, which may contribute to tumorigenesis and metastasis.


Subject(s)
DNA Methylation , Gene Expression Regulation, Neoplastic , Sequence Analysis, DNA/methods , Carcinogenesis/genetics , Cell Line, Tumor , Enhancer Elements, Genetic , Genome, Human , Humans , Immunologic Surveillance/genetics , Liver/metabolism , Liver Neoplasms/genetics , Liver Neoplasms/metabolism , Liver Neoplasms/pathology , Neoplasm Metastasis , Promoter Regions, Genetic , Ribosomal Proteins/genetics
10.
BMC Med ; 20(1): 368, 2022 10 17.
Article in English | MEDLINE | ID: mdl-36244991

ABSTRACT

BACKGROUND: Considering the heterogeneity of tumors, it is a key issue in precision medicine to predict the drug response of each individual. The accumulation of various types of drug informatics and multi-omics data facilitates the development of efficient models for drug response prediction. However, the selection of high-quality data sources and the design of suitable methods remain a challenge. METHODS: In this paper, we design NeRD, a multidimensional data integration model based on the PRISM drug response database, to predict the cellular response of drugs. Four feature extractors, including drug structure extractor (DSE), molecular fingerprint extractor (MFE), miRNA expression extractor (mEE), and copy number extractor (CNE), are designed for different types and dimensions of data. A fully connected network is used to fuse all features and make predictions. RESULTS: Experimental results demonstrate the effective integration of the global and local structural features of drugs, as well as the features of cell lines from different omics data. For all metrics tested on the PRISM database, NeRD surpassed previous approaches. We also verified that NeRD has strong reliability in the prediction results of new samples. Moreover, unlike other algorithms, when the amount of training data was reduced, NeRD maintained stable performance. CONCLUSIONS: NeRD's feature fusion provides a new idea for drug response prediction, which is of great significance for precise cancer treatment.


Subject(s)
MicroRNAs , Neoplasms , Algorithms , Humans , Neoplasms/drug therapy , Neural Networks, Computer , Reproducibility of Results
11.
Bioinformatics ; 37(23): 4485-4492, 2021 12 07.
Article in English | MEDLINE | ID: mdl-34180970

ABSTRACT

MOTIVATION: Predicting new drug-target interactions is an important step in new drug development, understanding of its side effects and drug repositioning. Heterogeneous data sources can provide comprehensive information and different perspectives for drug-target interaction prediction. Thus, there have been many calculation methods relying on heterogeneous networks. Most of them use graph-related algorithms to characterize nodes in heterogeneous networks for predicting new drug-target interactions (DTI). However, these methods can only make predictions in known heterogeneous network datasets, and cannot support the prediction of new chemical entities outside the heterogeneous network, which hinder further drug discovery and development. RESULTS: To solve this problem, we proposed a multi-modal DTI prediction model named 'MultiDTI' which uses our proposed joint learning framework based on heterogeneous networks. It combines the interaction or association information of the heterogeneous network and the drug/target sequence information, and maps the drugs, targets, side effects and disease nodes in the heterogeneous network into a common space. In this way, 'MultiDTI' can map the new chemical entity to this learned common space based on the chemical structure of the new entity. That is, bridging the gap between new chemical entities and known heterogeneous network. Our model has strong predictive performance, and the area under the receiver operating characteristic curve of the model is 0.961 and the area under the precision recall curve is 0.947 with 10-fold cross validation. In addition, some predicted new DTIs have been confirmed by ChEMBL database. Our results indicate that 'MultiDTI' is a powerful and practical tool for predicting new DTI, which can promote the development of drug discovery or drug repositioning. AVAILABILITY AND IMPLEMENTATION: Python codes and dataset are available at https://github.com/Deshan-Zhou/MultiDTI/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Drug Development , Drug-Related Side Effects and Adverse Reactions , Humans , Drug Repositioning , Algorithms , Drug Discovery
12.
Bioinformatics ; 37(24): 4793-4800, 2021 12 11.
Article in English | MEDLINE | ID: mdl-34329382

ABSTRACT

MOTIVATION: Predicting entity relationship can greatly benefit important biomedical problems. Recently, a large amount of biomedical heterogeneous networks (BioHNs) are generated and offer opportunities for developing network-based learning approaches to predict relationships among entities. However, current researches slightly explored BioHNs-based self-supervised representation learning methods, and are hard to simultaneously capturing local- and global-level association information among entities. RESULTS: In this study, we propose a BioHN-based self-supervised representation learning approach for entity relationship predictions, termed BioERP. A self-supervised meta path detection mechanism is proposed to train a deep Transformer encoder model that can capture the global structure and semantic feature in BioHNs. Meanwhile, a biomedical entity mask learning strategy is designed to reflect local associations of vertices. Finally, the representations from different task models are concatenated to generate two-level representation vectors for predicting relationships among entities. The results on eight datasets show BioERP outperforms 30 state-of-the-art methods. In particular, BioERP reveals great performance with results close to 1 in terms of AUC and AUPR on the drug-target interaction predictions. In summary, BioERP is a promising bio-entity relationship prediction approach. AVAILABILITY AND IMPLEMENTATION: Source code and data can be downloaded from https://github.com/pengsl-lab/BioERP.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Software , Semantics
13.
BMC Biol ; 19(1): 115, 2021 06 03.
Article in English | MEDLINE | ID: mdl-34082735

ABSTRACT

BACKGROUND: Many of genome features which could help unravel the often complex post-speciation evolution of closely related species are obscured because of their location in chromosomal regions difficult to accurately characterize using standard genome analysis methods, including centromeres and repeat regions. RESULTS: Here, we analyze the genome evolution and diversification of two recently diverged sister cotton species based on nanopore long-read sequence assemblies and Hi-C 3D genome data. Although D genomes are conserved in gene content, they have diversified in gene order, gene structure, gene family diversification, 3D chromatin structure, long-range regulation, and stress-related traits. Inversions predominate among D genome rearrangements. Our results support roles for 5mC and 6mA in gene activation, and 3D chromatin analysis showed that diversification in proximal-vs-distal regulatory-region interactions shape the regulation of defense-related-gene expression. Using a newly developed method, we accurately positioned cotton centromeres and found that these regions have undergone obviously more rapid evolution relative to chromosome arms. We also discovered a cotton-specific LTR class that clarifies evolutionary trajectories among diverse cotton species and identified genetic networks underlying the Verticillium tolerance of Gossypium thurberi (e.g., SA signaling) and salt-stress tolerance of Gossypium davidsonii (e.g., ethylene biosynthesis). Finally, overexpression of G. thurberi genes in upland cotton demonstrated how wild cottons can be exploited for crop improvement. CONCLUSIONS: Our study substantially deepens understanding about how centromeres have developed and evolutionarily impacted the divergence among closely related cotton species and reveals genes and 3D genome structures which can guide basic investigations and applied efforts to improve crops.


Subject(s)
Centromere , Gossypium , Centromere/genetics , Chromatin , Gene Expression Regulation, Plant , Genome, Plant/genetics , Gossypium/genetics , Phylogeny
14.
BMC Bioinformatics ; 22(1): 344, 2021 Jun 24.
Article in English | MEDLINE | ID: mdl-34167459

ABSTRACT

BACKGROUND: VISPR is an interactive visualization and analysis framework for CRISPR screening experiments. However, it only supports the output of MAGeCK, and requires installation and manual configuration. Furthermore, VISPR is designed to run on a single computer, and data sharing between collaborators is challenging. RESULTS: To make the tool easily accessible to the community, we present VISPR-online, a web-based general application allowing users to visualize, explore, and share CRISPR screening data online with a few simple steps. VISPR-online provides an exploration of screening results and visualization of read count changes. Apart from MAGeCK, VISPR-online supports two more popular CRISPR screening analysis tools: BAGEL and JACKS. It provides an interactive environment for exploring gene essentiality, viewing guide RNA (gRNA) locations, and allowing users to resume and share screening results. CONCLUSIONS: VISPR-online allows users to visualize, explore and share CRISPR screening data online. It is freely available at http://vispr-online.weililab.org , while the source code is available at https://github.com/lemoncyb/VISPR-online .


Subject(s)
Clustered Regularly Interspaced Short Palindromic Repeats , Software , Internet , RNA, Guide, Kinetoplastida , Research
15.
Entropy (Basel) ; 23(6)2021 Jun 16.
Article in English | MEDLINE | ID: mdl-34208626

ABSTRACT

Atmospheric continuous-variable quantum key distribution (ACVQKD) has been proven to be secure theoretically with the assumption that the signal source is well protected by the sender so that it cannot be compromised. However, this assumption is quite unpractical in realistic quantum communication system. In this work, we investigate a practical situation in which the signal source is no longer protected by the legitimate parts, but is exposed to the untrusted atmospheric channel. We show that the performance of ACVQKD is reduced by removing the assumption, especially when putting the untrusted source at the middle of the channel. To improve the performance of the ACVQKD with the untrusted source, a non-Gaussian operation, called photon subtraction, is subsequently introduced. Numerical analysis shows that the performance of ACVQKD with an untrusted source can be improved by properly adopting the photon subtraction operation. Moreover, a special situation where the untrusted source is located in the middle of the atmospheric channel is also considered. Under direct reconciliation, we find that its performance can be significantly improved when the photon subtraction operation is manipulated by the sender.

16.
BMC Genomics ; 21(Suppl 1): 872, 2020 Mar 05.
Article in English | MEDLINE | ID: mdl-32138651

ABSTRACT

BACKGROUND: The Type II clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated proteins (Cas) is a powerful genome editing technology, which is more and more popular in gene function analysis. In CRISPR/Cas, RNA guides Cas nuclease to the target site to perform DNA modification. RESULTS: The performance of CRISPR/Cas depends on well-designed single guide RNA (sgRNA). However, the off-target effect of sgRNA leads to undesired mutations in genome and limits the use of CRISPR/Cas. Here, we present OffScan, a universal and fast CRISPR off-target detection tool. CONCLUSIONS: OffScan is not limited by the number of mismatches and allows custom protospacer-adjacent motif (PAM), which is the target site by Cas protein. Besides, OffScan adopts the FM-index, which efficiently improves query speed and reduce memory consumption.


Subject(s)
CRISPR-Cas Systems , Computational Biology/methods , Gene Editing/methods , RNA, Guide, Kinetoplastida/genetics , Algorithms , Animals , Caenorhabditis elegans/genetics , Clustered Regularly Interspaced Short Palindromic Repeats , Endonucleases/metabolism , Humans , Mice , Mutation , Zebrafish/genetics
17.
Bioinformatics ; 35(3): 389-397, 2019 02 01.
Article in English | MEDLINE | ID: mdl-30010784

ABSTRACT

Motivation: Functional somatic mutations within coding amino acid sequences confer growth advantage in pathogenic process. Most existing methods for identifying cancer-related mutations focus on the single amino acid or the entire gene level. However, gain-of-function mutations often cluster in specific protein regions instead of existing independently in the amino acid sequences. Some approaches for identifying mutation clusters with mutation density on amino acid chain have been proposed recently. But their performance in identification of mutation clusters remains to be improved. Results: Here we present a Data-adaptive Mutation Clustering Method (DMCM), in which kernel density estimate (KDE) with a data-adaptive bandwidth is applied to estimate the mutation density, to find variable clusters with different lengths on amino acid sequences. We apply this approach in the mutation data of 571 genes in over twenty cancer types from The Cancer Genome Atlas (TCGA). We compare the DMCM with M2C, OncodriveCLUST and Pfam Domain and find that DMCM tends to identify more significant clusters. The cross-validation analysis shows DMCM is robust and cluster cancer type enrichment analysis shows that specific cancer types are enriched for specific mutation clusters. Availability and implementation: DMCM is written in Python and analysis methods of DMCM are written in R. They are all released online, available through https://github.com/XinguoLu/DMCM. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Cluster Analysis , Neoplasms/genetics , Humans , Mutation
18.
Am J Hum Genet ; 98(2): 256-74, 2016 Feb 04.
Article in English | MEDLINE | ID: mdl-26833333

ABSTRACT

Comprehensive identification of somatic structural variations (SVs) and understanding their mutational mechanisms in cancer might contribute to understanding biological differences and help to identify new therapeutic targets. Unfortunately, characterization of complex SVs across the whole genome and the mutational mechanisms underlying esophageal squamous cell carcinoma (ESCC) is largely unclear. To define a comprehensive catalog of somatic SVs, affected target genes, and their underlying mechanisms in ESCC, we re-analyzed whole-genome sequencing (WGS) data from 31 ESCCs using Meerkat algorithm to predict somatic SVs and Patchwork to determine copy-number changes. We found deletions and translocations with NHEJ and alt-EJ signature as the dominant SV types, and 16% of deletions were complex deletions. SVs frequently led to disruption of cancer-associated genes (e.g., CDKN2A and NOTCH1) with different mutational mechanisms. Moreover, chromothripsis, kataegis, and breakage-fusion-bridge (BFB) were identified as contributing to locally mis-arranged chromosomes that occurred in 55% of ESCCs. These genomic catastrophes led to amplification of oncogene through chromothripsis-derived double-minute chromosome formation (e.g., FGFR1 and LETM2) or BFB-affected chromosomes (e.g., CCND1, EGFR, ERBB2, MMPs, and MYC), with approximately 30% of ESCCs harboring BFB-derived CCND1 amplification. Furthermore, analyses of copy-number alterations reveal high frequency of whole-genome duplication (WGD) and recurrent focal amplification of CDCA7 that might act as a potential oncogene in ESCC. Our findings reveal molecular defects such as chromothripsis and BFB in malignant transformation of ESCCs and demonstrate diverse models of SVs-derived target genes in ESCCs. These genome-wide SV profiles and their underlying mechanisms provide preventive, diagnostic, and therapeutic implications for ESCCs.


Subject(s)
Carcinoma, Squamous Cell/genetics , Esophageal Neoplasms/genetics , Genetic Association Studies/methods , Genetic Variation , Cell Line , Cyclin D1/genetics , DNA Copy Number Variations , ErbB Receptors/genetics , Esophageal Squamous Cell Carcinoma , Gene Deletion , Gene Rearrangement , Genes, p16 , Genome, Human , Genomics , Humans , In Situ Hybridization, Fluorescence , Receptor, ErbB-2/genetics , Receptor, Fibroblast Growth Factor, Type 1/genetics , Receptor, Notch1/genetics , Reproducibility of Results , Sequence Analysis, RNA , Translocation, Genetic
19.
Nucleic Acids Res ; 45(17): e155, 2017 Sep 29.
Article in English | MEDLINE | ID: mdl-28973463

ABSTRACT

More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA.


Subject(s)
Algorithms , Computational Biology/methods , Gene Expression Profiling/statistics & numerical data , Transcriptome , Benchmarking , Databases, Genetic , Datasets as Topic , Humans , Knowledge Bases , Oligonucleotide Array Sequence Analysis
20.
BMC Bioinformatics ; 19(Suppl 4): 98, 2018 05 08.
Article in English | MEDLINE | ID: mdl-29745832

ABSTRACT

BACKGROUND: Frequent subgraphs mining is a significant problem in many practical domains. The solution of this kind of problem can particularly used in some large-scale drug molecular or biological libraries to help us find drugs or core biological structures rapidly and predict toxicity of some unknown compounds. The main challenge is its efficiency, as (i) it is computationally intensive to test for graph isomorphisms, and (ii) the graph collection to be mined and mining results can be very large. Existing solutions often require days to derive mining results from biological networks even with relative low support threshold. Also, the whole mining results always cannot be stored in single node memory. RESULTS: In this paper, we implement a parallel acceleration tool for classical frequent subgraph mining algorithm called cmFSM. The core idea is to employ parallel techniques to parallelize extension tasks, so as to reduce computation time. On the other hand, we employ multi-node strategy to solve the problem of memory constraints. The parallel optimization of cmFSM is carried out on three different levels, including the fine-grained OpenMP parallelization on single node, multi-node multi-process parallel acceleration and CPU-MIC collaborated parallel optimization. CONCLUSIONS: Evaluation results show that cmFSM clearly outperforms the existing state-of-the-art miners even if we only hold a few parallel computing resources. It means that cmFSM provides a practical solution to frequent subgraph mining problem with huge number of mining results. Specifically, our solution is up to one order of magnitude faster than the best CPU-based approach on single node and presents a promising scalability of massive mining tasks in multi-node scenario. More source code are available at:Source Code: https://github.com/ysycloud/cmFSM .


Subject(s)
Algorithms , Data Mining , Drug Evaluation, Preclinical , Software , Databases as Topic
SELECTION OF CITATIONS
SEARCH DETAIL