Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 67
Filtrar
Mais filtros

Bases de dados
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
Nucleic Acids Res ; 52(3): e16, 2024 Feb 09.
Artigo em Inglês | MEDLINE | ID: mdl-38088228

RESUMO

Functional molecular module (i.e., gene-miRNA co-modules and gene-miRNA-lncRNA triple-layer modules) analysis can dissect complex regulations underlying etiology or phenotypes. However, current module detection methods lack an appropriate usage and effective model of multi-omics data and cross-layer regulations of heterogeneous molecules, causing the loss of critical genetic information and corrupting the detection performance. In this study, we propose a heterogeneous network co-clustering framework (HetFCM) to detect functional co-modules. HetFCM introduces an attributed heterogeneous network to jointly model interplays and multi-type attributes of different molecules, and applies multiple variational graph autoencoders on the network to generate cross-layer association matrices, then it performs adaptive weighted co-clustering on association matrices and attribute data to identify co-modules of heterogeneous molecules. Empirical study on Human and Maize datasets reveals that HetFCM can find out co-modules characterized with denser topology and more significant functions, which are associated with human breast cancer (subtypes) and maize phenotypes (i.e., lipid storage, drought tolerance and oil content). HetFCM is a useful tool to detect co-modules and can be applied to multi-layer functional modules, yielding novel insights for analyzing molecular mechanisms. We also developed a user-friendly module detection and analysis tool and shared it at http://www.sdu-idea.cn/FMDTool.


Assuntos
Neoplasias da Mama , Análise por Conglomerados , Redes Reguladoras de Genes , Zea mays , Feminino , Humanos , Neoplasias da Mama/genética , Perfilação da Expressão Gênica/métodos , MicroRNAs/genética , Fenótipo , Zea mays/genética
2.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36733262

RESUMO

Single-cell RNA sequencing (scRNA-seq) data are typically with a large number of missing values, which often results in the loss of critical gene signaling information and seriously limit the downstream analysis. Deep learning-based imputation methods often can better handle scRNA-seq data than shallow ones, but most of them do not consider the inherent relations between genes, and the expression of a gene is often regulated by other genes. Therefore, it is essential to impute scRNA-seq data by considering the regional gene-to-gene relations. We propose a novel model (named scGGAN) to impute scRNA-seq data that learns the gene-to-gene relations by Graph Convolutional Networks (GCN) and global scRNA-seq data distribution by Generative Adversarial Networks (GAN). scGGAN first leverages single-cell and bulk genomics data to explore inherent relations between genes and builds a more compact gene relation network to jointly capture the homogeneous and heterogeneous information. Then, it constructs a GCN-based GAN model to integrate the scRNA-seq, gene sequencing data and gene relation network for generating scRNA-seq data, and trains the model through adversarial learning. Finally, it utilizes data generated by the trained GCN-based GAN model to impute scRNA-seq data. Experiments on simulated and real scRNA-seq datasets show that scGGAN can effectively identify dropout events, recover the biologically meaningful expressions, determine subcellular states and types, improve the differential expression analysis and temporal dynamics analysis. Ablation experiments confirm that both the gene relation network and gene sequence data help the imputation of scRNA-seq data.


Assuntos
Análise da Expressão Gênica de Célula Única , Software , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Genômica , Perfilação da Expressão Gênica
3.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37000166

RESUMO

Cooperative driver pathways discovery helps researchers to study the pathogenesis of cancer. However, most discovery methods mainly focus on genomics data, and neglect the known pathway information and other related multi-omics data; thus they cannot faithfully decipher the carcinogenic process. We propose CDPMiner (Cooperative Driver Pathways Miner) to discover cooperative driver pathways by multiplex network embedding, which can jointly model relational and attribute information of multi-type molecules. CDPMiner first uses the pathway topology to quantify the weight of genes in different pathways, and optimizes the relations between genes and pathways. Then it constructs an attributed multiplex network consisting of micro RNAs, long noncoding RNAs, genes and pathways, embeds the network through deep joint matrix factorization to mine more essential information for pathway-level analysis and reconstructs the pathway interaction network. Finally, CDPMiner leverages the reconstructed network and mutation data to define the driver weight between pathways to discover cooperative driver pathways. Experimental results on Breast invasive carcinoma and Stomach adenocarcinoma datasets show that CDPMiner can effectively fuse multi-omics data to discover more driver pathways, which indeed cooperatively trigger cancers and are valuable for carcinogenesis analysis. Ablation study justifies CDPMiner for a more comprehensive analysis of cancer by fusing multi-omics data.


Assuntos
Algoritmos , Neoplasias da Mama , Humanos , Feminino , Genômica/métodos , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Mutação , Carcinogênese/genética
4.
Brief Bioinform ; 23(4)2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-35696639

RESUMO

With the development of high-throughput genotyping technology, single nucleotide polymorphism (SNP)-SNP interactions (SSIs) detection has become an essential way for understanding disease susceptibility. Various methods have been proposed to detect SSIs. However, given the disease complexity and bias of individual SSI detectors, these single-detector-based methods are generally unscalable for real genome-wide data and with unfavorable results. We propose a novel ensemble learning-based approach (ELSSI) that can significantly reduce the bias of individual detectors and their computational load. ELSSI randomly divides SNPs into different subsets and evaluates them by multi-type detectors in parallel. Particularly, ELSSI introduces a four-stage pipeline (generate, score, switch and filter) to iteratively generate new SNP combination subsets from SNP subsets, score the combination subset by individual detectors, switch high-score combinations to other detectors for re-scoring, then filter out combinations with low scores. This pipeline makes ELSSI able to detect high-order SSIs from large genome-wide datasets. Experimental results on various simulated and real genome-wide datasets show the superior efficacy of ELSSI to state-of-the-art methods in detecting SSIs, especially for high-order ones. ELSSI is applicable with moderate PCs on the Internet and flexible to assemble new detectors. The code of ELSSI is available at https://www.sdu-idea.cn/codes.php?name=ELSSI.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Genoma , Estudo de Associação Genômica Ampla/métodos
5.
Bioinformatics ; 39(4)2023 04 03.
Artigo em Inglês | MEDLINE | ID: mdl-36929930

RESUMO

MOTIVATION: The integration of single-cell multi-omics data can uncover the underlying regulatory basis of diverse cell types and states. However, contemporary methods disregard the omics individuality, and the high noise, sparsity, and heterogeneity of single-cell data also impact the fusion effect. Furthermore, available single-cell clustering methods only focus on the cell type clustering, which cannot mine the alternative clustering to comprehensively analyze cells. RESULTS: We propose a single-cell data fusion based multiple clustering (scMCs) approach that can jointly model single-cell transcriptomics and epigenetic data, and explore multiple different clusterings. scMCs first mines the omics-specific and cross-omics consistent representations, then fuses them into a co-embedding representation, which can dissect cellular heterogeneity and impute data. To discover the potential alternative clustering embedded in multi-omics, scMCs projects the co-embedding representation into different salient subspaces. Meanwhile, it reduces the redundancy between subspaces to enhance the diversity of alternative clusterings and optimizes the cluster centers in each subspace to boost the quality of corresponding clustering. Unlike single clustering, these alternative clusterings provide additional perspectives for understanding complex genetic information, such as cell types and states. Experimental results show that scMCs can effectively identify subcellular types, impute dropout events, and uncover diverse cell characteristics by giving different but meaningful clusterings. AVAILABILITY AND IMPLEMENTATION: The code is available at www.sdu-idea.cn/codes.php?name=scMCs.


Assuntos
Algoritmos , Multiômica , Epigenômica , Perfilação da Expressão Gênica , Análise por Conglomerados
6.
Brief Bioinform ; 22(2): 1984-1999, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32103253

RESUMO

Discovering driver pathways is an essential step to uncover the molecular mechanism underlying cancer and to explore precise treatments for cancer patients. However, due to the difficulties of mapping genes to pathways and the limited knowledge about pathway interactions, most previous work focus on identifying individual pathways. In practice, two (or even more) pathways interplay and often cooperatively trigger cancer. In this study, we proposed a new approach called CDPathway to discover cooperative driver pathways. First, CDPathway introduces a driver impact quantification function to quantify the driver weight of each gene. CDPathway assumes that genes with larger weights contribute more to the occurrence of the target disease and identifies them as candidate driver genes. Next, it constructs a heterogeneous network composed of genes, miRNAs and pathways nodes based on the known intra(inter)-relations between them and assigns the quantified driver weights to gene-pathway and gene-miRNA relational edges. To transfer driver impacts of genes to pathway interaction pairs, CDPathway collaboratively factorizes the weighted adjacency matrices of the heterogeneous network to explore the latent relations between genes, miRNAs and pathways. After this, it reconstructs the pathway interaction network and identifies the pathway pairs with maximal interactive and driver weights as cooperative driver pathways. Experimental results on the breast, uterine corpus endometrial carcinoma and ovarian cancer data from The Cancer Genome Atlas show that CDPathway can effectively identify candidate driver genes [area under the receiver operating characteristic curve (AUROC) of $\geq $0.9] and reconstruct the pathway interaction network (AUROC of>0.9), and it uncovers much more known (potential) driver genes than other competitive methods. In addition, CDPathway identifies 150% more driver pathways and 60% more potential cooperative driver pathways than the competing methods. The code of CDPathway is available at http://mlda.swu.edu.cn/codes.php?name=CDPathway.


Assuntos
Neoplasias da Mama/genética , Redes Reguladoras de Genes , MicroRNAs/genética , Algoritmos , Conjuntos de Dados como Assunto , Feminino , Humanos
7.
Bioinformatics ; 38(19): 4581-4588, 2022 09 30.
Artigo em Inglês | MEDLINE | ID: mdl-35997558

RESUMO

MOTIVATION: High-resolution annotation of gene functions is a central task in functional genomics. Multiple proteoforms translated from alternatively spliced isoforms from a single gene are actual function performers and greatly increase the functional diversity. The specific functions of different isoforms can decipher the molecular basis of various complex diseases at a finer granularity. Multi-instance learning (MIL)-based solutions have been developed to distribute gene(bag)-level Gene Ontology (GO) annotations to isoforms(instances), but they simply presume that a particular annotation of the gene is responsible by only one isoform, neglect the hierarchical structures and semantics of massive GO terms (labels), or can only handle dozens of terms. RESULTS: We propose an efficacy approach IsofunGO to differentiate massive functions of isoforms by GO embedding. Particularly, IsofunGO first introduces an attributed hierarchical network to model massive GO terms, and a GO network embedding strategy to learn compact representations of GO terms and project GO annotations of genes into compressed ones, this strategy not only explores and preserves hierarchy between GO terms but also greatly reduces the prediction load. Next, it develops an attention-based MIL network to fuse genomics and transcriptomics data of isoforms and predict isoform functions by referring to compressed annotations. Extensive experiments on benchmark datasets demonstrate the efficacy of IsofunGO. Both the GO embedding and attention mechanism can boost the performance and interpretability. AVAILABILITYAND IMPLEMENTATION: The code of IsofunGO is available at http://www.sdu-idea.cn/codes.php?name=IsofunGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Semântica , Ontologia Genética , Anotação de Sequência Molecular , Isoformas de Proteínas/genética
8.
Bioinformatics ; 38(22): 5092-5099, 2022 11 15.
Artigo em Inglês | MEDLINE | ID: mdl-36130063

RESUMO

MOTIVATION: Cancer subtype diagnosis is crucial for its precise treatment and different subtypes need different therapies. Although the diagnosis can be greatly improved by fusing multiomics data, most fusion solutions depend on paired omics data, which are actually weakly paired, with different omics views missing for different samples. Incomplete multiview learning-based solutions can alleviate this issue but are still far from satisfactory because they: (i) mainly focus on shared information while ignore the important individuality of multiomics data and (ii) cannot pick out interpretable features for precise diagnosis. RESULTS: We introduce an interpretable and flexible solution (LungDWM) for Lung cancer subtype Diagnosis using Weakly paired Multiomics data. LungDWM first builds an attention-based encoder for each omics to pick out important diagnostic features and extract shared and complementary information across omics. Next, it proposes an individual loss to jointly extract the specific information of each omics and performs generative adversarial learning to impute missing omics of samples using extracted features. After that, it fuses the extracted and imputed features to diagnose cancer subtypes. Experiments on benchmark datasets show that LungDWM achieves a better performance than recent competitive methods, and has a high authenticity and good interpretability. AVAILABILITY AND IMPLEMENTATION: The code is available at http://www.sdu-idea.cn/codes.php?name=LungDWM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias Pulmonares , Humanos , Neoplasias Pulmonares/diagnóstico
9.
Methods ; 198: 65-75, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34555529

RESUMO

Epistasis between single nucleotide polymorphisms (SNPs) plays an important role in elucidating the missing heritability of complex diseases. Diverse approaches have been invented for detecting SNP interactions, but they canonically neglect the important and useful connections between SNPs and other bio-molecules (i.e., miRNAs and lncRNAs). To comprehensively model these disease related molecules, a heterogeneous bio-molecular network based solution EpiHNet is introduced for high-order SNP interactions detection. EpiHNet firstly uses case/control data to construct an SNP statistical network, and meta-path based similarity on the heterogeneous network composed with SNPs, genes, lncRNAs, miRNAs and diseases to define another SNP relational network. The SNP relational network can explore and exploit different associations between molecules and diseases to complement the SNP statistical network and search the significantly associated SNPs. Next, EpiHNet integrates these two networks into a composite network, applies the modularity based clustering with fast search strategy to divide SNP nodes into different clusters. After that, it detects SNP interactions based on SNP combinations derived from each cluster. Synthetic experiments on diverse two-locus and three-locus disease models manifest that EpiHNet outperforms competitive baselines, even without the heterogeneous network. For real WTCCC breast cancer data, EpiHNet also demonstrates expressive results on detecting high-order SNP interactions.


Assuntos
Epistasia Genética , Estudo de Associação Genômica Ampla , Algoritmos , Estudos de Casos e Controles , Análise por Conglomerados , Estudo de Associação Genômica Ampla/métodos , Humanos , Polimorfismo de Nucleotídeo Único
10.
Methods ; 205: 18-28, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-35690250

RESUMO

Genome-phenome association (GPA) prediction can promote the understanding of biological mechanisms about complex pathology of phenotypes (i.e., traits and diseases). Traditional heterogeneous network-based GPA approaches overwhelmingly need to project heterogeneous data toward homogeneous network for data fusion and prediction, such projections result in the loss of heterogeneous network structure information. Matrix factorization based data fusion can avoid such projection by integrating multi-type data in a coherent way, but they typically perform linear factorization and cannot mine the nonlinear relationships between molecules, which compromise the accuracy of GPA analysis. Furthermore, most of them can not selectively synergy network topology and node attribution information in a principle way. In this paper, we propose a weighted deep matrix factorization based solution (WDGPA) to predict GPAs by selectively and differentially fusing heterogeneous molecular network and diverse attributes of nodes. WDGPA firstly assigns weights to inter/intra-relational data matrices and attribute data matrices, and performs deep matrix factorization on these matrices of heterogeneous network in a cooperative manner to obtain the nonlinear representations of different nodes. In addition, it performs low-rank representation learning on the attribute data with the shared nonlinear representations. In this way, both the network topology and node attributes are jointly mined to explore the representations of molecules and complex interplays between molecules and phenotypes. WDGPA then uses the representational vectors of gene and phenotype nodes to predict GPAs. Experimental results on maize and human datasets confirm that WDGPA outperforms competitive methods by a large margin under different evaluation protocols.


Assuntos
Algoritmos , Genoma , Humanos , Fenótipo
11.
Bioinformatics ; 37(24): 4818-4825, 2021 12 11.
Artigo em Inglês | MEDLINE | ID: mdl-34282449

RESUMO

MOTIVATION: Alternative splicing creates the considerable proteomic diversity and complexity on relatively limited genome. Proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions of this gene, which reflect the functional knowledge of genes at a finer granular level. Recently, some computational approaches have been proposed to differentiate isoform functions using sequence and expression data. However, their performance is far from being desirable, mainly due to the imbalance and lack of annotations at isoform-level, and the difficulty of modeling gene-isoform relations. RESULT: We propose a deep multi-instance learning-based framework (DMIL-IsoFun) to differentiate the functions of isoforms. DMIL-IsoFun firstly introduces a multi-instance learning convolution neural network trained with isoform sequences and gene-level annotations to extract the feature vectors and initialize the annotations of isoforms, and then uses a class-imbalance Graph Convolution Network to refine the annotations of individual isoforms based on the isoform co-expression network and extracted features. Extensive experimental results show that DMIL-IsoFun improves the Smin and Fmax of state-of-the-art solutions by at least 29.6% and 40.8%. The effectiveness of DMIL-IsoFun is further confirmed on a testbed of human multiple-isoform genes, and maize isoforms related with photosynthesis. AVAILABILITY AND IMPLEMENTATION: The code and data are available at http://www.sdu-idea.cn/codes.php?name=DMIL-Isofun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Processamento Alternativo , Proteômica , Humanos , Isoformas de Proteínas/genética , Redes Neurais de Computação , Anotação de Sequência Molecular
12.
Bioinformatics ; 36(1): 303-310, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31250882

RESUMO

MOTIVATION: Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored. The main bottleneck is that functional annotations of isoforms are generally unavailable and functional genomic databases universally store the functional annotations at the gene level. RESULTS: We propose IsoFun to accomplish Isoform Function prediction based on bi-random walks on a heterogeneous network. IsoFun firstly constructs an isoform functional association network based on the expression profiles of isoforms derived from multiple RNA-seq datasets. Next, IsoFun uses the available Gene Ontology annotations of genes, gene-gene interactions and the relations between genes and isoforms to construct a heterogeneous network. After this, IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between GO terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms. Experimental results show that IsoFun significantly outperforms the state-of-the-art algorithms and improves the area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) by 17% and 44% at the gene-level, respectively. We further validated the performance of IsoFun on the genes ADAM15 and BCL2L1. IsoFun accurately differentiates the functions of respective isoforms of these two genes. AVAILABILITY AND IMPLEMENTATION: The code of IsoFun is available at http://mlda.swu.edu.cn/codes.php? name=IsoFun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional , Isoformas de Proteínas , Biologia Computacional/métodos , Ontologia Genética , Anotação de Sequência Molecular , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo
13.
Bioinformatics ; 36(6): 1864-1871, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-32176770

RESUMO

MOTIVATION: Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity. Since existing functional genomic databases uniformly record the annotations at the gene-level, and rarely record the annotations at the isoform-level, differentiating isoform functions is more challenging than the traditional gene-level function prediction. RESULTS: Several approaches have been proposed to differentiate the functions of isoforms. They generally follow the multi-instance learning paradigm by viewing each gene as a bag and the spliced isoforms as its instances, and push functions of bags onto instances. These approaches implicitly assume the collected annotations of genes are complete and only integrate multiple RNA-seq datasets. As such, they have compromised performance. We propose a data integrative solution (called DisoFun) to Differentiate isoform Functions with collaborative matrix factorization. DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms. It collaboratively factorizes the isoform data matrix and gene-term data matrix (storing Gene Ontology annotations of genes) into low-rank matrices to simultaneously explore the latent key isoforms, and achieve function prediction by aggregating predictions to their originating genes. In addition, it leverages the PPI network and Gene Ontology structure to further coordinate the matrix factorization. Extensive experimental results show that DisoFun improves the area under the receiver operating characteristic curve and area under the precision-recall curve of existing solutions by at least 7.7 and 28.9%, respectively. We further investigate DisoFun on four exemplar genes (LMNA, ADAM15, BCL2L1 and CFLAR) with known functions at the isoform-level, and observed that DisoFun can differentiate functions of their isoforms with 90.5% accuracy. AVAILABILITY AND IMPLEMENTATION: The code of DisoFun is available at mlda.swu.edu.cn/codes.php?name=DisoFun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Ontologia Genética , Anotação de Sequência Molecular , Isoformas de Proteínas/genética , Curva ROC
14.
Methods ; 173: 32-43, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31226302

RESUMO

Influx evidences show that red long non-coding RNAs (lncRNAs) play important roles in various critical biological processes, and they afffect the development and progression of various human diseases. Therefore, it is necessary to precisely identify the lncRNA-disease associations. The identification precision can be improved by developing data integrative models. However, current models mainly need to project heterogeneous data onto the homologous networks, and then merge these networks into a composite one for integrative prediction. We recognize that this projection overrides the individual structure of the heterogeneous data, and the combination is impacted by noisy networks. As a result, the performance is compromised. Given that, we introduce a weighted matrix factorization model on multi-relational data to predict LncRNA-disease associations (WMFLDA). WMFLDA firstly uses a heterogeneous network to capture the inter(intra)-associations between different types of nodes (including genes, lncRNAs, and Disease Ontology terms). Then, it presets weights to these inter-association and intra-association matrices of the network, and cooperatively decomposes these matrices into low-rank ones to explore the underlying relationships between nodes. Next, it jointly optimizes the low-rank matrices and the weights. After that, WMFLDA approximates the lncRNA-disease association matrix using the optimized matrices and weights, and thus to achieve the prediction. WMFLDA obtains a much better performance than related data integrative solutions across different experiment settings and evaluation metrics. It can not only respect the intrinsic structures of individual data sources, but can also fuse them with selection.


Assuntos
Biologia Computacional/métodos , Predisposição Genética para Doença , RNA Longo não Codificante/genética , Algoritmos , Progressão da Doença , Humanos
15.
BMC Bioinformatics ; 21(Suppl 16): 420, 2020 Dec 16.
Artigo em Inglês | MEDLINE | ID: mdl-33323113

RESUMO

BACKGROUND: Maize (Zea mays ssp. mays L.) is the most widely grown and yield crop in the world, as well as an important model organism for fundamental research of the function of genes. The functions of Maize proteins are annotated using the Gene Ontology (GO), which has more than 40000 terms and organizes GO terms in a direct acyclic graph (DAG). It is a huge challenge to accurately annotate relevant GO terms to a Maize protein from such a large number of candidate GO terms. Some deep learning models have been proposed to predict the protein function, but the effectiveness of these approaches is unsatisfactory. One major reason is that they inadequately utilize the GO hierarchy. RESULTS: To use the knowledge encoded in the GO hierarchy, we propose a deep Graph Convolutional Network (GCN) based model (DeepGOA) to predict GO annotations of proteins. DeepGOA firstly quantifies the correlations (or edges) between GO terms and updates the edge weights of the DAG by leveraging GO annotations and hierarchy, then learns the semantic representation and latent inter-relations of GO terms in the way by applying GCN on the updated DAG. Meanwhile, Convolutional Neural Network (CNN) is used to learn the feature representation of amino acid sequences with respect to the semantic representations. After that, DeepGOA computes the dot product of the two representations, which enable to train the whole network end-to-end coherently. Extensive experiments show that DeepGOA can effectively integrate GO structural information and amino acid information, and then annotates proteins accurately. CONCLUSIONS: Experiments on Maize PH207 inbred line and Human protein sequence dataset show that DeepGOA outperforms the state-of-the-art deep learning based methods. The ablation study proves that GCN can employ the knowledge of GO and boost the performance. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=DeepGOA .


Assuntos
Redes Neurais de Computação , Proteínas de Plantas/metabolismo , Zea mays/metabolismo , Área Sob a Curva , Ontologia Genética , Humanos , Anotação de Sequência Molecular
16.
Hum Mutat ; 41(3): 719-734, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31705708

RESUMO

Detecting epistatic interaction is a typical way of identifying the genetic susceptibility of complex diseases. Multifactor dimensionality reduction (MDR) is a decent solution for epistasis detection. Existing MDR-based methods still suffer from high computational costs or poor performance. In this paper, we propose a new solution that integrates a dual screening strategy with MDR, termed as DualWMDR. Particularly, the first screening employs an adaptive clustering algorithm with part mutual information (PMI) to group single nucleotide polymorphisms (SNPs) and exclude noisy SNPs; the second screening takes into account both the single-locus effect and interaction effect to select dominant SNPs, which effectively alleviates the negative impact of main effects and provides a much smaller but accurate candidate set for MDR. After that, MDR uses the weighted classification evaluation to improve its performance in epistasis identification on the candidate set. The results on diverse simulation datasets show that DualWMDR outperforms existing competitive methods, and the results on three real genome-wide datasets: the age-related macular degeneration (AMD) dataset, breast cancer (BC), and celiac disease (CD) datasets from the Wellcome Trust Case Control Consortium, again corroborate the effectiveness of DualWMDR.


Assuntos
Biologia Computacional/métodos , Epistasia Genética , Modelos Genéticos , Redução Dimensional com Múltiplos Fatores/métodos , Algoritmos , Bases de Dados Genéticas , Loci Gênicos , Predisposição Genética para Doença , Humanos , Degeneração Macular/genética , Polimorfismo de Nucleotídeo Único
17.
Genomics ; 111(3): 334-342, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-29477548

RESUMO

Gene Ontology (GO) uses structured vocabularies (or terms) to describe the molecular functions, biological roles, and cellular locations of gene products in a hierarchical ontology. GO annotations associate genes with GO terms and indicate the given gene products carrying out the biological functions described by the relevant terms. However, predicting correct GO annotations for genes from a massive set of GO terms as defined by GO is a difficult challenge. To combat with this challenge, we introduce a Gene Ontology Hierarchy Preserving Hashing (HPHash) based semantic method for gene function prediction. HPHash firstly measures the taxonomic similarity between GO terms. It then uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms, and to optimize a series of hashing functions to encode massive GO terms via compact binary codes. After that, HPHash utilizes these hashing functions to project the gene-term association matrix into a low-dimensional one and performs semantic similarity based gene function prediction in the low-dimensional space. Experimental results on three model species (Homo sapiens, Mus musculus and Rattus norvegicus) for interspecies gene function prediction show that HPHash performs better than other related approaches and it is robust to the number of hash functions. In addition, we also take HPHash as a plugin for BLAST based gene function prediction. From the experimental results, HPHash again significantly improves the prediction performance. The codes of HPHash are available at: http://mlda.swu.edu.cn/codes.php?name=HPHash.


Assuntos
Ontologia Genética , Software , Animais , Humanos , Camundongos , Ratos , Semântica
18.
Genomics ; 111(5): 1176-1182, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-30055230

RESUMO

Single nucleotide polymorphism (SNP) interactions can explain the missing heritability of common complex diseases. Many interaction detection methods have been proposed in genome-wide association studies, and they can be divided into two types: population-based and family-based. Compared with population-based methods, family-based methods are robust vs. population stratification. Several family-based methods have been proposed, among which Multifactor Dimensionality Reduction (MDR)-based methods are popular and powerful. However, current MDR-based methods suffer from heavy computational burden. Furthermore, they do not allow for main effect adjustment. In this work we develop a two-stage model-based MDR approach (TrioMDR) to detect multi-locus interaction in trio families (i.e., two parents and one affected child). TrioMDR combines the MDR framework with logistic regression models to check interactions, so TrioMDR can adjust main effects. In addition, unlike consuming permutation procedures used in traditional MDR-based methods, TrioMDR utilizes a simple semi-parameter P-values correction procedure to control type I error rate, this procedure only uses a few permutations to achieve the significance of a multi-locus model and significantly speeds up TrioMDR. We performed extensive experiments on simulated data to compare the type I error and power of TrioMDR under different scenarios. The results demonstrate that TrioMDR is fast and more powerful in general than some recently proposed methods for interaction detection in trios. The R codes of TrioMDR are available at: https://github.com/TrioMDR/TrioMDR.


Assuntos
Epistasia Genética , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Software , Animais , Estudo de Associação Genômica Ampla/normas , Humanos
19.
BMC Bioinformatics ; 20(Suppl 15): 547, 2019 Dec 24.
Artigo em Inglês | MEDLINE | ID: mdl-31874623

RESUMO

BACKGROUND: Traditional drug research and development is high cost, time-consuming and risky. Computationally identifying new indications for existing drugs, referred as drug repositioning, greatly reduces the cost and attracts ever-increasing research interests. Many network-based methods have been proposed for drug repositioning and most of them apply random walk on a heterogeneous network consisted with disease and drug nodes. However, these methods generally adopt the same walk-length for all nodes, and ignore the different contributions of different nodes. RESULTS: In this study, we propose a drug repositioning approach based on individual bi-random walks (DR-IBRW) on the heterogeneous network. DR-IBRW firstly quantifies the individual work-length of random walks for each node based on the network topology and knowledge that similar drugs tend to be associated with similar diseases. To account for the inner structural difference of the heterogeneous network, it performs bi-random walks with the quantified walk-lengths, and thus to identify new indications for approved drugs. Empirical study on public datasets shows that DR-IBRW achieves a much better drug repositioning performance than other related competitive methods. CONCLUSIONS: Using individual random walk-lengths for different nodes of heterogeneous network indeed boosts the repositioning performance. DR-IBRW can be easily generalized to prioritize links between nodes of a network.


Assuntos
Reposicionamento de Medicamentos , Biologia Computacional , Aprovação de Drogas , Reposicionamento de Medicamentos/métodos
20.
Bioinformatics ; 34(9): 1529-1537, 2018 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-29228285

RESUMO

Motivation: Long non-coding RNAs (lncRNAs) play crucial roles in complex disease diagnosis, prognosis, prevention and treatment, but only a small portion of lncRNA-disease associations have been experimentally verified. Various computational models have been proposed to identify lncRNA-disease associations by integrating heterogeneous data sources. However, existing models generally ignore the intrinsic structure of data sources or treat them as equally relevant, while they may not be. Results: To accurately identify lncRNA-disease associations, we propose a Matrix Factorization based LncRNA-Disease Association prediction model (MFLDA in short). MFLDA decomposes data matrices of heterogeneous data sources into low-rank matrices via matrix tri-factorization to explore and exploit their intrinsic and shared structure. MFLDA can select and integrate the data sources by assigning different weights to them. An iterative solution is further introduced to simultaneously optimize the weights and low-rank matrices. Next, MFLDA uses the optimized low-rank matrices to reconstruct the lncRNA-disease association matrix and thus to identify potential associations. In 5-fold cross validation experiments to identify verified lncRNA-disease associations, MFLDA achieves an area under the receiver operating characteristic curve (AUC) of 0.7408, at least 3% higher than those given by state-of-the-art data fusion based computational models. An empirical study on identifying masked lncRNA-disease associations again shows that MFLDA can identify potential associations more accurately than competing models. A case study on identifying lncRNAs associated with breast, lung and stomach cancers show that 38 out of 45 (84%) associations predicted by MFLDA are supported by recent biomedical literature and further proves the capability of MFLDA in identifying novel lncRNA-disease associations. MFLDA is a general data fusion framework, and as such it can be adopted to predict associations between other biological entities. Availability and implementation: The source code for MFLDA is available at: http://mlda.swu.edu.cn/codes.php? name = MFLDA. Contact: gxyu@swu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional/métodos , Simulação por Computador , Predisposição Genética para Doença , RNA Longo não Codificante/genética , Neoplasias da Mama/genética , Feminino , Humanos , Curva ROC
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA