Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Cybern ; 54(1): 486-495, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37022240

RESUMO

Finding the causal structure from a set of variables given observational data is a crucial task in many scientific areas. Most algorithms focus on discovering the global causal graph but few efforts have been made toward the local causal structure (LCS), which is of wide practical significance and easier to obtain. LCS learning faces the challenges of neighborhood determination and edge orientation. Available LCS algorithms build on conditional independence (CI) tests, they suffer the poor accuracy due to noises, various data generation mechanisms, and small-size samples of real-world applications, where CI tests do not work. In addition, they can only find the Markov equivalence class, leaving some edges undirected. In this article, we propose a GradieNt-based LCS learning approach (GraN-LCS) to determine neighbors and orient edges simultaneously in a gradient-descent way, and, thus, to explore LCS more accurately. GraN-LCS formulates the causal graph search as minimizing an acyclicity regularized score function, which can be optimized by efficient gradient-based solvers. GraN-LCS constructs a multilayer perceptron (MLP) to simultaneously fit all other variables with respect to a target variable and defines an acyclicity-constrained local recovery loss to promote the exploration of local graphs and to find out direct causes and effects of the target variable. To improve the efficacy, it applies preliminary neighborhood selection (PNS) to sketch the raw causal structure and further incorporates an l1 -norm-based feature selection on the first layer of MLP to reduce the scale of candidate variables and to pursue sparse weight matrix. GraN-LCS finally outputs LCS based on the sparse weighted adjacency matrix learned from MLPs. We conduct experiments on both synthetic and real-world datasets and verify its efficacy by comparing against state-of-the-art baselines. A detailed ablation study investigates the impact of key components of GraN-LCS and the results prove their contribution.

2.
Bioinformatics ; 38(19): 4581-4588, 2022 09 30.
Artigo em Inglês | MEDLINE | ID: mdl-35997558

RESUMO

MOTIVATION: High-resolution annotation of gene functions is a central task in functional genomics. Multiple proteoforms translated from alternatively spliced isoforms from a single gene are actual function performers and greatly increase the functional diversity. The specific functions of different isoforms can decipher the molecular basis of various complex diseases at a finer granularity. Multi-instance learning (MIL)-based solutions have been developed to distribute gene(bag)-level Gene Ontology (GO) annotations to isoforms(instances), but they simply presume that a particular annotation of the gene is responsible by only one isoform, neglect the hierarchical structures and semantics of massive GO terms (labels), or can only handle dozens of terms. RESULTS: We propose an efficacy approach IsofunGO to differentiate massive functions of isoforms by GO embedding. Particularly, IsofunGO first introduces an attributed hierarchical network to model massive GO terms, and a GO network embedding strategy to learn compact representations of GO terms and project GO annotations of genes into compressed ones, this strategy not only explores and preserves hierarchy between GO terms but also greatly reduces the prediction load. Next, it develops an attention-based MIL network to fuse genomics and transcriptomics data of isoforms and predict isoform functions by referring to compressed annotations. Extensive experiments on benchmark datasets demonstrate the efficacy of IsofunGO. Both the GO embedding and attention mechanism can boost the performance and interpretability. AVAILABILITYAND IMPLEMENTATION: The code of IsofunGO is available at http://www.sdu-idea.cn/codes.php?name=IsofunGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Semântica , Ontologia Genética , Anotação de Sequência Molecular , Isoformas de Proteínas/genética
3.
Artigo em Inglês | MEDLINE | ID: mdl-35862332

RESUMO

Personalized federated learning (PFL) learns a personalized model for each client in a decentralized manner, where each client owns private data that are not shared and data among clients are non-independent and identically distributed (i.i.d.) However, existing PFL solutions assume that clients have sufficient training samples to jointly induce personalized models. Thus, existing PFL solutions cannot perform well in a few-shot scenario, where most or all clients only have a handful of samples for training. Furthermore, existing few-shot learning (FSL) approaches typically need centralized training data; as such, these FSL methods are not applicable in decentralized scenarios. How to enable PFL with limited training samples per client is a practical but understudied problem. In this article, we propose a solution called personalized federated few-shot learning (pFedFSL) to tackle this problem. Specifically, pFedFSL learns a personalized and discriminative feature space for each client by identifying which models perform well on which clients, without exposing local data of clients to the server and other clients, and which clients should be selected for collaboration with the target client. In the learned feature spaces, each sample is made closer to samples of the same category and farther away from samples of different categories. Experimental results on four benchmark datasets demonstrate that pFedFSL outperforms competitive baselines across different settings.

4.
IEEE Trans Neural Netw Learn Syst ; 33(1): 304-314, 2022 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-33052870

RESUMO

Hashing has been widely adopted for large-scale data retrieval in many domains due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities is readily available. This assumption is unrealistic in practical applications. In addition, existing methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (FlexCMH) to learn effective hashing codes from weakly paired data, whose correspondence across modalities is partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the structure of each cluster and, thus, to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes the potential correspondence, the cross-modal hashing functions derived from the correspondence, and a hashing quantitative loss in a unified objective function. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions and reinforce the reciprocal effects of the two objectives. Experiments on public multimodal data sets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it, indeed, offers a high degree of flexibility for practical cross-modal hashing tasks.

5.
IEEE Trans Neural Netw Learn Syst ; 33(9): 4311-4321, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-33577462

RESUMO

Multiview multi-instance multilabel learning (M3L) is a framework for modeling complex objects. In this framework, each object (or bag) contains one or more instances, is represented with different feature views, and simultaneously annotated with a set of nonexclusive semantic labels. Given the multiplicity of the studied objects, traditional M3L methods generally demand a large number of labeled bags to train a predictive model to annotate bags (or instances) with semantic labels. However, annotating sufficient bags is very expensive and often impractical. In this article, we present an active learning-based M3L approach (M3AL) to reduce the labeling costs of bags and to improve the performance as much as possible. M3AL first adapts the multiview self-representation learning to evacuate the shared and individual information of bags and to learn the shared/individual similarities between bags across/within views. Next, to avoid scrutinizing all the possible labels, M3AL introduces a new query strategy that leverages the shared and individual information, and the diverse instance distribution of bags across views, to select the most informative bag-label pair for the query. Experimental studies on benchmark data sets show that M3AL can significantly reduce the query costs while achieving a better performance than other related competitive methods at the same cost.

6.
Bioinformatics ; 37(24): 4818-4825, 2021 12 11.
Artigo em Inglês | MEDLINE | ID: mdl-34282449

RESUMO

MOTIVATION: Alternative splicing creates the considerable proteomic diversity and complexity on relatively limited genome. Proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions of this gene, which reflect the functional knowledge of genes at a finer granular level. Recently, some computational approaches have been proposed to differentiate isoform functions using sequence and expression data. However, their performance is far from being desirable, mainly due to the imbalance and lack of annotations at isoform-level, and the difficulty of modeling gene-isoform relations. RESULT: We propose a deep multi-instance learning-based framework (DMIL-IsoFun) to differentiate the functions of isoforms. DMIL-IsoFun firstly introduces a multi-instance learning convolution neural network trained with isoform sequences and gene-level annotations to extract the feature vectors and initialize the annotations of isoforms, and then uses a class-imbalance Graph Convolution Network to refine the annotations of individual isoforms based on the isoform co-expression network and extracted features. Extensive experimental results show that DMIL-IsoFun improves the Smin and Fmax of state-of-the-art solutions by at least 29.6% and 40.8%. The effectiveness of DMIL-IsoFun is further confirmed on a testbed of human multiple-isoform genes, and maize isoforms related with photosynthesis. AVAILABILITY AND IMPLEMENTATION: The code and data are available at http://www.sdu-idea.cn/codes.php?name=DMIL-Isofun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Processamento Alternativo , Proteômica , Humanos , Isoformas de Proteínas/genética , Redes Neurais de Computação , Anotação de Sequência Molecular
7.
IEEE Trans Neural Netw Learn Syst ; 32(4): 1448-1459, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-32310798

RESUMO

Crowdsourcing is an economic and efficient strategy aimed at collecting annotations of data through an online platform. Crowd workers with different expertise are paid for their service, and the task requester usually has a limited budget. How to collect reliable annotations for multilabel data and how to compute the consensus within budget are an interesting and challenging, but rarely studied, problem. In this article, we propose a novel approach to accomplish active multilabel crowd consensus (AMCC). AMCC accounts for the commonality and individuality of workers and assumes that workers can be organized into different groups. Each group includes a set of workers who share a similar annotation behavior and label correlations. To achieve an effective multilabel consensus, AMCC models workers' annotations via a linear combination of commonality and individuality and reduces the impact of unreliable workers by assigning smaller weights to their groups. To collect reliable annotations with reduced cost, AMCC introduces an active crowdsourcing learning strategy that selects sample-label-worker triplets. In a triplet, the selected sample and label are the most informative for the consensus model, and the selected worker can reliably annotate the sample at a low cost. Our experimental results on multilabel data sets demonstrate the advantages of AMCC over state-of-the-art solutions on computing crowd consensus and on reducing the budget by choosing cost-effective triplets.

8.
IEEE Trans Cybern ; 51(3): 1716-1727, 2021 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-31751259

RESUMO

In multiview multilabel learning, each object is represented by several heterogeneous feature representations and is also annotated with a set of discrete nonexclusive labels. Previous studies typically focus on capturing the shared latent patterns among multiple views, while not sufficiently considering the diverse characteristics of individual views, which can cause performance degradation. In this article, we propose a novel approach [individuality- and commonality-based multiview multilabel learning (ICM2L)] to explicitly explore the individuality and commonality information of multilabel multiple view data in a unified model. Specifically, a common subspace is learned across different views to capture the shared patterns. Then, multiple individual classifiers are exploited to explore the characteristics of individual views. Next, an ensemble strategy is adopted to make a prediction. Finally, we develop an alternative solution to jointly optimize our model, which can enhance the robustness of the proposed model toward rare labels and reinforce the reciprocal effects of individuality and commonality among heterogeneous views, and thus further improve the performance. Experiments on various real-word datasets validate the effectiveness of ICM2L against the state-of-the-art solutions, and ICM2L can leverage the individuality and commonality information to achieve an improved performance as well as to enhance the robustness toward rare labels.

9.
IEEE Trans Cybern ; 51(7): 3576-3587, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-31751260

RESUMO

Clustering is a fundamental data exploration task which aims at discovering the hidden grouping structure in the data. The traditional clustering methods typically compute a single partition. However, there often exist different and equally meaningful clusterings in complex data. To solve this issue, multiple clustering approaches have emerged with the goal of exploring alternative clusterings from different perspectives. Existing solutions to this problem mainly focus on one-way clustering, that is, they cluster either the samples or the features. However, for many practical tasks, it is meaningful and desirable to explore alternative two-way clusterings (or co-clusterings), which capture not only the sample cluster structure but also the feature cluster structure. To tackle this interesting and unresolved task, we introduce an approach, called multiple co-clusterings (MultiCCs), to generate multiple alternative co-clusterings at the same time. MultiCC takes advantage of matrix tri-factorization to seek the co-clustering indicator matrices for samples and features and defines the row and column redundancy quantification terms to enforce diversity among co-clusterings based on these indicator matrices. After that, it integrates matrix tri-factorization and two nonredundancy terms into a unified objective function and gives an alternative optimization procedure to optimize the objective function. Extensive experimental results demonstrate that MultiCC performs significantly better than the existing multiple clustering methods. In addition, MultiCC can find out interesting co-clusters, which cannot be made by those comparing methods.

10.
Brief Bioinform ; 22(2): 1984-1999, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32103253

RESUMO

Discovering driver pathways is an essential step to uncover the molecular mechanism underlying cancer and to explore precise treatments for cancer patients. However, due to the difficulties of mapping genes to pathways and the limited knowledge about pathway interactions, most previous work focus on identifying individual pathways. In practice, two (or even more) pathways interplay and often cooperatively trigger cancer. In this study, we proposed a new approach called CDPathway to discover cooperative driver pathways. First, CDPathway introduces a driver impact quantification function to quantify the driver weight of each gene. CDPathway assumes that genes with larger weights contribute more to the occurrence of the target disease and identifies them as candidate driver genes. Next, it constructs a heterogeneous network composed of genes, miRNAs and pathways nodes based on the known intra(inter)-relations between them and assigns the quantified driver weights to gene-pathway and gene-miRNA relational edges. To transfer driver impacts of genes to pathway interaction pairs, CDPathway collaboratively factorizes the weighted adjacency matrices of the heterogeneous network to explore the latent relations between genes, miRNAs and pathways. After this, it reconstructs the pathway interaction network and identifies the pathway pairs with maximal interactive and driver weights as cooperative driver pathways. Experimental results on the breast, uterine corpus endometrial carcinoma and ovarian cancer data from The Cancer Genome Atlas show that CDPathway can effectively identify candidate driver genes [area under the receiver operating characteristic curve (AUROC) of $\geq $0.9] and reconstruct the pathway interaction network (AUROC of>0.9), and it uncovers much more known (potential) driver genes than other competitive methods. In addition, CDPathway identifies 150% more driver pathways and 60% more potential cooperative driver pathways than the competing methods. The code of CDPathway is available at http://mlda.swu.edu.cn/codes.php?name=CDPathway.


Assuntos
Neoplasias da Mama/genética , Redes Reguladoras de Genes , MicroRNAs/genética , Algoritmos , Conjuntos de Dados como Assunto , Feminino , Humanos
11.
Neural Netw ; 132: 333-341, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-32977278

RESUMO

The goal of zero-shot learning (ZSL) is to build a classifier that recognizes novel categories with no corresponding annotated training data. The typical routine is to transfer knowledge from seen classes to unseen ones by learning a visual-semantic embedding. Existing multi-label zero-shot learning approaches either ignore correlations among labels, suffer from large label combinations, or learn the embedding using only local or global visual features. In this paper, we propose a Graph Convolution Networks based Multi-label Zero-Shot Learning model, abbreviated as MZSL-GCN. Our model first constructs a label relation graph using label co-occurrences and compensates the absence of unseen labels in the training phase by semantic similarity. It then takes the graph and the word embedding of each seen (unseen) label as inputs to the GCN to learn the label semantic embedding, and to obtain a set of inter-dependent object classifiers. MZSL-GCN simultaneously trains another attention network to learn compatible local and global visual features of objects with respect to the classifiers, and thus makes the whole network end-to-end trainable. In addition, the use of unlabeled training data can reduce the bias toward seen labels and boost the generalization ability. Experimental results on benchmark datasets show that our MZSL-GCN competes with state-of-the-art approaches.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Humanos , Semântica
12.
Molecules ; 25(9)2020 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-32397410

RESUMO

Controlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs. In this paper, we propose a novel clustering-based approach which we demonstrate to significantly reduce an ensemble of generated structures without sacrificing quality. Evaluations are related on both benchmark and CASP target proteins. Structure ensembles subjected to the proposed approach and the source code of the proposed approach are publicly-available at the links provided in Section 1.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Análise por Conglomerados , Modelos Moleculares , Dobramento de Proteína , Estrutura Terciária de Proteína , Software
13.
Bioinformatics ; 36(6): 1864-1871, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-32176770

RESUMO

MOTIVATION: Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity. Since existing functional genomic databases uniformly record the annotations at the gene-level, and rarely record the annotations at the isoform-level, differentiating isoform functions is more challenging than the traditional gene-level function prediction. RESULTS: Several approaches have been proposed to differentiate the functions of isoforms. They generally follow the multi-instance learning paradigm by viewing each gene as a bag and the spliced isoforms as its instances, and push functions of bags onto instances. These approaches implicitly assume the collected annotations of genes are complete and only integrate multiple RNA-seq datasets. As such, they have compromised performance. We propose a data integrative solution (called DisoFun) to Differentiate isoform Functions with collaborative matrix factorization. DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms. It collaboratively factorizes the isoform data matrix and gene-term data matrix (storing Gene Ontology annotations of genes) into low-rank matrices to simultaneously explore the latent key isoforms, and achieve function prediction by aggregating predictions to their originating genes. In addition, it leverages the PPI network and Gene Ontology structure to further coordinate the matrix factorization. Extensive experimental results show that DisoFun improves the area under the receiver operating characteristic curve and area under the precision-recall curve of existing solutions by at least 7.7 and 28.9%, respectively. We further investigate DisoFun on four exemplar genes (LMNA, ADAM15, BCL2L1 and CFLAR) with known functions at the isoform-level, and observed that DisoFun can differentiate functions of their isoforms with 90.5% accuracy. AVAILABILITY AND IMPLEMENTATION: The code of DisoFun is available at mlda.swu.edu.cn/codes.php?name=DisoFun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Ontologia Genética , Anotação de Sequência Molecular , Isoformas de Proteínas/genética , Curva ROC
14.
Bioinformatics ; 36(1): 303-310, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31250882

RESUMO

MOTIVATION: Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored. The main bottleneck is that functional annotations of isoforms are generally unavailable and functional genomic databases universally store the functional annotations at the gene level. RESULTS: We propose IsoFun to accomplish Isoform Function prediction based on bi-random walks on a heterogeneous network. IsoFun firstly constructs an isoform functional association network based on the expression profiles of isoforms derived from multiple RNA-seq datasets. Next, IsoFun uses the available Gene Ontology annotations of genes, gene-gene interactions and the relations between genes and isoforms to construct a heterogeneous network. After this, IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between GO terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms. Experimental results show that IsoFun significantly outperforms the state-of-the-art algorithms and improves the area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) by 17% and 44% at the gene-level, respectively. We further validated the performance of IsoFun on the genes ADAM15 and BCL2L1. IsoFun accurately differentiates the functions of respective isoforms of these two genes. AVAILABILITY AND IMPLEMENTATION: The code of IsoFun is available at http://mlda.swu.edu.cn/codes.php? name=IsoFun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional , Isoformas de Proteínas , Biologia Computacional/métodos , Ontologia Genética , Anotação de Sequência Molecular , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo
15.
Methods ; 173: 32-43, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31226302

RESUMO

Influx evidences show that red long non-coding RNAs (lncRNAs) play important roles in various critical biological processes, and they afffect the development and progression of various human diseases. Therefore, it is necessary to precisely identify the lncRNA-disease associations. The identification precision can be improved by developing data integrative models. However, current models mainly need to project heterogeneous data onto the homologous networks, and then merge these networks into a composite one for integrative prediction. We recognize that this projection overrides the individual structure of the heterogeneous data, and the combination is impacted by noisy networks. As a result, the performance is compromised. Given that, we introduce a weighted matrix factorization model on multi-relational data to predict LncRNA-disease associations (WMFLDA). WMFLDA firstly uses a heterogeneous network to capture the inter(intra)-associations between different types of nodes (including genes, lncRNAs, and Disease Ontology terms). Then, it presets weights to these inter-association and intra-association matrices of the network, and cooperatively decomposes these matrices into low-rank ones to explore the underlying relationships between nodes. Next, it jointly optimizes the low-rank matrices and the weights. After that, WMFLDA approximates the lncRNA-disease association matrix using the optimized matrices and weights, and thus to achieve the prediction. WMFLDA obtains a much better performance than related data integrative solutions across different experiment settings and evaluation metrics. It can not only respect the intrinsic structures of individual data sources, but can also fuse them with selection.


Assuntos
Biologia Computacional/métodos , Predisposição Genética para Doença , RNA Longo não Codificante/genética , Algoritmos , Progressão da Doença , Humanos
16.
Evol Comput ; 26(1): 43-66, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-27982696

RESUMO

Many real-world problems involve massive amounts of data. Under these circumstances learning algorithms often become prohibitively expensive, making scalability a pressing issue to be addressed. A common approach is to perform sampling to reduce the size of the dataset and enable efficient learning. Alternatively, one customizes learning algorithms to achieve scalability. In either case, the key challenge is to obtain algorithmic efficiency without compromising the quality of the results. In this article we discuss a meta-learning algorithm (PSBML) that combines concepts from spatially structured evolutionary algorithms (SSEAs) with concepts from ensemble and boosting methodologies to achieve the desired scalability property. We present both theoretical and empirical analyses which show that PSBML preserves a critical property of boosting, specifically, convergence to a distribution centered around the margin. We then present additional empirical analyses showing that this meta-level algorithm provides a general and effective framework that can be used in combination with a variety of learning classifiers. We perform extensive experiments to investigate the trade-off achieved between scalability and accuracy, and robustness to noise, on both synthetic and real-world data. These empirical results corroborate our theoretical analysis, and demonstrate the potential of PSBML in achieving scalability without sacrificing accuracy.


Assuntos
Algoritmos , Inteligência Artificial , Simulação por Computador , Modelos Teóricos , Bases de Dados Factuais , Humanos
17.
Bioinformatics ; 34(9): 1529-1537, 2018 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-29228285

RESUMO

Motivation: Long non-coding RNAs (lncRNAs) play crucial roles in complex disease diagnosis, prognosis, prevention and treatment, but only a small portion of lncRNA-disease associations have been experimentally verified. Various computational models have been proposed to identify lncRNA-disease associations by integrating heterogeneous data sources. However, existing models generally ignore the intrinsic structure of data sources or treat them as equally relevant, while they may not be. Results: To accurately identify lncRNA-disease associations, we propose a Matrix Factorization based LncRNA-Disease Association prediction model (MFLDA in short). MFLDA decomposes data matrices of heterogeneous data sources into low-rank matrices via matrix tri-factorization to explore and exploit their intrinsic and shared structure. MFLDA can select and integrate the data sources by assigning different weights to them. An iterative solution is further introduced to simultaneously optimize the weights and low-rank matrices. Next, MFLDA uses the optimized low-rank matrices to reconstruct the lncRNA-disease association matrix and thus to identify potential associations. In 5-fold cross validation experiments to identify verified lncRNA-disease associations, MFLDA achieves an area under the receiver operating characteristic curve (AUC) of 0.7408, at least 3% higher than those given by state-of-the-art data fusion based computational models. An empirical study on identifying masked lncRNA-disease associations again shows that MFLDA can identify potential associations more accurately than competing models. A case study on identifying lncRNAs associated with breast, lung and stomach cancers show that 38 out of 45 (84%) associations predicted by MFLDA are supported by recent biomedical literature and further proves the capability of MFLDA in identifying novel lncRNA-disease associations. MFLDA is a general data fusion framework, and as such it can be adopted to predict associations between other biological entities. Availability and implementation: The source code for MFLDA is available at: http://mlda.swu.edu.cn/codes.php? name = MFLDA. Contact: gxyu@swu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional/métodos , Simulação por Computador , Predisposição Genética para Doença , RNA Longo não Codificante/genética , Neoplasias da Mama/genética , Feminino , Humanos , Curva ROC
18.
Artigo em Inglês | MEDLINE | ID: mdl-26357091

RESUMO

High-throughput experimental techniques provide a wide variety of heterogeneous proteomic data sources. To exploit the information spread across multiple sources for protein function prediction, these data sources are transformed into kernels and then integrated into a composite kernel. Several methods first optimize the weights on these kernels to produce a composite kernel, and then train a classifier on the composite kernel. As such, these approaches result in an optimal composite kernel, but not necessarily in an optimal classifier. On the other hand, some approaches optimize the loss of binary classifiers and learn weights for the different kernels iteratively. For multi-class or multi-label data, these methods have to solve the problem of optimizing weights on these kernels for each of the labels, which are computationally expensive and ignore the correlation among labels. In this paper, we propose a method called Predicting Protein Function using Multiple Kernels (ProMK). ProMK iteratively optimizes the phases of learning optimal weights and reduces the empirical loss of multi-label classifier for each of the labels simultaneously. ProMK can integrate kernels selectively and downgrade the weights on noisy kernels. We investigate the performance of ProMK on several publicly available protein function prediction benchmarks and synthetic datasets. We show that the proposed approach performs better than previously proposed protein function prediction approaches that integrate multiple data sources and multi-label multiple kernel learning methods. The codes of our proposed method are available at https://sites.google.com/site/guoxian85/promk.


Assuntos
Algoritmos , Proteínas/química , Proteínas/classificação , Proteômica/métodos , Animais , Humanos , Camundongos , Mapas de Interação de Proteínas , Análise de Sequência de Proteína , Leveduras/genética
19.
BMC Bioinformatics ; 16: 271, 2015 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-26310806

RESUMO

BACKGROUND: High-throughput bio-techniques accumulate ever-increasing amount of genomic and proteomic data. These data are far from being functionally characterized, despite the advances in gene (or gene's product proteins) functional annotations. Due to experimental techniques and to the research bias in biology, the regularly updated functional annotation databases, i.e., the Gene Ontology (GO), are far from being complete. Given the importance of protein functions for biological studies and drug design, proteins should be more comprehensively and precisely annotated. RESULTS: We proposed downward Random Walks (dRW) to predict missing (or new) functions of partially annotated proteins. Particularly, we apply downward random walks with restart on the GO directed acyclic graph, along with the available functions of a protein, to estimate the probability of missing functions. To further boost the prediction accuracy, we extend dRW to dRW-kNN. dRW-kNN computes the semantic similarity between proteins based on the functional annotations of proteins; it then predicts functions based on the functions estimated by dRW, together with the functions associated with the k nearest proteins. Our proposed models can predict two kinds of missing functions: (i) the ones that are missing for a protein but associated with other proteins of interest; (ii) the ones that are not available for any protein of interest, but exist in the GO hierarchy. Experimental results on the proteins of Yeast and Human show that dRW and dRW-kNN can replenish functions more accurately than other related approaches, especially for sparse functions associated with no more than 10 proteins. CONCLUSION: The empirical study shows that the semantic similarity between GO terms and the ontology hierarchy play important roles in predicting protein function. The proposed dRW and dRW-kNN can serve as tools for replenishing functions of partially annotated proteins.


Assuntos
Proteínas/metabolismo , Proteômica/métodos , Algoritmos , Ontologia Genética , Humanos , Anotação de Sequência Molecular , Proteínas/química , Leveduras/metabolismo
20.
BMC Syst Biol ; 9 Suppl 1: S3, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25707434

RESUMO

BACKGROUND: High throughput techniques produce multiple functional association networks. Integrating these networks can enhance the accuracy of protein function prediction. Many algorithms have been introduced to generate a composite network, which is obtained as a weighted sum of individual networks. The weight assigned to an individual network reflects its benefit towards the protein functional annotation inference. A classifier is then trained on the composite network for predicting protein functions. However, since these techniques model the optimization of the composite network and the prediction tasks as separate objectives, the resulting composite network is not necessarily optimal for the follow-up protein function prediction. RESULTS: We address this issue by modeling the optimization of the composite network and the prediction problems within a unified objective function. In particular, we use a kernel target alignment technique and the loss function of a network based classifier to jointly adjust the weights assigned to the individual networks. We show that the proposed method, called MNet, can achieve a performance that is superior (with respect to different evaluation criteria) to related techniques using the multiple networks of four example species (yeast, human, mouse, and fly) annotated with thousands (or hundreds) of GO terms. CONCLUSION: MNet can effectively integrate multiple networks for protein function prediction and is robust to the input parameters. Supplementary data is available at https://sites.google.com/site/guoxian85/home/mnet. The Matlab code of MNet is available upon request.


Assuntos
Biologia Computacional/métodos , Proteínas/metabolismo , Algoritmos , Animais , Bases de Dados de Proteínas , Proteínas Fúngicas/metabolismo , Ontologia Genética , Humanos , Proteínas de Insetos/metabolismo , Camundongos , Fatores de Tempo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA