Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 26
Filter
1.
Brief Bioinform ; 22(2): 1984-1999, 2021 03 22.
Article in English | MEDLINE | ID: mdl-32103253

ABSTRACT

Discovering driver pathways is an essential step to uncover the molecular mechanism underlying cancer and to explore precise treatments for cancer patients. However, due to the difficulties of mapping genes to pathways and the limited knowledge about pathway interactions, most previous work focus on identifying individual pathways. In practice, two (or even more) pathways interplay and often cooperatively trigger cancer. In this study, we proposed a new approach called CDPathway to discover cooperative driver pathways. First, CDPathway introduces a driver impact quantification function to quantify the driver weight of each gene. CDPathway assumes that genes with larger weights contribute more to the occurrence of the target disease and identifies them as candidate driver genes. Next, it constructs a heterogeneous network composed of genes, miRNAs and pathways nodes based on the known intra(inter)-relations between them and assigns the quantified driver weights to gene-pathway and gene-miRNA relational edges. To transfer driver impacts of genes to pathway interaction pairs, CDPathway collaboratively factorizes the weighted adjacency matrices of the heterogeneous network to explore the latent relations between genes, miRNAs and pathways. After this, it reconstructs the pathway interaction network and identifies the pathway pairs with maximal interactive and driver weights as cooperative driver pathways. Experimental results on the breast, uterine corpus endometrial carcinoma and ovarian cancer data from The Cancer Genome Atlas show that CDPathway can effectively identify candidate driver genes [area under the receiver operating characteristic curve (AUROC) of $\geq $0.9] and reconstruct the pathway interaction network (AUROC of>0.9), and it uncovers much more known (potential) driver genes than other competitive methods. In addition, CDPathway identifies 150% more driver pathways and 60% more potential cooperative driver pathways than the competing methods. The code of CDPathway is available at http://mlda.swu.edu.cn/codes.php?name=CDPathway.


Subject(s)
Breast Neoplasms/genetics , Gene Regulatory Networks , MicroRNAs/genetics , Algorithms , Datasets as Topic , Female , Humans
2.
Bioinformatics ; 38(19): 4581-4588, 2022 09 30.
Article in English | MEDLINE | ID: mdl-35997558

ABSTRACT

MOTIVATION: High-resolution annotation of gene functions is a central task in functional genomics. Multiple proteoforms translated from alternatively spliced isoforms from a single gene are actual function performers and greatly increase the functional diversity. The specific functions of different isoforms can decipher the molecular basis of various complex diseases at a finer granularity. Multi-instance learning (MIL)-based solutions have been developed to distribute gene(bag)-level Gene Ontology (GO) annotations to isoforms(instances), but they simply presume that a particular annotation of the gene is responsible by only one isoform, neglect the hierarchical structures and semantics of massive GO terms (labels), or can only handle dozens of terms. RESULTS: We propose an efficacy approach IsofunGO to differentiate massive functions of isoforms by GO embedding. Particularly, IsofunGO first introduces an attributed hierarchical network to model massive GO terms, and a GO network embedding strategy to learn compact representations of GO terms and project GO annotations of genes into compressed ones, this strategy not only explores and preserves hierarchy between GO terms but also greatly reduces the prediction load. Next, it develops an attention-based MIL network to fuse genomics and transcriptomics data of isoforms and predict isoform functions by referring to compressed annotations. Extensive experiments on benchmark datasets demonstrate the efficacy of IsofunGO. Both the GO embedding and attention mechanism can boost the performance and interpretability. AVAILABILITYAND IMPLEMENTATION: The code of IsofunGO is available at http://www.sdu-idea.cn/codes.php?name=IsofunGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Semantics , Gene Ontology , Molecular Sequence Annotation , Protein Isoforms/genetics
3.
Bioinformatics ; 37(24): 4818-4825, 2021 12 11.
Article in English | MEDLINE | ID: mdl-34282449

ABSTRACT

MOTIVATION: Alternative splicing creates the considerable proteomic diversity and complexity on relatively limited genome. Proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions of this gene, which reflect the functional knowledge of genes at a finer granular level. Recently, some computational approaches have been proposed to differentiate isoform functions using sequence and expression data. However, their performance is far from being desirable, mainly due to the imbalance and lack of annotations at isoform-level, and the difficulty of modeling gene-isoform relations. RESULT: We propose a deep multi-instance learning-based framework (DMIL-IsoFun) to differentiate the functions of isoforms. DMIL-IsoFun firstly introduces a multi-instance learning convolution neural network trained with isoform sequences and gene-level annotations to extract the feature vectors and initialize the annotations of isoforms, and then uses a class-imbalance Graph Convolution Network to refine the annotations of individual isoforms based on the isoform co-expression network and extracted features. Extensive experimental results show that DMIL-IsoFun improves the Smin and Fmax of state-of-the-art solutions by at least 29.6% and 40.8%. The effectiveness of DMIL-IsoFun is further confirmed on a testbed of human multiple-isoform genes, and maize isoforms related with photosynthesis. AVAILABILITY AND IMPLEMENTATION: The code and data are available at http://www.sdu-idea.cn/codes.php?name=DMIL-Isofun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Alternative Splicing , Proteomics , Humans , Protein Isoforms/genetics , Neural Networks, Computer , Molecular Sequence Annotation
4.
Bioinformatics ; 36(6): 1864-1871, 2020 03 01.
Article in English | MEDLINE | ID: mdl-32176770

ABSTRACT

MOTIVATION: Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity. Since existing functional genomic databases uniformly record the annotations at the gene-level, and rarely record the annotations at the isoform-level, differentiating isoform functions is more challenging than the traditional gene-level function prediction. RESULTS: Several approaches have been proposed to differentiate the functions of isoforms. They generally follow the multi-instance learning paradigm by viewing each gene as a bag and the spliced isoforms as its instances, and push functions of bags onto instances. These approaches implicitly assume the collected annotations of genes are complete and only integrate multiple RNA-seq datasets. As such, they have compromised performance. We propose a data integrative solution (called DisoFun) to Differentiate isoform Functions with collaborative matrix factorization. DisoFun assumes the functional annotations of genes are aggregated from those of key isoforms. It collaboratively factorizes the isoform data matrix and gene-term data matrix (storing Gene Ontology annotations of genes) into low-rank matrices to simultaneously explore the latent key isoforms, and achieve function prediction by aggregating predictions to their originating genes. In addition, it leverages the PPI network and Gene Ontology structure to further coordinate the matrix factorization. Extensive experimental results show that DisoFun improves the area under the receiver operating characteristic curve and area under the precision-recall curve of existing solutions by at least 7.7 and 28.9%, respectively. We further investigate DisoFun on four exemplar genes (LMNA, ADAM15, BCL2L1 and CFLAR) with known functions at the isoform-level, and observed that DisoFun can differentiate functions of their isoforms with 90.5% accuracy. AVAILABILITY AND IMPLEMENTATION: The code of DisoFun is available at mlda.swu.edu.cn/codes.php?name=DisoFun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Gene Ontology , Molecular Sequence Annotation , Protein Isoforms/genetics , ROC Curve
5.
Bioinformatics ; 36(1): 303-310, 2020 01 01.
Article in English | MEDLINE | ID: mdl-31250882

ABSTRACT

MOTIVATION: Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored. The main bottleneck is that functional annotations of isoforms are generally unavailable and functional genomic databases universally store the functional annotations at the gene level. RESULTS: We propose IsoFun to accomplish Isoform Function prediction based on bi-random walks on a heterogeneous network. IsoFun firstly constructs an isoform functional association network based on the expression profiles of isoforms derived from multiple RNA-seq datasets. Next, IsoFun uses the available Gene Ontology annotations of genes, gene-gene interactions and the relations between genes and isoforms to construct a heterogeneous network. After this, IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between GO terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms. Experimental results show that IsoFun significantly outperforms the state-of-the-art algorithms and improves the area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) by 17% and 44% at the gene-level, respectively. We further validated the performance of IsoFun on the genes ADAM15 and BCL2L1. IsoFun accurately differentiates the functions of respective isoforms of these two genes. AVAILABILITY AND IMPLEMENTATION: The code of IsoFun is available at http://mlda.swu.edu.cn/codes.php? name=IsoFun. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology , Protein Isoforms , Computational Biology/methods , Gene Ontology , Molecular Sequence Annotation , Protein Isoforms/genetics , Protein Isoforms/metabolism
6.
Methods ; 173: 32-43, 2020 02 15.
Article in English | MEDLINE | ID: mdl-31226302

ABSTRACT

Influx evidences show that red long non-coding RNAs (lncRNAs) play important roles in various critical biological processes, and they afffect the development and progression of various human diseases. Therefore, it is necessary to precisely identify the lncRNA-disease associations. The identification precision can be improved by developing data integrative models. However, current models mainly need to project heterogeneous data onto the homologous networks, and then merge these networks into a composite one for integrative prediction. We recognize that this projection overrides the individual structure of the heterogeneous data, and the combination is impacted by noisy networks. As a result, the performance is compromised. Given that, we introduce a weighted matrix factorization model on multi-relational data to predict LncRNA-disease associations (WMFLDA). WMFLDA firstly uses a heterogeneous network to capture the inter(intra)-associations between different types of nodes (including genes, lncRNAs, and Disease Ontology terms). Then, it presets weights to these inter-association and intra-association matrices of the network, and cooperatively decomposes these matrices into low-rank ones to explore the underlying relationships between nodes. Next, it jointly optimizes the low-rank matrices and the weights. After that, WMFLDA approximates the lncRNA-disease association matrix using the optimized matrices and weights, and thus to achieve the prediction. WMFLDA obtains a much better performance than related data integrative solutions across different experiment settings and evaluation metrics. It can not only respect the intrinsic structures of individual data sources, but can also fuse them with selection.


Subject(s)
Computational Biology/methods , Genetic Predisposition to Disease , RNA, Long Noncoding/genetics , Algorithms , Disease Progression , Humans
7.
Molecules ; 25(9)2020 May 09.
Article in English | MEDLINE | ID: mdl-32397410

ABSTRACT

Controlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs. In this paper, we propose a novel clustering-based approach which we demonstrate to significantly reduce an ensemble of generated structures without sacrificing quality. Evaluations are related on both benchmark and CASP target proteins. Structure ensembles subjected to the proposed approach and the source code of the proposed approach are publicly-available at the links provided in Section 1.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Cluster Analysis , Models, Molecular , Protein Folding , Protein Structure, Tertiary , Software
8.
Bioinformatics ; 34(9): 1529-1537, 2018 05 01.
Article in English | MEDLINE | ID: mdl-29228285

ABSTRACT

Motivation: Long non-coding RNAs (lncRNAs) play crucial roles in complex disease diagnosis, prognosis, prevention and treatment, but only a small portion of lncRNA-disease associations have been experimentally verified. Various computational models have been proposed to identify lncRNA-disease associations by integrating heterogeneous data sources. However, existing models generally ignore the intrinsic structure of data sources or treat them as equally relevant, while they may not be. Results: To accurately identify lncRNA-disease associations, we propose a Matrix Factorization based LncRNA-Disease Association prediction model (MFLDA in short). MFLDA decomposes data matrices of heterogeneous data sources into low-rank matrices via matrix tri-factorization to explore and exploit their intrinsic and shared structure. MFLDA can select and integrate the data sources by assigning different weights to them. An iterative solution is further introduced to simultaneously optimize the weights and low-rank matrices. Next, MFLDA uses the optimized low-rank matrices to reconstruct the lncRNA-disease association matrix and thus to identify potential associations. In 5-fold cross validation experiments to identify verified lncRNA-disease associations, MFLDA achieves an area under the receiver operating characteristic curve (AUC) of 0.7408, at least 3% higher than those given by state-of-the-art data fusion based computational models. An empirical study on identifying masked lncRNA-disease associations again shows that MFLDA can identify potential associations more accurately than competing models. A case study on identifying lncRNAs associated with breast, lung and stomach cancers show that 38 out of 45 (84%) associations predicted by MFLDA are supported by recent biomedical literature and further proves the capability of MFLDA in identifying novel lncRNA-disease associations. MFLDA is a general data fusion framework, and as such it can be adopted to predict associations between other biological entities. Availability and implementation: The source code for MFLDA is available at: http://mlda.swu.edu.cn/codes.php? name = MFLDA. Contact: gxyu@swu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology/methods , Computer Simulation , Genetic Predisposition to Disease , RNA, Long Noncoding/genetics , Breast Neoplasms/genetics , Female , Humans , ROC Curve
9.
Evol Comput ; 26(1): 43-66, 2018.
Article in English | MEDLINE | ID: mdl-27982696

ABSTRACT

Many real-world problems involve massive amounts of data. Under these circumstances learning algorithms often become prohibitively expensive, making scalability a pressing issue to be addressed. A common approach is to perform sampling to reduce the size of the dataset and enable efficient learning. Alternatively, one customizes learning algorithms to achieve scalability. In either case, the key challenge is to obtain algorithmic efficiency without compromising the quality of the results. In this article we discuss a meta-learning algorithm (PSBML) that combines concepts from spatially structured evolutionary algorithms (SSEAs) with concepts from ensemble and boosting methodologies to achieve the desired scalability property. We present both theoretical and empirical analyses which show that PSBML preserves a critical property of boosting, specifically, convergence to a distribution centered around the margin. We then present additional empirical analyses showing that this meta-level algorithm provides a general and effective framework that can be used in combination with a variety of learning classifiers. We perform extensive experiments to investigate the trade-off achieved between scalability and accuracy, and robustness to noise, on both synthetic and real-world data. These empirical results corroborate our theoretical analysis, and demonstrate the potential of PSBML in achieving scalability without sacrificing accuracy.


Subject(s)
Algorithms , Artificial Intelligence , Computer Simulation , Models, Theoretical , Databases, Factual , Humans
10.
BMC Bioinformatics ; 16: 1, 2015 Jan 16.
Article in English | MEDLINE | ID: mdl-25591917

ABSTRACT

BACKGROUND: Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction. RESULTS: In this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels. CONCLUSION: The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.


Subject(s)
Algorithms , Databases, Protein , Molecular Dynamics Simulation , Proteins/chemistry , Proteins/metabolism
11.
BMC Bioinformatics ; 16: 271, 2015 Aug 27.
Article in English | MEDLINE | ID: mdl-26310806

ABSTRACT

BACKGROUND: High-throughput bio-techniques accumulate ever-increasing amount of genomic and proteomic data. These data are far from being functionally characterized, despite the advances in gene (or gene's product proteins) functional annotations. Due to experimental techniques and to the research bias in biology, the regularly updated functional annotation databases, i.e., the Gene Ontology (GO), are far from being complete. Given the importance of protein functions for biological studies and drug design, proteins should be more comprehensively and precisely annotated. RESULTS: We proposed downward Random Walks (dRW) to predict missing (or new) functions of partially annotated proteins. Particularly, we apply downward random walks with restart on the GO directed acyclic graph, along with the available functions of a protein, to estimate the probability of missing functions. To further boost the prediction accuracy, we extend dRW to dRW-kNN. dRW-kNN computes the semantic similarity between proteins based on the functional annotations of proteins; it then predicts functions based on the functions estimated by dRW, together with the functions associated with the k nearest proteins. Our proposed models can predict two kinds of missing functions: (i) the ones that are missing for a protein but associated with other proteins of interest; (ii) the ones that are not available for any protein of interest, but exist in the GO hierarchy. Experimental results on the proteins of Yeast and Human show that dRW and dRW-kNN can replenish functions more accurately than other related approaches, especially for sparse functions associated with no more than 10 proteins. CONCLUSION: The empirical study shows that the semantic similarity between GO terms and the ontology hierarchy play important roles in predicting protein function. The proposed dRW and dRW-kNN can serve as tools for replenishing functions of partially annotated proteins.


Subject(s)
Proteins/metabolism , Proteomics/methods , Algorithms , Gene Ontology , Humans , Molecular Sequence Annotation , Proteins/chemistry , Yeasts/metabolism
12.
IEEE Trans Cybern ; 54(1): 486-495, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37022240

ABSTRACT

Finding the causal structure from a set of variables given observational data is a crucial task in many scientific areas. Most algorithms focus on discovering the global causal graph but few efforts have been made toward the local causal structure (LCS), which is of wide practical significance and easier to obtain. LCS learning faces the challenges of neighborhood determination and edge orientation. Available LCS algorithms build on conditional independence (CI) tests, they suffer the poor accuracy due to noises, various data generation mechanisms, and small-size samples of real-world applications, where CI tests do not work. In addition, they can only find the Markov equivalence class, leaving some edges undirected. In this article, we propose a GradieNt-based LCS learning approach (GraN-LCS) to determine neighbors and orient edges simultaneously in a gradient-descent way, and, thus, to explore LCS more accurately. GraN-LCS formulates the causal graph search as minimizing an acyclicity regularized score function, which can be optimized by efficient gradient-based solvers. GraN-LCS constructs a multilayer perceptron (MLP) to simultaneously fit all other variables with respect to a target variable and defines an acyclicity-constrained local recovery loss to promote the exploration of local graphs and to find out direct causes and effects of the target variable. To improve the efficacy, it applies preliminary neighborhood selection (PNS) to sketch the raw causal structure and further incorporates an l1 -norm-based feature selection on the first layer of MLP to reduce the scale of candidate variables and to pursue sparse weight matrix. GraN-LCS finally outputs LCS based on the sparse weighted adjacency matrix learned from MLPs. We conduct experiments on both synthetic and real-world datasets and verify its efficacy by comparing against state-of-the-art baselines. A detailed ablation study investigates the impact of key components of GraN-LCS and the results prove their contribution.

13.
IEEE Trans Neural Netw Learn Syst ; 33(1): 304-314, 2022 Jan.
Article in English | MEDLINE | ID: mdl-33052870

ABSTRACT

Hashing has been widely adopted for large-scale data retrieval in many domains due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities is readily available. This assumption is unrealistic in practical applications. In addition, existing methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (FlexCMH) to learn effective hashing codes from weakly paired data, whose correspondence across modalities is partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the structure of each cluster and, thus, to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes the potential correspondence, the cross-modal hashing functions derived from the correspondence, and a hashing quantitative loss in a unified objective function. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions and reinforce the reciprocal effects of the two objectives. Experiments on public multimodal data sets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it, indeed, offers a high degree of flexibility for practical cross-modal hashing tasks.

14.
IEEE Trans Neural Netw Learn Syst ; 33(9): 4311-4321, 2022 Sep.
Article in English | MEDLINE | ID: mdl-33577462

ABSTRACT

Multiview multi-instance multilabel learning (M3L) is a framework for modeling complex objects. In this framework, each object (or bag) contains one or more instances, is represented with different feature views, and simultaneously annotated with a set of nonexclusive semantic labels. Given the multiplicity of the studied objects, traditional M3L methods generally demand a large number of labeled bags to train a predictive model to annotate bags (or instances) with semantic labels. However, annotating sufficient bags is very expensive and often impractical. In this article, we present an active learning-based M3L approach (M3AL) to reduce the labeling costs of bags and to improve the performance as much as possible. M3AL first adapts the multiview self-representation learning to evacuate the shared and individual information of bags and to learn the shared/individual similarities between bags across/within views. Next, to avoid scrutinizing all the possible labels, M3AL introduces a new query strategy that leverages the shared and individual information, and the diverse instance distribution of bags across views, to select the most informative bag-label pair for the query. Experimental studies on benchmark data sets show that M3AL can significantly reduce the query costs while achieving a better performance than other related competitive methods at the same cost.

15.
Article in English | MEDLINE | ID: mdl-35862332

ABSTRACT

Personalized federated learning (PFL) learns a personalized model for each client in a decentralized manner, where each client owns private data that are not shared and data among clients are non-independent and identically distributed (i.i.d.) However, existing PFL solutions assume that clients have sufficient training samples to jointly induce personalized models. Thus, existing PFL solutions cannot perform well in a few-shot scenario, where most or all clients only have a handful of samples for training. Furthermore, existing few-shot learning (FSL) approaches typically need centralized training data; as such, these FSL methods are not applicable in decentralized scenarios. How to enable PFL with limited training samples per client is a practical but understudied problem. In this article, we propose a solution called personalized federated few-shot learning (pFedFSL) to tackle this problem. Specifically, pFedFSL learns a personalized and discriminative feature space for each client by identifying which models perform well on which clients, without exposing local data of clients to the server and other clients, and which clients should be selected for collaboration with the target client. In the learned feature spaces, each sample is made closer to samples of the same category and farther away from samples of different categories. Experimental results on four benchmark datasets demonstrate that pFedFSL outperforms competitive baselines across different settings.

16.
IEEE Trans Neural Netw Learn Syst ; 32(4): 1448-1459, 2021 04.
Article in English | MEDLINE | ID: mdl-32310798

ABSTRACT

Crowdsourcing is an economic and efficient strategy aimed at collecting annotations of data through an online platform. Crowd workers with different expertise are paid for their service, and the task requester usually has a limited budget. How to collect reliable annotations for multilabel data and how to compute the consensus within budget are an interesting and challenging, but rarely studied, problem. In this article, we propose a novel approach to accomplish active multilabel crowd consensus (AMCC). AMCC accounts for the commonality and individuality of workers and assumes that workers can be organized into different groups. Each group includes a set of workers who share a similar annotation behavior and label correlations. To achieve an effective multilabel consensus, AMCC models workers' annotations via a linear combination of commonality and individuality and reduces the impact of unreliable workers by assigning smaller weights to their groups. To collect reliable annotations with reduced cost, AMCC introduces an active crowdsourcing learning strategy that selects sample-label-worker triplets. In a triplet, the selected sample and label are the most informative for the consensus model, and the selected worker can reliably annotate the sample at a low cost. Our experimental results on multilabel data sets demonstrate the advantages of AMCC over state-of-the-art solutions on computing crowd consensus and on reducing the budget by choosing cost-effective triplets.

17.
IEEE Trans Cybern ; 51(3): 1716-1727, 2021 Mar.
Article in English | MEDLINE | ID: mdl-31751259

ABSTRACT

In multiview multilabel learning, each object is represented by several heterogeneous feature representations and is also annotated with a set of discrete nonexclusive labels. Previous studies typically focus on capturing the shared latent patterns among multiple views, while not sufficiently considering the diverse characteristics of individual views, which can cause performance degradation. In this article, we propose a novel approach [individuality- and commonality-based multiview multilabel learning (ICM2L)] to explicitly explore the individuality and commonality information of multilabel multiple view data in a unified model. Specifically, a common subspace is learned across different views to capture the shared patterns. Then, multiple individual classifiers are exploited to explore the characteristics of individual views. Next, an ensemble strategy is adopted to make a prediction. Finally, we develop an alternative solution to jointly optimize our model, which can enhance the robustness of the proposed model toward rare labels and reinforce the reciprocal effects of individuality and commonality among heterogeneous views, and thus further improve the performance. Experiments on various real-word datasets validate the effectiveness of ICM2L against the state-of-the-art solutions, and ICM2L can leverage the individuality and commonality information to achieve an improved performance as well as to enhance the robustness toward rare labels.

18.
IEEE Trans Cybern ; 51(7): 3576-3587, 2021 Jul.
Article in English | MEDLINE | ID: mdl-31751260

ABSTRACT

Clustering is a fundamental data exploration task which aims at discovering the hidden grouping structure in the data. The traditional clustering methods typically compute a single partition. However, there often exist different and equally meaningful clusterings in complex data. To solve this issue, multiple clustering approaches have emerged with the goal of exploring alternative clusterings from different perspectives. Existing solutions to this problem mainly focus on one-way clustering, that is, they cluster either the samples or the features. However, for many practical tasks, it is meaningful and desirable to explore alternative two-way clusterings (or co-clusterings), which capture not only the sample cluster structure but also the feature cluster structure. To tackle this interesting and unresolved task, we introduce an approach, called multiple co-clusterings (MultiCCs), to generate multiple alternative co-clusterings at the same time. MultiCC takes advantage of matrix tri-factorization to seek the co-clustering indicator matrices for samples and features and defines the row and column redundancy quantification terms to enforce diversity among co-clusterings based on these indicator matrices. After that, it integrates matrix tri-factorization and two nonredundancy terms into a unified objective function and gives an alternative optimization procedure to optimize the objective function. Extensive experimental results demonstrate that MultiCC performs significantly better than the existing multiple clustering methods. In addition, MultiCC can find out interesting co-clusters, which cannot be made by those comparing methods.

19.
Neural Netw ; 132: 333-341, 2020 Dec.
Article in English | MEDLINE | ID: mdl-32977278

ABSTRACT

The goal of zero-shot learning (ZSL) is to build a classifier that recognizes novel categories with no corresponding annotated training data. The typical routine is to transfer knowledge from seen classes to unseen ones by learning a visual-semantic embedding. Existing multi-label zero-shot learning approaches either ignore correlations among labels, suffer from large label combinations, or learn the embedding using only local or global visual features. In this paper, we propose a Graph Convolution Networks based Multi-label Zero-Shot Learning model, abbreviated as MZSL-GCN. Our model first constructs a label relation graph using label co-occurrences and compensates the absence of unseen labels in the training phase by semantic similarity. It then takes the graph and the word embedding of each seen (unseen) label as inputs to the GCN to learn the label semantic embedding, and to obtain a set of inter-dependent object classifiers. MZSL-GCN simultaneously trains another attention network to learn compatible local and global visual features of objects with respect to the classifiers, and thus makes the whole network end-to-end trainable. In addition, the use of unlabeled training data can reduce the bias toward seen labels and boost the generalization ability. Experimental results on benchmark datasets show that our MZSL-GCN competes with state-of-the-art approaches.


Subject(s)
Machine Learning , Neural Networks, Computer , Pattern Recognition, Automated/methods , Humans , Semantics
20.
IEEE Trans Neural Netw ; 16(4): 899-909, 2005 Jul.
Article in English | MEDLINE | ID: mdl-16121731

ABSTRACT

The nearest neighbor technique is a simple and appealing approach to addressing classification problems. It relies on the assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. Severe bias can be introduced under these conditions when using the nearest neighbor rule. The employment of a locally adaptive metric becomes crucial in order to keep class conditional probabilities close to uniform, thereby minimizing the bias of estimates. We propose a technique that computes a locally flexible metric by means of support vector machines (SVMs). The decision function constructed by SVMs is used to determine the most discriminant direction in a neighborhood around the query. Such a direction provides a local feature weighting scheme. We formally show that our method increases the margin in the weighted space where classification takes place. Moreover, our method has the important advantage of online computational efficiency over competing locally adaptive techniques for nearest neighbor classification. We demonstrate the efficacy of our method using both real and simulated data.


Subject(s)
Algorithms , Artificial Intelligence , Breast Neoplasms/diagnosis , Diabetes Mellitus/diagnosis , Diagnosis, Computer-Assisted/methods , Models, Biological , Pattern Recognition, Automated/methods , Computer Simulation , Computing Methodologies , Decision Support Techniques , Humans , Models, Statistical , Numerical Analysis, Computer-Assisted , Stochastic Processes
SELECTION OF CITATIONS
SEARCH DETAIL