Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 51
1.
bioRxiv ; 2024 Apr 09.
Article En | MEDLINE | ID: mdl-38645128

A main limitation of bulk transcriptomic technologies is that individual measurements normally contain contributions from multiple cell populations, impeding the identification of cellular heterogeneity within diseased tissues. To extract cellular insights from existing large cohorts of bulk transcriptomic data, we present CSsingle, a novel method designed to accurately deconvolve bulk data into a predefined set of cell types using a scRNA-seq reference. Through comprehensive benchmark evaluations and analyses using diverse real data sets, we reveal the systematic bias inherent in existing methods, stemming from differences in cell size or library size. Our extensive experiments demonstrate that CSsingle exhibits superior accuracy and robustness compared to leading methods, particularly when dealing with bulk mixtures originating from cell types of markedly different cell sizes, as well as when handling bulk and single-cell reference data obtained from diverse sources. Our work provides an efficient and robust methodology for the integrated analysis of bulk and scRNA-seq data, facilitating various biological and clinical studies.

2.
Article En | MEDLINE | ID: mdl-37590112

As one of the effective ways of ocular disease recognition, early fundus screening can help patients avoid unrecoverable blindness. Although deep learning is powerful for image-based ocular disease recognition, the performance mainly benefits from a large number of labeled data. For ocular disease, data collection and annotation in a single site usually take a lot of time. If multi-site data are obtained, there are two main issues: 1) the data privacy is easy to be leaked; 2) the domain gap among sites will influence the recognition performance. Inspired by the above, first, a Gaussian randomized mechanism is adopted in local sites, which are then engaged in a global model to preserve the data privacy of local sites and models. Second, to bridge the domain gap among different sites, a two-step domain adaptation method is introduced, which consists of a domain confusion module and a multi-expert learning strategy. Based on the above, a privacy-preserving federated learning framework with domain adaptation is constructed. In the experimental part, a multi-disease early fundus screening dataset, including a detailed ablation study and four experimental settings, is used to show the stepwise performance, which verifies the efficiency of our proposed framework.

3.
Article En | MEDLINE | ID: mdl-37028079

In this work, we study a more realistic challenging scenario in multiview clustering (MVC), referred to as incomplete MVC (IMVC) where some instances in certain views are missing. The key to IMVC is how to adequately exploit complementary and consistency information under the incompleteness of data. However, most existing methods address the incompleteness problem at the instance level and they require sufficient information to perform data recovery. In this work, we develop a new approach to facilitate IMVC based on the graph propagation perspective. Specifically, a partial graph is used to describe the similarity of samples for incomplete views, such that the issue of missing instances can be translated into the missing entries of the partial graph. In this way, a common graph can be adaptively learned to self-guide the propagation process by exploiting the consistency information, and the propagated graph of each view is in turn used to refine the common self-guided graph in an iterative manner. Thus, the associated missing entries can be inferred through graph propagation by exploiting the consistency information across all views. On the other hand, existing approaches focus on the consistency structure only, and the complementary information has not been sufficiently exploited due to the data incompleteness issue. By contrast, under the proposed graph propagation framework, an exclusive regularization term can be naturally adopted to exploit the complementary information in our method. Extensive experiments demonstrate the effectiveness of the proposed method in comparison with state-of-the-art methods. The source code of our method is available at the https://github.com/CLiu272/TNNLS-PGP.

4.
IEEE Trans Biomed Eng ; 70(1): 307-317, 2023 01.
Article En | MEDLINE | ID: mdl-35820001

Advances of high throughput experimental methods have led to the availability of more diverse omic datasets in clinical analysis applications. Different types of omic data reveal different cellular aspects and contribute to the understanding of disease progression from these aspects. While survival prediction and subgroup identification are two important research problems in clinical analysis, their performance can be further boosted by taking advantages of multiple omics data through multi-view learning. However, these two tasks are generally studied separately, and the possibility that they could reinforce each other by collaborative learning has not been adequately considered. In light of this, we propose a View-aware Collaborative Learning (VaCoL) method to jointly boost the performance of survival prediction and subgroup identification by integration of multiple omics data. Specifically, survival analysis and affinity learning, which respectively perform survival prediction and subgroup identification, are integrated into a unified optimization framework to learn the two tasks in a collaborative way. In addition, by considering the diversity of different types of data, we make use of the log-rank test statistic to evaluate the importance of different views. As a result, the proposed approach can adaptively learn the optimal weight for each view during training. Empirical results on several real datasets show that our method is able to significantly improve the performance of survival prediction and subgroup identification. A detailed model analysis study is also provided to show the effectiveness of the proposed collaborative learning and view-weight learning approaches.


Interdisciplinary Placement , Machine Learning , Learning , Survival Analysis
5.
IEEE Trans Neural Netw Learn Syst ; 33(2): 654-666, 2022 Feb.
Article En | MEDLINE | ID: mdl-33079681

Recently, multitask learning has been successfully applied to survival analysis problems. A critical challenge in real-world survival analysis tasks is that not all instances and tasks are equally learnable. A survival analysis model can be improved when considering the complexities of instances and tasks during the model training. To this end, we propose an asymmetric graph-guided multitask learning approach with self-paced learning for survival analysis applications. The proposed model is able to improve the learning performance by identifying the complex structure among tasks and considering the complexities of training instances and tasks during the model training. Especially, by incorporating the self-paced learning strategy and asymmetric graph-guided regularization, the proposed model is able to learn the model in a progressive way from "easy" to "hard" loss function items. In addition, together with the self-paced learning function, the asymmetric graph-guided regularization allows the related knowledge transfer from one task to another in an asymmetric way. Consequently, the knowledge acquired from those earlier learned tasks can help to solve complex tasks effectively. The experimental results on both synthetic and real-world TCGA data suggest that the proposed method is indeed useful for improving survival analysis and achieves higher prediction accuracies than the previous state-of-the-art methods.

6.
IEEE/ACM Trans Comput Biol Bioinform ; 19(2): 1193-1202, 2022.
Article En | MEDLINE | ID: mdl-32750893

Identifying cancer subtypes by integration of multi-omic data is beneficial to improve the understanding of disease progression, and provides more precise treatment for patients. Cancer subtypes identification is usually accomplished by clustering patients with unsupervised learning approaches. Thus, most existing integrative cancer subtyping methods are performed in an entirely unsupervised way. An integrative cancer subtyping approach can be improved to discover clinically more relevant cancer subtypes when considering the clinical survival response variables. In this study, we propose a Survival Supervised Graph Clustering (S2GC)for cancer subtyping by taking into consideration survival information. Specifically, we use a graph to represent similarity of patients, and develop a multi-omic survival analysis embedding with patient-to-patient similarity graph learning for cancer subtype identification. The multi-view (omic)survival analysis model and graph of patients are jointly learned in a unified way. The learned optimal graph can be unitized to cluster cancer subtypes directly. In the proposed model, the survival analysis model and adaptive graph learning could positively reinforce each other. Consequently, the survival time can be considered as supervised information to improve the quality of the similarity graph and explore clinically more relevant subgroups of patients. Experiments on several representative multi-omic cancer datasets demonstrate that the proposed method achieves better results than a number of state-of-the-art methods. The results also suggest that our method is able to identify biologically meaningful subgroups for different cancer types. (Our Matlab source code is available online at github: https://github.com/CLiu272/S2GC).


Algorithms , Neoplasms , Cluster Analysis , Humans , Neoplasms/genetics , Software , Survival Analysis
7.
IEEE Trans Neural Netw Learn Syst ; 33(1): 75-88, 2022 Jan.
Article En | MEDLINE | ID: mdl-33048763

Graph-based methods have achieved impressive performance on semisupervised classification (SSC). Traditional graph-based methods have two main drawbacks. First, the graph is predefined before training a classifier, which does not leverage the interactions between the classifier training and similarity matrix learning. Second, when handling high-dimensional data with noisy or redundant features, the graph constructed in the original input space is actually unsuitable and may lead to poor performance. In this article, we propose an SSC method with novel graph construction (SSC-NGC), in which the similarity matrix is optimized in both label space and an additional subspace to get a better and more robust result than in original data space. Furthermore, to obtain a high-quality subspace, we learn the projection matrix of the additional subspace by preserving the local and global structure of the data. Finally, we intergrade the classifier training, the graph construction, and the subspace learning into a unified framework. With this framework, the classifier parameters, similarity matrix, and projection matrix of subspace are adaptively learned in an iterative scheme to obtain an optimal joint result. We conduct extensive comparative experiments against state-of-the-art methods over multiple real-world data sets. Experimental results demonstrate the superiority of the proposed method over other state-of-the-art algorithms.

8.
IEEE Trans Cybern ; 52(5): 3658-3668, 2022 May.
Article En | MEDLINE | ID: mdl-32924945

Ensemble learning has many successful applications because of its effectiveness in boosting the predictive performance of classification models. In this article, we propose a semisupervised multiple choice learning (SemiMCL) approach to jointly train a network ensemble on partially labeled data. Our model mainly focuses on improving a labeled data assignment among the constituent networks and exploiting unlabeled data to capture domain-specific information, such that semisupervised classification can be effectively facilitated. Different from conventional multiple choice learning models, the constituent networks learn multiple tasks in the training process. Specifically, an auxiliary reconstruction task is included to learn domain-specific representation. For the purpose of performing implicit labeling on reliable unlabeled samples, we adopt a negative l1 -norm regularization when minimizing the conditional entropy with respect to the posterior probability distribution. Extensive experiments on multiple real-world datasets are conducted to verify the effectiveness and superiority of the proposed SemiMCL model.


Learning , Supervised Machine Learning
9.
IEEE Trans Image Process ; 30: 5807-5818, 2021.
Article En | MEDLINE | ID: mdl-34138710

Both target-specific and domain-invariant features can facilitate Open Set Domain Adaptation (OSDA). To exploit these features, we propose a Knowledge Exchange (KnowEx) model which jointly trains two complementary constituent networks: (1) a Domain-Adversarial Network (DAdvNet) learning the domain-invariant representation, through which the supervision in source domain can be exploited to infer the class information of unlabeled target data; (2) a Private Network (PrivNet) exclusive for target domain, which is beneficial for discriminating between instances from known and unknown classes. The two constituent networks exchange training experience in the learning process. Toward this end, we exploit an adversarial perturbation process against DAdvNet to regularize PrivNet. This enhances the complementarity between the two networks. At the same time, we incorporate an adaptation layer into DAdvNet to address the unreliability of the PrivNet's experience. Therefore, DAdvNet and PrivNet are able to mutually reinforce each other during training. We have conducted thorough experiments on multiple standard benchmarks to verify the effectiveness and superiority of KnowEx in OSDA.

10.
IEEE Trans Cybern ; 51(4): 2019-2031, 2021 Apr.
Article En | MEDLINE | ID: mdl-31180903

Healthcare question answering (HQA) system plays a vital role in encouraging patients to inquire for professional consultation. However, there are some challenging factors in learning and representing the question corpus of HQA datasets, such as high dimensionality, sparseness, noise, nonprofessional expression, etc. To address these issues, we propose an inception convolutional autoencoder model for Chinese healthcare question clustering (ICAHC). First, we select a set of kernels with different sizes using convolutional autoencoder networks to explore both the diversity and quality in the clustering ensemble. Thus, these kernels encourage to capture diverse representations. Second, we design four ensemble operators to merge representations based on whether they are independent, and input them into the encoder using different skip connections. Third, it maps features from the encoder into a lower-dimensional space, followed by clustering. We conduct comparative experiments against other clustering algorithms on a Chinese healthcare dataset. Experimental results show the effectiveness of ICAHC in discovering better clustering solutions. The results can be used in the prediction of patients' conditions and the development of an automatic HQA system.


Cluster Analysis , Delivery of Health Care/methods , Diagnosis, Computer-Assisted/methods , Neural Networks, Computer , Algorithms , China , Humans
11.
IEEE Trans Neural Netw Learn Syst ; 32(8): 3593-3607, 2021 Aug.
Article En | MEDLINE | ID: mdl-32845845

Semisupervised clustering methods improve performance by randomly selecting pairwise constraints, which may lead to redundancy and instability. In this context, active clustering is proposed to maximize the efficacy of annotations by effectively using pairwise constraints. However, existing methods lack an overall consideration of the querying criteria and repeatedly run semisupervised clustering to update labels. In this work, we first propose an active density peak (ADP) clustering algorithm that considers both representativeness and informativeness. Representative instances are selected to capture data patterns, while informative instances are queried to reduce the uncertainty of clustering results. Meanwhile, we design a fast-update-strategy to update labels efficiently. In addition, we propose an active clustering ensemble framework that combines local and global uncertainties to query the most ambiguous instances for better separation between the clusters. A weighted voting consensus method is introduced for better integration of clustering results. We conducted experiments by comparing our methods with state-of-the-art methods on real-world data sets. Experimental results demonstrate the effectiveness of our methods.

12.
IEEE Trans Cybern ; 50(1): 74-86, 2020 Jan.
Article En | MEDLINE | ID: mdl-30137022

Multitask feature selection (MTFS) methods have become more important for many real world applications, especially in a high-dimensional setting. The most widely used assumption is that all tasks share the same features, and the l2,1 regularization method is usually applied. However, this assumption may not hold when the correlations among tasks are not obvious. Learning with unrelated tasks together may result in negative transfer and degrade the performance. In this paper, we present a flexible MTFS by graph-clustered feature sharing approach. To avoid the above limitation, we adopt a graph to represent the relevance among tasks instead of adopting a hard task set partition. Furthermore, we propose a graph-guided regularization approach such that the sparsity of the solution can be achieved on both the task level and the feature level, and a variant of the smooth proximal gradient method is developed to solve the corresponding optimization problem. An evaluation of the proposed method on multitask regression and multitask binary classification problem has been performed. Extensive experiments on synthetic datasets and real-world datasets demonstrate the effectiveness of the proposed approach to capture task structure.

13.
IEEE Trans Cybern ; 50(6): 2872-2885, 2020 Jun.
Article En | MEDLINE | ID: mdl-30596592

Clustering ensemble (CE) takes multiple clustering solutions into consideration in order to effectively improve the accuracy and robustness of the final result. To reduce redundancy as well as noise, a CE selection (CES) step is added to further enhance performance. Quality and diversity are two important metrics of CES. However, most of the CES strategies adopt heuristic selection methods or a threshold parameter setting to achieve tradeoff between quality and diversity. In this paper, we propose a transfer CES (TCES) algorithm which makes use of the relationship between quality and diversity in a source dataset, and transfers it into a target dataset based on three objective functions. Furthermore, a multiobjective self-evolutionary process is designed to optimize these three objective functions. Finally, we construct a transfer CE framework (TCE-TCES) based on TCES to obtain better clustering results. The experimental results on 12 transfer clustering tasks obtained from the 20newsgroups dataset show that TCE-TCES can find a better tradeoff between quality and diversity, as well as obtaining more desirable clustering results.

14.
IEEE Trans Neural Netw Learn Syst ; 31(4): 1387-1400, 2020 04.
Article En | MEDLINE | ID: mdl-31265410

The class imbalance problem has become a leading challenge. Although conventional imbalance learning methods are proposed to tackle this problem, they have some limitations: 1) undersampling methods suffer from losing important information and 2) cost-sensitive methods are sensitive to outliers and noise. To address these issues, we propose a hybrid optimal ensemble classifier framework that combines density-based undersampling and cost-effective methods through exploring state-of-the-art solutions using multi-objective optimization algorithm. Specifically, we first develop a density-based undersampling method to select informative samples from the original training data with probability-based data transformation, which enables to obtain multiple subsets following a balanced distribution across classes. Second, we exploit the cost-sensitive classification method to address the incompleteness of information problem via modifying weights of misclassified minority samples rather than the majority ones. Finally, we introduce a multi-objective optimization procedure and utilize connections between samples to self-modify the classification result using an ensemble classifier framework. Extensive comparative experiments conducted on real-world data sets demonstrate that our method outperforms the majority of imbalance and ensemble classification approaches.

15.
Article En | MEDLINE | ID: mdl-31603782

In this paper, we explore how to leverage readily available unlabeled data to improve semi-supervised human detection performance. For this purpose, we specifically modify the region proposal network (RPN) for learning on a partially labeled dataset. Based on commonly observed false positive types, a verification module is developed to assess foreground human objects in the candidate regions to provide an important cue for filtering the RPN's proposals. The remaining proposals with high confidence scores are then used as pseudo annotations for re-training our detection model. To reduce the risk of error propagation in the training process, we adopt a self-paced training strategy to progressively include more pseudo annotations generated by the previous model over multiple training rounds. The resulting detector re-trained on the augmented data can be expected to have better detection performance. The effectiveness of the main components of this framework is verified through extensive experiments, and the proposed approach achieves state-of-the-art detection results on multiple scene-specific human detection benchmarks in the semi-supervised setting.

16.
Article En | MEDLINE | ID: mdl-31425030

Using an ensemble of neural networks with consistency regularization is effective for improving performance and stability of deep learning, compared to the case of a single network. In this paper, we present a semi-supervised Deep Coupled Ensemble (DCE) model, which contributes to ensemble learning and classification landmark exploration for better locating the final decision boundaries in the learnt latent space. First, multiple complementary consistency regularizations are integrated into our DCE model to enable the ensemble members to learn from each other and themselves, such that training experience from different sources can be shared and utilized during training. Second, in view of the possibility of producing incorrect predictions on a number of difficult instances, we adopt class-wise mean feature matching to explore important unlabeled instances as classification landmarks, on which the model predictions are more reliable. Minimizing the weighted conditional entropy on unlabeled data is able to force the final decision boundaries to move away from important training data points, which facilitates semi-supervised learning. Ensemble members could eventually have similar performance due to consistency regularization, and thus only one of these members is needed during the test stage, such that the efficiency of our model is the same as the non-ensemble case. Extensive experimental results demonstrate the superiority of our proposed DCE model over existing state-of-the-art semi-supervised learning methods.

17.
Article En | MEDLINE | ID: mdl-29989970

In gene expression data analysis, the problems of cancer classification and gene selection are closely related. Successfully selecting informative genes will significantly improve the classification performance. To identify informative genes from a large number of candidate genes, various methods have been proposed. However, the gene expression data may include some important correlation structures, and some of the genes can be divided into different groups based on their biological pathways. Many existing methods do not take into consideration the exact correlation structure within the data. Therefore, from both the knowledge discovery and biological perspectives, an ideal gene selection method should take this structural information into account. Moreover, the better generalization performance can be obtained by discovering correlation structure within data. In order to discover structure information among data and improve learning performance, we propose a structured penalized logistic regression model which simultaneously performs feature selection and model learning for gene expression data analysis. An efficient coordinate descent algorithm has been developed to optimize the model. The numerical simulation studies demonstrate that our method is able to select the highly correlated features. In addition, the results from real gene expression datasets show that the proposed method performs competitively with respect to previous approaches.


Computational Biology/methods , Gene Expression Profiling/methods , Logistic Models , Machine Learning , Cluster Analysis , Databases, Genetic , Humans , Neoplasms/classification , Neoplasms/genetics , Neoplasms/metabolism , Transcriptome/genetics
18.
IEEE Trans Cybern ; 49(2): 366-379, 2019 Feb.
Article En | MEDLINE | ID: mdl-29989979

High dimensional data classification with very limited labeled training data is a challenging task in the area of data mining. In order to tackle this task, we first propose a feature selection-based semi-supervised classifier ensemble framework (FSCE) to perform high dimensional data classification. Then, we design an adaptive semi-supervised classifier ensemble framework (ASCE) to improve the performance of FSCE. When compared with FSCE, ASCE is characterized by an adaptive feature selection process, an adaptive weighting process (AWP), and an auxiliary training set generation process (ATSGP). The adaptive feature selection process generates a set of compact subspaces based on the selected attributes obtained by the feature selection algorithms, while the AWP associates each basic semi-supervised classifier in the ensemble with a weight value. The ATSGP enlarges the training set with unlabeled samples. In addition, a set of nonparametric tests are adopted to compare multiple semi-supervised classifier ensemble (SSCE)approaches over different datasets. The experiments on 20 high dimensional real-world datasets show that: 1) the two adaptive processes in ASCE are useful for improving the performance of the SSCE approach and 2) ASCE works well on high dimensional datasets with very limited labeled training data, and outperforms most state-of-the-art SSCE approaches.

19.
IEEE Trans Cybern ; 49(2): 403-416, 2019 Feb.
Article En | MEDLINE | ID: mdl-29990215

Traditional ensemble learning approaches explore the feature space and the sample space, respectively, which will prevent them to construct more powerful learning models for noisy real-world dataset classification. The random subspace method only search for the selection of features. Meanwhile, the bagging approach only search for the selection of samples. To overcome these limitations, we propose the hybrid incremental ensemble learning (HIEL) approach which takes into consideration the feature space and the sample space simultaneously to handle noisy dataset. Specifically, HIEL first adopts the bagging technique and linear discriminant analysis to remove noisy attributes, and generates a set of bootstraps and the corresponding ensemble members in the subspaces. Then, the classifiers are selected incrementally based on a classifier-specific criterion function and an ensemble criterion function. The corresponding weights for the classifiers are assigned during the same process. Finally, the final label is summarized by a weighted voting scheme, which serves as the final result of the classification. We also explore various classifier-specific criterion functions based on different newly proposed similarity measures, which will alleviate the effect of noisy samples on the distance functions. In addition, the computational cost of HIEL is analyzed theoretically. A set of nonparametric tests are adopted to compare HIEL and other algorithms over several datasets. The experiment results show that HIEL performs well on the noisy datasets. HIEL outperforms most of the compared classifier ensemble methods on 14 out of 24 noisy real-world UCI and KEEL datasets.

20.
IEEE Trans Cybern ; 49(6): 2280-2293, 2019 Jun.
Article En | MEDLINE | ID: mdl-29993923

Classification of high-dimensional data with very limited labels is a challenging task in the field of data mining and machine learning. In this paper, we propose the multiobjective semisupervised classifier ensemble (MOSSCE) approach to address this challenge. Specifically, a multiobjective subspace selection process (MOSSP) in MOSSCE is first designed to generate the optimal combination of feature subspaces. Three objective functions are then proposed for MOSSP, which include the relevance of features, the redundancy between features, and the data reconstruction error. Then, MOSSCE generates an auxiliary training set based on the sample confidence to improve the performance of the classifier ensemble. Finally, the training set, combined with the auxiliary training set, is used to select the optimal combination of basic classifiers in the ensemble, train the classifier ensemble, and generate the final result. In addition, diversity analysis of the ensemble learning process is applied, and a set of nonparametric statistical tests is adopted for the comparison of semisupervised classification approaches on multiple datasets. The experiments on 12 gene expression datasets and two large image datasets show that MOSSCE has a better performance than other state-of-the-art semisupervised classifiers on high-dimensional data.

...