Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38701413

RESUMO

With the emergence of large amount of single-cell RNA sequencing (scRNA-seq) data, the exploration of computational methods has become critical in revealing biological mechanisms. Clustering is a representative for deciphering cellular heterogeneity embedded in scRNA-seq data. However, due to the diversity of datasets, none of the existing single-cell clustering methods shows overwhelming performance on all datasets. Weighted ensemble methods are proposed to integrate multiple results to improve heterogeneity analysis performance. These methods are usually weighted by considering the reliability of the base clustering results, ignoring the performance difference of the same base clustering on different cells. In this paper, we propose a high-order element-wise weighting strategy based self-representative ensemble learning framework: scEWE. By assigning different base clustering weights to individual cells, we construct and optimize the consensus matrix in a careful and exquisite way. In addition, we extracted the high-order information between cells, which enhanced the ability to represent the similarity relationship between cells. scEWE is experimentally shown to significantly outperform the state-of-the-art methods, which strongly demonstrates the effectiveness of the method and supports the potential applications in complex single-cell data analytical problems.


Assuntos
Análise de Sequência de RNA , Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Análise de Sequência de RNA/métodos , Algoritmos , Biologia Computacional/métodos , Humanos , RNA-Seq/métodos
2.
Am J Epidemiol ; 193(8): 1146-1154, 2024 Aug 05.
Artigo em Inglês | MEDLINE | ID: mdl-38576181

RESUMO

Multimorbidity, defined as having 2 or more chronic conditions, is a growing public health concern, but research in this area is complicated by the fact that multimorbidity is a highly heterogenous outcome. Individuals in a sample may have a differing number and varied combinations of conditions. Clustering methods, such as unsupervised machine learning algorithms, may allow us to tease out the unique multimorbidity phenotypes. However, many clustering methods exist, and choosing which to use is challenging because we do not know the true underlying clusters. Here, we demonstrate the use of 3 individual algorithms (partition around medoids, hierarchical clustering, and probabilistic clustering) and a clustering ensemble approach (which pools different clustering approaches) to identify multimorbidity clusters in the AIDS Linked to the Intravenous Experience cohort study. We show how the clusters can be compared based on cluster quality, interpretability, and predictive ability. In practice, it is critical to compare the clustering results from multiple algorithms and to choose the approach that performs best in the domain(s) that aligns with plans to use the clusters in future analyses.


Assuntos
Algoritmos , Multimorbidade , Humanos , Análise por Conglomerados , Feminino , Masculino , Pessoa de Meia-Idade , Aprendizado de Máquina não Supervisionado , Adulto
3.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34607358

RESUMO

The discovery of cancer subtypes has become much-researched topic in oncology. Dividing cancer patients into subtypes can provide personalized treatments for heterogeneous patients. High-throughput technologies provide multiple omics data for cancer subtyping. Integration of multi-view data is used to identify cancer subtypes in many computational methods, which obtain different subtypes for the same cancer, even using the same multi-omics data. To a certain extent, these subtypes from distinct methods are related, which may have certain guiding significance for cancer subtyping. It is a challenge to effectively utilize the valuable information of distinct subtypes to produce more accurate and reliable subtypes. A weighted ensemble sparse latent representation (subtype-WESLR) is proposed to detect cancer subtypes on heterogeneous omics data. Using a weighted ensemble strategy to fuse base clustering obtained by distinct methods as prior knowledge, subtype-WESLR projects each sample feature profile from each data type to a common latent subspace while maintaining the local structure of the original sample feature space and consistency with the weighted ensemble and optimizes the common subspace by an iterative method to identify cancer subtypes. We conduct experiments on various synthetic datasets and eight public multi-view datasets from The Cancer Genome Atlas. The results demonstrate that subtype-WESLR is better than competing methods by utilizing the integration of base clustering of exist methods for more precise subtypes.


Assuntos
Algoritmos , Neoplasias , Análise por Conglomerados , Humanos , Neoplasias/genética
4.
J Biomed Inform ; 143: 104406, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37257630

RESUMO

Multi-view clustering methods are essential for the stratification of patients into sub-groups of similar molecular characteristics. In recent years, a wide range of methods have been developed for this purpose. However, due to the high diversity of cancer-related data, a single method may not perform sufficiently well in all cases. We present Parea, a multi-view hierarchical ensemble clustering approach for disease subtype discovery. We demonstrate its performance on several machine learning benchmark datasets. We apply and validate our methodology on real-world multi-view patient data, comprising seven types of cancer. Parea outperforms the current state-of-the-art on six out of seven analysed cancer types. We have integrated the Parea method into our Python package Pyrea (https://github.com/mdbloice/Pyrea), which enables the effortless and flexible design of ensemble workflows while incorporating a wide range of fusion and clustering algorithms.


Assuntos
Algoritmos , Neoplasias , Humanos , Análise por Conglomerados , Neoplasias/genética , Aprendizado de Máquina
5.
Brief Bioinform ; 21(5): 1531-1548, 2020 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-31631226

RESUMO

Protein complexes are the fundamental units for many cellular processes. Identifying protein complexes accurately is critical for understanding the functions and organizations of cells. With the increment of genome-scale protein-protein interaction (PPI) data for different species, various computational methods focus on identifying protein complexes from PPI networks. In this article, we give a comprehensive and updated review on the state-of-the-art computational methods in the field of protein complex identification, especially focusing on the newly developed approaches. The computational methods are organized into three categories, including cluster-quality-based methods, node-affinity-based methods and ensemble clustering methods. Furthermore, the advantages and disadvantages of different methods are discussed, and then, the performance of 17 state-of-the-art methods is evaluated on two widely used benchmark data sets. Finally, the bottleneck problems and their potential solutions in this important field are discussed.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Algoritmos , Análise por Conglomerados , Mapas de Interação de Proteínas
6.
Sensors (Basel) ; 22(17)2022 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-36081153

RESUMO

This paper proposes an innovative methodology for finding how many lifting techniques people with chronic low back pain (CLBP) can demonstrate with camera data collected from 115 participants. The system employs a feature extraction algorithm to calculate the knee, trunk and hip range of motion in the sagittal plane, Ward's method, a combination of K-means and Ensemble clustering method for classification algorithm, and Bayesian neural network to validate the result of Ward's method and the combination of K-means and Ensemble clustering method. The classification results and effect size show that Ward clustering is the optimal method where precision and recall percentages of all clusters are above 90, and the overall accuracy of the Bayesian Neural Network is 97.9%. The statistical analysis reported a significant difference in the range of motion of the knee, hip and trunk between each cluster, F (9, 1136) = 195.67, p < 0.0001. The results of this study suggest that there are four different lifting techniques in people with CLBP. Additionally, the results show that even though the clusters demonstrated similar pain levels, one of the clusters, which uses the least amount of trunk and the most knee movement, demonstrates the lowest pain self-efficacy.


Assuntos
Dor Lombar , Teorema de Bayes , Fenômenos Biomecânicos , Humanos , Remoção , Aprendizado de Máquina , Autoeficácia
7.
Entropy (Basel) ; 24(10)2022 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-37420344

RESUMO

Accurate clustering is a challenging task with unlabeled data. Ensemble clustering aims to combine sets of base clusterings to obtain a better and more stable clustering and has shown its ability to improve clustering accuracy. Dense representation ensemble clustering (DREC) and entropy-based locally weighted ensemble clustering (ELWEC) are two typical methods for ensemble clustering. However, DREC treats each microcluster equally and hence, ignores the differences between each microcluster, while ELWEC conducts clustering on clusters rather than microclusters and ignores the sample-cluster relationship. To address these issues, a divergence-based locally weighted ensemble clustering with dictionary learning (DLWECDL) is proposed in this paper. Specifically, the DLWECDL consists of four phases. First, the clusters from the base clustering are used to generate microclusters. Second, a Kullback-Leibler divergence-based ensemble-driven cluster index is used to measure the weight of each microcluster. With these weights, an ensemble clustering algorithm with dictionary learning and the L2,1-norm is employed in the third phase. Meanwhile, the objective function is resolved by optimizing four subproblems and a similarity matrix is learned. Finally, a normalized cut (Ncut) is used to partition the similarity matrix and the ensemble clustering results are obtained. In this study, the proposed DLWECDL was validated on 20 widely used datasets and compared to some other state-of-the-art ensemble clustering methods. The experimental results demonstrated that the proposed DLWECDL is a very promising method for ensemble clustering.

8.
Biometrics ; 77(1): 293-304, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-32150282

RESUMO

This paper considers the clustering problem of physical step count data recorded on wearable devices. Clustering step data give an insight into an individual's activity status and further provide the groundwork for health-related policies. However, classical methods, such as K-means clustering and hierarchical clustering, are not suitable for step count data that are typically high-dimensional and zero-inflated. This paper presents a new clustering method for step data based on a novel combination of ensemble clustering and binning. We first construct multiple sets of binned data by changing the size and starting position of the bin, and then merge the clustering results from the binned data using a voting method. The advantage of binning, as a critical component, is that it substantially reduces the dimension of the original data while preserving the essential characteristics of the data. As a result, combining clustering results from multiple binned data can provide an improved clustering result that reflects both local and global structures of the data. Simulation studies and real data analysis were carried out to evaluate the empirical performance of the proposed method and demonstrate its general utility.


Assuntos
Algoritmos , Análise por Conglomerados , Simulação por Computador
9.
Molecules ; 26(12)2021 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-34204651

RESUMO

The driving forces and conformational pathways leading to amphitropic protein-membrane binding and in some cases also to protein misfolding and aggregation is the subject of intensive research. In this study, a chimeric polypeptide, A-Cage-C, derived from α-Lactalbumin is investigated with the aim of elucidating conformational changes promoting interaction with bilayers. From previous studies, it is known that A-Cage-C causes membrane leakages associated with the sporadic formation of amorphous aggregates on solid-supported bilayers. Here we express and purify double-labelled A-Cage-C and prepare partially deuterated bicelles as a membrane mimicking system. We investigate A-Cage-C in the presence and absence of these bicelles at non-binding (pH 7.0) and binding (pH 4.5) conditions. Using in silico analyses, NMR, conformational clustering, and Molecular Dynamics, we provide tentative insights into the conformations of bound and unbound A-Cage-C. The conformation of each state is dynamic and samples a large amount of overlapping conformational space. We identify one of the clusters as likely representing the binding conformation and conclude tentatively that the unfolding around the central W23 segment and its reorientation may be necessary for full intercalation at binding conditions (pH 4.5). We also see evidence for an overall elongation of A-Cage-C in the presence of model bilayers.


Assuntos
Proteína Oncogênica pp60(v-src)/química , Fragmentos de Peptídeos/química , Peptídeos/química , Lactalbumina/química , Espectroscopia de Ressonância Magnética/métodos , Proteínas de Membrana/química , Proteínas de Membrana/metabolismo , Membranas , Simulação de Dinâmica Molecular , Proteína Oncogênica pp60(v-src)/metabolismo , Fragmentos de Peptídeos/metabolismo , Peptídeos/metabolismo , Ligação Proteica , Conformação Proteica
10.
J Cancer Res Clin Oncol ; 150(1): 3, 2024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38168012

RESUMO

INTRODUCTION: In recent decades, many theories have been proposed about the cause of hereditary diseases such as cancer. However, most studies state genetic and environmental factors as the most important parameters. It has been shown that gene expression data are valuable information about hereditary diseases and their analysis can identify the relationships between these diseases. OBJECTIVE: Identification of damaged genes from various diseases can be done through the discovery of cell-to-cell biological communications. Also, extraction of intercellular communications can identify relationships between different diseases. For example, gene disorders that cause damage to the same cells in both breast and blood cancers. Hence, the purpose is to discover cell-to-cell biological communications in gene expression data. METHODOLOGY: The identification of cell-to-cell biological communications for various cancer diseases has been widely performed by clustering algorithms. However, this field remains open due to the abundance of unprocessed gene expression data. Accordingly, this paper focuses on the development of a semi-supervised ensemble clustering algorithm that can discover relationships between different diseases through the extraction of cell-to-cell biological communications. The proposed clustering framework includes a stratified feature sampling mechanism and a novel similarity metric to deal with high-dimensional data and improve the diversity of primary partitions. RESULTS: The performance of the proposed clustering algorithm is verified with several datasets from the UCI machine learning repository and then applied to the FANTOM5 dataset to extract cell-to-cell biological communications. The used version of this dataset contains 108 cells and 86,427 promoters from 702 samples. The strength of communication between two similar cells from different diseases indicates the relationship of those diseases. Here, the strength of communication is determined by promoter, so we found the highest cell-to-cell biological communication between "basophils" and "ciliary.epithelial.cells" with 62,809 promoters. CONCLUSION: The maximum cell-to-cell biological similarity in each cluster can be used to detect the relationship between different diseases such as cancer.


Assuntos
Neoplasias Hematológicas , Neoplasias , Humanos , Algoritmos , Análise por Conglomerados , Neoplasias/genética , Neoplasias/metabolismo , Aprendizado de Máquina , Perfilação da Expressão Gênica/métodos
11.
Brief Funct Genomics ; 22(4): 329-340, 2023 07 17.
Artigo em Inglês | MEDLINE | ID: mdl-36848584

RESUMO

Single-cell clustering is the most significant part of single-cell RNA sequencing (scRNA-seq) data analysis. One main issue facing the scRNA-seq data is noise and sparsity, which poses a great challenge for the advance of high-precision clustering algorithms. This study adopts cellular markers to identify differences between cells, which contributes to feature extraction of single cells. In this work, we propose a high-precision single-cell clustering algorithm-SCMcluster (single-cell cluster using marker genes). This algorithm integrates two cell marker databases(CellMarker database and PanglaoDB database) with scRNA-seq data for feature extraction and constructs an ensemble clustering model based on the consensus matrix. We test the efficiency of this algorithm and compare it with other eight popular clustering algorithms on two scRNA-seq datasets derived from human and mouse tissues, respectively. The experimental results show that SCMcluster outperforms the existing methods in both feature extraction and clustering performance. The source code of SCMcluster is available for free at https://github.com/HaoWuLab-Bioinformatics/SCMcluster.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Animais , Humanos , Camundongos , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Algoritmos , Análise por Conglomerados
12.
Front Genet ; 14: 1183099, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37091787

RESUMO

Identifying different types of cells in scRNA-seq data is a critical task in single-cell data analysis. In this paper, we propose a method called ProgClust for the decomposition of cell populations and detection of rare cells. ProgClust represents the single-cell data with clustering trees where a progressive searching method is designed to select cell population-specific genes and cluster cells. The obtained trees reveal the structure of both abundant cell populations and rare cell populations. Additionally, it can automatically determine the number of clusters. Experimental results show that ProgClust outperforms the baseline method and is capable of accurately identifying both common and rare cells. Moreover, when applied to real unlabeled data, it reveals potential cell subpopulations which provides clues for further exploration. In summary, ProgClust shows potential in identifying subpopulations of complex single-cell data.

13.
Int J Data Sci Anal ; 14(3): 305-318, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35528805

RESUMO

This paper describes an ensemble cluster analysis of bivariate profiles of HIV biomarkers, viral load and CD4 cell counts, which jointly measure disease progression. Data are from a prevalent cohort of HIV positive participants in a clinical trial of vitamin supplementation in Botswana. These individuals were HIV positive upon enrollment, but with unknown times of infection. To categorize groups of participants based on their patterns of progression of HIV infection using both biomarkers, we combine univariate shape-based cluster results for multiple biomarkers through the use of ensemble clustering methods. We first describe univariate clustering for each of the individual biomarker profiles, and make use of shape-respecting distances for clustering the longitudinal profile data. In our data, profiles are subject to either missing or irregular measurements as well as unobserved initiation times of the process of interest. Shape-respecting distances that can handle such data issues, preserve time-ordering, and identify similar profile shapes are useful in identifying patterns of disease progression from longitudinal biomarker data. However, their performance with regard to clustering differs by severity of the data issues mentioned above. We provide an empirical investigation of shape-respecting distances (Fréchet and dynamic time warping (DTW)) on benchmark shape data, and use DTW in cluster analysis of biomarker profile observations. These reveal a primary group of 'typical progressors,' as well as a smaller group that shows relatively rapid progression. We then refine the analysis using ensemble clustering for both markers to obtain a single classification. The information from joint evaluation of the two biomarkers combined with ensemble clustering reveals subgroups of patients not identifiable through univariate analyses; noteworthy subgroups are those that appear to represent recently and chronically infected subsets. Supplementary Information: The online version contains supplementary material available at 10.1007/s41060-022-00323-2.

14.
Comput Biol Med ; 136: 104631, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34273770

RESUMO

The Spike receptor binding domain (S-RBD) from SARS-CoV-2, a crucial protein for the entrance of the virus into target cells is known to cause infection by binding to a cell surface protein. Hence, reckoning therapeutics for the S-RBD of SARS-CoV-2 may address a significant way to target viral entry into the host cells. Herein, through in-silico approaches (Molecular docking, molecular dynamics (MD) simulations, and end-state thermodynamics), we aimed to screen natural molecules from different plants for their ability to inhibit S-RBD of SARS-CoV-2. We prioritized the best interacting molecules (Diacetylcurcumin and Dicaffeoylquinic acid) by analysis of protein-ligand interactions and subjected them for long-term MD simulations. We found that Dicaffeoylquinic acid interacted prominently with essential residues (Lys417, Gln493, Tyr489, Phe456, Tyr473, and Glu484) of S-RBD. These residues are involved in interactions between S-RBD and ACE2 and could inhibit the viral entry into the host cells. The in-silico analyses indicated that Dicaffeoylquinic acid and Diacetylcurcumin might have the potential to act as inhibitors of SARS-CoV-2 S-RBD. The present study warrants further in-vitro and in-vivo studies of Dicaffeoylquinic acid and Diacetylcurcumin for validation and acceptance of their inhibitory potential against S-RBD of SARS-CoV-2.


Assuntos
Enzima de Conversão de Angiotensina 2/antagonistas & inibidores , Antivirais , COVID-19 , Compostos Fitoquímicos/farmacologia , Glicoproteína da Espícula de Coronavírus , Antivirais/farmacologia , Humanos , Simulação de Acoplamento Molecular , Simulação de Dinâmica Molecular , Ligação Proteica , SARS-CoV-2 , Glicoproteína da Espícula de Coronavírus/antagonistas & inibidores
15.
Front Genet ; 12: 627964, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34262590

RESUMO

Chronic lymphocytic leukemia (CLL) is the most common form of adult leukemia in the Western world with a highly variable clinical course. Its striking genetic heterogeneity is not yet fully understood. Although the CLL genetic landscape has been well-described, patient stratification based on mutation profiles remains elusive mainly due to the heterogeneity of data. Here we attempted to decrease the heterogeneity of somatic mutation data by mapping mutated genes in the respective biological processes. From the sequencing data gathered by the International Cancer Genome Consortium for 506 CLL patients, we generated pathway mutation scores, applied ensemble clustering on them, and extracted abnormal molecular pathways with a machine learning approach. We identified four clusters differing in pathway mutational profiles and time to first treatment. Interestingly, common CLL drivers such as ATM or TP53 were associated with particular subtypes, while others like NOTCH1 or SF3B1 were not. This study provides an important step in understanding mutational patterns in CLL.

16.
Environ Sci Pollut Res Int ; 28(30): 40746-40755, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-32632685

RESUMO

Air pollution these days could cause severe effects on human health. As human health is crumbled with serious respiratory or other lung diseases, it is prominent to study air pollution. One of the ways to address this issue is by applying clustering techniques. The two main important problems that are faced in the clustering algorithm are, firstly, the exact shape of the cluster and the number of clusters that input data can produce. Secondly, choosing an appropriate algorithm for a particular problem is not clearly known. Finally, multiple replications of the same algorithm lead to alternative solutions due to the fact such as random initialization of cluster heads. Ensembling algorithms can handle these problems and overcome bias and variance in the traditional clustering process. An adequate study has not been carried out in the ensembling approach mainly for clustering. In this paper, we use an enhanced ensemble clustering method to cluster the pollution data levels. This study helps to take preventive measures that are needed to control further contamination, reduce the alarming levels, and analyze the results to find healthy and unhealthy regions in a given area. This ensemble technique also explains about uncertain objects that are found in clustering. The distinct advantage of this algorithm is that there is no requirement of prior information about the data. This experiment shows that the implemented ensemble consensus clustering has demonstrated improved performance when compared with basic clustering algorithms.


Assuntos
Poluição do Ar , Algoritmos , Análise por Conglomerados , Poluição Ambiental , Humanos
17.
Algorithms Mol Biol ; 15: 3, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32082410

RESUMO

BACKGROUND: Advances in molecular biology have resulted in big and complicated data sets, therefore a clustering approach that able to capture the actual structure and the hidden patterns of the data is required. Moreover, the geometric space may not reflects the actual similarity between the different objects. As a result, in this research we use clustering-based space that convert the geometric space of the molecular to a categorical space based on clustering results. Then we use this space for developing a new classification algorithm. RESULTS: In this study, we propose a new classification method named GrpClassifierEC that replaces the given data space with categorical space based on ensemble clustering (EC). The EC space is defined by tracking the membership of the points over multiple runs of clustering algorithms. Different points that were included in the same clusters will be represented as a single point. Our algorithm classifies all these points as a single class. The similarity between two objects is defined as the number of times that these objects were not belong to the same cluster. In order to evaluate our suggested method, we compare its results to the k nearest neighbors, Decision tree and Random forest classification algorithms on several benchmark datasets. The results confirm that the suggested new algorithm GrpClassifierEC outperforms the other algorithms. CONCLUSIONS: Our algorithm can be integrated with many other algorithms. In this research, we use only the k-means clustering algorithm with different k values. In future research, we propose several directions: (1) checking the effect of the clustering algorithm to build an ensemble clustering space. (2) Finding poor clustering results based on the training data, (3) reducing the volume of the data by combining similar points based on the EC. AVAILABILITY AND IMPLEMENTATION: The KNIME workflow, implementing GrpClassifierEC, is available at https://malikyousef.com.

18.
MethodsX ; 7: 100916, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32477894

RESUMO

Evidence accumulation clustering (EAC) is an ensemble clustering algorithm that can cluster data for arbitrary shapes and numbers of clusters. Here, we present a variant of EAC in which we aimed to better cluster data with a large number of features, many of which may be uninformative. Our new method builds on the existing EAC algorithm by populating the clustering ensemble with clusterings based on combinations of fewer features than the original dataset at a time. Our method also calls for prewhitening the recombined data and weighting the influence of each individual clustering by an estimate of its informativeness. We provide code of an example implementation of the algorithm in Matlab and demonstrate its effectiveness compared to ordinary evidence accumulation clustering with synthetic data.•The clustering ensemble is made by clustering on subset combinations of features from the data•The recombined data may be prewhitened•Evidence accumulation can be improved by weighting the evidence with a goodness-of-clustering measure.

19.
Front Genet ; 11: 572242, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33329710

RESUMO

Advances in technology have made it convenient to obtain a large amount of single cell RNA sequencing (scRNA-seq) data. Since that clustering is a very important step in identifying or defining cellular phenotypes, many clustering approaches have been developed recently for these applications. The general methods can be roughly divided into normal clustering methods and integrated (ensemble) clustering methods which combine more than two normal clustering methods aiming to get much more informative performance. In order to make a contrast with the integrated clustering algorithm, the normal clustering method is often called individual or base clustering method. Note that the results of many individual clustering methods are often developed to capture one aspect of the data, and the results depend on the initial parameter settings, such as cluster number, distance metric and so on. Compared with individual clustering, although integrative clustering method may get much more accurate performance, the results depend on the base clustering results and integrated systems are often not self-regulation. Therefore, how to design a robust unsupervised clustering method is still a challenge. In order to tackle above limitations, we propose a novel Ensemble Clustering algorithm based on Probability Graphical Model with Graph Regularization, which is called EC-PGMGR for short. On one hand, we use parameter controlling in Probability Graphical Model (PGM) to automatically determine the cluster number without prior knowledge. On the other hand, we add a regularization term to reduce the effect deriving from some weak base clustering results. Particularly, the integrative results collected from base clustering methods can be assembled in the form of combination with self-regulation weights through a pre-learning process, which can efficiently enhance the effect of active clustering methods while weaken the effect of inactive clustering methods. Experiments are carried out on 7 data sets generated by different platforms with the number of single cells from 822 to 5,132. Results show that EC-PGMGR performs better than 4 alternative individual clustering methods and 2 ensemble methods in terms of accuracy including Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), robustness, effectiveness and so on. EC-PGMGR provides an effective way to integrate different clustering results for more accurate and reliable results in further biological analysis as well. It may provide some new insights to the other applications of clustering.

20.
Brain Connect ; 10(4): 183-194, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32264696

RESUMO

This work addresses the problem of constructing a unified, topologically optimal connectivity-based brain atlas. The proposed approach aggregates an ensemble partition from individual parcellations without label agreement, providing a balance between sufficiently flexible individual parcellations and intuitive representation of the average topological structure of the connectome. The methods exploit a previously proposed dense connectivity representation, first performing graph-based hierarchical parcellation of individual brains, and subsequently aggregating the individual parcellations into a consensus parcellation. The search for consensus-based on the hard ensemble (HE) algorithm-approximately minimizes the sum of cluster membership distances, effectively estimating a pseudo-Karcher mean of individual parcellations. Computational stability, graph structure preservation, and biological relevance of the simplified representation resulting from the proposed parcellation are assessed on the Human Connectome Project data set. These aspects are assessed using (1) edge weight distribution divergence with respect to the dense connectome representation, (2) interhemispheric symmetry, (3) network characteristics' stability and agreement with respect to individually and anatomically parcellated networks, and (4) performance of the simplified connectome in a biological sex classification task. Ensemble parcellation was found to be highly stable with respect to subject sampling, outperforming anatomical atlases and other connectome-based parcellations in classification as well as preserving global connectome properties. The HE-based parcellation also showed a degree of symmetry comparable with anatomical atlases and a high degree of spatial contiguity without using explicit priors.


Assuntos
Encéfalo/anatomia & histologia , Imagem de Difusão por Ressonância Magnética/métodos , Rede Nervosa/anatomia & histologia , Neuroimagem/métodos , Adulto , Atlas como Assunto , Encéfalo/diagnóstico por imagem , Conectoma , Imagem de Difusão por Ressonância Magnética/normas , Feminino , Humanos , Interpretação de Imagem Assistida por Computador/métodos , Processamento de Imagem Assistida por Computador/métodos , Masculino , Rede Nervosa/diagnóstico por imagem , Neuroimagem/normas , Adulto Jovem
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA