Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 401
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Genet Epidemiol ; 2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38533840

RESUMO

Copy number variants (CNVs) are prevalent in the human genome and are found to have a profound effect on genomic organization and human diseases. Discovering disease-associated CNVs is critical for understanding the pathogenesis of diseases and aiding their diagnosis and treatment. However, traditional methods for assessing the association between CNVs and disease risks adopt a two-stage strategy conducting quantitative CNV measurements first and then testing for association, which may lead to biased association estimation and low statistical power, serving as a major barrier in routine genome-wide assessment of such variation. In this article, we developed One-Stage CNV-disease Association Analysis (OSCAA), a flexible algorithm to discover disease-associated CNVs for both quantitative and qualitative traits. OSCAA employs a two-dimensional Gaussian mixture model that is built upon the PCs from copy number intensities, accounting for technical biases in CNV detection while simultaneously testing for their effect on outcome traits. In OSCAA, CNVs are identified and their associations with disease risk are evaluated simultaneously in a single step, taking into account the uncertainty of CNV identification in the statistical model. Our simulations demonstrated that OSCAA outperformed the existing one-stage method and traditional two-stage methods by yielding a more accurate estimate of the CNV-disease association, especially for short CNVs or CNVs with weak signals. In conclusion, OSCAA is a powerful and flexible approach for CNV association testing with high sensitivity and specificity, which can be easily applied to different traits and clinical risk predictions.

2.
Biostatistics ; 2024 Jul 13.
Artigo em Inglês | MEDLINE | ID: mdl-39002144

RESUMO

High-dimensional omics data often contain intricate and multifaceted information, resulting in the coexistence of multiple plausible sample partitions based on different subsets of selected features. Conventional clustering methods typically yield only one clustering solution, limiting their capacity to fully capture all facets of cluster structures in high-dimensional data. To address this challenge, we propose a model-based multifacet clustering (MFClust) method based on a mixture of Gaussian mixture models, where the former mixture achieves facet assignment for gene features and the latter mixture determines cluster assignment of samples. We demonstrate superior facet and cluster assignment accuracy of MFClust through simulation studies. The proposed method is applied to three transcriptomic applications from postmortem brain and lung disease studies. The result captures multifacet clustering structures associated with critical clinical variables and provides intriguing biological insights for further hypothesis generation and discovery.

3.
Biostatistics ; 2024 Apr 19.
Artigo em Inglês | MEDLINE | ID: mdl-38637995

RESUMO

Computed tomography (CT) has been a powerful diagnostic tool since its emergence in the 1970s. Using CT data, 3D structures of human internal organs and tissues, such as blood vessels, can be reconstructed using professional software. This 3D reconstruction is crucial for surgical operations and can serve as a vivid medical teaching example. However, traditional 3D reconstruction heavily relies on manual operations, which are time-consuming, subjective, and require substantial experience. To address this problem, we develop a novel semiparametric Gaussian mixture model tailored for the 3D reconstruction of blood vessels. This model extends the classical Gaussian mixture model by enabling nonparametric variations in the component-wise parameters of interest according to voxel positions. We develop a kernel-based expectation-maximization algorithm for estimating the model parameters, accompanied by a supporting asymptotic theory. Furthermore, we propose a novel regression method for optimal bandwidth selection. Compared to the conventional cross-validation-based (CV) method, the regression method outperforms the CV method in terms of computational and statistical efficiency. In application, this methodology facilitates the fully automated reconstruction of 3D blood vessel structures with remarkable accuracy.

4.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36592058

RESUMO

The progress of single-cell RNA sequencing (scRNA-seq) has led to a large number of scRNA-seq data, which are widely used in biomedical research. The noise in the raw data and tens of thousands of genes pose a challenge to capture the real structure and effective information of scRNA-seq data. Most of the existing single-cell analysis methods assume that the low-dimensional embedding of the raw data belongs to a Gaussian distribution or a low-dimensional nonlinear space without any prior information, which limits the flexibility and controllability of the model to a great extent. In addition, many existing methods need high computational cost, which makes them difficult to be used to deal with large-scale datasets. Here, we design and develop a depth generation model named Gaussian mixture adversarial autoencoders (scGMAAE), assuming that the low-dimensional embedding of different types of cells follows different Gaussian distributions, integrating Bayesian variational inference and adversarial training, as to give the interpretable latent representation of complex data and discover the statistical distribution of different types of cells. The scGMAAE is provided with good controllability, interpretability and scalability. Therefore, it can process large-scale datasets in a short time and give competitive results. scGMAAE outperforms existing methods in several ways, including dimensionality reduction visualization, cell clustering, differential expression analysis and batch effect removal. Importantly, compared with most deep learning methods, scGMAAE requires less iterations to generate the best results.


Assuntos
Perfilação da Expressão Gênica , Análise da Expressão Gênica de Célula Única , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Distribuição Normal , Teorema de Bayes , Análise de Célula Única/métodos , Análise por Conglomerados
5.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36653899

RESUMO

Gene regulatory networks govern complex gene expression programs in various biological phenomena, including embryonic development, cell fate decisions and oncogenesis. Single-cell techniques are increasingly being used to study gene expression, providing higher resolution than traditional approaches. However, inferring a comprehensive gene regulatory network across different cell types remains a challenge. Here, we propose to construct context-dependent gene regulatory networks (CDGRNs) from single-cell RNA sequencing data utilizing both spliced and unspliced transcript expression levels. A gene regulatory network is decomposed into subnetworks corresponding to different transcriptomic contexts. Each subnetwork comprises the consensus active regulation pairs of transcription factors and their target genes shared by a group of cells, inferred by a Gaussian mixture model. We find that the union of gene regulation pairs in all contexts is sufficient to reconstruct differentiation trajectories. Functions specific to the cell cycle, cell differentiation or tissue-specific functions are enriched throughout the developmental process in each context. Surprisingly, we also observe that the network entropy of CDGRNs decreases along differentiation trajectories, indicating directionality in differentiation. Overall, CDGRN allows us to establish the connection between gene regulation at the molecular level and cell differentiation at the macroscopic level.


Assuntos
Desenvolvimento Embrionário , Redes Reguladoras de Genes , Diferenciação Celular/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Perfilação da Expressão Gênica
6.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37080761

RESUMO

Advancing spatially resolved transcriptomics (ST) technologies help biologists comprehensively understand organ function and tissue microenvironment. Accurate spatial domain identification is the foundation for delineating genome heterogeneity and cellular interaction. Motivated by this perspective, a graph deep learning (GDL) based spatial clustering approach is constructed in this paper. First, the deep graph infomax module embedded with residual gated graph convolutional neural network is leveraged to address the gene expression profiles and spatial positions in ST. Then, the Bayesian Gaussian mixture model is applied to handle the latent embeddings to generate spatial domains. Designed experiments certify that the presented method is superior to other state-of-the-art GDL-enabled techniques on multiple ST datasets. The codes and dataset used in this manuscript are summarized at https://github.com/narutoten520/SCGDL.


Assuntos
Aprendizado Profundo , Transcriptoma , Teorema de Bayes , Perfilação da Expressão Gênica , Comunicação Celular
7.
BMC Bioinformatics ; 25(1): 90, 2024 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-38429687

RESUMO

RNA sequencing of time-course experiments results in three-way count data where the dimensions are the genes, the time points and the biological units. Clustering RNA-seq data allows to extract groups of co-expressed genes over time. After standardisation, the normalised counts of individual genes across time points and biological units have similar properties as compositional data. We propose the following procedure to suitably cluster three-way RNA-seq data: (1) pre-process the RNA-seq data by calculating the normalised expression profiles, (2) transform the data using the additive log ratio transform to map the composition in the D-part Aitchison simplex to a D - 1 -dimensional Euclidean vector, (3) cluster the transformed RNA-seq data using matrix-variate Gaussian mixture models and (4) assess the quality of the overall cluster solution and of individual clusters based on cluster separation in the transformed space using density-based silhouette information and on compactness of the cluster in the original space using cluster maps as a suitable visualisation. The proposed procedure is illustrated on RNA-seq data from fission yeast and results are also compared to an analogous two-way approach after flattening out the biological units.


Assuntos
RNA , RNA/genética , Análise de Sequência de RNA/métodos , RNA-Seq , Sequência de Bases , Análise por Conglomerados
8.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34698349

RESUMO

Target identification of small molecules is an important and still changeling work in the area of drug discovery, especially for botanical drug development. Indistinct understanding of the relationships of ligand-protein interactions is one of the main obstacles for drug repurposing and identification of off-targets. In this study, we collected 9063 crystal structures of ligand-binding proteins released from January, 1995 to April, 2021 in PDB bank, and split the complexes into 5133 interaction pairs of ligand atoms and protein fragments (covalently linked three heavy atoms) with interatomic distance ≤5 Å. The interaction pairs were grouped into ligand atoms with the same SYBYL atom type surrounding each type of protein fragment, which were further clustered via Bayesian Gaussian Mixture Model (BGMM). Gaussian distributions with ligand atoms ≥20 were identified as significant interaction patterns. Reliability of the significant interaction patterns was validated by comparing the difference of number of significant interaction patterns between the docked poses with higher and lower similarity to the native crystal structures. Fifty-one candidate targets of brucine, strychnine and icajine involved in Semen Strychni (Mǎ Qián Zǐ) and eight candidate targets of astragaloside-IV, formononetin and calycosin-7-glucoside involved in Astragalus (Huáng Qí) were predicted by the significant interaction patterns, in combination with docking, which were consistent with the therapeutic effects of Semen Strychni and Astragalus for cancer and chronic pain. The new strategy in this study improves the accuracy of target identification for small molecules, which will facilitate discovery of botanical drugs.


Assuntos
Teorema de Bayes , Ligantes , Ligação Proteica , Reprodutibilidade dos Testes
9.
Graefes Arch Clin Exp Ophthalmol ; 262(6): 1819-1828, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38446204

RESUMO

PURPOSE: The aim of this study is to investigate the distribution of spherical equivalent and axial length in the general population and to analyze the influence of education on spherical equivalent with a focus on ocular biometric parameters. METHODS: The Gutenberg Health Study is a population-based cohort study in Mainz, Germany. Participants underwent comprehensive ophthalmologic examinations as part of the 5-year follow-up examination in 2012-2017 including genotyping. The spherical equivalent and axial length distributions were modeled with gaussian mixture models. Regression analysis (on person-individual level) was performed to analyze associations between biometric parameters and educational factors. Mendelian randomization analysis explored the causal effect between spherical equivalent, axial length, and education. Additionally, effect mediation analysis examined the link between spherical equivalent and education. RESULTS: A total of 8532 study participants were included (median age: 57 years, 49% female). The distribution of spherical equivalent and axial length follows a bi-Gaussian function, partially explained by the length of education (i.e., < 11 years education vs. 11-20 years). Mendelian randomization indicated an effect of education on refractive error using a genetic risk score of education as an instrument variable (- 0.35 diopters per SD increase in the instrument, 95% CI, - 0.64-0.05, p = 0.02) and an effect of education on axial length (0.63 mm per SD increase in the instrument, 95% CI, 0.22-1.04, p = 0.003). Spherical equivalent, axial length and anterior chamber depth were associated with length of education in regression analyses. Mediation analysis revealed that the association between spherical equivalent and education is mainly driven (70%) by alteration in axial length. CONCLUSIONS: The distribution of axial length and spherical equivalent is represented by subgroups of the population (bi-Gaussian). This distribution can be partially explained by length of education. The impact of education on spherical equivalent is mainly driven by alteration in axial length.


Assuntos
Comprimento Axial do Olho , Escolaridade , Humanos , Feminino , Masculino , Pessoa de Meia-Idade , Alemanha/epidemiologia , Comprimento Axial do Olho/patologia , Distribuição Normal , Biometria/métodos , Refração Ocular/fisiologia , Seguimentos , Erros de Refração/fisiopatologia , Erros de Refração/diagnóstico , Erros de Refração/genética , Idoso , Adulto
10.
Sensors (Basel) ; 24(10)2024 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-38793838

RESUMO

Collaborative crowdsensing is a team collaboration model that harnesses the intelligence of a large network of participants, primarily applied in areas such as intelligent computing, federated learning, and blockchain. Unlike traditional crowdsensing, user recruitment in collaborative crowdsensing not only considers the individual capabilities of users but also emphasizes their collaborative abilities. In this context, this paper takes a unique approach by modeling user interactions as a graph, transforming the recruitment challenge into a graph theory problem. The methodology employs an enhanced Prim algorithm to identify optimal team members by finding the maximum spanning tree within the user interaction graph. After the recruitment, the collaborative crowdsensing explored in this paper presents a challenge of unfair incentives due to users engaging in free-riding behavior. To address these challenges, the paper introduces the MR-SVIM mechanism. Initially, the process begins with a Gaussian mixture model predicting the quality of users' tasks, combined with historical reputation values to calculate their direct reputation. Subsequently, to assess users' significance within the team, aggregation functions and the improved PageRank algorithm are employed for local and global influence evaluation, respectively. Indirect reputation is determined based on users' importance and similarity with interacting peers. Considering the comprehensive reputation value derived from the combined assessment of direct and indirect reputations, and integrating the collaborative capabilities among users, we have formulated a feature function for contribution. This function is applied within an enhanced Shapley value method to assess the relative contributions of each user, achieving a more equitable distribution of earnings. Finally, experiments conducted on real datasets validate the fairness of this mechanism.

11.
Sensors (Basel) ; 24(8)2024 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-38676261

RESUMO

This study aimed to use a data-driven approach to identify individualized speed thresholds to characterize running demands and athlete workload during games and practices in skill and linemen football players. Data were recorded from wearable sensors over 28 sessions from 30 male Canadian varsity football athletes, resulting in a total of 287 performances analyzed, including 137 games and 150 practices, using a global positioning system. Speed zones were identified for each performance by fitting a 5-dimensional Gaussian mixture model (GMM) corresponding to 5 running intensity zones from minimal (zone 1) to maximal (zone 5). Skill players had significantly higher (p < 0.001) speed thresholds, percentage of time spent, and distance covered in maximal intensity zones compared to linemen. The distance covered in game settings was significantly higher (p < 0.001) compared to practices. This study highlighted the use of individualized speed thresholds to determine running intensity and athlete workloads for American and Canadian football athletes, as well as compare running performances between practice and game scenarios. This approach can be used to monitor physical workload in athletes with respect to their tactical positions during practices and games, and to ensure that athletes are adequately trained to meet in-game physical demands.


Assuntos
Atletas , Corrida , Humanos , Corrida/fisiologia , Masculino , Canadá , Desempenho Atlético/fisiologia , Sistemas de Informação Geográfica , Adulto Jovem , Futebol Americano/fisiologia , Adulto , Futebol/fisiologia
12.
Sensors (Basel) ; 24(7)2024 Mar 28.
Artigo em Inglês | MEDLINE | ID: mdl-38610387

RESUMO

In the realm of road safety and the evolution toward automated driving, Advanced Driver Assistance and Automated Driving (ADAS/AD) systems play a pivotal role. As the complexity of these systems grows, comprehensive testing becomes imperative, with virtual test environments becoming crucial, especially for handling diverse and challenging scenarios. Radar sensors are integral to ADAS/AD units and are known for their robust performance even in adverse conditions. However, accurately modeling the radar's perception, particularly the radar cross-section (RCS), proves challenging. This paper adopts a data-driven approach, using Gaussian mixture models (GMMs) to model the radar's perception for various vehicles and aspect angles. A Bayesian variational approach automatically infers model complexity. The model is expanded into a comprehensive radar sensor model based on object lists, incorporating occlusion effects and RCS-based detectability decisions. The model's effectiveness is demonstrated through accurate reproduction of the RCS behavior and scatter point distribution. The full capabilities of the sensor model are demonstrated in different scenarios. The flexible and modular framework has proven apt for modeling specific aspects and allows for an easy model extension. Simultaneously, alongside model extension, more extensive validation is proposed to refine accuracy and broaden the model's applicability.

13.
Entropy (Basel) ; 26(8)2024 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-39202129

RESUMO

We calculate the average differential entropy of a q-component Gaussian mixture in Rn. For simplicity, all components have covariance matrix σ21, while the means {Wi}i=1q are i.i.d. Gaussian vectors with zero mean and covariance s21. We obtain a series expansion in µ=s2/σ2 for the average differential entropy up to order O(µ2), and we provide a recipe to calculate higher-order terms. Our result provides an analytic approximation with a quantifiable order of magnitude for the error, which is not achieved in previous literature.

14.
Entropy (Basel) ; 26(7)2024 Jul 10.
Artigo em Inglês | MEDLINE | ID: mdl-39056952

RESUMO

While collecting training data, even with the manual verification of experts from crowdsourcing platforms, eliminating incorrect annotations (noisy labels) completely is difficult and expensive. In dealing with datasets that contain noisy labels, over-parameterized deep neural networks (DNNs) tend to overfit, leading to poor generalization and classification performance. As a result, noisy label learning (NLL) has received significant attention in recent years. Existing research shows that although DNNs eventually fit all training data, they first prioritize fitting clean samples, then gradually overfit to noisy samples. Mainstream methods utilize this characteristic to divide training data but face two issues: class imbalance in the segmented data subsets and the optimization conflict between unsupervised contrastive representation learning and supervised learning. To address these issues, we propose a Balanced Partitioning and Training framework with Pseudo-Label Relaxed contrastive loss called BPT-PLR, which includes two crucial processes: a balanced partitioning process with a two-dimensional Gaussian mixture model (BP-GMM) and a semi-supervised oversampling training process with a pseudo-label relaxed contrastive loss (SSO-PLR). The former utilizes both semantic feature information and model prediction results to identify noisy labels, introducing a balancing strategy to maintain class balance in the divided subsets as much as possible. The latter adopts the latest pseudo-label relaxed contrastive loss to replace unsupervised contrastive loss, reducing optimization conflicts between semi-supervised and unsupervised contrastive losses to improve performance. We validate the effectiveness of BPT-PLR on four benchmark datasets in the NLL field: CIFAR-10/100, Animal-10N, and Clothing1M. Extensive experiments comparing with state-of-the-art methods demonstrate that BPT-PLR can achieve optimal or near-optimal performance.

15.
J Stat Comput Simul ; 94(10): 2291-2319, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39176071

RESUMO

It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications.

16.
Biostatistics ; 24(1): 68-84, 2022 12 12.
Artigo em Inglês | MEDLINE | ID: mdl-34363675

RESUMO

Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse $K$-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this article, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small $n$) with high-dimensional gene features (large $p$). A modified EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows the superior performance of the proposed count data model in clustering accuracy, feature selection, and biological interpretation in pathways.


Assuntos
Modelos Estatísticos , Humanos , RNA-Seq , Teorema de Bayes , Análise por Conglomerados , Distribuição Normal
17.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33300547

RESUMO

The rapid development of single-cell RNA sequencing (scRNA-Seq) technology provides strong technical support for accurate and efficient analyzing single-cell gene expression data. However, the analysis of scRNA-Seq is accompanied by many obstacles, including dropout events and the curse of dimensionality. Here, we propose the scGMAI, which is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA). Specifically, scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data. The integration of these computational techniques in scGMAI leads to outperforming results compared to existing tools, including Seurat, in clustering cells from 17 public scRNA-Seq datasets. In summary, scGMAI is an effective tool for accurately clustering and identifying cell types from scRNA-Seq data and shows the great potential of its applicative power in scRNA-Seq data analysis. The source code is available at https://github.com/QUST-AIBBDRC/scGMAI/.


Assuntos
Algoritmos , RNA-Seq , Análise de Célula Única , Software
18.
Eur J Nucl Med Mol Imaging ; 50(11): 3265-3275, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37272955

RESUMO

PURPOSE: Several [18F]Flortaucipir cutoffs have been proposed for tau PET positivity (T+) in Alzheimer's disease (AD), but none were data-driven. The aim of this study was to establish and validate unsupervised T+ cutoffs by applying Gaussian mixture models (GMM). METHODS: Amyloid negative (A-) cognitively normal (CN) and amyloid positive (A+) AD-related dementia (ADRD) subjects from ADNI (n=269) were included. ADNI (n=475) and Geneva Memory Clinic (GMC) cohorts (n=98) were used for validation. GMM-based cutoffs were extracted for the temporal meta-ROI, and validated against previously published cutoffs and visual rating. RESULTS: GMM-based cutoffs classified less subjects as T+, mainly in the A- CN (<3.4% vs >28.5%) and A+ CN (<14.5% vs >42.9%) groups and showed higher agreement with visual rating (ICC=0.91 vs ICC<0.62) than published cutoffs. CONCLUSION: We provided reliable data-driven [18F]Flortaucipir cutoffs for in vivo T+ detection in AD. These cutoffs might be useful to select participants in clinical and research studies.


Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Humanos , Doença de Alzheimer/diagnóstico por imagem , Proteínas tau , Peptídeos beta-Amiloides , Tomografia por Emissão de Pósitrons , Amiloide
19.
J Microsc ; 289(1): 58-70, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36229040

RESUMO

Scanning electron microscopy has been a powerful technique to investigate the structural and chemical properties of multiphase materials on micro and nanoscale due to its high-resolution capabilities. One of the main outcomes of the SEM-based analysis is the calculation of the fractions of material components constituting the multiphase material by means of the segmentation of their back scattered electron SEM images. In order to segment multiphase images, Gaussian mixture models (GMMs) are commonly used based on the deconvolution of the image pixel histogram. Despite its extensive use, the accuracy of GMM predictions has not been validated yet. In this paper, we proceed to a systematic study of the evaluation of the accuracy and the limitations of the GMM method when applied to the segmentation of a four-phase material. To this end, first, we build a modelling framework and propose an index to quantify the accuracy of GMM predictions for all phases. Then we apply this framework to calculate the impact of collective parameters of image histogram on the accuracy of GMM predictions. Finally, some rules of thumb are concluded to guide SEM users about the suitability of using GMM for the segmentation of their SEM images based only on the inspection of the image histogram. A suitable histogram for GMM is a histogram with number of peaks equal to the number of Gaussian components, and if that is not the case, kurtosis and skewness should be smaller than 2.35 and 0.1, respectively.


Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador , Processamento de Imagem Assistida por Computador/métodos , Distribuição Normal , Microscopia Eletrônica de Varredura
20.
Biometrics ; 79(2): 866-877, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-35220585

RESUMO

One key challenge encountered in single-cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.


Assuntos
Algoritmos , Distribuição Normal , Análise por Conglomerados
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA