RESUMO
Single-cell RNA sequencing (scRNA-seq) is widely used to interpret cellular states, detect cell subpopulations, and study disease mechanisms. In scRNA-seq data analysis, cell clustering is a key step that can identify cell types. However, scRNA-seq data are characterized by high dimensionality and significant sparsity, presenting considerable challenges for clustering. In the high-dimensional gene expression space, cells may form complex topological structures. Many conventional scRNA-seq data analysis methods focus on identifying cell subgroups rather than exploring these potential high-dimensional structures in detail. Although some methods have begun to consider the topological structures within the data, many still overlook the continuity and complex topology present in single-cell data. We propose a deep learning framework that begins by employing a zero-inflated negative binomial (ZINB) model to denoise the highly sparse and over-dispersed scRNA-seq data. Next, scZAG uses an adaptive graph contrastive representation learning approach that combines approximate personalized propagation of neural predictions graph convolution (APPNPGCN) with graph contrastive learning methods. By using APPNPGCN as the encoder for graph contrastive learning, we ensure that each cell's representation reflects not only its own features but also its position in the graph and its relationships with other cells. Graph contrastive learning exploits the relationships between nodes to capture the similarity among cells, better representing the data's underlying continuity and complex topology. Finally, the learned low-dimensional latent representations are clustered using Kullback-Leibler divergence. We validated the superior clustering performance of scZAG on 10 common scRNA-seq datasets in comparison to existing state-of-the-art clustering methods.
Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Humanos , RNA-Seq/métodos , Análise de Sequência de RNA/métodos , Algoritmos , Software , Aprendizado Profundo , Biologia Computacional/métodos , Análise da Expressão Gênica de Célula ÚnicaRESUMO
As one of the most important tasks in protein structure prediction, protein fold recognition has attracted more and more attention. In this regard, some computational predictors have been proposed with the development of machine learning and artificial intelligence techniques. However, these existing computational methods are still suffering from some disadvantages. In this regard, we propose a new network-based predictor called ProtFold-DFG for protein fold recognition. We propose the Directed Fusion Graph (DFG) to fuse the ranking lists generated by different methods, which employs the transitive closure to incorporate more relationships among proteins and uses the KL divergence to calculate the relationship between two proteins so as to improve its generalization ability. Finally, the PageRank algorithm is performed on the DFG to accurately recognize the protein folds by considering the global interactions among proteins in the DFG. Tested on a widely used and rigorous benchmark data set, LINDAHL dataset, experimental results show that the ProtFold-DFG outperforms the other 35 competing methods, indicating that ProtFold-DFG will be a useful method for protein fold recognition. The source code and data of ProtFold-DFG can be downloaded from http://bliulab.net/ProtFold-DFG/download.
Assuntos
Algoritmos , Inteligência Artificial , Biologia Computacional/métodos , Dobramento de Proteína , Proteínas/química , Redes Neurais de Computação , Mapas de Interação de Proteínas , Proteínas/metabolismo , Reprodutibilidade dos TestesRESUMO
Advances in information technologies have made network data increasingly frequent in a spectrum of big data applications, which is often explored by probabilistic graphical models. To precisely estimate the precision matrix, we propose an optimal model averaging estimator for Gaussian graphs. We prove that the proposed estimator is asymptotically optimal when candidate models are misspecified. The consistency and the asymptotic distribution of model averaging estimator, and the weight convergence are also studied when at least one correct model is included in the candidate set. Furthermore, numerical simulations and a real data analysis on yeast genetic data are conducted to illustrate that the proposed method is promising.
Assuntos
Modelos Estatísticos , Simulação por ComputadorRESUMO
Non-intrusive load monitoring systems that are based on deep learning methods produce high-accuracy end use detection; however, they are mainly designed with the one vs. one strategy. This strategy dictates that one model is trained to disaggregate only one appliance, which is sub-optimal in production. Due to the high number of parameters and the different models, training and inference can be very costly. A promising solution to this problem is the design of an NILM system in which all the target appliances can be recognized by only one model. This paper suggests a novel multi-appliance power disaggregation model. The proposed architecture is a multi-target regression neural network consisting of two main parts. The first part is a variational encoder with convolutional layers, and the second part has multiple regression heads which share the encoder's parameters. Considering the total consumption of an installation, the multi-regressor outputs the individual consumption of all the target appliances simultaneously. The experimental setup includes a comparative analysis against other multi- and single-target state-of-the-art models.
RESUMO
Time series (TS) and multiple time series (MTS) predictions have historically paved the way for distinct families of deep learning models. The temporal dimension, distinguished by its evolutionary sequential aspect, is usually modeled by decomposition into the trio of "trend, seasonality, noise", by attempts to copy the functioning of human synapses, and more recently, by transformer models with self-attention on the temporal dimension. These models may find applications in finance and e-commerce, where any increase in performance of less than 1% has large monetary repercussions, they also have potential applications in natural language processing (NLP), medicine, and physics. To the best of our knowledge, the information bottleneck (IB) framework has not received significant attention in the context of TS or MTS analyses. One can demonstrate that a compression of the temporal dimension is key in the context of MTS. We propose a new approach with partial convolution, where a time sequence is encoded into a two-dimensional representation resembling images. Accordingly, we use the recent advances made in image extension to predict an unseen part of an image from a given one. We show that our model compares well with traditional TS models, has information-theoretical foundations, and can be easily extended to more dimensions than only time and space. An evaluation of our multiple time series-information bottleneck (MTS-IB) model proves its efficiency in electricity production, road traffic, and astronomical data representing solar activity, as recorded by NASA's interface region imaging spectrograph (IRIS) satellite.
RESUMO
BACKGROUND: Lung cancer is one of the cancers with the highest mortality rate in China. With the rapid development of high-throughput sequencing technology and the research and application of deep learning methods in recent years, deep neural networks based on gene expression have become a hot research direction in lung cancer diagnosis in recent years, which provide an effective way of early diagnosis for lung cancer. Thus, building a deep neural network model is of great significance for the early diagnosis of lung cancer. However, the main challenges in mining gene expression datasets are the curse of dimensionality and imbalanced data. The existing methods proposed by some researchers can't address the problems of high-dimensionality and imbalanced data, because of the overwhelming number of variables measured (genes) versus the small number of samples, which result in poor performance in early diagnosis for lung cancer. METHOD: Given the disadvantages of gene expression data sets with small datasets, high-dimensionality and imbalanced data, this paper proposes a gene selection method based on KL divergence, which selects some genes with higher KL divergence as model features. Then build a deep neural network model using Focal Loss as loss function, at the same time, we use k-fold cross validation method to verify and select the best model, we set the value of k is five in this paper. RESULT: The deep learning model method based on KL divergence gene selection proposed in this paper has an AUC of 0.99 on the validation set. The generalization performance of model is high. CONCLUSION: The deep neural network model based on KL divergence gene selection proposed in this paper is proved to be an accurate and effective method for lung cancer prediction.
Assuntos
Aprendizado Profundo , Neoplasias Pulmonares , China , Expressão Gênica , Humanos , Neoplasias Pulmonares/genética , Redes Neurais de ComputaçãoRESUMO
We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by candidate distribution and true target distribution. To solve this challenging minimization problem by gradient descent, we apply a reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.
RESUMO
It is of great practical importance to compare and combine data from different studies in order to carry out appropriate and more powerful statistical inference. We propose a partition based measure to quantify the compatibility of two datasets using their respective posterior distributions. We further propose an information gain measure to quantify the information increase (or decrease) in combining two datasets. These measures are well calibrated and efficient computational algorithms are provided for their calculations. We use examples in a benchmark dose toxicology study, a six cities pollution data and a melanoma clinical trial to illustrate how these two measures are useful in combining current data with historical data and missing data.
Assuntos
Algoritmos , Análise de Dados , HumanosRESUMO
The asymmetric skew divergence smooths one of the distributions by mixing it, to a degree determined by the parameter λ, with the other distribution. Such divergence is an approximation of the KL divergence that does not require the target distribution to be absolutely continuous with respect to the source distribution. In this paper, an information geometric generalization of the skew divergence called the α-geodesical skew divergence is proposed, and its properties are studied.
RESUMO
Detection of faults at the incipient stage is critical to improving the availability and continuity of satellite services. The application of a local optimum projection vector and the Kullback-Leibler (KL) divergence can improve the detection rate of incipient faults. However, this suffers from the problem of high time complexity. We propose decomposing the KL divergence in the original optimization model and applying the property of the generalized Rayleigh quotient to reduce time complexity. Additionally, we establish two distribution models for subfunctions F1(w) and F3(w) to detect the slight anomalous behavior of the mean and covariance. The effectiveness of the proposed method was verified through a numerical simulation case and a real satellite fault case. The results demonstrate the advantages of low computational complexity and high sensitivity to incipient faults.
RESUMO
Information theoretic (IT) approaches to quantifying causal influences have experienced some popularity in the literature, in both theoretical and applied (e.g., neuroscience and climate science) domains. While these causal measures are desirable in that they are model agnostic and can capture non-linear interactions, they are fundamentally different from common statistical notions of causal influence in that they (1) compare distributions over the effect rather than values of the effect and (2) are defined with respect to random variables representing a cause rather than specific values of a cause. We here present IT measures of direct, indirect, and total causal effects. The proposed measures are unlike existing IT techniques in that they enable measuring causal effects that are defined with respect to specific values of a cause while still offering the flexibility and general applicability of IT techniques. We provide an identifiability result and demonstrate application of the proposed measures in estimating the causal effect of the El Niño-Southern Oscillation on temperature anomalies in the North American Pacific Northwest.
RESUMO
Ensemble clustering combines different basic partitions of a dataset into a more stable and robust one. Thus, cluster ensemble plays a significant role in applications like image segmentation. However, existing ensemble methods have a few demerits, including the lack of diversity of basic partitions and the low accuracy caused by data noise. In this paper, to get over these difficulties, we propose an efficient fuzzy cluster ensemble method based on Kullback-Leibler divergence or simply, the KL divergence. The data are first classified with distinct fuzzy clustering methods. Then, the soft clustering results are aggregated by a fuzzy KL divergence-based objective function. Moreover, for image segmentation problems, we utilize the local spatial information in the cluster ensemble algorithm to suppress the effect of noise. Experiment results reveal that the proposed methods outperform many other methods in synthetic and real image-segmentation problems.
RESUMO
While Langevin integrators are popular in the study of equilibrium properties of complex systems, it is challenging to estimate the timestep-induced discretization error: the degree to which the sampled phase-space or configuration-space probability density departs from the desired target density due to the use of a finite integration timestep. Sivak et al., introduced a convenient approach to approximating a natural measure of error between the sampled density and the target equilibrium density, the Kullback-Leibler (KL) divergence, in phase space, but did not specifically address the issue of configuration-space properties, which are much more commonly of interest in molecular simulations. Here, we introduce a variant of this near-equilibrium estimator capable of measuring the error in the configuration-space marginal density, validating it against a complex but exact nested Monte Carlo estimator to show that it reproduces the KL divergence with high fidelity. To illustrate its utility, we employ this new near-equilibrium estimator to assess a claim that a recently proposed Langevin integrator introduces extremely small configuration-space density errors up to the stability limit at no extra computational expense. Finally, we show how this approach to quantifying sampling bias can be applied to a wide variety of stochastic integrators by following a straightforward procedure to compute the appropriate shadow work, and describe how it can be extended to quantify the error in arbitrary marginal or conditional distributions of interest.
RESUMO
BACKGROUND: Ultrasound imaging is safer than other imaging modalities, because it is noninvasive and nonradiative. Speckle noise degrades the quality of ultrasound images and has negative effects on visual perception and diagnostic operations. METHODS: In this paper, a nonlocal total variation (NLTV) method for ultrasonic speckle reduction is proposed. A spatiogram similarity measurement is introduced for the similarity calculation between image patches. It is based on symmetric Kullback-Leibler (KL) divergence and signal-dependent speckle model for log-compressed ultrasound images. Each patch is regarded as a spatiogram, and the spatial distribution of each bin of the spatiogram is regarded as a weighted Gamma distribution. The similarity between the corresponding bins of the two spatiograms is computed by the symmetric KL divergence. The Split-Bregman fast algorithm is then used to solve the adapted NLTV object function. Kolmogorov-Smirnov (KS) test is performed on synthetic noisy images and real ultrasound images. RESULTS: We validate our method on synthetic noisy images and clinical ultrasound images. Three measures are adopted for the quantitative evaluation of the despeckling performance: the signal-to-noise ratio (SNR), structural similarity index (SSIM), and natural image quality evaluator (NIQE). For synthetic noisy images, when the noise level increases, the proposed algorithm achieves slightly higher SNRS than that of the other two algorithms, and the SSIMS yielded by the proposed algorithm is obviously higher than that of the other two algorithms. For liver, IVUS and 3DUS images, the NIQE values are 8.25, 6.42 and 9.01, all of which are higher than that of the other two algorithms. CONCLUSIONS: The results of the experiments over synthetic and real ultrasound images demonstrate that the proposed method outperforms current state-of-the-art despeckling methods with respect to speckle reduction and tissue texture preservation.
Assuntos
Interpretação de Imagem Assistida por Computador/métodos , Fígado/diagnóstico por imagem , Algoritmos , Humanos , Razão Sinal-Ruído , Ultrassonografia/métodosRESUMO
Variational Autoencoders (VAEs) are an efficient variational inference technique coupled with the generated network. Due to the uncertainty provided by variational inference, VAEs have been applied in medical image registration. However, a critical problem in VAEs is that the simple prior cannot provide suitable regularization, which leads to the mismatch between the variational posterior and prior. An optimal prior can close the gap between the evidence's real and variational posterior. In this paper, we propose a multi-stage VAE to learn the optimal prior, which is the aggregated posterior. A lightweight VAE is used to generate the aggregated posterior as a whole. It is an effective way to estimate the distribution of the high-dimensional aggregated posterior that commonly exists in medical image registration based on VAEs. A factorized telescoping classifier is trained to estimate the density ratio of a simple given prior and aggregated posterior, aiming to calculate the KL divergence between the variational and aggregated posterior more accurately. We analyze the KL divergence and find that the finer the factorization, the smaller the KL divergence is. However, too fine a partition is not conducive to registration accuracy. Moreover, the diagonal hypothesis of the variational posterior's covariance ignores the relationship between latent variables in image registration. To address this issue, we learn a covariance matrix with low-rank information to enable correlations with each dimension of the variational posterior. The covariance matrix is further used as a measure to reduce the uncertainty of deformation fields. Experimental results on four public medical image datasets demonstrate that our proposed method outperforms other methods in negative log-likelihood (NLL) and achieves better registration accuracy.
Assuntos
Processamento de Imagem Assistida por Computador , Humanos , Processamento de Imagem Assistida por Computador/métodos , AlgoritmosRESUMO
BACKGROUND AND OBJECTIVE: Myocardial infarction (MI) is a life-threatening condition diagnosed acutely on the electrocardiogram (ECG). Several errors, such as noise, can impair the prediction of automated ECG diagnosis. Therefore, quantification and communication of model uncertainty are essential for reliable MI diagnosis. METHODS: A Dirichlet DenseNet model that could analyze out-of-distribution data and detect misclassification of MI and normal ECG signals was developed. The DenseNet model was first trained with the pre-processed MI ECG signals (from the best lead V6) acquired from the Physikalisch-Technische Bundesanstalt (PTB) database, using the reverse Kullback-Leibler (KL) divergence loss. The model was then tested with newly synthesized ECG signals with added em and ma noise samples. Predictive entropy was used as an uncertainty measure to determine the misclassification of normal and MI signals. Model performance was evaluated using four uncertainty metrics: uncertainty sensitivity (UNSE), uncertainty specificity (UNSP), uncertainty accuracy (UNAC), and uncertainty precision (UNPR); the classification threshold was set at 0.3. RESULTS: The UNSE of the DenseNet model was low but increased over the studied decremental noise range (-6 to 24 dB), indicating that the model grew more confident in classifying the signals as they got less noisy. The model became more certain in its predictions from SNR values of 12 dB and 18 dB onwards, yielding UNAC values of 80% and 82.4% for em and ma noise signals, respectively. UNSP and UNPR values were close to 100% for em and ma noise signals, indicating that the model was self-aware of what it knew and didn't. CONCLUSION: Through this work, it has been established that the model is reliable as it was able to convey when it was not confident in the diagnostic information it was presenting. Thus, the model is trustworthy and can be used in healthcare applications, such as the emergency diagnosis of MI on ECGs.
Assuntos
Eletrocardiografia , Infarto do Miocárdio , Humanos , Incerteza , Infarto do Miocárdio/diagnóstico , Bases de Dados Factuais , EntropiaRESUMO
Introduction: The development of multimodal single-cell omics methods has enabled the collection of data across different omics modalities from the same set of single cells. Each omics modality provides unique information about cell type and function, so the ability to integrate data from different modalities can provide deeper insights into cellular functions. Often, single-cell omics data can prove challenging to model because of high dimensionality, sparsity, and technical noise. Methods: We propose a novel multimodal data analysis method called joint graph-regularized Single-Cell Kullback-Leibler Sparse Non-negative Matrix Factorization (jrSiCKLSNMF, pronounced "junior sickles NMF") that extracts latent factors shared across omics modalities within the same set of single cells. Results: We compare our clustering algorithm to several existing methods on four sets of data simulated from third party software. We also apply our algorithm to a real set of cell line data. Discussion: We show overwhelmingly better clustering performance than several existing methods on the simulated data. On a real multimodal omics dataset, we also find our method to produce scientifically accurate clustering results.
RESUMO
Lung cancer is a prevalent malignancy that impacts individuals of all genders and is often diagnosed late due to delayed symptoms. To catch it early, researchers are developing algorithms to study lung cancer images. The primary objective of this work is to propose a novel approach for the detection of lung cancer using histopathological images. In this work, the histopathological images underwent preprocessing, followed by segmentation using a modified approach of KFCM-based segmentation and the segmented image intensity values were dimensionally reduced using Particle Swarm Optimization (PSO) and Grey Wolf Optimization (GWO). Algorithms such as KL Divergence and Invasive Weed Optimization (IWO) are used for feature selection. Seven different classifiers such as SVM, KNN, Random Forest, Decision Tree, Softmax Discriminant, Multilayer Perceptron, and BLDC were used to analyze and classify the images as benign or malignant. Results were compared using standard metrics, and kappa analysis assessed classifier agreement. The Decision Tree Classifier with GWO feature extraction achieved good accuracy of 85.01% without feature selection and hyperparameter tuning approaches. Furthermore, we present a methodology to enhance the accuracy of the classifiers by employing hyperparameter tuning algorithms based on Adam and RAdam. By combining features from GWO and IWO, and using the RAdam algorithm, the Decision Tree classifier achieves the commendable accuracy of 91.57%.
RESUMO
Acute lymphoblastic leukemia (ALL) is a life-threatening hematological malignancy that requires early and accurate diagnosis for effective treatment. However, the manual diagnosis of ALL is time-consuming and can delay critical treatment decisions. To address this challenge, researchers have turned to advanced technologies such as deep learning (DL) models. These models leverage the power of artificial intelligence to analyze complex patterns and features in medical images and data, enabling faster and more accurate diagnosis of ALL. However, the existing DL-based ALL diagnosis suffers from various challenges, such as computational complexity, sensitivity to hyperparameters, and difficulties with noisy or low-quality input images. To address these issues, in this paper, we propose a novel Deep Skip Connections-Based Dense Network (DSCNet) tailored for ALL diagnosis using peripheral blood smear images. The DSCNet architecture integrates skip connections, custom image filtering, Kullback-Leibler (KL) divergence loss, and dropout regularization to enhance its performance and generalization abilities. DSCNet leverages skip connections to address the vanishing gradient problem and capture long-range dependencies, while custom image filtering enhances relevant features in the input data. KL divergence loss serves as the optimization objective, enabling accurate predictions. Dropout regularization is employed to prevent overfitting during training, promoting robust feature representations. The experiments conducted on an augmented dataset for ALL highlight the effectiveness of DSCNet. The proposed DSCNet outperforms competing methods, showcasing significant enhancements in accuracy, sensitivity, specificity, F-score, and area under the curve (AUC), achieving increases of 1.25%, 1.32%, 1.12%, 1.24%, and 1.23%, respectively. The proposed approach demonstrates the potential of DSCNet as an effective tool for early and accurate ALL diagnosis, with potential applications in clinical settings to improve patient outcomes and advance leukemia detection research.
RESUMO
In this paper, a weighted multivariate generalized Gaussian mixture model combined with stochastic optimization is proposed for point cloud registration. The mixture model parameters of the target scene and the scene to be registered are updated iteratively by the fixed point method under the framework of the EM algorithm, and the number of components is determined based on the minimum message length criterion (MML). The KL divergence between these two mixture models is utilized as the loss function for stochastic optimization to find the optimal parameters of the transformation model. The self-built point clouds are used to evaluate the performance of the proposed algorithm on rigid registration. Experiments demonstrate that the algorithm dramatically reduces the impact of noise and outliers and effectively extracts the key features of the data-intensive regions.