RESUMEN
Single-cell RNA sequencing (scRNA-seq) technology has revolutionized biological research by enabling high-throughput, cellular-resolution gene expression profiling. A critical step in scRNA-seq data analysis is cell clustering, which supports downstream analyses. However, the high-dimensional and sparse nature of scRNA-seq data poses significant challenges to existing clustering methods. Furthermore, integrating gene expression information with potential cell structure data remains largely unexplored. Here, we present scCFIB, a novel information bottleneck (IB)-based clustering algorithm that leverages the power of IB for efficient processing of high-dimensional sparse data and incorporates a cross-view fusion strategy to achieve robust cell clustering. scCFIB constructs a multi-feature space by establishing two distinct views from the original features. We then formulate the cell clustering problem as a target loss function within the IB framework, employing a collaborative information fusion strategy. To further optimize scCFIB's performance, we introduce a novel sequential optimization approach through an iterative process. Benchmarking against established methods on diverse scRNA-seq datasets demonstrates that scCFIB achieves superior performance in scRNA-seq data clustering tasks. Availability: the source code is publicly available on GitHub: https://github.com/weixiaojiao/scCFIB.
Asunto(s)
Algoritmos , Análisis de la Célula Individual , Análisis por Conglomerados , Análisis de la Célula Individual/métodos , RNA-Seq/métodos , Humanos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Biología Computacional/métodos , Análisis de Expresión Génica de una Sola CélulaRESUMEN
Despite significant advances in the deep clustering research, there remain three critical limitations to most of the existing approaches. First, they often derive the clustering result by associating some distribution-based loss to specific network layers, neglecting the potential benefits of leveraging the contrastive sample-wise relationships. Second, they frequently focus on representation learning at the full-image scale, overlooking the discriminative information latent in partial image regions. Third, although some prior studies perform the learning process at multiple levels, they mostly lack the ability to exploit the interaction between different learning levels. To overcome these limitations, this paper presents a novel deep image clustering approach via Partial Information discrimination and Cross-level Interaction (PICI). Specifically, we utilize a Transformer encoder as the backbone, coupled with two types of augmentations to formulate two parallel views. The augmented samples, integrated with masked patches, are processed through the Transformer encoder to produce the class tokens. Subsequently, three partial information learning modules are jointly enforced, namely, the partial information self-discrimination (PISD) module for masked image reconstruction, the partial information contrastive discrimination (PICD) module for the simultaneous instance- and cluster-level contrastive learning, and the cross-level interaction (CLI) module to ensure the consistency across different learning levels. Through this unified formulation, our PICI approach for the first time, to our knowledge, bridges the gap between the masked image modeling and the deep contrastive clustering, offering a novel pathway for enhanced representation learning and clustering. Experimental results across six image datasets demonstrate the superiority of our PICI approach over the state-of-the-art. In particular, our approach achieves an ACC of 0.772 (0.634) on the RSOD (UC-Merced) dataset, which shows an improvement of 29.7% (24.8%) over the best baseline. The source code is available at https://github.com/Regan-Zhang/PICI.
RESUMEN
The field of data exploration relies heavily on clustering techniques to organize vast datasets into meaningful subgroups, offering valuable insights across various domains. Traditional clustering algorithms face limitations in terms of performance, often getting stuck in local minima and struggling with complex datasets of varying shapes and densities. They also require prior knowledge of the number of clusters, which can be a drawback in real-world scenarios. In response to these challenges, we propose the "hybrid raven roosting intelligence framework" (HRIF) algorithm. HRIF draws inspiration from the dynamic behaviors of roosting ravens and computational intelligence. What distinguishes HRIF is its effective capacity to adeptly navigate the clustering landscape, evading local optima and converging toward optimal solutions. An essential enhancement in HRIF is the incorporation of the Gaussian mutation operator, which adds stochasticity to improve exploration and mitigate the risk of local minima. This research presents the development and evaluation of HRIF, showcasing its unique fusion of nature-inspired optimization techniques and computational intelligence. Extensive experiments with diverse benchmark datasets demonstrate HRIF's competitive performance, particularly its capability to handle complex data and avoid local minima, resulting in accurate clustering outcomes. HRIF's adaptability to challenging datasets and its potential to enhance clustering efficiency and solution quality position it as a promising solution in the world of data exploration.
RESUMEN
The article presents the experience of artificial intelligence application in research process. The article contains general information about basic concepts of machine learning (clustering and visualization), as well as considers more detaily an experience of clinical testing. The effectiveness of applying Data Analysis methods and means as one of the research stages is demonstrated on the example of a case on processing medical information using algorithms of machine learning: solving the problem on diagnostic value of the proposed indicator (FTF) for determining target age groups. Implementation of such approach of digital transformation improves the operational effectiveness of researches, as well as quality and availability of final technological products being developed - software for solving expert problems.
Asunto(s)
Inteligencia Artificial , Humanos , Análisis de Datos , Aprendizaje Automático , Algoritmos , Medicina Legal/métodosRESUMEN
Non-linear dimensionality reduction can be performed by manifold learning approaches, such as stochastic neighbour embedding (SNE), locally linear embedding (LLE) and isometric feature mapping (ISOMAP). These methods aim to produce two or three latent embeddings, primarily to visualise the data in intelligible representations. This manuscript proposes extensions of Student's t-distributed SNE (t-SNE), LLE and ISOMAP, for dimensionality reduction and visualisation of multi-view data. Multi-view data refers to multiple types of data generated from the same samples. The proposed multi-view approaches provide more comprehensible projections of the samples compared to the ones obtained by visualising each data-view separately. Commonly, visualisation is used for identifying underlying patterns within the samples. By incorporating the obtained low-dimensional embeddings from the multi-view manifold approaches into the K-means clustering algorithm, it is shown that clusters of the samples are accurately identified. Through extensive comparisons of novel and existing multi-view manifold learning algorithms on real and synthetic data, the proposed multi-view extension of t-SNE, named multi-SNE, is found to have the best performance, quantified both qualitatively and quantitatively by assessing the clusterings obtained. The applicability of multi-SNE is illustrated by its implementation in the newly developed and challenging multi-omics single-cell data. The aim is to visualise and identify cell heterogeneity and cell types in biological tissues relevant to health and disease. In this application, multi-SNE provides an improved performance over single-view manifold learning approaches and a promising solution for unified clustering of multi-omics single-cell data.
RESUMEN
Multimedia resources, such as instructional videos, are currently enjoying a certain popularity in the training programs for medical and dental students. The major challenge is to create such resources with quality content that is approved by students. In order to answer this challenge, it is imperative to find out which features of instructional videos are considered to be necessary and useful by students, thus being able to excite them, to hold their attention, and to stimulate them in learning with pleasure. AIM: We investigated the opinions of a sample of 551 students from four medical universities in Romania, in order to identify the students' preferred characteristics in instructional videos, both globally and comparatively on genders and age groups and also according to their general preferences for using internet services. MATERIAL AND METHODS: We used univariate (hypothesis testing) and multivariate (two-step clustering) data analysis techniques and revealed three clusters of students, primarily determined by their perceptions of the visual appearance of the instructional videos. RESULTS: The structure of the clusters by gender and age group was relatively similar, but we recorded differences associated with the students' expressed preferences for certain internet services compared to others. The first identified cluster (35.4% of the cases) contains students who prefer instructional videos to contain images used only for aesthetic purposes and to fill the gaps; they use internet services mainly for communication. The second cluster of students (34.8%) prefers videos designed as practical lessons, using explanatory drawings and diagrams drawn at the same time as the explanations; they also use internet services mainly for communication. The last cluster of students (29.8%) prefer videos designed as PowerPoint presentations, with animated pictures, diagrams, and drawings; they are slightly younger than the others and use internet services mainly for information and communication, but also for domestic facilities. CONCLUSIONS: The students' preferences for certain features of instructional videos depend not only on gender and age but are also related to their developmental background and general opinions about modern technologies.
RESUMEN
The study refers to the application of a type of artificial neural network called the Self-Organising Map (SOM) for the identification of areas of the human abdominal wall that behave in a similar mechanical way. The research is based on data acquired during in vivo tests using the digital image correlation technique (DIC). The mechanical behaviour of the human abdominal wall is analysed during changing intra-abdominal pressure. SOM allow to study simultaneously three variables in four time/load steps. The variables refer to the principal strains and their directions. SOM classifies all the abdominal surface data points into clusters that behave similarly in accordance with the 12 variables. The analysis of the clusters provides a better insight into abdominal wall deformation and its evolution under pressure than when observing a single mechanical variable. The presented results may provide a better understanding of the mechanics of the living human abdominal wall. It might be particularly useful when selecting proper implants as well as for the design of surgical meshes for the treatment of abdominal hernias, which would be mechanically compatible with identified regions of the human anterior abdominal wall, and possibly open the way for patient-specific solutions.
Asunto(s)
Pared Abdominal , Redes Neurales de la Computación , Estrés Mecánico , Humanos , Pared Abdominal/fisiología , Fenómenos Biomecánicos , Presión , Pruebas Mecánicas , MasculinoRESUMEN
Climate change has intensified the effects of habitat fragmentation in many ecosystems, particularly exacerbated in riparian habitats. Therefore, there is an urgent need to identify keystone connectivity spots to ensure long-term conservation and sustainable management of riparian systems as they play a crucial role for landscape connectivity. This paper aims to identify critical areas for connectivity under two contrasting climate change scenarios (RCP 4.5 and RCP 8.5 models) for the years 2030, 2050 and 2100 and to group these critical areas by similar connectivity in keystone spots for sustainable management. A set of analyses comprising climate analysis, drainage network analysis, configuration of potential riparian habitats, riparian habitat connectivity, data clustering, and statistical analysis within a Spanish river basin (NW Spain) were applied. The node and link connectivity would be reduced under the two climate change scenarios (≈2.5 % and 4.4 % reduction, respectively), intensifying riparian habitat fragmentation. Furthermore, 51 different clusters (critical areas) were obtained and classified in five classes (keystone spots) with similar connectivity across the different scenarios of climate change. Each keystone spot obtained by hierarchical classification was associated with one or more climate scenarios. One of these keystone spots was especially susceptible to the worst climate change scenario. Key riparian connectivity spots will be crucial for the management and restoration of highly threatened riparian systems and to ensure long-term biodiversity conservation.
Asunto(s)
Cambio Climático , Ecosistema , Biodiversidad , Ríos , España , Conservación de los Recursos NaturalesRESUMEN
In multiview data clustering, consistent or complementary information in the multiview data can achieve better clustering results. However, the high dimensions, lack of labeling, and redundancy of multiview data certainly affect the clustering effect, posing a challenge to multiview clustering. A clustering algorithm based on multiview feature selection clustering (MFSC), which combines similarity graph learning and unsupervised feature selection, is designed in this study. During the MFSC implementation, local manifold regularization is integrated into similarity graph learning, with the clustering label of similarity graph learning as the standard for unsupervised feature selection. MFSC can retain the characteristics of the clustering label on the premise of maintaining the manifold structure of multiview data. The algorithm is systematically evaluated using benchmark multiview and simulated data. The clustering experiment results prove that the MFSC algorithm is more effective than the traditional algorithm.
RESUMEN
Cluster analysis is a crucial stage in the analysis and interpretation of single-cell gene expression (scRNA-seq) data. It is an inherently ill-posed problem whose solutions depend heavily on hyper-parameter and algorithmic choice. The popular approach of K-means clustering, for example, depends heavily on the choice of K and the convergence of the expectation-maximization algorithm to local minima of the objective. Exhaustive search of the space for multiple good quality solutions is known to be a complex problem. Here, we show that quantum computing offers a solution to exploring the cost function of clustering by quantum annealing, implemented on a quantum computing facility offered by D-Wave [1]. Out formulation extracts minimum vertex cover of an affinity graph to sub-sample the cell population and quantum annealing to optimise the cost function. A distribution of low-energy solutions can thus be extracted, offering alternate hypotheses about how genes group together in their space of expressions.
Asunto(s)
Metodologías Computacionales , Teoría Cuántica , RNA-Seq , Análisis de Secuencia de ARN , Algoritmos , Análisis por Conglomerados , Perfilación de la Expresión GénicaRESUMEN
BACKGROUND: Technology advancement has allowed more frequent monitoring of biomarkers. The resulting data structure entails more frequent follow-ups compared to traditional longitudinal studies where the number of follow-up is often small. Such data allow explorations of the role of intra-person variability in understanding disease etiology and characterizing disease processes. A specific example was to characterize pathogenesis of bacterial vaginosis (BV) using weekly vaginal microbiota Nugent assay scores collected over 2 years in post-menarcheeal women from Rakai, Uganda, and to identify risk factors for each vaginal microbiota pattern to inform epidemiological and etiological understanding of the pathogenesis of BV. METHODS: We use a fully data-driven approach to characterize the longitudinal patters of vaginal microbiota by considering the densely sampled Nugent scores to be random functions over time and performing dimension reduction by functional principal components. Extending a current functional data clustering method, we use a hierarchical functional clustering framework considering multiple data features to help identify clinically meaningful patterns of vaginal microbiota fluctuations. Additionally, multinomial logistic regression was used to identify risk factors for each vaginal microbiota pattern to inform epidemiological and etiological understanding of the pathogenesis of BV. RESULTS: Using weekly Nugent scores over 2 years of 211 sexually active and post-menarcheal women in Rakai, four patterns of vaginal microbiota variation were identified: persistent with a BV state (high Nugent scores), persistent with normal ranged Nugent scores, large fluctuation of Nugent scores which however are predominantly in the BV state; large fluctuation of Nugent scores but predominantly the scores are in the normal state. Higher Nugent score at the start of an interval, younger age group of less than 20 years, unprotected source for bathing water, a woman's partner's being not circumcised, use of injectable/Norplant hormonal contraceptives for family planning were associated with higher odds of persistent BV in women. CONCLUSION: The hierarchical functional data clustering method can be used for fully data driven unsupervised clustering of densely sampled longitudinal data to identify clinically informative clusters and risk-factors associated with each cluster.
Asunto(s)
Microbiota , Vaginosis Bacteriana , Femenino , Humanos , Adulto Joven , Factores de Riesgo , Uganda/epidemiología , Vagina/microbiología , Vaginosis Bacteriana/epidemiología , Vaginosis Bacteriana/microbiologíaRESUMEN
Research producing evidence-based information on the health benefits of green and blue spaces often has within its design, the potential for inherent or implicit bias which can unconsciously orient the outcomes of such studies towards preconceived hypothesis. Many studies are situated in proximity to specific or generic green and blue spaces (hence, constituting a green or blue space led approach), others are conducted due to availability of green and blue space data (hence, applying a green or blue space data led approach), while other studies are shaped by particular interests in the association of particular health conditions with presence of, or engagements with green or blue spaces (hence, adopting a health or health status led approach). In order to tackle this bias and develop a more objective research design for studying associations between human health outcomes and green and blue spaces, this paper discussed the features of a methodological framework suitable for that purpose after an initial, year-long, exploratory Irish study. The innovative approach explored by this study (i.e., the health-data led approach) first identifies sample sites with good and poor health outcomes from available health data (using data clustering techniques) before examining the potential role of the presence of, or engagement with green and blue spaces in creating such health outcomes. By doing so, we argue that some of the bias associated with the other three listed methods can be reduced and even eliminated. Finally, we infer that the principles and paradigm adopted by the health data led approach can be applicable and effective in analyzing other sustainability problems beyond associations between human health outcomes and green and blue spaces (e.g., health, energy, food, income, environment and climate inequality and justice etc.). The possibility of this is also discussed within this paper.
Asunto(s)
Clima , Alimentos , Humanos , Renta , Justicia SocialRESUMEN
In this paper, a weighted multivariate generalized Gaussian mixture model combined with stochastic optimization is proposed for point cloud registration. The mixture model parameters of the target scene and the scene to be registered are updated iteratively by the fixed point method under the framework of the EM algorithm, and the number of components is determined based on the minimum message length criterion (MML). The KL divergence between these two mixture models is utilized as the loss function for stochastic optimization to find the optimal parameters of the transformation model. The self-built point clouds are used to evaluate the performance of the proposed algorithm on rigid registration. Experiments demonstrate that the algorithm dramatically reduces the impact of noise and outliers and effectively extracts the key features of the data-intensive regions.
RESUMEN
PIM-1 kinase is a serine-threonine phosphorylating enzyme with implications in multiple types of malignancies, including prostate, breast, and blood cancers. Developing better search methodologies for PIM-1 kinase inhibitors may be a good strategy to speed up the discovery of an oncological drug approved for targeting this specific kinase. Computer-aided screening methods are promising approaches for the discovery of novel therapeutics, although certain limitations should be addressed. A frequent omission that is encountered in molecular docking is the lack of proper implementation of scoring functions and algorithms on the post-docking results, which usually alters the outcome of the virtual screening. The current study suggests a method for post-processing docking results, expressed either as binding affinity or score, that considers different binding modes of known inhibitors to the studied targets while making use of in vitro data, where available. The docking protocol successfully discriminated between known PIM-1 kinase inhibitors and decoy molecules, although binding energies alone were not sufficient to ensure a successful prediction. Logistic regression models were trained to predict the probability of PIM-1 kinase inhibitory activity based on binding energies and the presence of interactions with identified key amino acid residues. The selected model showed 80.9% true positive and 81.4% true negative rates. The discussed approach can be further applied in large-scale molecular docking campaigns to increase hit discovery success rates.
RESUMEN
This research introduces an efficacious model for incremental data clustering using Entropy weighted-Gradient Namib Beetle Mayfly Algorithm (NBMA). Here, feature selection is done based upon support vector machine recursive feature elimination (SVM-RFE), where the weight parameter is optimally fine-tuned using NBMA. After that, clustering is carried out utilizing entropy weighted power k-means clustering algorithm and weight is updated employing designed Gradient NBMA. Finally, incremental data clustering takes place in which centroid matching is carried out based on RV coefficient, whereas centroid is updated based on deep maxout network (DMN). Also, the result shows the better performance of the proposed method..
RESUMEN
In recent decades, the Variational AutoEncoder (VAE) model has shown good potential and capability in image generation and dimensionality reduction. The combination of VAE and various machine learning frameworks has also worked effectively in different daily life applications, however its possible use and effectiveness in modern game design has seldom been explored nor assessed. The use of its feature extractor for data clustering has also been minimally discussed in the literature neither. This study first attempts to explore different mathematical properties of the VAE model, in particular, the theoretical framework of the encoding and decoding processes, the possible achievable lower bound and loss functions of different applications; then applies the established VAE model to generate new game levels based on two well-known game settings; and to validate the effectiveness of its data clustering mechanism with the aid of the Modified National Institute of Standards and Technology (MNIST) database. Respective statistical metrics and assessments are also utilized to evaluate the performance of the proposed VAE model in aforementioned case studies. Based on the statistical and graphical results, several potential deficiencies, for example, difficulties in handling high-dimensional and vast datasets, as well as insufficient clarity of outputs are discussed; then measures of future enhancement, such as tokenization and the combination of VAE and GAN models, are also outlined. Hopefully, this can ultimately maximize the strengths and advantages of VAE for future game design tasks and relevant industrial missions.
RESUMEN
Multi-objective design approaches can help identify future infrastructure system designs that appropriately balance different engineering, environmental, and other societal goals. Planners benefit from assessing the trade-offs implied by the best-performing infrastructure system solutions. However, a large number of possible efficient system designs, obtained when using multi-objective optimization, can be overwhelming to interpret. This study attempts to aid decision-making in multi-criteria infrastructure system design by reducing the complexity of the identified set of efficient infrastructure designs, i.e., the Pareto-front. A soft clustering algorithm is applied, which identifies similarities between solutions, partitions the front accordingly, and selects a set of representative solutions while preserving the multi-dimensional structure of the solutions on the efficiency frontier. Three post-optimization decision-making metrics are introduced to help quantify the overall performance of the Pareto-optimal designs to further summarize design process outputs for decision-makers. We apply the method to an illustrious urban drainage network case study. Results show how the approach can simplify Pareto-fronts with thousands of solutions into sets of highlighted designs that aid interpreting the trade-offs implied by the best-performing simulated systems.
Asunto(s)
Algoritmos , Ingeniería , Toma de DecisionesRESUMEN
Developing automated systems with a reasonable cost for long-term care for elders is a promising research direction. Such smart systems are based on realizing activities of daily living (ADLs) to enable aging in place while preserving the quality of life of all inhabitants in smart homes. One of the research directions is based on localizing items used by elders to monitor their activities with fine-grained details of the progress. In this paper, we shed the light on this issue by presenting an approach for localizing items in smart homes. The presented method is based on applying machine learning algorithms to Radio Frequency IDentification (RFID) tags readings. Our approach achieves the required task through two stages. The first stage detects in which room the selected object is located. Then, the second one determines the exact position of the selected object inside the detected room. Additionally, we present an efficient approach based on gradient boosted decision trees for detecting the location of the selected object in a real-world smart home. Moreover, we employ some techniques of over- and under-sampling with data clustering for improving the performance of the presented techniques. Many experiments are conducted in this work to evaluate the performance of the presented approach for localizing objects in a real smart home. The results of the experiments have shown that our approach provides remarkable performance.
RESUMEN
Healthy life expectancy (HLE) is an indicator that measures the number of years individuals at a given age are expected to live free of disease or disability. HLE forecasting is essential for planning the provision of health care to elderly populations and appropriately pricing Long Term Care insurance products. In this paper, we propose a methodology that simultaneously forecasts HLE for groups of countries and allows for investigating similarities in their HLE patterns. We firstly apply a functional data clustering to the multivariate time series of HLE at birth of different countries for the years 1990-2019 provided by the Global Burden of Disease Study. Three clusters are identified for both genders. Then, we carry out the HLE simultaneous forecasting of the populations within each cluster by a multivariate random walk with drift. Numerical results and the statistical significance of the parameters of the identified multivariate processes are shown. Demographic evidences on the different evolution of HLE between countries are commented.
RESUMEN
The Crested Ibis (Nipponia nippon) is an endangered animal with an extremely high ecological, humanistic, and scientific value. However, this species still faces survival challenges, due to rapidly shrinking foraging grounds, the serious interference of human behavior, and increased habitat requirements. Geographical environment is a significant factor affecting Crested Ibis behavior-pattern analysis and habitat protection. The spatial and temporal trajectory contains habitat location and period information, a vital record of the Crested Ibis' habits, and the basis of all research. Nevertheless, there are only a handful of studies on the missing trajectory data and fusing multiple sources of environmental data-research methods. We studied the spatial and temporal habitat use of the tracked Crested lbis by fusing multiple data-sources in China. This paper adopts the LSTM (long short-term memory) model to supplement the missing trajectory data and perform cluster mining; and a random forest model is used to predict the habitat of the Crested Ibis with high fitting accuracy (R2 = 84.9%). The results show that the Crested Ibis distribution-pattern is characterized by high altitude and proximity to woodland and rivers. Additionally, the habitat dependence on the village implicates human agricultural activities in positively impacting its reproduction. This paper provides a complete method for analyzing Crested Ibis' spatial and temporal trajectory by fusing multi-source data, which is crucial for protecting the survival and reproduction of the Crested Ibis.