RESUMO
Nonlinear dimensionality reduction (NLDR) methods such as t-Distributed Stochastic Neighbour Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) have been widely used for biological data exploration, especially in single-cell analysis. However, the existing methods have drawbacks in preserving data's geometric and topological structures. A high-dimensional data analysis method, called Panoramic manifold projection (Panoramap), was developed as an enhanced deep learning framework for structure-preserving NLDR. Panoramap enhances deep neural networks by using cross-layer geometry-preserving constraints. The constraints constitute the loss for deep manifold learning and serve as geometric regularizers for NLDR network training. Therefore, Panoramap has better performance in preserving global structures of the original data. Here, we apply Panoramap to single-cell datasets and show that Panoramap excels at delineating the cell type lineage/hierarchy and can reveal rare cell types. Panoramap can facilitate trajectory inference and has the potential to aid in the early diagnosis of tumors. Panoramap gives improved and more biologically plausible visualization and interpretation of single-cell data. Panoramap can be readily used in single-cell research domains and other research fields that involve high dimensional data analysis.
Assuntos
Algoritmos , Redes Neurais de Computação , Análise de Célula ÚnicaRESUMO
Unsupervised representation learning (URL) that learns compact embeddings of high-dimensional data without supervision has achieved remarkable progress recently. However, the development of URLs for different requirements is independent, which limits the generalization of the algorithms, especially prohibitive as the number of tasks grows. For example, dimension reduction (DR) methods, t-SNE and UMAP, optimize pairwise data relationships by preserving the global geometric structure, while self-supervised learning, SimCLR and BYOL, focuses on mining the local statistics of instances under specific augmentations. To address this dilemma, we summarize and propose a unified similarity-based URL framework, GenURL, which can adapt to various URL tasks smoothly. In this article, we regard URL tasks as different implicit constraints on the data geometric structure that help to seek optimal low-dimensional representations that boil down to data structural modeling (DSM) and low-dimensional transformation (LDT). Specifically, DSM provides a structure-based submodule to describe the global structures, and LDT learns compact low-dimensional embeddings with given pretext tasks. Moreover, an objective function, general Kullback-Leibler (GKL) divergence, is proposed to connect DSM and LDT naturally. Comprehensive experiments demonstrate that GenURL achieves consistent state-of-the-art performance in self-supervised visual learning, unsupervised knowledge distillation (KD), graph embeddings (GEs), and DR.
RESUMO
Dimensional reduction (DR) maps high-dimensional data into a lower dimensions latent space with minimized defined optimization objectives. The two independent branches of DR are feature selection (FS) and feature projection (FP). FS focuses on selecting a critical subset of dimensions but risks destroying the data distribution (structure). On the other hand, FP combines all the input features into lower dimensions space, aiming to maintain the data structure, but lacks interpretability and sparsity. Moreover, FS and FP are traditionally incompatible categories and have not been unified into an amicable framework. Therefore, we consider that the ideal DR approach combines both FS and FP into a unified end-to-end manifold learning framework, simultaneously performing fundamental feature discovery while maintaining the intrinsic relationships between data samples in the latent space. This paper proposes a unified framework named Unified Dimensional Reduction Network (UDRN) to integrate FS and FP in an end-to-end way. Furthermore, a novel network framework is designed to implement FS and FP tasks separately using a stacked feature selection network and feature projection network. In addition, a stronger manifold assumption and a novel loss function are proposed. Furthermore, the loss function can leverage the priors of data augmentation to enhance the generalization ability of the proposed UDRN. Finally, comprehensive experimental results on four image and four biological datasets, including very high-dimensional data, demonstrate the advantages of DRN over existing methods (FS, FP, and FS&FP pipeline), especially in downstream tasks such as classification and visualization.
Assuntos
Aprendizagem , Redes Neurais de Computação , Generalização PsicológicaRESUMO
Dimensionality reduction and visualization play an important role in biological data analysis, such as data interpretation of single-cell RNA sequences (scRNA-seq). It is desired to have a visualization method that can not only be applicable to various application scenarios, including cell clustering and trajectory inference, but also satisfy a variety of technical requirements, especially the ability to preserve inherent structure of data and handle with batch effects. However, no existing methods can accommodate these requirements in a unified framework. In this paper, we propose a general visualization method, deep visualization (DV), that possesses the ability to preserve inherent structure of data and handle batch effects and is applicable to a variety of datasets from different application domains and dataset scales. The method embeds a given dataset into a 2- or 3-dimensional visualization space, with either a Euclidean or hyperbolic metric depending on a specified task type with type static (at a time point) or dynamic (at a sequence of time points) scRNA-seq data, respectively. Specifically, DV learns a structure graph to describe the relationships between data samples, transforms the data into visualization space while preserving the geometric structure of the data and correcting batch effects in an end-to-end manner. The experimental results on nine datasets in complex tissue from human patients or animal development demonstrate the competitiveness of DV in discovering complex cellular relations, uncovering temporal trajectories, and addressing complex batch factors. We also provide a preliminary attempt to pre-train a DV model for visualization of new incoming data.
Assuntos
Análise de Célula Única , Análise da Expressão Gênica de Célula Única , Animais , Humanos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Análise por ConglomeradosRESUMO
Dimension reduction (DR) is commonly utilized to capture the intrinsic structure and transform high-dimensional data into low-dimensional space while retaining meaningful properties of the original data. It is used in various applications, such as image recognition, single-cell sequencing analysis, and biomarker discovery. However, contemporary parametric-free and parametric DR techniques suffer from several significant shortcomings, such as the inability to preserve global and local features and the pool generalization performance. On the other hand, regarding explainability, it is crucial to comprehend the embedding process, especially the contribution of each part to the embedding process, while understanding how each feature affects the embedding results that identify critical components and help diagnose the embedding process. To address these problems, we have developed a deep neural network method called EVNet, which provides not only excellent performance in structural maintainability but also explainability to the DR therein. EVNet starts with data augmentation and a manifold-based loss function to improve embedding performance. The explanation is based on saliency maps and aims to examine the trained EVNet parameters and contributions of components during the embedding process. The proposed techniques are integrated with a visual interface to help the user to adjust EVNet to achieve better DR performance and explainability. The interactive visual interface makes it easier to illustrate the data features, compare different DR techniques, and investigate DR. An in-depth experimental comparison shows that EVNet consistently outperforms the state-of-the-art methods in both performance measures and explainability.
RESUMO
Determination of malignancy in thyroid nodules remains a major diagnostic challenge. Here we report the feasibility and clinical utility of developing an AI-defined protein-based biomarker panel for diagnostic classification of thyroid nodules: based initially on formalin-fixed paraffin-embedded (FFPE), and further refined for fine-needle aspiration (FNA) tissue specimens of minute amounts which pose technical challenges for other methods. We first developed a neural network model of 19 protein biomarkers based on the proteomes of 1724 FFPE thyroid tissue samples from a retrospective cohort. This classifier achieved over 91% accuracy in the discovery set for classifying malignant thyroid nodules. The classifier was externally validated by blinded analyses in a retrospective cohort of 288 nodules (89% accuracy; FFPE) and a prospective cohort of 294 FNA biopsies (85% accuracy) from twelve independent clinical centers. This study shows that integrating high-throughput proteomics and AI technology in multi-center retrospective and prospective clinical cohorts facilitates precise disease diagnosis which is otherwise difficult to achieve by other methods.
RESUMO
Severity prediction of COVID-19 remains one of the major clinical challenges for the ongoing pandemic. Here, we have recruited a 144 COVID-19 patient cohort, resulting in a data matrix containing 3,065 readings for 124 types of measurements over 52 days. A machine learning model was established to predict the disease progression based on the cohort consisting of training, validation, and internal test sets. A panel of eleven routine clinical factors constructed a classifier for COVID-19 severity prediction, achieving accuracy of over 98% in the discovery set. Validation of the model in an independent cohort containing 25 patients achieved accuracy of 80%. The overall sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were 0.70, 0.99, 0.93, and 0.93, respectively. Our model captured predictive dynamics of lactate dehydrogenase (LDH) and creatine kinase (CK) while their levels were in the normal range. This model is accessible at https://www.guomics.com/covidAI/ for research purpose.
RESUMO
A novel approach for phenotype prediction is developed for data-independent acquisition (DIA) mass spectrometric (MS) data without the need for peptide precursor identification using existing DIA software tools. The first step converts the DIA-MS data file into a new file format called DIA tensor (DIAT), which can be used for the convenient visualization of all the ions from peptide precursors and fragments. DIAT files can be fed directly into a deep neural network to predict phenotypes such as appearances of cats, dogs, and microscopic images. As a proof of principle, we applied this approach to 102 hepatocellular carcinoma samples and achieved an accuracy of 96.8% in distinguishing malignant from benign samples. We further applied a refined model to classify thyroid nodules. Deep learning based on 492 training samples achieved an accuracy of 91.7% in an independent cohort of 216 test samples. This approach surpassed the deep-learning model based on peptide and protein matrices generated by OpenSWATH. In summary, we present a new strategy for DIA data analysis based on a novel data format called DIAT, which enables facile two-dimensional visualization of DIA proteomics data. DIAT files can be directly used for deep learning for biological and clinical phenotype classification. Future research will interpret the deep-learning models emerged from DIAT analysis.
Assuntos
Espectrometria de Massas/métodos , Proteoma/análise , Proteômica/métodos , Carcinoma Hepatocelular/química , Carcinoma Hepatocelular/diagnóstico , Aprendizado Profundo , Humanos , Neoplasias Hepáticas/química , Neoplasias Hepáticas/diagnóstico , Peptídeos/análise , Software , Glândula Tireoide/químicaRESUMO
In this paper, a hybrid deep neural network scheduler (HDNNS) is proposed to solve job-shop scheduling problems (JSSPs). In order to mine the state information of schedule processing, a job-shop scheduling problem is divided into several classification-based subproblems. And a deep learning framework is used for solving these subproblems. HDNNS applies the convolution two-dimensional transformation method (CTDT) to transform irregular scheduling information into regular features so that the convolution operation of deep learning can be introduced into dealing with JSSP. The simulation experiments designed for testing HDNNS are in the context of JSSPs with different scales of machines and jobs as well as different time distributions for processing procedures. The results show that the MAKESPAN index of HDNNS is 9% better than that of HNN and the index is also 4% better than that of ANN in ZLP dataset. With the same neural network structure, the training time of the HDNNS method is obviously shorter than that of the DEEPRM method. In addition, the scheduler has an excellent generalization performance, which can address large-scale scheduling problems with only small-scale training data.