Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 92
Filter
1.
Cell ; 185(10): 1625-1627, 2022 05 12.
Article in English | MEDLINE | ID: mdl-35561663

ABSTRACT

The generation of spatial transcriptomes of whole embryo has been limited in scale and resolution due to various technological restrictions. In this issue of Cell, Chen et al. introduce a DNA nanoball-based sample-capture technology for spatial transcriptome analysis to generate a molecular atlas of mouse organogenesis at single-cell resolution.


Subject(s)
Organogenesis , Transcriptome , Animals , Embryo, Mammalian , Gene Expression Profiling , Mice , Organogenesis/genetics , Single-Cell Analysis
2.
Brief Bioinform ; 24(3)2023 05 19.
Article in English | MEDLINE | ID: mdl-37096588

ABSTRACT

The advances of single-cell transcriptomic technologies have led to increasing use of single-cell RNA sequencing (scRNA-seq) data in large-scale patient cohort studies. The resulting high-dimensional data can be summarized and incorporated into patient outcome prediction models in several ways; however, there is a pressing need to understand the impact of analytical decisions on such model quality. In this study, we evaluate the impact of analytical choices on model choices, ensemble learning strategies and integrate approaches on patient outcome prediction using five scRNA-seq COVID-19 datasets. First, we examine the difference in performance between using single-view feature space versus multi-view feature space. Next, we survey multiple learning platforms from classical machine learning to modern deep learning methods. Lastly, we compare different integration approaches when combining datasets is necessary. Through benchmarking such analytical combinations, our study highlights the power of ensemble learning, consistency among different learning methods and robustness to dataset normalization when using multiple datasets as the model input.


Subject(s)
Benchmarking , COVID-19 , Humans , Gene Expression Profiling , Machine Learning , Sequence Analysis, RNA/methods
3.
Bioinformatics ; 40(6)2024 Jun 03.
Article in English | MEDLINE | ID: mdl-38889275

ABSTRACT

MOTIVATION: Single-cell omics technologies have enabled the quantification of molecular profiles in individual cells at an unparalleled resolution. Deep learning, a rapidly evolving sub-field of machine learning, has instilled a significant interest in single-cell omics research due to its remarkable success in analysing heterogeneous high-dimensional single-cell omics data. Nevertheless, the inherent multi-layer nonlinear architecture of deep learning models often makes them 'black boxes' as the reasoning behind predictions is often unknown and not transparent to the user. This has stimulated an increasing body of research for addressing the lack of interpretability in deep learning models, especially in single-cell omics data analyses, where the identification and understanding of molecular regulators are crucial for interpreting model predictions and directing downstream experimental validations. RESULTS: In this work, we introduce the basics of single-cell omics technologies and the concept of interpretable deep learning. This is followed by a review of the recent interpretable deep learning models applied to various single-cell omics research. Lastly, we highlight the current limitations and discuss potential future directions.


Subject(s)
Deep Learning , Single-Cell Analysis , Single-Cell Analysis/methods , Humans , Computational Biology/methods , Genomics/methods
4.
Nature ; 567(7747): 187-193, 2019 03.
Article in English | MEDLINE | ID: mdl-30814737

ABSTRACT

Dysregulation of lipid homeostasis is a precipitating event in the pathogenesis and progression of hepatosteatosis and metabolic syndrome. These conditions are highly prevalent in developed societies and currently have limited options for diagnostic and therapeutic intervention. Here, using a proteomic and lipidomic-wide systems genetic approach, we interrogated lipid regulatory networks in 107 genetically distinct mouse strains to reveal key insights into the control and network structure of mammalian lipid metabolism. These include the identification of plasma lipid signatures that predict pathological lipid abundance in the liver of mice and humans, defining subcellular localization and functionality of lipid-related proteins, and revealing functional protein and genetic variants that are predicted to modulate lipid abundance. Trans-omic analyses using these datasets facilitated the identification and validation of PSMD9 as a previously unknown lipid regulatory protein. Collectively, our study serves as a rich resource for probing mammalian lipid metabolism and provides opportunities for the discovery of therapeutic agents and biomarkers in the setting of hepatic lipotoxicity.


Subject(s)
Lipid Metabolism/genetics , Lipids/analysis , Lipids/genetics , Proteomics , Animals , HEK293 Cells , Humans , Lipid Metabolism/physiology , Lipids/blood , Lipids/classification , Liver/chemistry , Liver/metabolism , Liver/pathology , Male , Mice , Mice, Inbred C57BL , Mice, Inbred DBA , Obesity/genetics , Obesity/metabolism , Proteasome Endopeptidase Complex/chemistry , Proteasome Endopeptidase Complex/genetics , Proteasome Endopeptidase Complex/metabolism
5.
Mol Cell ; 68(1): 104-117.e6, 2017 Oct 05.
Article in English | MEDLINE | ID: mdl-28985501

ABSTRACT

Eukaryotic gene transcription is regulated at many steps, including RNA polymerase II (Pol II) recruitment, transcription initiation, promoter-proximal Pol II pause release, and transcription termination; however, mechanisms regulating transcription during productive elongation remain poorly understood. Enhancers, which activate gene transcription, themselves undergo Pol II-mediated transcription, but our understanding of enhancer transcription and enhancer RNAs (eRNAs) remains incomplete. Here we show that transcription at intragenic enhancers interferes with and attenuates host gene transcription during productive elongation. While the extent of attenuation correlates positively with nascent eRNA expression, the act of intragenic enhancer transcription alone, but not eRNAs, explains the attenuation. Through CRISPR/Cas9-mediated deletions, we demonstrate a physiological role for intragenic enhancer-mediated transcription attenuation in cell fate determination. We propose that intragenic enhancers not only enhance transcription of one or more genes from a distance but also fine-tune transcription of their host gene through transcription interference, facilitating differential utilization of the same regulatory element for disparate functions.


Subject(s)
Enhancer Elements, Genetic , Gene Expression Regulation , Mouse Embryonic Stem Cells/metabolism , RNA Polymerase II/genetics , Transcription Elongation, Genetic , Animals , CRISPR-Cas Systems , Cell Line , Chromatin/chemistry , Chromatin/metabolism , Embryoid Bodies/cytology , Embryoid Bodies/metabolism , Gene Editing , Mice , Mouse Embryonic Stem Cells/cytology , Promoter Regions, Genetic , RNA/genetics , RNA/metabolism , RNA Polymerase II/metabolism
6.
Nucleic Acids Res ; 51(8): e45, 2023 05 08.
Article in English | MEDLINE | ID: mdl-36912104

ABSTRACT

Multimodal single-cell omics technologies enable multiple molecular programs to be simultaneously profiled at a global scale in individual cells, creating opportunities to study biological systems at a resolution that was previously inaccessible. However, the analysis of multimodal single-cell omics data is challenging due to the lack of methods that can integrate across multiple data modalities generated from such technologies. Here, we present Matilda, a multi-task learning method for integrative analysis of multimodal single-cell omics data. By leveraging the interrelationship among tasks, Matilda learns to perform data simulation, dimension reduction, cell type classification, and feature selection in a single unified framework. We compare Matilda with other state-of-the-art methods on datasets generated from some of the most popular multimodal single-cell omics technologies. Our results demonstrate the utility of Matilda for addressing multiple key tasks on integrative multimodal single-cell omics data analysis. Matilda is implemented in Pytorch and is freely available from https://github.com/PYangLab/Matilda.


Subject(s)
Genomics , Single-Cell Analysis , Genomics/methods , Computer Simulation
7.
Genome Res ; 31(12): 2170-2184, 2021 Dec.
Article in English | MEDLINE | ID: mdl-34667120

ABSTRACT

Bivalent chromatin is characterized by the simultaneous presence of H3K4me3 and H3K27me3, histone modifications generally associated with transcriptionally active and repressed chromatin, respectively. Prevalent in embryonic stem cells (ESCs), bivalency is postulated to poise/prime lineage-controlling developmental genes for rapid activation during embryogenesis while maintaining a transcriptionally repressed state in the absence of activation cues; however, this hypothesis remains to be directly tested. Most gene promoters DNA hypermethylated in adult human cancers are bivalently marked in ESCs, and it was speculated that bivalency predisposes them for aberrant de novo DNA methylation and irreversible silencing in cancer, but evidence supporting this model is largely lacking. Here, we show that bivalent chromatin does not poise genes for rapid activation but protects promoters from de novo DNA methylation. Genome-wide studies in differentiating ESCs reveal that activation of bivalent genes is no more rapid than that of other transcriptionally silent genes, challenging the premise that H3K4me3 is instructive for transcription. H3K4me3 at bivalent promoters-a product of the underlying DNA sequence-persists in nearly all cell types irrespective of gene expression and confers protection from de novo DNA methylation. Bivalent genes in ESCs that are frequent targets of aberrant hypermethylation in cancer are particularly strongly associated with loss of H3K4me3/bivalency in cancer. Altogether, our findings suggest that bivalency protects reversibly repressed genes from irreversible silencing and that loss of H3K4me3 may make them more susceptible to aberrant DNA methylation in diseases such as cancer. Bivalency may thus represent a distinct regulatory mechanism for maintaining epigenetic plasticity.

8.
Bioinformatics ; 39(6)2023 06 01.
Article in English | MEDLINE | ID: mdl-37314966

ABSTRACT

MOTIVATION: Recent advances in multimodal single-cell omics technologies enable multiple modalities of molecular attributes, such as gene expression, chromatin accessibility, and protein abundance, to be profiled simultaneously at a global level in individual cells. While the increasing availability of multiple data modalities is expected to provide a more accurate clustering and characterization of cells, the development of computational methods that are capable of extracting information embedded across data modalities is still in its infancy. RESULTS: We propose SnapCCESS for clustering cells by integrating data modalities in multimodal single-cell omics data using an unsupervised ensemble deep learning framework. By creating snapshots of embeddings of multimodality using variational autoencoders, SnapCCESS can be coupled with various clustering algorithms for generating consensus clustering of cells. We applied SnapCCESS with several clustering algorithms to various datasets generated from popular multimodal single-cell omics technologies. Our results demonstrate that SnapCCESS is effective and more efficient than conventional ensemble deep learning-based clustering methods and outperforms other state-of-the-art multimodal embedding generation methods in integrating data modalities for clustering cells. The improved clustering of cells from SnapCCESS will pave the way for more accurate characterization of cell identity and types, an essential step for various downstream analyses of multimodal single-cell omics data. AVAILABILITY AND IMPLEMENTATION: SnapCCESS is implemented as a Python package and is freely available from https://github.com/PYangLab/SnapCCESS under the open-source license of GPL-3. The data used in this study are publicly available (see section 'Data availability').


Subject(s)
Deep Learning , Algorithms , Cluster Analysis , Chromatin , Single-Cell Analysis
9.
Proteomics ; 23(3-4): e2200068, 2023 02.
Article in English | MEDLINE | ID: mdl-35580145

ABSTRACT

Protein phosphorylation plays an essential role in modulating cell signalling and its downstream transcriptional and translational regulations. Until recently, protein phosphorylation has been studied mostly using low-throughput biochemical assays. The advancement of mass spectrometry (MS)-based phosphoproteomics transformed the field by enabling measurement of proteome-wide phosphorylation events, where tens of thousands of phosphosites are routinely identified and quantified in an experiment. This has brought a significant challenge in analysing large-scale phosphoproteomic data, making computational methods and systems approaches integral parts of phosphoproteomics. Previous works have primarily focused on reviewing the experimental techniques in MS-based phosphoproteomics, yet a systematic survey of the computational landscape in this field is still missing. Here, we review computational methods and tools, and systems approaches that have been developed for phosphoproteomics data analysis. We categorise them into four aspects including data processing, functional analysis, phosphoproteome annotation and their integration with other omics, and in each aspect, we discuss the key methods and example studies. Lastly, we highlight some of the potential research directions on which future work would make a significant contribution to this fast-growing field. We hope this review provides a useful snapshot of the field of computational systems phosphoproteomics and stimulates new research that drives future development.


Subject(s)
Phosphoproteins , Protein Processing, Post-Translational , Phosphoproteins/metabolism , Phosphorylation , Proteome/metabolism , Systems Analysis
10.
Bioinformatics ; 38(7): 1956-1963, 2022 03 28.
Article in English | MEDLINE | ID: mdl-35015814

ABSTRACT

MOTIVATION: The advance of mass spectrometry-based technologies enabled the profiling of the phosphoproteomes of a multitude of cell and tissue types. However, current research primarily focused on investigating the phosphorylation dynamics in specific cell types and experimental conditions, whereas the phosphorylation events that are common across cell/tissue types and stable regardless of experimental conditions are, so far, mostly ignored. RESULTS: Here, we developed a statistical framework to identify the stable phosphoproteome across 53 human phosphoproteomics datasets, covering 40 cell/tissue types and 194 conditions/treatments. We demonstrate that the stably phosphorylated sites (SPSs) identified from our statistical framework are evolutionarily conserved, functionally important and enriched in a range of core signaling and gene pathways. Particularly, we show that SPSs are highly enriched in the RNA splicing pathway, an essential cellular process in mammalian cells, and frequently disrupted by cancer mutations, suggesting a link between the dysregulation of RNA splicing and cancer development through mutations on SPSs. AVAILABILITY AND IMPLEMENTATION: The source code for data analysis in this study is available from Github repository https://github.com/PYangLab/SPSs under the open-source license of GPL-3. The data used in this study are publicly available (see Section 2.8). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Neoplasms , Proteome , Animals , Humans , Software , Phosphorylation , Mass Spectrometry , Neoplasms/genetics , Mammals
11.
Bioinformatics ; 38(20): 4745-4753, 2022 10 14.
Article in English | MEDLINE | ID: mdl-36040148

ABSTRACT

MOTIVATION: With the recent surge of large-cohort scale single cell research, it is of critical importance that analytical methods can fully utilize the comprehensive characterization of cellular systems that single cell technologies produce to provide insights into samples from individuals. Currently, there is little consensus on the best ways to compress information from the complex data structures of these technologies to summary statistics that represent each sample (e.g. individuals). RESULTS: Here, we present scFeatures, an approach that creates interpretable cellular and molecular representations of single-cell and spatial data at the sample level. We demonstrate that summarizing a broad collection of features at the sample level is both important for understanding underlying disease mechanisms in different experimental studies and for accurately classifying disease status of individuals. AVAILABILITY AND IMPLEMENTATION: scFeatures is publicly available as an R package at https://github.com/SydneyBioX/scFeatures. All data used in this study are publicly available with accession ID reported in the Section 2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Software , Humans
12.
Cytometry A ; 103(7): 593-599, 2023 07.
Article in English | MEDLINE | ID: mdl-36879360

ABSTRACT

Highly multiplexed in situ imaging cytometry assays have made it possible to study the spatial organization of numerous cell types simultaneously. We have addressed the challenge of quantifying complex multi-cellular relationships by proposing a statistical method which clusters local indicators of spatial association. Our approach successfully identifies distinct tissue architectures in datasets generated from three state-of-the-art high-parameter assays demonstrating its value in summarizing the information-rich data generated from these technologies.


Subject(s)
Image Cytometry , Spatial Analysis
13.
Mol Cell Proteomics ; 20: 100030, 2021.
Article in English | MEDLINE | ID: mdl-33583770

ABSTRACT

Many cell surface and secreted proteins are modified by the covalent addition of glycans that play an important role in the development of multicellular organisms. These glycan modifications enable communication between cells and the extracellular matrix via interactions with specific glycan-binding lectins and the regulation of receptor-mediated signaling. Aberrant protein glycosylation has been associated with the development of several muscular diseases, suggesting essential glycan- and lectin-mediated functions in myogenesis and muscle development, but our molecular understanding of the precise glycans, catalytic enzymes, and lectins involved remains only partially understood. Here, we quantified dynamic remodeling of the membrane-associated proteome during a time-course of myogenesis in cell culture. We observed wide-spread changes in the abundance of several important lectins and enzymes facilitating glycan biosynthesis. Glycomics-based quantification of released N-linked glycans confirmed remodeling of the glycome consistent with the regulation of glycosyltransferases and glycosidases responsible for their formation including a previously unknown digalactose-to-sialic acid switch supporting a functional role of these glycoepitopes in myogenesis. Furthermore, dynamic quantitative glycoproteomic analysis with multiplexed stable isotope labeling and analysis of enriched glycopeptides with multiple fragmentation approaches identified glycoproteins modified by these regulated glycans including several integrins and growth factor receptors. Myogenesis was also associated with the regulation of several lectins, most notably the upregulation of galectin-1 (LGALS1). CRISPR/Cas9-mediated deletion of Lgals1 inhibited differentiation and myotube formation, suggesting an early functional role of galectin-1 in the myogenic program. Importantly, similar changes in N-glycosylation and the upregulation of galectin-1 during postnatal skeletal muscle development were observed in mice. Treatment of new-born mice with recombinant adeno-associated viruses to overexpress galectin-1 in the musculature resulted in enhanced muscle mass. Our data form a valuable resource to further understand the glycobiology of myogenesis and will aid the development of intervention strategies to promote healthy muscle development or regeneration.


Subject(s)
Galectin 1/metabolism , Glycopeptides/metabolism , Muscle Development , Animals , Cell Line , Galectin 1/genetics , Glycomics , Glycosylation , Male , Mice, Inbred C57BL , Muscle, Skeletal/metabolism , Protein Processing, Post-Translational , Proteomics , Rats
14.
Biochem J ; 479(11): 1237-1256, 2022 06 17.
Article in English | MEDLINE | ID: mdl-35594055

ABSTRACT

Trafficking regulator of GLUT4-1, TRARG1, positively regulates insulin-stimulated GLUT4 trafficking and insulin sensitivity. However, the mechanism(s) by which this occurs remain(s) unclear. Using biochemical and mass spectrometry analyses we found that TRARG1 is dephosphorylated in response to insulin in a PI3K/Akt-dependent manner and is a novel substrate for GSK3. Priming phosphorylation of murine TRARG1 at serine 84 allows for GSK3-directed phosphorylation at serines 72, 76 and 80. A similar pattern of phosphorylation was observed in human TRARG1, suggesting that our findings are translatable to human TRARG1. Pharmacological inhibition of GSK3 increased cell surface GLUT4 in cells stimulated with a submaximal insulin dose, and this was impaired following Trarg1 knockdown, suggesting that TRARG1 acts as a GSK3-mediated regulator in GLUT4 trafficking. These data place TRARG1 within the insulin signaling network and provide insights into how GSK3 regulates GLUT4 trafficking in adipocytes.


Subject(s)
Glycogen Synthase Kinase 3 , Phosphatidylinositol 3-Kinases , Adipocytes/metabolism , Animals , Cell Membrane/metabolism , Glucose/metabolism , Glucose Transporter Type 4/genetics , Glucose Transporter Type 4/metabolism , Glycogen Synthase Kinase 3/genetics , Glycogen Synthase Kinase 3/metabolism , Humans , Insulin/metabolism , Mice , Phosphatidylinositol 3-Kinases/metabolism , Phosphorylation , Proto-Oncogene Proteins c-akt/genetics , Proto-Oncogene Proteins c-akt/metabolism , Serine/metabolism
15.
Mol Cell ; 55(5): 708-22, 2014 Sep 04.
Article in English | MEDLINE | ID: mdl-25132174

ABSTRACT

Cell type-specific master transcription factors (TFs) play vital roles in defining cell identity and function. However, the roles ubiquitous factors play in the specification of cell identity remain underappreciated. Here we show that the ubiquitous CCAAT-binding NF-Y complex is required for the maintenance of embryonic stem cell (ESC) identity and is an essential component of the core pluripotency network. Genome-wide studies in ESCs and neurons reveal that NF-Y regulates not only genes with housekeeping functions through cell type-invariant promoter-proximal binding, but also genes required for cell identity by binding to cell type-specific enhancers with master TFs. Mechanistically, NF-Y's distinct DNA-binding mode promotes master/pioneer TF binding at enhancers by facilitating a permissive chromatin conformation. Our studies unearth a conceptually unique function for histone-fold domain (HFD) protein NF-Y in promoting chromatin accessibility and suggest that other HFD proteins with analogous structural and DNA-binding properties may function in similar ways.


Subject(s)
CCAAT-Binding Factor/physiology , Chromatin/metabolism , Histones/metabolism , Animals , Binding Sites , CCAAT-Binding Factor/metabolism , Cells, Cultured , Embryonic Stem Cells/metabolism , Embryonic Stem Cells/ultrastructure , Mice , Models, Genetic , Nucleosomes/chemistry , Nucleosomes/metabolism , Pluripotent Stem Cells , Transcription Factors/chemistry , Transcription Factors/metabolism , Transcription Factors/physiology
16.
Nucleic Acids Res ; 48(4): 1828-1842, 2020 02 28.
Article in English | MEDLINE | ID: mdl-31853542

ABSTRACT

The developmental potential of cells, termed pluripotency, is highly dynamic and progresses through a continuum of naive, formative and primed states. Pluripotency progression of mouse embryonic stem cells (ESCs) from naive to formative and primed state is governed by transcription factors (TFs) and their target genes. Genomic techniques have uncovered a multitude of TF binding sites in ESCs, yet a major challenge lies in identifying target genes from functional binding sites and reconstructing dynamic transcriptional networks underlying pluripotency progression. Here, we integrated time-resolved 'trans-omic' datasets together with TF binding profiles and chromatin conformation data to identify target genes of a panel of TFs. Our analyses revealed that naive TF target genes are more likely to be TFs themselves than those of formative TFs, suggesting denser hierarchies among naive TFs. We also discovered that formative TF target genes are marked by permissive epigenomic signatures in the naive state, indicating that they are poised for expression prior to the initiation of pluripotency transition to the formative state. Finally, our reconstructed transcriptional networks pinpointed the precise timing from naive to formative pluripotency progression and enabled the spatiotemporal mapping of differentiating ESCs to their in vivo counterparts in developing embryos.


Subject(s)
Embryonic Development/genetics , Mouse Embryonic Stem Cells/metabolism , Pluripotent Stem Cells/metabolism , Transcription Factors/genetics , Animals , Binding Sites/genetics , Cell Differentiation/genetics , Chromatin/genetics , Gene Expression Regulation, Developmental/genetics , Gene Regulatory Networks/genetics , Genome/genetics , Mice
17.
Proc Natl Acad Sci U S A ; 116(20): 9775-9784, 2019 05 14.
Article in English | MEDLINE | ID: mdl-31028141

ABSTRACT

Concerted examination of multiple collections of single-cell RNA sequencing (RNA-seq) data promises further biological insights that cannot be uncovered with individual datasets. Here we present scMerge, an algorithm that integrates multiple single-cell RNA-seq datasets using factor analysis of stably expressed genes and pseudoreplicates across datasets. Using a large collection of public datasets, we benchmark scMerge against published methods and demonstrate that it consistently provides improved cell type separation by removing unwanted factors; scMerge can also enhance biological discovery through robust data integration, which we show through the inference of development trajectory in a liver dataset collection.


Subject(s)
Meta-Analysis as Topic , Sequence Analysis, RNA , Single-Cell Analysis , Software , Algorithms , Animals , Embryonic Development , Factor Analysis, Statistical , Gene Expression , Humans , Mice
18.
Brief Bioinform ; 20(6): 2316-2326, 2019 11 27.
Article in English | MEDLINE | ID: mdl-30137247

ABSTRACT

Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson's correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson's correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson's correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.


Subject(s)
Sequence Analysis, RNA , Algorithms , Cluster Analysis , Humans
19.
Bioinformatics ; 36(14): 4137-4143, 2020 08 15.
Article in English | MEDLINE | ID: mdl-32353146

ABSTRACT

MOTIVATION: Multi-modal profiling of single cells represents one of the latest technological advancements in molecular biology. Among various single-cell multi-modal strategies, cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) allows simultaneous quantification of two distinct species: RNA and cell-surface proteins. Here, we introduce CiteFuse, a streamlined package consisting of a suite of tools for doublet detection, modality integration, clustering, differential RNA and protein expression analysis, antibody-derived tag evaluation, ligand-receptor interaction analysis and interactive web-based visualization of CITE-seq data. RESULTS: We demonstrate the capacity of CiteFuse to integrate the two data modalities and its relative advantage against data generated from single-modality profiling using both simulations and real-world CITE-seq data. Furthermore, we illustrate a novel doublet detection method based on a combined index of cell hashing and transcriptome data. Finally, we demonstrate CiteFuse for predicting ligand-receptor interactions by using multi-modal CITE-seq data. Collectively, we demonstrate the utility and effectiveness of CiteFuse for the integrative analysis of transcriptome and epitope profiles from CITE-seq data. AVAILABILITY AND IMPLEMENTATION: CiteFuse is freely available at http://shiny.maths.usyd.edu.au/CiteFuse/ as an online web service and at https://github.com/SydneyBioX/CiteFuse/ as an R package. CONTACT: pengyi.yang@sydney.edu.au. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Software , Transcriptome , Epitopes , Gene Expression Profiling , RNA , Sequence Analysis, RNA , Single-Cell Analysis
20.
Mol Syst Biol ; 16(6): e9389, 2020 06.
Article in English | MEDLINE | ID: mdl-32567229

ABSTRACT

Automated cell type identification is a key computational challenge in single-cell RNA-sequencing (scRNA-seq) data. To capitalise on the large collection of well-annotated scRNA-seq datasets, we developed scClassify, a multiscale classification framework based on ensemble learning and cell type hierarchies constructed from single or multiple annotated datasets as references. scClassify enables the estimation of sample size required for accurate classification of cell types in a cell type hierarchy and allows joint classification of cells when multiple references are available. We show that scClassify consistently performs better than other supervised cell type classification methods across 114 pairs of reference and testing data, representing a diverse combination of sizes, technologies and levels of complexity, and further demonstrate the unique components of scClassify through simulations and compendia of experimental datasets. Finally, we demonstrate the scalability of scClassify on large single-cell atlases and highlight a novel application of identifying subpopulations of cells from the Tabula Muris data that were unidentified in the original publication. Together, scClassify represents state-of-the-art methodology in automated cell type identification from scRNA-seq data.


Subject(s)
Cells/metabolism , Animals , Cluster Analysis , Databases as Topic , Humans , Leukocytes, Mononuclear/metabolism , Machine Learning , Mice , Pancreas/metabolism , Sample Size , Software
SELECTION OF CITATIONS
SEARCH DETAIL