Search | VHL Regional Portal

1.

DNA binding analysis of rare variants in homeodomains reveals homeodomain specificity-determining residues.

Kock, Kian Hong; Kimes, Patrick K; Gisselbrecht, Stephen S; Inukai, Sachi; Phanor, Sabrina K; Anderson, James T; Ramakrishnan, Gayatri; Lipper, Colin H; Song, Dongyuan; Kurland, Jesse V; Rogers, Julia M; Jeong, Raehoon; Blacklow, Stephen C; Irizarry, Rafael A; Bulyk, Martha L.

Nat Commun ; 15(1): 3110, 2024 Apr 10.

Article in English | MEDLINE | ID: mdl-38600112

ABSTRACT

Homeodomains (HDs) are the second largest class of DNA binding domains (DBDs) among eukaryotic sequence-specific transcription factors (TFs) and are the TF structural class with the largest number of disease-associated mutations in the Human Gene Mutation Database (HGMD). Despite numerous structural studies and large-scale analyses of HD DNA binding specificity, HD-DNA recognition is still not fully understood. Here, we analyze 92 human HD mutants, including disease-associated variants and variants of uncertain significance (VUS), for their effects on DNA binding activity. Many of the variants alter DNA binding affinity and/or specificity. Detailed biochemical analysis and structural modeling identifies 14 previously unknown specificity-determining positions, 5 of which do not contact DNA. The same missense substitution at analogous positions within different HDs often exhibits different effects on DNA binding activity. Variant effect prediction tools perform moderately well in distinguishing variants with altered DNA binding affinity, but poorly in identifying those with altered binding specificity. Our results highlight the need for biochemical assays of TF coding variants and prioritize dozens of variants for further investigations into their pathogenicity and the development of clinical diagnostics and precision therapies.

Subject(s)

Homeodomain Proteins , Transcription Factors , Humans , Homeodomain Proteins/metabolism , Transcription Factors/metabolism , DNA/metabolism , Mutation , Models, Molecular

2.

scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics.

Song, Dongyuan; Wang, Qingyang; Yan, Guanao; Liu, Tianyang; Sun, Tianyi; Li, Jingyi Jessica.

Nat Biotechnol ; 42(2): 247-252, 2024 Feb.

Article in English | MEDLINE | ID: mdl-37169966

ABSTRACT

We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.

Subject(s)

Benchmarking , Models, Statistical , Research Design

3.

Benchmarking computational methods to identify spatially variable genes and peaks.

Li, Zhijian; Patel, Zain M; Song, Dongyuan; Yan, Guanao; Li, Jingyi Jessica; Pinello, Luca.

bioRxiv ; 2023 Dec 03.

Article in English | MEDLINE | ID: mdl-38076922

ABSTRACT

Spatially resolved transcriptomics offers unprecedented insight by enabling the profiling of gene expression within the intact spatial context of cells, effectively adding a new and essential dimension to data interpretation. To efficiently detect spatial structure of interest, an essential step in analyzing such data involves identifying spatially variable genes. Despite researchers having developed several computational methods to accomplish this task, the lack of a comprehensive benchmark evaluating their performance remains a considerable gap in the field. Here, we present a systematic evaluation of 14 methods using 60 simulated datasets generated by four different simulation strategies, 12 real-world transcriptomics, and three spatial ATAC-seq datasets. We find that spatialDE2 consistently outperforms the other benchmarked methods, and Moran's I achieves competitive performance in different experimental settings. Moreover, our results reveal that more specialized algorithms are needed to identify spatially variable peaks.

4.

Decoding Heterogenous Single-cell Perturbation Responses.

Song, Bicna; Liu, Dingyu; Dai, Weiwei; McMyn, Natalie; Wang, Qingyang; Yang, Dapeng; Krejci, Adam; Vasilyev, Anatoly; Untermoser, Nicole; Loregger, Anke; Song, Dongyuan; Williams, Breanna; Rosen, Bess; Cheng, Xiaolong; Chao, Lumen; Kale, Hanuman T; Zhang, Hao; Diao, Yarui; Bürckstümmer, Tilmann; Siliciano, Jenet M; Li, Jingyi Jessica; Siliciano, Robert; Huangfu, Danwei; Li, Wei.

bioRxiv ; 2023 Nov 29.

Article in English | MEDLINE | ID: mdl-37961332

ABSTRACT

Understanding diverse responses of individual cells to the same perturbation is central to many biological and biomedical problems. Current methods, however, do not precisely quantify the strength of perturbation responses and, more importantly, reveal new biological insights from heterogeneity in responses. Here we introduce the perturbation-response score (PS), based on constrained quadratic optimization, to quantify diverse perturbation responses at a single-cell level. Applied to single-cell transcriptomes of large-scale genetic perturbation datasets (e.g., Perturb-seq), PS outperforms existing methods for quantifying partial gene perturbation responses. In addition, PS presents two major advances. First, PS enables large-scale, single-cell-resolution dosage analysis of perturbation, without the need to titrate perturbation strength. By analyzing the dose-response patterns of over 2,000 essential genes in Perturb-seq, we identify two distinct patterns, depending on whether a moderate reduction in their expression induces strong downstream expression alterations. Second, PS identifies intrinsic and extrinsic biological determinants of perturbation responses. We demonstrate the application of PS in contexts such as T cell stimulation, latent HIV-1 expression, and pancreatic cell differentiation. Notably, PS unveiled a previously unrecognized, cell-type-specific role of coiled-coil domain containing 6 (CCDC6) in guiding liver and pancreatic lineage decisions, where CCDC6 knockouts drive the endoderm cell differentiation towards liver lineage, rather than pancreatic lineage. The PS approach provides an innovative method for dose-to-function analysis and will enable new biological discoveries from single-cell perturbation datasets.

5.

scReadSim: a single-cell RNA-seq and ATAC-seq read simulator.

Yan, Guanao; Song, Dongyuan; Li, Jingyi Jessica.

Nat Commun ; 14(1): 7482, 2023 11 18.

Article in English | MEDLINE | ID: mdl-37980428

ABSTRACT

Benchmarking single-cell RNA-seq (scRNA-seq) and single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) computational tools demands simulators to generate realistic sequencing reads. However, none of the few read simulators aim to mimic real data. To fill this gap, we introduce scReadSim, a single-cell RNA-seq and ATAC-seq read simulator that allows user-specified ground truths and generates synthetic sequencing reads (in a FASTQ or BAM file) by mimicking real data. At both read-sequence and read-count levels, scReadSim mimics real scRNA-seq and scATAC-seq data. Moreover, scReadSim provides ground truths, including unique molecular identifier (UMI) counts for scRNA-seq and open chromatin regions for scATAC-seq. In particular, scReadSim allows users to design cell-type-specific ground-truth open chromatin regions for scATAC-seq data generation. In benchmark applications of scReadSim, we show that UMI-tools achieves the top accuracy in scRNA-seq UMI deduplication, and HMMRATAC and MACS3 achieve the top performance in scATAC-seq peak calling.

Subject(s)

Chromatin Immunoprecipitation Sequencing , Single-Cell Gene Expression Analysis , Single-Cell Analysis , Chromatin/genetics

6.

ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping.

Song, Dongyuan; Li, Kexin; Ge, Xinzhou; Li, Jingyi Jessica.

Res Sq ; 2023 Aug 02.

Article in English | MEDLINE | ID: mdl-37577698

ABSTRACT

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used twice to define cell clusters as potential cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to identify cell-type marker genes as top DE genes and distinguish them from housekeeping genes. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

7.

ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping.

Song, Dongyuan; Li, Kexin; Ge, Xinzhou; Li, Jingyi Jessica.

bioRxiv ; 2023 Jul 25.

Article in English | MEDLINE | ID: mdl-37546812

ABSTRACT

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE test for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality. The core idea of ClusterDE is to generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to find cell-type marker genes that are biologically meaningful. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

8.

Author Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

Genome Biol ; 24(1): 32, 2023 Feb 22.

Article in English | MEDLINE | ID: mdl-36814256

9.

Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime.

Cui, Elvis Han; Song, Dongyuan; Wong, Weng Kee; Li, Jingyi Jessica.

Bioinformatics ; 38(16): 3927-3934, 2022 08 10.

Article in English | MEDLINE | ID: mdl-35758616

ABSTRACT

MOTIVATION: Modeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models. RESULTS: Here, we propose the single-cell generalized trend model (scGTM) for capturing a gene's expression trend, which may be monotone, hill-shaped or valley-shaped, along cell pseudotime. The scGTM has three advantages: (i) it can capture non-monotonic trends that are easy to interpret, (ii) its parameters are biologically interpretable and trend informative, and (iii) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression datasets using the scGTM and show that scGTM can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying biological processes. AVAILABILITY AND IMPLEMENTATION: The Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Single-Cell Analysis , Software , Single-Cell Analysis/methods , Algorithms , Likelihood Functions , Gene Expression

10.

scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data.

Song, Dongyuan; Xi, Nan Miles; Li, Jingyi Jessica; Wang, Lin.

Bioinformatics ; 38(11): 3126-3127, 2022 05 26.

Article in English | MEDLINE | ID: mdl-35426898

ABSTRACT

SUMMARY: The number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here, we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data. AVAILABILITY AND IMPLEMENTATION: scSampler is implemented in Python and is published under the MIT source license. It can be installed by "pip install scsampler" and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler. An R interface is available at: https://github.com/SONGDONGYUAN1994/rscsampler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Software , Transcriptome , Algorithms , Data Analysis

11.

Statistics or biology: the zero-inflation controversy about scRNA-seq data.

Jiang, Ruochen; Sun, Tianyi; Song, Dongyuan; Li, Jingyi Jessica.

Genome Biol ; 23(1): 31, 2022 01 21.

Article in English | MEDLINE | ID: mdl-35063006

ABSTRACT

Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.

Subject(s)

Benchmarking , Single-Cell Analysis , Biology , Sequence Analysis, RNA , Exome Sequencing

12.

Simulating Single-Cell Gene Expression Count Data with Preserved Gene Correlations by scDesign2.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

J Comput Biol ; 29(1): 23-26, 2022 01.

Article in English | MEDLINE | ID: mdl-35020490

ABSTRACT

scDesign2 is a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. This article shows how to download and install the scDesign2 R package, how to fit probabilistic models (one per cell type) to real data and simulate synthetic data from the fitted models, and how to use scDesign2 to guide experimental design and benchmark computational methods. Finally, a note is given about cell clustering as a preprocessing step before model fitting and data simulation.

Subject(s)

Gene Expression Profiling/statistics & numerical data , Single-Cell Analysis/statistics & numerical data , Software , Algorithms , Animals , Cluster Analysis , Computational Biology , Computer Simulation , Databases, Nucleic Acid/statistics & numerical data , Gene Expression , Mice , Models, Statistical , RNA-Seq/statistics & numerical data

13.

Clipper: p-value-free FDR control on high-throughput data from two conditions.

Ge, Xinzhou; Chen, Yiling Elaine; Song, Dongyuan; McDermott, MeiLu; Woyshner, Kyla; Manousopoulou, Antigoni; Wang, Ning; Li, Wei; Wang, Leo D; Li, Jingyi Jessica.

Genome Biol ; 22(1): 288, 2021 10 11.

Article in English | MEDLINE | ID: mdl-34635147

ABSTRACT

High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Software , Chromatin Immunoprecipitation Sequencing/methods , Chromosomes , Computer Simulation , Data Interpretation, Statistical , Humans , Mass Spectrometry , Peptides/chemistry , Proteomics/methods , RNA-Seq/methods , Single-Cell Analysis

14.

scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling.

Song, Dongyuan; Li, Kexin; Hemminger, Zachary; Wollman, Roy; Li, Jingyi Jessica.

Bioinformatics ; 37(Suppl_1): i358-i366, 2021 07 12.

Article in English | MEDLINE | ID: mdl-34252925

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. RESULTS: Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. AVAILABILITY AND IMPLEMENTATION: The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Gene Expression Profiling , Single-Cell Analysis , Algorithms , Sequence Analysis, RNA , Software

15.

Publisher Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

Genome Biol ; 22(1): 177, 2021 Jun 09.

Article in English | MEDLINE | ID: mdl-34108038

16.

scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

Genome Biol ; 22(1): 163, 2021 05 25.

Article in English | MEDLINE | ID: mdl-34034771

ABSTRACT

A pressing challenge in single-cell transcriptomics is to benchmark experimental protocols and computational methods. A solution is to use computational simulators, but existing simulators cannot simultaneously achieve three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill this gap, we propose scDesign2, a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple single-cell gene expression count-based technologies. In particular, scDesign2 is advantageous in its transparent use of probabilistic models and its ability to capture gene correlations via copulas.

Subject(s)

Computer Simulation , Gene Expression Regulation , Single-Cell Analysis , Software , Animals , Calibration , Cell Count , Cluster Analysis , Genomics , Goblet Cells/metabolism , Humans , Mice , RNA-Seq

17.

PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data.

Song, Dongyuan; Li, Jingyi Jessica.

Genome Biol ; 22(1): 124, 2021 04 29.

Article in English | MEDLINE | ID: mdl-33926517

ABSTRACT

To investigate molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along the pseudotime inferred from single-cell RNA-sequencing data. However, existing methods do not account for pseudotime inference uncertainty, and they have either ill-posed p-values or restrictive models. Here we propose PseudotimeDE, a DE gene identification method that adapts to various pseudotime inference methods, accounts for pseudotime inference uncertainty, and outputs well-calibrated p-values. Comprehensive simulations and real-data applications verify that PseudotimeDE outperforms existing methods in false discovery rate control and power.

Subject(s)

Gene Expression Profiling , Gene Expression Regulation, Developmental , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Transcriptome , Algorithms , Cell Lineage/genetics , Computational Biology/methods , Gene Expression Profiling/methods , Gene Ontology , High-Throughput Nucleotide Sequencing , Organ Specificity/genetics

18.

Explaining the ocean's richest biodiversity hotspot and global patterns of fish diversity.

Miller, Elizabeth Christina; Hayashi, Kenji T; Song, Dongyuan; Wiens, John J.

Proc Biol Sci ; 285(1888)2018 10 10.

Article in English | MEDLINE | ID: mdl-30305433

ABSTRACT

For most marine organisms, species richness peaks in the Central Indo-Pacific region and declines longitudinally, a striking pattern that remains poorly understood. Here, we used phylogenetic approaches to address the causes of richness patterns among global marine regions, comparing the relative importance of colonization time, number of colonization events, and diversification rates (speciation minus extinction). We estimated regional richness using distributional data for almost all percomorph fishes (17 435 species total, including approximately 72% of all marine fishes and approximately 33% of all freshwater fishes). The high diversity of the Central Indo-Pacific was explained by its colonization by many lineages 5.3-34 million years ago. These relatively old colonizations allowed more time for richness to build up through in situ diversification compared to other warm-marine regions. Surprisingly, diversification rates were decoupled from marine richness patterns, with clades in low-richness cold-marine habitats having the highest rates. Unlike marine richness, freshwater diversity was largely derived from a few ancient colonizations, coupled with high diversification rates. Our results are congruent with the geological history of the marine tropics, and thus may apply to many other organisms. Beyond marine biogeography, we add to the growing number of cases where colonization and time-for-speciation explain large-scale richness patterns instead of diversification rates.

Subject(s)

Biodiversity , Fishes , Genetic Speciation , Animals , Ecosystem , Indian Ocean , Pacific Ocean

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL