Pesquisa | BVS IEC

1.

Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime.

Cui, Elvis Han; Song, Dongyuan; Wong, Weng Kee; Li, Jingyi Jessica.

Bioinformatics ; 38(16): 3927-3934, 2022 08 10.

Artigo em Inglês | MEDLINE | ID: mdl-35758616

RESUMO

MOTIVATION: Modeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models. RESULTS: Here, we propose the single-cell generalized trend model (scGTM) for capturing a gene's expression trend, which may be monotone, hill-shaped or valley-shaped, along cell pseudotime. The scGTM has three advantages: (i) it can capture non-monotonic trends that are easy to interpret, (ii) its parameters are biologically interpretable and trend informative, and (iii) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression datasets using the scGTM and show that scGTM can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying biological processes. AVAILABILITY AND IMPLEMENTATION: The Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Análise de Célula Única , Software , Análise de Célula Única/métodos , Algoritmos , Funções Verossimilhança , Expressão Gênica

2.

scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data.

Song, Dongyuan; Xi, Nan Miles; Li, Jingyi Jessica; Wang, Lin.

Bioinformatics ; 38(11): 3126-3127, 2022 05 26.

Artigo em Inglês | MEDLINE | ID: mdl-35426898

RESUMO

SUMMARY: The number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here, we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data. AVAILABILITY AND IMPLEMENTATION: scSampler is implemented in Python and is published under the MIT source license. It can be installed by "pip install scsampler" and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler. An R interface is available at: https://github.com/SONGDONGYUAN1994/rscsampler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Software , Transcriptoma , Algoritmos , Análise de Dados

3.

scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling.

Song, Dongyuan; Li, Kexin; Hemminger, Zachary; Wollman, Roy; Li, Jingyi Jessica.

Bioinformatics ; 37(Suppl_1): i358-i366, 2021 07 12.

Artigo em Inglês | MEDLINE | ID: mdl-34252925

RESUMO

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. RESULTS: Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. AVAILABILITY AND IMPLEMENTATION: The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Perfilação da Expressão Gênica , Análise de Célula Única , Algoritmos , Análise de Sequência de RNA , Software

4.

Explaining the ocean's richest biodiversity hotspot and global patterns of fish diversity.

Miller, Elizabeth Christina; Hayashi, Kenji T; Song, Dongyuan; Wiens, John J.

Proc Biol Sci ; 285(1888)2018 10 10.

Artigo em Inglês | MEDLINE | ID: mdl-30305433

RESUMO

For most marine organisms, species richness peaks in the Central Indo-Pacific region and declines longitudinally, a striking pattern that remains poorly understood. Here, we used phylogenetic approaches to address the causes of richness patterns among global marine regions, comparing the relative importance of colonization time, number of colonization events, and diversification rates (speciation minus extinction). We estimated regional richness using distributional data for almost all percomorph fishes (17 435 species total, including approximately 72% of all marine fishes and approximately 33% of all freshwater fishes). The high diversity of the Central Indo-Pacific was explained by its colonization by many lineages 5.3-34 million years ago. These relatively old colonizations allowed more time for richness to build up through in situ diversification compared to other warm-marine regions. Surprisingly, diversification rates were decoupled from marine richness patterns, with clades in low-richness cold-marine habitats having the highest rates. Unlike marine richness, freshwater diversity was largely derived from a few ancient colonizations, coupled with high diversification rates. Our results are congruent with the geological history of the marine tropics, and thus may apply to many other organisms. Beyond marine biogeography, we add to the growing number of cases where colonization and time-for-speciation explain large-scale richness patterns instead of diversification rates.

Assuntos

Biodiversidade , Peixes , Especiação Genética , Animais , Ecossistema , Oceano Índico , Oceano Pacífico

5.

scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics.

Song, Dongyuan; Wang, Qingyang; Yan, Guanao; Liu, Tianyang; Sun, Tianyi; Li, Jingyi Jessica.

Nat Biotechnol ; 42(2): 247-252, 2024 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-37169966

RESUMO

We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.

Assuntos

Benchmarking , Modelos Estatísticos , Projetos de Pesquisa

6.

DNA binding analysis of rare variants in homeodomains reveals homeodomain specificity-determining residues.

Kock, Kian Hong; Kimes, Patrick K; Gisselbrecht, Stephen S; Inukai, Sachi; Phanor, Sabrina K; Anderson, James T; Ramakrishnan, Gayatri; Lipper, Colin H; Song, Dongyuan; Kurland, Jesse V; Rogers, Julia M; Jeong, Raehoon; Blacklow, Stephen C; Irizarry, Rafael A; Bulyk, Martha L.

Nat Commun ; 15(1): 3110, 2024 Apr 10.

Artigo em Inglês | MEDLINE | ID: mdl-38600112

RESUMO

Homeodomains (HDs) are the second largest class of DNA binding domains (DBDs) among eukaryotic sequence-specific transcription factors (TFs) and are the TF structural class with the largest number of disease-associated mutations in the Human Gene Mutation Database (HGMD). Despite numerous structural studies and large-scale analyses of HD DNA binding specificity, HD-DNA recognition is still not fully understood. Here, we analyze 92 human HD mutants, including disease-associated variants and variants of uncertain significance (VUS), for their effects on DNA binding activity. Many of the variants alter DNA binding affinity and/or specificity. Detailed biochemical analysis and structural modeling identifies 14 previously unknown specificity-determining positions, 5 of which do not contact DNA. The same missense substitution at analogous positions within different HDs often exhibits different effects on DNA binding activity. Variant effect prediction tools perform moderately well in distinguishing variants with altered DNA binding affinity, but poorly in identifying those with altered binding specificity. Our results highlight the need for biochemical assays of TF coding variants and prioritize dozens of variants for further investigations into their pathogenicity and the development of clinical diagnostics and precision therapies.

Assuntos

Proteínas de Homeodomínio , Fatores de Transcrição , Humanos , Proteínas de Homeodomínio/metabolismo , Fatores de Transcrição/metabolismo , DNA/metabolismo , Mutação , Modelos Moleculares

7.

ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping.

Song, Dongyuan; Li, Kexin; Ge, Xinzhou; Li, Jingyi Jessica.

bioRxiv ; 2023 Jul 25.

Artigo em Inglês | MEDLINE | ID: mdl-37546812

RESUMO

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE test for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality. The core idea of ClusterDE is to generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to find cell-type marker genes that are biologically meaningful. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

8.

ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping.

Song, Dongyuan; Li, Kexin; Ge, Xinzhou; Li, Jingyi Jessica.

Res Sq ; 2023 Aug 02.

Artigo em Inglês | MEDLINE | ID: mdl-37577698

RESUMO

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used twice to define cell clusters as potential cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to identify cell-type marker genes as top DE genes and distinguish them from housekeeping genes. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

9.

scReadSim: a single-cell RNA-seq and ATAC-seq read simulator.

Yan, Guanao; Song, Dongyuan; Li, Jingyi Jessica.

Nat Commun ; 14(1): 7482, 2023 11 18.

Artigo em Inglês | MEDLINE | ID: mdl-37980428

RESUMO

Benchmarking single-cell RNA-seq (scRNA-seq) and single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) computational tools demands simulators to generate realistic sequencing reads. However, none of the few read simulators aim to mimic real data. To fill this gap, we introduce scReadSim, a single-cell RNA-seq and ATAC-seq read simulator that allows user-specified ground truths and generates synthetic sequencing reads (in a FASTQ or BAM file) by mimicking real data. At both read-sequence and read-count levels, scReadSim mimics real scRNA-seq and scATAC-seq data. Moreover, scReadSim provides ground truths, including unique molecular identifier (UMI) counts for scRNA-seq and open chromatin regions for scATAC-seq. In particular, scReadSim allows users to design cell-type-specific ground-truth open chromatin regions for scATAC-seq data generation. In benchmark applications of scReadSim, we show that UMI-tools achieves the top accuracy in scRNA-seq UMI deduplication, and HMMRATAC and MACS3 achieve the top performance in scATAC-seq peak calling.

Assuntos

Sequenciamento de Cromatina por Imunoprecipitação , Análise da Expressão Gênica de Célula Única , Análise de Célula Única , Cromatina/genética

10.

Benchmarking computational methods to identify spatially variable genes and peaks.

Li, Zhijian; Patel, Zain M; Song, Dongyuan; Yan, Guanao; Li, Jingyi Jessica; Pinello, Luca.

bioRxiv ; 2023 Dec 03.

Artigo em Inglês | MEDLINE | ID: mdl-38076922

RESUMO

Spatially resolved transcriptomics offers unprecedented insight by enabling the profiling of gene expression within the intact spatial context of cells, effectively adding a new and essential dimension to data interpretation. To efficiently detect spatial structure of interest, an essential step in analyzing such data involves identifying spatially variable genes. Despite researchers having developed several computational methods to accomplish this task, the lack of a comprehensive benchmark evaluating their performance remains a considerable gap in the field. Here, we present a systematic evaluation of 14 methods using 60 simulated datasets generated by four different simulation strategies, 12 real-world transcriptomics, and three spatial ATAC-seq datasets. We find that spatialDE2 consistently outperforms the other benchmarked methods, and Moran's I achieves competitive performance in different experimental settings. Moreover, our results reveal that more specialized algorithms are needed to identify spatially variable peaks.

11.

Decoding Heterogenous Single-cell Perturbation Responses.

Song, Bicna; Liu, Dingyu; Dai, Weiwei; McMyn, Natalie; Wang, Qingyang; Yang, Dapeng; Krejci, Adam; Vasilyev, Anatoly; Untermoser, Nicole; Loregger, Anke; Song, Dongyuan; Williams, Breanna; Rosen, Bess; Cheng, Xiaolong; Chao, Lumen; Kale, Hanuman T; Zhang, Hao; Diao, Yarui; Bürckstümmer, Tilmann; Siliciano, Jenet M; Li, Jingyi Jessica; Siliciano, Robert; Huangfu, Danwei; Li, Wei.

bioRxiv ; 2023 Nov 29.

Artigo em Inglês | MEDLINE | ID: mdl-37961332

RESUMO

Understanding diverse responses of individual cells to the same perturbation is central to many biological and biomedical problems. Current methods, however, do not precisely quantify the strength of perturbation responses and, more importantly, reveal new biological insights from heterogeneity in responses. Here we introduce the perturbation-response score (PS), based on constrained quadratic optimization, to quantify diverse perturbation responses at a single-cell level. Applied to single-cell transcriptomes of large-scale genetic perturbation datasets (e.g., Perturb-seq), PS outperforms existing methods for quantifying partial gene perturbation responses. In addition, PS presents two major advances. First, PS enables large-scale, single-cell-resolution dosage analysis of perturbation, without the need to titrate perturbation strength. By analyzing the dose-response patterns of over 2,000 essential genes in Perturb-seq, we identify two distinct patterns, depending on whether a moderate reduction in their expression induces strong downstream expression alterations. Second, PS identifies intrinsic and extrinsic biological determinants of perturbation responses. We demonstrate the application of PS in contexts such as T cell stimulation, latent HIV-1 expression, and pancreatic cell differentiation. Notably, PS unveiled a previously unrecognized, cell-type-specific role of coiled-coil domain containing 6 (CCDC6) in guiding liver and pancreatic lineage decisions, where CCDC6 knockouts drive the endoderm cell differentiation towards liver lineage, rather than pancreatic lineage. The PS approach provides an innovative method for dose-to-function analysis and will enable new biological discoveries from single-cell perturbation datasets.

12.

Statistics or biology: the zero-inflation controversy about scRNA-seq data.

Jiang, Ruochen; Sun, Tianyi; Song, Dongyuan; Li, Jingyi Jessica.

Genome Biol ; 23(1): 31, 2022 01 21.

Artigo em Inglês | MEDLINE | ID: mdl-35063006

RESUMO

Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.

Assuntos

Benchmarking , Análise de Célula Única , Biologia , Análise de Sequência de RNA , Sequenciamento do Exoma

13.

Simulating Single-Cell Gene Expression Count Data with Preserved Gene Correlations by scDesign2.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

J Comput Biol ; 29(1): 23-26, 2022 01.

Artigo em Inglês | MEDLINE | ID: mdl-35020490

RESUMO

scDesign2 is a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. This article shows how to download and install the scDesign2 R package, how to fit probabilistic models (one per cell type) to real data and simulate synthetic data from the fitted models, and how to use scDesign2 to guide experimental design and benchmark computational methods. Finally, a note is given about cell clustering as a preprocessing step before model fitting and data simulation.

Assuntos

Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Software , Algoritmos , Animais , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Expressão Gênica , Camundongos , Modelos Estatísticos , RNA-Seq/estatística & dados numéricos

14.

PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data.

Song, Dongyuan; Li, Jingyi Jessica.

Genome Biol ; 22(1): 124, 2021 04 29.

Artigo em Inglês | MEDLINE | ID: mdl-33926517

RESUMO

To investigate molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along the pseudotime inferred from single-cell RNA-sequencing data. However, existing methods do not account for pseudotime inference uncertainty, and they have either ill-posed p-values or restrictive models. Here we propose PseudotimeDE, a DE gene identification method that adapts to various pseudotime inference methods, accounts for pseudotime inference uncertainty, and outputs well-calibrated p-values. Comprehensive simulations and real-data applications verify that PseudotimeDE outperforms existing methods in false discovery rate control and power.

Assuntos

Perfilação da Expressão Gênica , Regulação da Expressão Gênica no Desenvolvimento , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Transcriptoma , Algoritmos , Linhagem da Célula/genética , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Ontologia Genética , Sequenciamento de Nucleotídeos em Larga Escala , Especificidade de Órgãos/genética

15.

scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

Genome Biol ; 22(1): 163, 2021 05 25.

Artigo em Inglês | MEDLINE | ID: mdl-34034771

RESUMO

A pressing challenge in single-cell transcriptomics is to benchmark experimental protocols and computational methods. A solution is to use computational simulators, but existing simulators cannot simultaneously achieve three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill this gap, we propose scDesign2, a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple single-cell gene expression count-based technologies. In particular, scDesign2 is advantageous in its transparent use of probabilistic models and its ability to capture gene correlations via copulas.

Assuntos

Simulação por Computador , Regulação da Expressão Gênica , Análise de Célula Única , Software , Animais , Calibragem , Contagem de Células , Análise por Conglomerados , Genômica , Células Caliciformes/metabolismo , Humanos , Camundongos , RNA-Seq

16.

Clipper: p-value-free FDR control on high-throughput data from two conditions.

Ge, Xinzhou; Chen, Yiling Elaine; Song, Dongyuan; McDermott, MeiLu; Woyshner, Kyla; Manousopoulou, Antigoni; Wang, Ning; Li, Wei; Wang, Leo D; Li, Jingyi Jessica.

Genome Biol ; 22(1): 288, 2021 10 11.

Artigo em Inglês | MEDLINE | ID: mdl-34635147

RESUMO

High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Sequenciamento de Cromatina por Imunoprecipitação/métodos , Cromossomos , Simulação por Computador , Interpretação Estatística de Dados , Humanos , Espectrometria de Massas , Peptídeos/química , Proteômica/métodos , RNA-Seq/métodos , Análise de Célula Única

17.

Author Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

Genome Biol ; 24(1): 32, 2023 Feb 22.

Artigo em Inglês | MEDLINE | ID: mdl-36814256

18.

Publisher Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.

Sun, Tianyi; Song, Dongyuan; Li, Wei Vivian; Li, Jingyi Jessica.

Genome Biol ; 22(1): 177, 2021 Jun 09.

Artigo em Inglês | MEDLINE | ID: mdl-34108038

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA