Búsqueda | Portal de Búsqueda de la BVS

1.

Data integration and inference of gene regulation using single-cell temporal multimodal data with scTIE.

Lin, Yingxin; Wu, Tung-Yu; Chen, Xi; Wan, Sheng; Chao, Brian; Xin, Jingxue; Yang, Jean Y H; Wong, Wing H; Wang, Y X Rachel.

Genome Res ; 34(1): 119-133, 2024 02 07.

Artículo en Inglés | MEDLINE | ID: mdl-38190633

RESUMEN

Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space by using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal data sets, we show scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome data set we generated from differentiating mouse embryonic stem cells over time, we show scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.

Asunto(s)

Perfilación de la Expresión Génica , Análisis de la Célula Individual , Animales , Ratones , Perfilación de la Expresión Génica/métodos , Análisis de la Célula Individual/métodos , Regulación de la Expresión Génica

2.

Network Modeling in Biology: Statistical Methods for Gene and Brain Networks.

Wang, Y X Rachel; Li, Lexin; Li, Jingyi Jessica; Huang, Haiyan.

Stat Sci ; 36(1): 89-108, 2021 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-34305304

RESUMEN

The rise of network data in many different domains has offered researchers new insight into the problem of modeling complex systems and propelled the development of numerous innovative statistical methodologies and computational tools. In this paper, we primarily focus on two types of biological networks, gene networks and brain networks, where statistical network modeling has found both fruitful and challenging applications. Unlike other network examples such as social networks where network edges can be directly observed, both gene and brain networks require careful estimation of edges using covariates as a first step. We provide a discussion on existing statistical and computational methods for edge esitimation and subsequent statistical inference problems in these two types of biological networks.

3.

GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs.

Ursu, Oana; Boley, Nathan; Taranova, Maryna; Wang, Y X Rachel; Yardimci, Galip Gurkan; Stafford Noble, William; Kundaje, Anshul.

Bioinformatics ; 34(16): 2701-2707, 2018 08 15.

Artículo en Inglés | MEDLINE | ID: mdl-29554289

RESUMEN

Motivation: The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of three-dimensional chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts. Results: We introduce a concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP. Availability and implementation: Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Cromatina/metabolismo , Biología Computacional/métodos , Programas Informáticos , Línea Celular , Cromatina/ultraestructura , Humanos , Conformación Molecular , Reproducibilidad de los Resultados

4.

Generalized correlation measure using count statistics for gene expression data with ordered samples.

Wang, Y X Rachel; Liu, Ke; Theusch, Elizabeth; Rotter, Jerome I; Medina, Marisa W; Waterman, Michael S; Huang, Haiyan; Stegle, Oliver.

Bioinformatics ; 34(4): 617-624, 2018 02 15.

Artículo en Inglés | MEDLINE | ID: mdl-29040382

RESUMEN

Motivation: Capturing association patterns in gene expression levels under different conditions or time points is important for inferring gene regulatory interactions. In practice, temporal changes in gene expression may result in complex association patterns that require more sophisticated detection methods than simple correlation measures. For instance, the effect of regulation may lead to time-lagged associations and interactions local to a subset of samples. Furthermore, expression profiles of interest may not be aligned or directly comparable (e.g. gene expression profiles from two species). Results: We propose a count statistic for measuring association between pairs of gene expression profiles consisting of ordered samples (e.g. time-course), where correlation may only exist locally in subsequences separated by a position shift. The statistic is simple and fast to compute, and we illustrate its use in two applications. In a cross-species comparison of developmental gene expression levels, we show our method not only measures association of gene expressions between the two species, but also provides alignment between different developmental stages. In the second application, we applied our statistic to expression profiles from two distinct phenotypic conditions, where the samples in each profile are ordered by the associated phenotypic values. The detected associations can be useful in building correspondence between gene association networks under different phenotypes. On the theoretical side, we provide asymptotic distributions of the statistic for different regions of the parameter space and test its power on simulated data. Availability and implementation: The code used to perform the analysis is available as part of the Supplementary Material. Contact: msw@usc.edu or hhuang@stat.berkeley.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Fenotipo , Análisis de Secuencia de ARN/métodos

5.

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data.

Bhaskar, Anand; Wang, Y X Rachel; Song, Yun S.

Genome Res ; 25(2): 268-79, 2015 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-25564017

RESUMEN

With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions.

Asunto(s)

Sitios Genéticos , Variación Genética , Genética de Población , Genómica , Tasa de Mutación , Densidad de Población , Exoma , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Modelos Genéticos , Modelos Estadísticos , Reproducibilidad de los Resultados

6.

Gene coexpression measures in large heterogeneous samples using count statistics.

Wang, Y X Rachel; Waterman, Michael S; Huang, Haiyan.

Proc Natl Acad Sci U S A ; 111(46): 16371-6, 2014 Nov 18.

Artículo en Inglés | MEDLINE | ID: mdl-25288767

RESUMEN

With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.

Asunto(s)

Biología Computacional/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Redes Reguladoras de Genes , Algoritmos , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/biosíntesis , Proteínas de Arabidopsis/genética , Proteínas de Ciclo Celular/biosíntesis , Proteínas de Ciclo Celular/genética , Biología Computacional/métodos , Simulación por Computador , Regulación Fúngica de la Expresión Génica , Regulación de la Expresión Génica de las Plantas , Genes Fúngicos , Genes de Plantas , Modelos Genéticos , Método de Montecarlo , Saccharomyces cerevisiae/citología , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/biosíntesis , Proteínas de Saccharomyces cerevisiae/genética , Factores de Tiempo

7.

Review on statistical methods for gene network reconstruction using expression data.

Wang, Y X Rachel; Huang, Haiyan.

J Theor Biol ; 362: 53-61, 2014 Dec 07.

Artículo en Inglés | MEDLINE | ID: mdl-24726980

RESUMEN

Network modeling has proven to be a fundamental tool in analyzing the inner workings of a cell. It has revolutionized our understanding of biological processes and made significant contributions to the discovery of disease biomarkers. Much effort has been devoted to reconstruct various types of biochemical networks using functional genomic datasets generated by high-throughput technologies. This paper discusses statistical methods used to reconstruct gene regulatory networks using gene expression data. In particular, we highlight progress made and challenges yet to be met in the problems involved in estimating gene interactions, inferring causality and modeling temporal changes of regulation behaviors. As rapid advances in technologies have made available diverse, large-scale genomic data, we also survey methods of incorporating all these additional data to achieve better, more accurate inference of gene networks.

Asunto(s)

Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Algoritmos , Teorema de Bayes , Biomarcadores/metabolismo , Análisis por Conglomerados , Redes Reguladoras de Genes , Genómica , Humanos , Modelos Estadísticos , Distribución Normal , Reconocimiento de Normas Patrones Automatizadas , Programas Informáticos

8.

An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection.

Steinrücken, Matthias; Wang, Y X Rachel; Song, Yun S.

Theor Popul Biol ; 83: 1-14, 2013 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-23127866

RESUMEN

Characterizing time-evolution of allele frequencies in a population is a fundamental problem in population genetics. In the Wright-Fisher diffusion, such dynamics is captured by the transition density function, which satisfies well-known partial differential equations. For a multi-allelic model with general diploid selection, various theoretical results exist on representations of the transition density, but finding an explicit formula has remained a difficult problem. In this paper, a technique recently developed for a diallelic model is extended to find an explicit transition density for an arbitrary number of alleles, under a general diploid selection model with recurrent parent-independent mutation. Specifically, the method finds the eigenvalues and eigenfunctions of the generator associated with the multi-allelic diffusion, thus yielding an accurate spectral representation of the transition density. Furthermore, this approach allows for efficient, accurate computation of various other quantities of interest, including the normalizing constant of the stationary distribution and the rate of convergence to this distribution.

Asunto(s)

Alelos , Diploidia , Modelos Teóricos

9.

scTIE: data integration and inference of gene regulation using single-cell temporal multimodal data.

Lin, Yingxin; Wu, Tung-Yu; Chen, Xi; Wan, Sheng; Chao, Brian; Xin, Jingxue; Yang, Jean Y H; Wong, Wing H; Wang, Y X Rachel.

bioRxiv ; 2023 May 22.

Artículo en Inglés | MEDLINE | ID: mdl-37292801

RESUMEN

Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal datasets, we demonstrate scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome dataset we generated from differentiating mouse embryonic stem cells over time, we demonstrate scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.

10.

Statistics in everyone's backyard: An impact study via citation network analysis.

Wang, Lijia; Tong, Xin; Wang, Y X Rachel.

Patterns (N Y) ; 3(8): 100532, 2022 Aug 12.

Artículo en Inglés | MEDLINE | ID: mdl-36033599

RESUMEN

Statistical methodologies are indispensable in data-driven scientific discoveries. In this paper, we make the first effort to understand the impact of recent statistical innovations on other scientific fields. By collecting comprehensive bibliometric data from the Web of Science database for selected statistical journals, we investigate the citation trends and compositions of citing fields over time, and we find increasing citation diversity. Furthermore, in a new setting, we apply a local clustering technique involving personalized PageRank with graph conductance for size selection to find the most relevant statistical innovation for a given external topic in other fields. Through a number of case studies, we show that the results from our citation data analysis align well with our knowledge and intuition about these external topics. Overall, we have found that the statistical theory and methods recently invented by the statistics community have made increasing impact on other scientific fields.

11.

scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning.

Lin, Yingxin; Wu, Tung-Yu; Wan, Sheng; Yang, Jean Y H; Wong, Wing H; Wang, Y X Rachel.

Nat Biotechnol ; 40(5): 703-710, 2022 05.

Artículo en Inglés | MEDLINE | ID: mdl-35058621

RESUMEN

Single-cell multiomics data continues to grow at an unprecedented pace. Although several methods have demonstrated promising results in integrating several data modalities from the same tissue, the complexity and scale of data compositions present in cell atlases still pose a challenge. Here, we present scJoint, a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semisupervised framework and uses a neural network to simultaneously train labeled and unlabeled data, allowing label transfer and joint visualization in an integrative framework. Using atlas data as well as multimodal datasets generated with ASAP-seq and CITE-seq, we demonstrate that scJoint is computationally efficient and consistently achieves substantially higher cell-type label accuracy than existing methods while providing meaningful joint visualizations. Thus, scJoint overcomes the heterogeneity of different data modalities to enable a more comprehensive understanding of cellular phenotypes.

Asunto(s)

Secuenciación de Inmunoprecipitación de Cromatina , Análisis de la Célula Individual , Aprendizaje Automático , RNA-Seq , Análisis de Secuencia de ARN , Análisis de la Célula Individual/métodos , Secuenciación del Exoma

12.

NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA.

Wang, Y X Rachel; Sarkar, Purnamrita; Ursu, Oana; Kundaje, Anshul; Bickel, Peter J.

Ann Appl Stat ; 13(3): 1511-1536, 2019 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-32968472

RESUMEN

Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA