Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 130
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 180(3): 568-584.e23, 2020 02 06.
Artículo en Inglés | MEDLINE | ID: mdl-31981491

RESUMEN

We present the largest exome sequencing study of autism spectrum disorder (ASD) to date (n = 35,584 total samples, 11,986 with ASD). Using an enhanced analytical framework to integrate de novo and case-control rare variation, we identify 102 risk genes at a false discovery rate of 0.1 or less. Of these genes, 49 show higher frequencies of disruptive de novo variants in individuals ascertained to have severe neurodevelopmental delay, whereas 53 show higher frequencies in individuals ascertained to have ASD; comparing ASD cases with mutations in these groups reveals phenotypic differences. Expressed early in brain development, most risk genes have roles in regulation of gene expression or neuronal communication (i.e., mutations effect neurodevelopmental and neurophysiological changes), and 13 fall within loci recurrently hit by copy number variants. In cells from the human cortex, expression of risk genes is enriched in excitatory and inhibitory neuronal lineages, consistent with multiple paths to an excitatory-inhibitory imbalance underlying ASD.


Asunto(s)
Trastorno Autístico/genética , Corteza Cerebral/crecimiento & desarrollo , Secuenciación del Exoma/métodos , Regulación del Desarrollo de la Expresión Génica , Neurobiología/métodos , Estudios de Casos y Controles , Linaje de la Célula , Estudios de Cohortes , Exoma , Femenino , Frecuencia de los Genes , Predisposición Genética a la Enfermedad , Humanos , Masculino , Mutación Missense , Neuronas/metabolismo , Fenotipo , Factores Sexuales , Análisis de la Célula Individual/métodos
2.
Cell ; 155(5): 997-1007, 2013 Nov 21.
Artículo en Inglés | MEDLINE | ID: mdl-24267886

RESUMEN

Autism spectrum disorder (ASD) is a complex developmental syndrome of unknown etiology. Recent studies employing exome- and genome-wide sequencing have identified nine high-confidence ASD (hcASD) genes. Working from the hypothesis that ASD-associated mutations in these biologically pleiotropic genes will disrupt intersecting developmental processes to contribute to a common phenotype, we have attempted to identify time periods, brain regions, and cell types in which these genes converge. We have constructed coexpression networks based on the hcASD "seed" genes, leveraging a rich expression data set encompassing multiple human brain regions across human development and into adulthood. By assessing enrichment of an independent set of probable ASD (pASD) genes, derived from the same sequencing studies, we demonstrate a key point of convergence in midfetal layer 5/6 cortical projection neurons. This approach informs when, where, and in what cell types mutations in these specific genes may be productively studied to clarify ASD pathophysiology.


Asunto(s)
Encéfalo/metabolismo , Trastornos Generalizados del Desarrollo Infantil/genética , Trastornos Generalizados del Desarrollo Infantil/fisiopatología , Animales , Encéfalo/embriología , Encéfalo/crecimiento & desarrollo , Encéfalo/patología , Trastornos Generalizados del Desarrollo Infantil/patología , Exoma , Femenino , Feto/metabolismo , Feto/patología , Perfilación de la Expresión Génica , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Ratones , Mutación , Neuronas/metabolismo , Corteza Prefrontal/metabolismo , Análisis de Secuencia de ADN
3.
Biostatistics ; 25(4): 1254-1272, 2024 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-38649751

RESUMEN

CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present considerable statistical challenges. We demonstrate through theoretical and real data analyses that a standard method for estimation and inference in single-cell CRISPR screens-"thresholded regression"-exhibits attenuation bias and a bias-variance tradeoff as a function of an intrinsic, challenging-to-select tuning parameter. To overcome these difficulties, we introduce GLM-EIV ("GLM-based errors-in-variables"), a new method for single-cell CRISPR screen analysis. GLM-EIV extends the classical errors-in-variables model to responses and noisy predictors that are exponential family-distributed and potentially impacted by the same set of confounding variables. We develop a computational infrastructure to deploy GLM-EIV across hundreds of processors on clouds (e.g. Microsoft Azure) and high-performance clusters. Leveraging this infrastructure, we apply GLM-EIV to analyze two recent, large-scale, single-cell CRISPR screen datasets, yielding several new insights.


Asunto(s)
Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Humanos , Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas/genética , Sistemas CRISPR-Cas/genética , Modelos Estadísticos
4.
Proc Natl Acad Sci U S A ; 119(34): e2205518119, 2022 08 23.
Artículo en Inglés | MEDLINE | ID: mdl-35969737

RESUMEN

Testing the significance of predictors in a regression model is one of the most important topics in statistics. This problem is especially difficult without any parametric assumptions on the data. This paper aims to test the null hypothesis that given confounding variables Z, X does not significantly contribute to the prediction of Y under the model-free setting, where X and Z are possibly high dimensional. We propose a general framework that first fits nonparametric machine learning regression algorithms on [Formula: see text] and [Formula: see text], then compares the prediction power of the two models. The proposed method allows us to leverage the strength of the most powerful regression algorithms developed in the modern machine learning community. The P value for the test can be easily obtained by permutation. In simulations, we find that the proposed method is more powerful compared to existing methods. The proposed method allows us to draw biologically meaningful conclusions from two gene expression data analyses without strong distributional assumptions: 1) testing the prediction power of sequencing RNA for the proteins in cellular indexing of transcriptomes and epitopes by sequencing data and 2) identification of spatially variable genes in spatially resolved transcriptomics data.


Asunto(s)
Genómica , Aprendizaje Automático , Algoritmos , Análisis de Regresión , Transcriptoma
5.
Proc Natl Acad Sci U S A ; 119(49): e2214414119, 2022 12 06.
Artículo en Inglés | MEDLINE | ID: mdl-36459654

RESUMEN

Recent advances in single-cell technologies enable joint profiling of multiple omics. These profiles can reveal the complex interplay of different regulatory layers in single cells; still, new challenges arise when integrating datasets with some features shared across experiments and others exclusive to a single source; combining information across these sources is called mosaic integration. The difficulties lie in imputing missing molecular layers to build a self-consistent atlas, finding a common latent space, and transferring learning to new data sources robustly. Existing mosaic integration approaches based on matrix factorization cannot efficiently adapt to nonlinear embeddings for the latent cell space and are not designed for accurate imputation of missing molecular layers. By contrast, we propose a probabilistic variational autoencoder model, scVAEIT, to integrate and impute multimodal datasets with mosaic measurements. A key advance is the use of a missing mask for learning the conditional distribution of unobserved modalities and features, which makes scVAEIT flexible to combine different panels of measurements from multimodal datasets accurately and in an end-to-end manner. Imputing the masked features serves as a supervised learning procedure while preventing overfitting by regularization. Focusing on gene expression, protein abundance, and chromatin accessibility, we validate that scVAEIT robustly imputes the missing modalities and features of cells biologically different from the training data. scVAEIT also adjusts for batch effects while maintaining the biological variation, which provides better latent representations for the integrated datasets. We demonstrate that scVAEIT significantly improves integration and imputation across unseen cell types, different technologies, and different tissues.


Asunto(s)
Modelos Estadísticos , Programas Informáticos , Cromatina , Tecnología
6.
BMC Bioinformatics ; 25(1): 113, 2024 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-38486150

RESUMEN

BACKGROUND: Single-cell RNA-sequencing (scRNA) datasets are becoming increasingly popular in clinical and cohort studies, but there is a lack of methods to investigate differentially expressed (DE) genes among such datasets with numerous individuals. While numerous methods exist to find DE genes for scRNA data from limited individuals, differential-expression testing for large cohorts of case and control individuals using scRNA data poses unique challenges due to substantial effects of human variation, i.e., individual-level confounding covariates that are difficult to account for in the presence of sparsely-observed genes. RESULTS: We develop the eSVD-DE, a matrix factorization that pools information across genes and removes confounding covariate effects, followed by a novel two-sample test in mean expression between case and control individuals. In general, differential testing after dimension reduction yields an inflation of Type-1 errors. However, we overcome this by testing for differences between the case and control individuals' posterior mean distributions via a hierarchical model. In previously published datasets of various biological systems, eSVD-DE has more accuracy and power compared to other DE methods typically repurposed for analyzing cohort-wide differential expression. CONCLUSIONS: eSVD-DE proposes a novel and powerful way to test for DE genes among cohorts after performing a dimension reduction. Accurate identification of differential expression on the individual level, instead of the cell level, is important for linking scRNA-seq studies to our understanding of the human population.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de Expresión Génica de una Sola Célula , Humanos , Perfilación de la Expresión Génica/métodos , Programas Informáticos , Análisis de la Célula Individual/métodos
7.
Hum Mol Genet ; 31(3): 481-489, 2022 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-34508597

RESUMEN

The use of external controls in genome-wide association study (GWAS) can significantly increase the size and diversity of the control sample, enabling high-resolution ancestry matching and enhancing the power to detect association signals. However, the aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors and the use of different genotyping platforms. These obstacles have impeded the use of external controls in GWAS and can lead to spurious results if not carefully addressed. We propose a unified data harmonization pipeline that includes an iterative approach to quality control and imputation, implemented before and after merging cohorts and arrays. We apply this harmonization pipeline to aggregate 27 517 European control samples from 16 collections within dbGaP. We leverage these harmonized controls to conduct a GWAS of Crohn's disease. We demonstrate a boost in power over using the cohort samples alone, and that our procedure results in summary statistics free of any significant batch effects. This harmonization pipeline for aggregating genotype data from multiple sources can also serve other applications where individual level genotypes, rather than summary statistics, are required.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Estudios de Cohortes , Genotipo , Humanos , Polimorfismo de Nucleótido Simple/genética , Control de Calidad
8.
Genome Res ; 31(10): 1807-1818, 2021 10.
Artículo en Inglés | MEDLINE | ID: mdl-33837133

RESUMEN

When assessed over a large number of samples, bulk RNA sequencing provides reliable data for gene expression at the tissue level. Single-cell RNA sequencing (scRNA-seq) deepens those analyses by evaluating gene expression at the cellular level. Both data types lend insights into disease etiology. With current technologies, scRNA-seq data are known to be noisy. Constrained by costs, scRNA-seq data are typically generated from a relatively small number of subjects, which limits their utility for some analyses, such as identification of gene expression quantitative trait loci (eQTLs). To address these issues while maintaining the unique advantages of each data type, we develop a Bayesian method (bMIND) to integrate bulk and scRNA-seq data. With a prior derived from scRNA-seq data, we propose to estimate sample-level cell type-specific (CTS) expression from bulk expression data. The CTS expression enables large-scale sample-level downstream analyses, such as detection of CTS differentially expressed genes (DEGs) and eQTLs. Through simulations, we show that bMIND improves the accuracy of sample-level CTS expression estimates and increases the power to discover CTS DEGs when compared to existing methods. To further our understanding of two complex phenotypes, autism spectrum disorder and Alzheimer's disease, we apply bMIND to gene expression data of relevant brain tissue to identify CTS DEGs. Our results complement findings for CTS DEGs obtained from snRNA-seq studies, replicating certain DEGs in specific cell types while nominating other novel genes for those cell types. Finally, we calculate CTS eQTLs for 11 brain regions by analyzing Genotype-Tissue Expression Project data, creating a new resource for biological insights.


Asunto(s)
Trastorno del Espectro Autista , Análisis de la Célula Individual , Trastorno del Espectro Autista/genética , Teorema de Bayes , Expresión Génica , Perfilación de la Expresión Génica/métodos , Humanos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos
9.
Biometrics ; 80(1)2024 Jan 29.
Artículo en Inglés | MEDLINE | ID: mdl-38465983

RESUMEN

In genomics studies, the investigation of gene relationships often brings important biological insights. Currently, the large heterogeneous datasets impose new challenges for statisticians because gene relationships are often local. They change from one sample point to another, may only exist in a subset of the sample, and can be nonlinear or even nonmonotone. Most previous dependence measures do not specifically target local dependence relationships, and the ones that do are computationally costly. In this paper, we explore a state-of-the-art network estimation technique that characterizes gene relationships at the single cell level, under the name of cell-specific gene networks. We first show that averaging the cell-specific gene relationship over a population gives a novel univariate dependence measure, the averaged Local Density Gap (aLDG), that accumulates local dependence and can detect any nonlinear, nonmonotone relationship. Together with a consistent nonparametric estimator, we establish its robustness on both the population and empirical levels. Then, we show that averaging the cell-specific gene relationship over mini-batches determined by some external structure information (eg, spatial or temporal factor) better highlights meaningful local structure change points. We explore the application of aLDG and its minibatch variant in many scenarios, including pairwise gene relationship estimation, bifurcating point detection in cell trajectory, and spatial transcriptomics structure visualization. Both simulations and real data analysis show that aLDG outperforms existing ones.


Asunto(s)
Algoritmos , Análisis de Expresión Génica de una Sola Célula , Perfilación de la Expresión Génica/métodos , Redes Reguladoras de Genes , Análisis de Secuencia de ARN/métodos
10.
Proc Natl Acad Sci U S A ; 118(51)2021 12 21.
Artículo en Inglés | MEDLINE | ID: mdl-34903665

RESUMEN

Gene coexpression networks yield critical insights into biological processes, and single-cell RNA sequencing provides an opportunity to target inquiries at the cellular level. However, due to the sparsity and heterogeneity of transcript counts, it is challenging to construct accurate gene networks. We develop an approach, locCSN, that estimates cell-specific networks (CSNs) for each cell, preserving information about cellular heterogeneity that is lost with other approaches. LocCSN is based on a nonparametric investigation of the joint distribution of gene expression; hence it can readily detect nonlinear correlations, and it is more robust to distributional challenges. Although individual CSNs are estimated with considerable noise, average CSNs provide stable estimates of networks, which reveal gene communities better than traditional measures. Additionally, we propose downstream analysis methods using CSNs to utilize more fully the information contained within them. Repeated estimates of gene networks facilitate testing for differences in network structure between cell groups. Notably, with this approach, we can identify differential network genes, which typically do not differ in gene expression, but do differ in terms of the coexpression networks. These genes might help explain the etiology of disease. Finally, to further our understanding of autism spectrum disorder, we examine the evolution of gene networks in fetal brain cells and compare the CSNs of cells sampled from case and control subjects to reveal intriguing patterns in gene coexpression.


Asunto(s)
Encéfalo/citología , Redes Reguladoras de Genes/fisiología , Análisis de Secuencia de ARN , Análisis de la Célula Individual/métodos , Trastorno del Espectro Autista/metabolismo , Feto , Regulación de la Expresión Génica , Humanos , Neuronas , RNA-Seq
11.
Proc Natl Acad Sci U S A ; 118(10)2021 03 09.
Artículo en Inglés | MEDLINE | ID: mdl-33658382

RESUMEN

Large, comprehensive collections of single-cell RNA sequencing (scRNA-seq) datasets have been generated that allow for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets or transfer knowledge from one to the other to better understand cellular identity and functions. Here, we present a simple yet surprisingly effective method named common factor integration and transfer learning (cFIT) for capturing various batch effects across experiments, technologies, subjects, and even species. The proposed method models the shared information between various datasets by a common factor space while allowing for unique distortions and shifts in genewise expression in each batch. The model parameters are learned under an iterative nonnegative matrix factorization (NMF) framework and then used for synchronized integration from across-domain assays. In addition, the model enables transferring via low-rank matrix from more informative data to allow for precise identification in data of lower quality. Compared with existing approaches, our method imposes weaker assumptions on the cell composition of each individual dataset; however, it is shown to be more reliable in preserving biological variations. We apply cFIT to multiple scRNA-seq datasets of developing brain from human and mouse, varying by technologies and developmental stages. The successful integration and transfer uncover the transcriptional resemblance across systems. The study helps establish a comprehensive landscape of brain cell-type diversity and provides insights into brain development.


Asunto(s)
Secuenciación del Exoma , Aprendizaje Automático , RNA-Seq , Análisis de la Célula Individual , Programas Informáticos , Transcriptoma , Animales , Humanos , Ratones
12.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34459489

RESUMEN

In genome-wide association studies (GWAS), it has become commonplace to test millions of single-nucleotide polymorphisms (SNPs) for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene's test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive $P$-value thresholding, guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Predisposición Genética a la Enfermedad , Pruebas Genéticas/normas , Estudio de Asociación del Genoma Completo/métodos , Estudio de Asociación del Genoma Completo/normas , Alelos , Mapeo Cromosómico , Bases de Datos Genéticas , Pruebas Genéticas/métodos , Humanos , Desequilibrio de Ligamiento , Fenotipo , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo
13.
Nucleic Acids Res ; 49(16): e91, 2021 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-34125905

RESUMEN

A wealth of clustering algorithms are available for single-cell RNA sequencing (scRNA-seq) data to enable the identification of functionally distinct subpopulations that each possess a different pattern of gene expression activity. Implementation of these methods requires a choice of resolution parameter to determine the number of clusters, and critical judgment from the researchers is required to determine the desired resolution. This supervised process takes significant time and effort. Moreover, it can be difficult to compare and characterize the evolution of cell clusters from results obtained at one single resolution. To overcome these challenges, we built Multi-resolution Reconciled Tree (MRtree), a highly flexible tree-construction algorithm that generates a cluster hierarchy from flat clustering results attained for a range of resolutions. Because MRtree can be coupled with most scRNA-seq clustering algorithms, it inherits the robustness and versatility of a flat clustering approach, while maintaining the hierarchical structure of cells. The constructed trees from multiple scRNA-seq datasets effectively reflect the extent of transcriptional distinctions among cell groups and align well with levels of functional specializations among cells. Importantly, application to fetal brain cells identified subtypes of cells determined mainly by maturation states, spatial location and terminal specification.


Asunto(s)
RNA-Seq/métodos , Análisis de la Célula Individual/métodos , Programas Informáticos , Análisis por Conglomerados , Humanos
14.
Proc Natl Acad Sci U S A ; 117(26): 15028-15035, 2020 06 30.
Artículo en Inglés | MEDLINE | ID: mdl-32522875

RESUMEN

To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of selective inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptive P-value thresholding (AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS association P values play the role of the primary data for AdaPT; single-nucleotide polymorphisms (SNPs) are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: GWAS statistics from genetically correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene-gene coexpression, captured by subnetwork (module) membership. In all, 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations using gene expression information from the developing human prefrontal cortex. We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.


Asunto(s)
Trastorno Bipolar/genética , Esquizofrenia/genética , Algoritmos , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Herencia Multifactorial , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo
15.
Bioinformatics ; 37(19): 3228-3234, 2021 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-33904573

RESUMEN

MOTIVATION: Marker genes, defined as genes that are expressed primarily in a single-cell type, can be identified from the single-cell transcriptome; however, such data are not always available for the many uses of marker genes, such as deconvolution of bulk tissue. Marker genes for a cell type, however, are highly correlated in bulk data, because their expression levels depend primarily on the proportion of that cell type in the samples. Therefore, when many tissue samples are analyzed, it is possible to identify these marker genes from the correlation pattern. RESULTS: To capitalize on this pattern, we develop a new algorithm to detect marker genes by combining published information about likely marker genes with bulk transcriptome data in the form of a semi-supervised algorithm. The algorithm then exploits the correlation structure of the bulk data to refine the published marker genes by adding or removing genes from the list. AVAILABILITY AND IMPLEMENTATION: We implement this method as an R package markerpen, hosted on CRAN (https://CRAN.R-project.org/package=markerpen). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

16.
Bioinformatics ; 37(16): 2374-2381, 2021 Aug 25.
Artículo en Inglés | MEDLINE | ID: mdl-33624750

RESUMEN

MOTIVATION: Gene-gene co-expression networks (GCN) are of biological interest for the useful information they provide for understanding gene-gene interactions. The advent of single cell RNA-sequencing allows us to examine more subtle gene co-expression occurring within a cell type. Many imputation and denoising methods have been developed to deal with the technical challenges observed in single cell data; meanwhile, several simulators have been developed for benchmarking and assessing these methods. Most of these simulators, however, either do not incorporate gene co-expression or generate co-expression in an inconvenient manner. RESULTS: Therefore, with the focus on gene co-expression, we propose a new simulator, ESCO, which adopts the idea of the copula to impose gene co-expression, while preserving the highlights of available simulators, which perform well for simulation of gene expression marginally. Using ESCO, we assess the performance of imputation methods on GCN recovery and find that imputation generally helps GCN recovery when the data are not too sparse, and the ensemble imputation method works best among leading methods. In contrast, imputation fails to help in the presence of an excessive fraction of zero counts, where simple data aggregating methods are a better choice. These findings are further verified with mouse and human brain cell data. AVAILABILITY AND IMPLEMENTATION: The ESCO implementation is available as R package ESCO. Users can either download the development version via github (https://github.com/JINJINT/ESCO) or the archived version via Zenodo (https://zenodo.org/record/4455890). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

17.
Proc Natl Acad Sci U S A ; 116(2): 466-471, 2019 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-30587579

RESUMEN

Motivated by the dynamics of development, in which cells of recognizable types, or pure cell types, transition into other types over time, we propose a method of semisoft clustering that can classify both pure and intermediate cell types from data on gene expression from individual cells. Called semisoft clustering with pure cells (SOUP), this algorithm reveals the clustering structure for both pure cells and transitional cells with soft memberships. SOUP involves a two-step process: Identify the set of pure cells and then estimate a membership matrix. To find pure cells, SOUP uses the special block structure in the expression similarity matrix. Once pure cells are identified, they provide the key information from which the membership matrix can be computed. By modeling cells as a continuous mixture of K discrete types we obtain more parsimonious results than obtained with standard clustering algorithms. Moreover, using soft membership estimates of cell type cluster centers leads to better estimates of developmental trajectories. The strong performance of SOUP is documented via simulation studies, which show its robustness to violations of modeling assumptions. The advantages of SOUP are illustrated by analyses of two independent datasets of gene expression from a large number of cells from fetal brain.


Asunto(s)
Algoritmos , Diferenciación Celular , Proliferación Celular , Procesamiento Automatizado de Datos , Modelos Biológicos , Animales , Humanos
18.
Bioinformatics ; 36(3): 782-788, 2020 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-31400192

RESUMEN

MOTIVATION: Patterns of gene expression, quantified at the level of tissue or cells, can inform on etiology of disease. There are now rich resources for tissue-level (bulk) gene expression data, which have been collected from thousands of subjects, and resources involving single-cell RNA-sequencing (scRNA-seq) data are expanding rapidly. The latter yields cell type information, although the data can be noisy and typically are derived from a small number of subjects. RESULTS: Complementing these approaches, we develop a method to estimate subject- and cell-type-specific (CTS) gene expression from tissue using an empirical Bayes method that borrows information across multiple measurements of the same tissue per subject (e.g. multiple regions of the brain). Analyzing expression data from multiple brain regions from the Genotype-Tissue Expression project (GTEx) reveals CTS expression, which then permits downstream analyses, such as identification of CTS expression Quantitative Trait Loci (eQTL). AVAILABILITY AND IMPLEMENTATION: We implement this method as an R package MIND, hosted on https://github.com/randel/MIND. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Perfilación de la Expresión Génica , Programas Informáticos , Teorema de Bayes , Análisis de Secuencia de ARN , Análisis de la Célula Individual
19.
Proc Natl Acad Sci U S A ; 115(5): 927-932, 2018 01 30.
Artículo en Inglés | MEDLINE | ID: mdl-29339482

RESUMEN

Community detection is challenging when the network structure is estimated with uncertainty. Dynamic networks present additional challenges but also add information across time periods. We propose a global community detection method, persistent communities by eigenvector smoothing (PisCES), that combines information across a series of networks, longitudinally, to strengthen the inference for each period. Our method is derived from evolutionary spectral clustering and degree correction methods. Data-driven solutions to the problem of tuning parameter selection are provided. In simulations we find that PisCES performs better than competing methods designed for a low signal-to-noise ratio. Recently obtained gene expression data from rhesus monkey brains provide samples from finely partitioned brain regions over a broad time span including pre- and postnatal periods. Of interest is how gene communities develop over space and time; however, once the data are divided into homogeneous spatial and temporal periods, sample sizes are very small, making inference quite challenging. Applying PisCES to medial prefrontal cortex in monkey rhesus brains from near conception to adulthood reveals dense communities that persist, merge, and diverge over time and others that are loosely organized and short lived, illustrating how dynamic community detection can yield interesting insights into processes such as brain development.


Asunto(s)
Análisis por Conglomerados , Redes Reguladoras de Genes , Algoritmos , Animales , Simulación por Computador , Regulación del Desarrollo de la Expresión Génica , Macaca mulatta , Modelos Genéticos , Modelos Neurológicos , Modelos Estadísticos , Corteza Prefrontal/embriología , Corteza Prefrontal/crecimiento & desarrollo , Corteza Prefrontal/metabolismo
20.
Annu Rev Genomics Hum Genet ; 18: 167-187, 2017 08 31.
Artículo en Inglés | MEDLINE | ID: mdl-28426285

RESUMEN

The etiology of autism spectrum disorder (ASD) is complex, involving both genetic and environmental contributions to individual and population-level liability. Early researchers hypothesized that ASD arises from polygenic inheritance, but later results, such as the identification of mutations in certain genes that are responsible for syndromes associated with ASD, led others to propose that de novo mutations of major effect would account for most cases. This yin and yang of monogenic causes and polygenic inheritance continues to this day. The development of genome-wide genotyping and sequencing techniques has resulted in remarkable advances in our understanding of the genetic architecture of risk for ASD. The combined research findings provide solid evidence that ASD is a complex polygenic disorder. Rare de novo and inherited variations act within the context of a common-variant genetic load, and this load accounts for the largest portion of ASD liability.


Asunto(s)
Trastorno del Espectro Autista/genética , Predisposición Genética a la Enfermedad , Mutación , Polimorfismo Genético , Trastorno del Espectro Autista/etiología , Femenino , Humanos , Masculino
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA