Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Nat Methods ; 20(2): 229-238, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36587187

RESUMEN

Nonnegative matrix factorization (NMF) is widely used to analyze high-dimensional count data because, in contrast to real-valued alternatives such as factor analysis, it produces an interpretable parts-based representation. However, in applications such as spatial transcriptomics, NMF fails to incorporate known structure between observations. Here, we present nonnegative spatial factorization (NSF), a spatially-aware probabilistic dimension reduction model based on transformed Gaussian processes that naturally encourages sparsity and scales to tens of thousands of observations. NSF recovers ground truth factors more accurately than real-valued alternatives such as MEFISTO in simulations, and has lower out-of-sample prediction error than probabilistic NMF on three spatial transcriptomics datasets from mouse brain and liver. Since not all patterns of gene expression have spatial correlations, we also propose a hybrid extension of NSF that combines spatial and nonspatial components, enabling quantification of spatial importance for both observations and features. A TensorFlow implementation of NSF is available from https://github.com/willtownes/nsf-paper .


Asunto(s)
Algoritmos , Perfilación de la Expresión Génica , Animales , Ratones , Perfilación de la Expresión Génica/métodos , Genómica , Modelos Estadísticos
2.
Nat Methods ; 20(9): 1379-1387, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-37592182

RESUMEN

Spatially resolved genomic technologies have allowed us to study the physical organization of cells and tissues, and promise an understanding of local interactions between cells. However, it remains difficult to precisely align spatial observations across slices, samples, scales, individuals and technologies. Here, we propose a probabilistic model that aligns spatially-resolved samples onto a known or unknown common coordinate system (CCS) with respect to phenotypic readouts (for example, gene expression). Our method, Gaussian Process Spatial Alignment (GPSA), consists of a two-layer Gaussian process: the first layer maps observed samples' spatial locations onto a CCS, and the second layer maps from the CCS to the observed readouts. Our approach enables complex downstream spatially aware analyses that are impossible or inaccurate with unaligned data, including an analysis of variance, creation of a dense three-dimensional (3D) atlas from sparse two-dimensional (2D) slices or association tests across data modalities.


Asunto(s)
Genómica , Modelos Estadísticos , Humanos , Distribución Normal
3.
Bioinformatics ; 36(22-23): 5432-5438, 2021 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-33367522

RESUMEN

MOTIVATION: Analysis of rare variants in family-based studies remains a challenge. Transmission-based approaches provide robustness against population stratification, but the evaluation of the significance of test statistics based on asymptotic theory can be imprecise. Also, power will depend heavily on the choice of the test statistic and on the underlying genetic architecture of the locus, which will be generally unknown. RESULTS: In our proposed framework, we utilize the FBAT haplotype algorithm to obtain the conditional offspring genotype distribution under the null hypothesis given the sufficient statistic. Based on this conditional offspring genotype distribution, the significance of virtually any association test statistic can be evaluated based on simulations or exact computations, without the need for asymptotic approximations. Besides standard linear burden-type statistics, this enables our approach to also evaluate other test statistics such as variance components statistics, higher criticism approaches, and maximum-single-variant-statistics, where asymptotic theory might be involved or does not provide accurate approximations for rare variant data. Based on these P-values, combined test statistics such as the aggregated Cauchy association test (ACAT) can also be utilized. In simulation studies, we show that our framework outperforms existing approaches for family-based studies in several scenarios. We also applied our methodology to a TOPMed whole-genome sequencing dataset with 897 asthmatic trios from Costa Rica. AVAILABILITY AND IMPLEMENTATION: FBAT software is available at https://sites.google.com/view/fbatwebpage. Simulation code is available at https://github.com/julianhecker/FBAT_rare_variant_test_simulations. Whole-genome sequencing data for 'NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica' is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000988.v4.p1. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

4.
PLoS Comput Biol ; 16(11): e1008429, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33253142

RESUMEN

Aging is a complex process with poorly understood genetic mechanisms. Recent studies have sought to classify genes as pro-longevity or anti-longevity using a variety of machine learning algorithms. However, it is not clear which types of features are best for optimizing classification performance and which algorithms are best suited to this task. Further, performance assessments based on held-out test data are lacking. We systematically compare five popular classification algorithms using gene ontology and gene expression datasets as features to predict the pro-longevity versus anti-longevity status of genes for two model organisms (C. elegans and S. cerevisiae) using the GenAge database as ground truth. We find that elastic net penalized logistic regression performs particularly well at this task. Using elastic net, we make novel predictions of pro- and anti-longevity genes that are not currently in the GenAge database.


Asunto(s)
Expresión Génica , Ontología de Genes , Longevidad/genética , Algoritmos , Animales , Caenorhabditis elegans/genética , Genes Fúngicos , Aprendizaje Automático , Reproducibilidad de los Resultados , Saccharomyces cerevisiae/genética
5.
Genet Epidemiol ; 42(1): 123-126, 2018 02.
Artículo en Inglés | MEDLINE | ID: mdl-29159827

RESUMEN

For family-based association studies, Horvath et al. proposed an algorithm for the association analysis between haplotypes and arbitrary phenotypes when the phase of the haplotypes is unknown, that is, genotype data is given. Their approach to haplotype analysis maintains the original features of the TDT/FBAT-approach, that is, complete robustness against genetic confounding and misspecification of the phenotype. The algorithm has been implemented in the FBAT and PBAT software package and has been used in numerous substantive manuscripts. Here, we propose a simplification of the original algorithm that maintains the original approach but reduces the computational burden of the approach substantially and gives valuable insights regarding the conditional distribution. With the modified algorithm, the application to whole-genome sequencing (WGS) studies becomes feasible; for example, in sliding window approaches or spatial-clustering approaches. The reduction of the computational burden that our modification provides is especially dramatic when both parental genotypes are missing. For example, for eight variants and 441 nuclear families with mostly offspring-only families, in a WGS study at the APOE locus, the running time decreased from approximately 21 hr for the original algorithm to 0.11 sec after our modification.


Asunto(s)
Algoritmos , Haplotipos , Núcleo Familiar , Fenotipo , Apolipoproteínas E/genética , Análisis por Conglomerados , Femenino , Humanos , Masculino , Modelos Genéticos , Factores de Tiempo , Secuenciación Completa del Genoma
6.
Biostatistics ; 19(4): 562-578, 2018 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-29121214

RESUMEN

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.


Asunto(s)
Perfilación de la Expresión Génica/normas , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ARN/normas , Análisis de la Célula Individual/normas , Transcriptoma , Animales , Humanos
7.
Life Sci Alliance ; 5(12)2022 08 17.
Artículo en Inglés | MEDLINE | ID: mdl-35977827

RESUMEN

Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene-variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual's genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at https://github.com/gewirtz/TBLDA.


Asunto(s)
Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Genotipo , Polimorfismo de Nucleótido Simple/genética , Sitios de Carácter Cuantitativo/genética
8.
Genome Biol ; 21(1): 160, 2020 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-32620142

RESUMEN

Single-cell RNA-seq (scRNA-seq) profiles gene expression of individual cells. Unique molecular identifiers (UMIs) remove duplicates in read counts resulting from polymerase chain reaction, a major source of noise. For scRNA-seq data lacking UMIs, we propose quasi-UMIs: quantile normalization of read counts to a compound Poisson distribution empirically derived from UMI datasets. When applied to ground-truth datasets having both reads and UMIs, quasi-UMI normalization has higher accuracy than competing methods. Using quasi-UMIs enables methods designed specifically for UMI data to be applied to non-UMI scRNA-seq datasets.


Asunto(s)
Análisis de Secuencia de ARN , Análisis de la Célula Individual , Animales , Humanos , Distribución Normal , Distribución de Poisson
9.
Genome Biol ; 21(1): 179, 2020 Jul 22.
Artículo en Inglés | MEDLINE | ID: mdl-32698904

RESUMEN

An amendment to this paper has been published and can be accessed via the original article.

10.
Genome Biol ; 20(1): 295, 2019 12 23.
Artículo en Inglés | MEDLINE | ID: mdl-31870412

RESUMEN

Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.


Asunto(s)
Modelos Estadísticos , Análisis de Secuencia de ARN , Análisis de la Célula Individual
11.
PLoS One ; 11(10): e0163544, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27732614

RESUMEN

BACKGROUND: The recent Ebola virus disease (EVD) outbreak in West Africa has spread wider than any previous human EVD epidemic. While individual-level risk factors that contribute to the spread of EVD have been studied, the population-level attributes of subnational regions associated with outbreak severity have not yet been considered. METHODS: To investigate the area-level predictors of EVD dynamics, we integrated time series data on cumulative reported cases of EVD from the World Health Organization and covariate data from the Demographic and Health Surveys. We first estimated the early growth rates of epidemics in each second-level administrative district (ADM2) in Guinea, Sierra Leone and Liberia using exponential, logistic and polynomial growth models. We then evaluated how these growth rates, as well as epidemic size within ADM2s, were ecologically associated with several demographic and socio-economic characteristics of the ADM2, using bivariate correlations and multivariable regression models. RESULTS: The polynomial growth model appeared to best fit the ADM2 epidemic curves, displaying the lowest residual standard error. Each outcome was associated with various regional characteristics in bivariate models, however in stepwise multivariable models only mean education levels were consistently associated with a worse local epidemic. DISCUSSION: By combining two common methods-estimation of epidemic parameters using mathematical models, and estimation of associations using ecological regression models-we identified some factors predicting rapid and severe EVD epidemics in West African subnational regions. While care should be taken interpreting such results as anything more than correlational, we suggest that our approach of using data sources that were publicly available in advance of the epidemic or in real-time provides an analytic framework that may assist countries in understanding the dynamics of future outbreaks as they occur.


Asunto(s)
Fiebre Hemorrágica Ebola/epidemiología , Clase Social , Adolescente , Adulto , Brotes de Enfermedades , Femenino , Guinea/epidemiología , Encuestas Epidemiológicas , Fiebre Hemorrágica Ebola/economía , Humanos , Entrevistas como Asunto , Liberia/epidemiología , Modelos Lineales , Masculino , Persona de Mediana Edad , Modelos Estadísticos , Sierra Leona/epidemiología , Adulto Joven
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA