Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 148
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 184(26): 6281-6298.e23, 2021 12 22.
Artículo en Inglés | MEDLINE | ID: mdl-34875227

RESUMEN

While intestinal Th17 cells are critical for maintaining tissue homeostasis, recent studies have implicated their roles in the development of extra-intestinal autoimmune diseases including multiple sclerosis. However, the mechanisms by which tissue Th17 cells mediate these dichotomous functions remain unknown. Here, we characterized the heterogeneity, plasticity, and migratory phenotypes of tissue Th17 cells in vivo by combined fate mapping with profiling of the transcriptomes and TCR clonotypes of over 84,000 Th17 cells at homeostasis and during CNS autoimmune inflammation. Inter- and intra-organ single-cell analyses revealed a homeostatic, stem-like TCF1+ IL-17+ SLAMF6+ population that traffics to the intestine where it is maintained by the microbiota, providing a ready reservoir for the IL-23-driven generation of encephalitogenic GM-CSF+ IFN-γ+ CXCR6+ T cells. Our study defines a direct in vivo relationship between IL-17+ non-pathogenic and GM-CSF+ and IFN-γ+ pathogenic Th17 populations and provides a mechanism by which homeostatic intestinal Th17 cells direct extra-intestinal autoimmune disease.


Asunto(s)
Autoinmunidad , Intestinos/inmunología , Células Madre/metabolismo , Células Th17/inmunología , Animales , Movimiento Celular , Células Clonales , Encefalomielitis Autoinmune Experimental/inmunología , Factor Estimulante de Colonias de Granulocitos y Macrófagos/metabolismo , Homeostasis , Humanos , Interferón gamma/metabolismo , Interleucina-17/metabolismo , Ratones Endogámicos C57BL , Especificidad de Órganos , ARN/metabolismo , RNA-Seq , Receptores de Antígenos de Linfocitos T/metabolismo , Receptores CXCR6/metabolismo , Receptores de Interleucina/metabolismo , Reproducibilidad de los Resultados , Familia de Moléculas Señalizadoras de la Activación Linfocitaria/metabolismo , Análisis de la Célula Individual , Bazo/metabolismo
2.
Cell ; 182(6): 1474-1489.e23, 2020 09 17.
Artículo en Inglés | MEDLINE | ID: mdl-32841603

RESUMEN

Widespread changes to DNA methylation and chromatin are well documented in cancer, but the fate of higher-order chromosomal structure remains obscure. Here we integrated topological maps for colon tumors and normal colons with epigenetic, transcriptional, and imaging data to characterize alterations to chromatin loops, topologically associated domains, and large-scale compartments. We found that spatial partitioning of the open and closed genome compartments is profoundly compromised in tumors. This reorganization is accompanied by compartment-specific hypomethylation and chromatin changes. Additionally, we identify a compartment at the interface between the canonical A and B compartments that is reorganized in tumors. Remarkably, similar shifts were evident in non-malignant cells that have accumulated excess divisions. Our analyses suggest that these topological changes repress stemness and invasion programs while inducing anti-tumor immunity genes and may therefore restrain malignant progression. Our findings call into question the conventional view that tumor-associated epigenomic alterations are primarily oncogenic.


Asunto(s)
Cromatina/metabolismo , Cromosomas/metabolismo , Neoplasias Colorrectales/genética , Neoplasias Colorrectales/metabolismo , Metilación de ADN , Epigénesis Genética , Regulación Neoplásica de la Expresión Génica/genética , División Celular , Senescencia Celular/genética , Secuenciación de Inmunoprecipitación de Cromatina , Cromosomas/genética , Estudios de Cohortes , Neoplasias Colorrectales/mortalidad , Neoplasias Colorrectales/patología , Biología Computacional , Metilación de ADN/genética , Epigenómica , Células HCT116 , Humanos , Hibridación Fluorescente in Situ , Microscopía Electrónica de Transmisión , Simulación de Dinámica Molecular , RNA-Seq , Análisis Espacial , Proteínas Supresoras de Tumor/genética , Proteínas Supresoras de Tumor/metabolismo
3.
Nat Methods ; 20(8): 1196-1202, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37429993

RESUMEN

Unsupervised clustering of single-cell RNA-sequencing data enables the identification of distinct cell populations. However, the most widely used clustering algorithms are heuristic and do not formally account for statistical uncertainty. We find that not addressing known sources of variability in a statistically rigorous manner can lead to overconfidence in the discovery of novel cell types. Here we extend a previous method, significance of hierarchical clustering, to propose a model-based hypothesis testing approach that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations. We also adapt this approach to permit statistical assessment on the clusters reported by any algorithm. Finally, we extend these approaches to account for batch structure. We benchmarked our approach against popular clustering workflows, demonstrating improved performance. To show practical utility, we applied our approach to the Human Lung Cell Atlas and an atlas of the mouse cerebellar cortex, identifying several cases of over-clustering and recapitulating experimentally validated cell type definitions.


Asunto(s)
Algoritmos , Benchmarking , Humanos , Animales , Ratones , Análisis por Conglomerados , ARN , Análisis de la Célula Individual/métodos , Análisis de Secuencia de ARN/métodos , Perfilación de la Expresión Génica/métodos
4.
Blood ; 2024 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-39241199

RESUMEN

Engineered cellular therapy with CD19-targeting chimeric antigen receptor T-cells (CAR-T) has revolutionized outcomes for patients with relapsed/refractory Large B-Cell Lymphoma (LBCL), but the cellular and molecular features associated with response remain largely unresolved. We analyzed serial peripheral blood samples ranging from day of apheresis (day -28/baseline) to 28 days after CAR-T infusion from 50 patients with LBCL treated with axicabtagene ciloleucel (axi-cel) by integrating single cell RNA and TCR sequencing (scRNA-seq/scTCR-seq), flow cytometry, and mass cytometry (CyTOF) to characterize features associated with response to CAR-T. Pretreatment patient characteristics associated with response included presence of B cells and increased lymphocyte-to-monocyte ratio (ALC/AMC). Infusion products from responders were enriched for clonally expanded, highly activated CD8+ T cells. We expanded these observations to 99 patients from the ZUMA-1 cohort and identified a subset of patients with elevated baseline B cells, 80% of whom were complete responders. We integrated B cell proportion 0.5% and ALC/AMC 1.2 into a two-factor predictive model and applied this model to the ZUMA-1 cohort. Estimated progression free survival (PFS) at 1 year in patients meeting one or both criteria was 65% versus 31% for patients meeting neither criterion. Our results suggest that patients' immunologic state at baseline affects likelihood of response to CAR-T through both modulation of the T cell apheresis product composition and promoting a more favorable circulating immune compartment prior to therapy. These baseline immunologic features, measured readily in the clinical setting prior to CAR-T, can be applied to predict response to therapy.

5.
Nat Methods ; 19(9): 1076-1087, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-36050488

RESUMEN

A central problem in spatial transcriptomics is detecting differentially expressed (DE) genes within cell types across tissue context. Challenges to learning DE include changing cell type composition across space and measurement pixels detecting transcripts from multiple cell types. Here, we introduce a statistical method, cell type-specific inference of differential expression (C-SIDE), that identifies cell type-specific DE in spatial transcriptomics, accounting for localization of other cell types. We model gene expression as an additive mixture across cell types of log-linear cell type-specific expression functions. C-SIDE's framework applies to many contexts: DE due to pathology, anatomical regions, cell-to-cell interactions and cellular microenvironment. Furthermore, C-SIDE enables statistical inference across multiple/replicates. Simulations and validation experiments on Slide-seq, MERFISH and Visium datasets demonstrate that C-SIDE accurately identifies DE with valid uncertainty quantification. Last, we apply C-SIDE to identify plaque-dependent immune activity in Alzheimer's disease and cellular interactions between tumor and immune cells. We distribute C-SIDE within the R package https://github.com/dmcable/spacexr .


Asunto(s)
Perfilación de la Expresión Génica , Transcriptoma , Perfilación de la Expresión Génica/métodos
6.
Biostatistics ; 23(4): 1150-1164, 2022 10 14.
Artículo en Inglés | MEDLINE | ID: mdl-35770795

RESUMEN

Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de la Célula Individual , Expresión Génica , Perfilación de la Expresión Génica/métodos , Humanos , RNA-Seq , Análisis de Secuencia de ARN/métodos , Programas Informáticos
7.
Biostatistics ; 24(1): 1-16, 2022 12 12.
Artículo en Inglés | MEDLINE | ID: mdl-34467372

RESUMEN

High-dimensional biological data collection across heterogeneous groups of samples has become increasingly common, creating high demand for dimensionality reduction techniques that capture underlying structure of the data. Discovering low-dimensional embeddings that describe the separation of any underlying discrete latent structure in data is an important motivation for applying these techniques since these latent classes can represent important sources of unwanted variability, such as batch effects, or interesting sources of signal such as unknown cell types. The features that define this discrete latent structure are often hard to identify in high-dimensional data. Principal component analysis (PCA) is one of the most widely used methods as an unsupervised step for dimensionality reduction. This reduction technique finds linear transformations of the data which explain total variance. When the goal is detecting discrete structure, PCA is applied with the assumption that classes will be separated in directions of maximum variance. However, PCA will fail to accurately find discrete latent structure if this assumption does not hold. Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), attempt to mitigate these problems with PCA by creating a low-dimensional space where similar objects are modeled by nearby points in the low-dimensional embedding and dissimilar objects are modeled by distant points with high probability. However, since t-SNE and UMAP are computationally expensive, often a PCA reduction is done before applying them which makes it sensitive to PCAs downfalls. Also, tSNE is limited to only two or three dimensions as a visualization tool, which may not be adequate for retaining discriminatory information. The linear transformations of PCA are preferable to non-linear transformations provided by methods like t-SNE and UMAP for interpretable feature weights. Here, we propose iterative discriminant analysis (iDA), a dimensionality reduction technique designed to mitigate these limitations. iDA produces an embedding that carries discriminatory information which optimally separates latent clusters using linear transformations that permit post hoc analysis to determine features that define these latent structures.


Asunto(s)
Algoritmos , Humanos , Análisis de Componente Principal
8.
Proc Natl Acad Sci U S A ; 117(51): 32772-32778, 2020 12 22.
Artículo en Inglés | MEDLINE | ID: mdl-33293417

RESUMEN

Population displacement may occur after natural disasters, permanently altering the demographic composition of the affected regions. Measuring this displacement is vital for both optimal postdisaster resource allocation and calculation of measures of public health interest such as mortality estimates. Here, we analyzed data generated by mobile phones and social media to estimate the weekly island-wide population at risk and within-island geographic heterogeneity of migration in Puerto Rico after Hurricane Maria. We compared these two data sources with population estimates derived from air travel records and census data. We observed a loss of population across all data sources throughout the study period; however, the magnitude and dynamics differ by the data source. Census data predict a population loss of just over 129,000 from July 2017 to July 2018, a 4% decrease; air travel data predict a population loss of 168,295 for the same period, a 5% decrease; mobile phone-based estimates predict a loss of 235,375 from July 2017 to May 2018, an 8% decrease; and social media-based estimates predict a loss of 476,779 from August 2017 to August 2018, a 17% decrease. On average, municipalities with a smaller population size lost a bigger proportion of their population. Moreover, we infer that these municipalities experienced greater infrastructure damage as measured by the proportion of unknown locations stemming from these regions. Finally, our analysis measures a general shift of population from rural to urban centers within the island. Passively collected data provide a promising supplement to current at-risk population estimation procedures; however, each data source has its own biases and limitations.

9.
EMBO J ; 37(6)2018 03 15.
Artículo en Inglés | MEDLINE | ID: mdl-29335281

RESUMEN

In the post-genomic era, thousands of putative noncoding regulatory regions have been identified, such as enhancers, promoters, long noncoding RNAs (lncRNAs), and a cadre of small peptides. These ever-growing catalogs require high-throughput assays to test their functionality at scale. Massively parallel reporter assays have greatly enhanced the understanding of noncoding DNA elements en masse Here, we present a massively parallel RNA assay (MPRNA) that can assay 10,000 or more RNA segments for RNA-based functionality. We applied MPRNA to identify RNA-based nuclear localization domains harbored in lncRNAs. We examined a pool of 11,969 oligos densely tiling 38 human lncRNAs that were fused to a cytosolic transcript. After cell fractionation and barcode sequencing, we identified 109 unique RNA regions that significantly enriched this cytosolic transcript in the nucleus including a cytosine-rich motif. These nuclear enrichment sequences are highly conserved and over-represented in global nuclear fractionation sequencing. Importantly, many of these regions were independently validated by single-molecule RNA fluorescence in situ hybridization. Overall, we demonstrate the utility of MPRNA for future investigation of RNA-based functionalities.


Asunto(s)
ARN Largo no Codificante/genética , Núcleo Celular/genética , Células HeLa , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Hibridación Fluorescente in Situ , Análisis de Secuencia de ARN
10.
Development ; 146(6)2019 03 28.
Artículo en Inglés | MEDLINE | ID: mdl-30923056

RESUMEN

Cell type specification during early nervous system development in Drosophila melanogaster requires precise regulation of gene expression in time and space. Resolving the programs driving neurogenesis has been a major challenge owing to the complexity and rapidity with which distinct cell populations arise. To resolve the cell type-specific gene expression dynamics in early nervous system development, we have sequenced the transcriptomes of purified neurogenic cell types across consecutive time points covering crucial events in neurogenesis. The resulting gene expression atlas comprises a detailed resource of global transcriptome dynamics that permits systematic analysis of how cells in the nervous system acquire distinct fates. We resolve known gene expression dynamics and uncover novel expression signatures for hundreds of genes among diverse neurogenic cell types, most of which remain unstudied. We also identified a set of conserved long noncoding RNAs (lncRNAs) that are regulated in a tissue-specific manner and exhibit spatiotemporal expression during neurogenesis with exquisite specificity. lncRNA expression is highly dynamic and demarcates specific subpopulations within neurogenic cell types. Our spatiotemporal transcriptome atlas provides a comprehensive resource for investigating the function of coding genes and noncoding RNAs during crucial stages of early neurogenesis.


Asunto(s)
Drosophila melanogaster/genética , Regulación del Desarrollo de la Expresión Génica , Sistema Nervioso/embriología , Neurogénesis/genética , ARN Largo no Codificante/genética , Animales , Linaje de la Célula , Drosophila melanogaster/metabolismo , Citometría de Flujo , Perfilación de la Expresión Génica , Redes Reguladoras de Genes , Hibridación Fluorescente in Situ , Neuroglía/fisiología , Filogenia , Transcriptoma
11.
Epidemiology ; 33(3): 346-353, 2022 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-35383642

RESUMEN

Quantifying the impact of natural disasters or epidemics is critical for guiding policy decisions and interventions. When the effects of an event are long-lasting and difficult to detect in the short term, the accumulated effects can be devastating. Mortality is one of the most reliably measured health outcomes, partly due to its unambiguous definition. As a result, excess mortality estimates are an increasingly effective approach for quantifying the effect of an event. However, the fact that indirect effects are often characterized by small, but enduring, increases in mortality rates present a statistical challenge. This is compounded by sources of variability introduced by demographic changes, secular trends, seasonal and day of the week effects, and natural variation. Here, we present a model that accounts for these sources of variability and characterizes concerning increases in mortality rates with smooth functions of time that provide statistical power. The model permits discontinuities in the smooth functions to model sudden increases due to direct effects. We implement a flexible estimation approach that permits both surveillance of concerning increases in mortality rates and careful characterization of the effect of a past event. We demonstrate our tools' utility by estimating excess mortality after hurricanes in the United States and Puerto Rico. We use Hurricane Maria as a case study to show appealing properties that are unique to our method compared with current approaches. Finally, we show the flexibility of our approach by detecting and quantifying the 2014 Chikungunya outbreak in Puerto Rico and the COVID-19 pandemic in the United States. We make our tools available through the excessmort R package available from https://cran.r-project.org/web/packages/excessmort/.


Asunto(s)
COVID-19 , Tormentas Ciclónicas , Humanos , Pandemias , Puerto Rico/epidemiología , Estados Unidos/epidemiología
12.
N Engl J Med ; 379(2): 162-170, 2018 Jul 12.
Artículo en Inglés | MEDLINE | ID: mdl-29809109

RESUMEN

BACKGROUND: Quantifying the effect of natural disasters on society is critical for recovery of public health services and infrastructure. The death toll can be difficult to assess in the aftermath of a major disaster. In September 2017, Hurricane Maria caused massive infrastructural damage to Puerto Rico, but its effect on mortality remains contentious. The official death count is 64. METHODS: Using a representative, stratified sample, we surveyed 3299 randomly chosen households across Puerto Rico to produce an independent estimate of all-cause mortality after the hurricane. Respondents were asked about displacement, infrastructure loss, and causes of death. We calculated excess deaths by comparing our estimated post-hurricane mortality rate with official rates for the same period in 2016. RESULTS: From the survey data, we estimated a mortality rate of 14.3 deaths (95% confidence interval [CI], 9.8 to 18.9) per 1000 persons from September 20 through December 31, 2017. This rate yielded a total of 4645 excess deaths during this period (95% CI, 793 to 8498), equivalent to a 62% increase in the mortality rate as compared with the same period in 2016. However, this number is likely to be an underestimate because of survivor bias. The mortality rate remained high through the end of December 2017, and one third of the deaths were attributed to delayed or interrupted health care. Hurricane-related migration was substantial. CONCLUSIONS: This household-based survey suggests that the number of excess deaths related to Hurricane Maria in Puerto Rico is more than 70 times the official estimate. (Funded by the Harvard T.H. Chan School of Public Health and others.).


Asunto(s)
Tormentas Ciclónicas , Desastres/estadística & datos numéricos , Accesibilidad a los Servicios de Salud/estadística & datos numéricos , Mortalidad , Adolescente , Adulto , Distribución por Edad , Anciano , Anciano de 80 o más Años , Causas de Muerte , Niño , Preescolar , Femenino , Humanos , Masculino , Persona de Mediana Edad , Mortalidad Prematura , Puerto Rico/epidemiología , Encuestas y Cuestionarios , Adulto Joven
13.
Ann Intern Med ; 173(12): 1004-1007, 2020 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-32915654

RESUMEN

As of mid-August 2020, more than 170 000 U.S. residents have died of coronavirus disease 2019 (COVID-19); however, the true number of deaths resulting from COVID-19, both directly and indirectly, is likely to be much higher. The proper attribution of deaths to this pandemic has a range of societal, legal, mortuary, and public health consequences. This article discusses the current difficulties of disaster death attribution and describes the strengths and limitations of relying on death counts from death certificates, estimations of indirect deaths, and estimations of excess mortality. Improving the tabulation of direct and indirect deaths on death certificates will require concerted efforts and consensus across medical institutions and public health agencies. In addition, actionable estimates of excess mortality will require timely access to standardized and structured vital registry data, which should be shared directly at the state level to ensure rapid response for local governments. Correct attribution of direct and indirect deaths and estimation of excess mortality are complementary goals that are critical to our understanding of the pandemic and its effect on human life.


Asunto(s)
COVID-19/mortalidad , Pandemias , Sistema de Registros , SARS-CoV-2 , Causas de Muerte/tendencias , Humanos , Tasa de Supervivencia/tendencias
14.
Genome Res ; 27(11): 1930-1938, 2017 11.
Artículo en Inglés | MEDLINE | ID: mdl-29025895

RESUMEN

The main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics' public catalogs is based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance, as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-seq experiments and that this variability leads to false-positive peak calls. More concerning is that the GC effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell line. However, accounting for GC content bias in ChIP-seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signals with unwanted variability. To account for this challenge, we introduce a statistical approach that accounts for GC effects on both nonspecific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false-positive peaks as well as improved consistency across laboratories.


Asunto(s)
Composición de Base , ADN/metabolismo , Análisis de Secuencia de ADN/métodos , Algoritmos , Sitios de Unión , Inmunoprecipitación de Cromatina , ADN/química , Reacciones Falso Positivas , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento
15.
Biostatistics ; 20(3): 367-383, 2019 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29481604

RESUMEN

With recent advances in sequencing technology, it is now feasible to measure DNA methylation at tens of millions of sites across the entire genome. In most applications, biologists are interested in detecting differentially methylated regions, composed of multiple sites with differing methylation levels among populations. However, current computational approaches for detecting such regions do not provide accurate statistical inference. A major challenge in reporting uncertainty is that a genome-wide scan is involved in detecting these regions, which needs to be accounted for. A further challenge is that sample sizes are limited due to the costs associated with the technology. We have developed a new approach that overcomes these challenges and assesses uncertainty for differentially methylated regions in a rigorous manner. Region-level statistics are obtained by fitting a generalized least squares regression model with a nested autoregressive correlated error structure for the effect of interest on transformed methylation proportions. We develop an inferential approach, based on a pooled null distribution, that can be implemented even when as few as two samples per population are available. Here, we demonstrate the advantages of our method using both experimental data and Monte Carlo simulation. We find that the new method improves the specificity and sensitivity of lists of regions and accurately controls the false discovery rate.


Asunto(s)
Metilación de ADN , Genómica/métodos , Modelos Estadísticos , Análisis de Secuencia de ADN/métodos , Animales , Simulación por Computador , Genómica/normas , Humanos , Análisis de Secuencia de ADN/normas , Incertidumbre
16.
Nat Methods ; 14(4): 417-419, 2017 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-28263959

RESUMEN

We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA-seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.


Asunto(s)
Algoritmos , Análisis de Secuencia de ARN/métodos , Composición de Base , Teorema de Bayes , Perfilación de la Expresión Génica/métodos , Perfilación de la Expresión Génica/estadística & datos numéricos , Análisis de Secuencia de ARN/estadística & datos numéricos
17.
Biostatistics ; 19(4): 562-578, 2018 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-29121214

RESUMEN

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.


Asunto(s)
Perfilación de la Expresión Génica/normas , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ARN/normas , Análisis de la Célula Individual/normas , Transcriptoma , Animales , Humanos
18.
Biostatistics ; 19(2): 185-198, 2018 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-29036413

RESUMEN

Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions. We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods. A software implementation is available from https://github.com/stephaniehicks/qsmooth.


Asunto(s)
Bioestadística/métodos , Interpretación Estadística de Datos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Modelos Estadísticos , Humanos
19.
Nucleic Acids Res ; 45(2): e9, 2017 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-27694310

RESUMEN

Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Programas Informáticos , Regulación de la Expresión Génica , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Anotación de Secuencia Molecular , Especificidad de Órganos/genética , Transcriptoma , Navegador Web
20.
Nat Methods ; 12(2): 115-21, 2015 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-25633503

RESUMEN

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.


Asunto(s)
Biología Computacional , Perfilación de la Expresión Génica , Genómica/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Programas Informáticos , Lenguajes de Programación , Interfaz Usuario-Computador
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA