Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 147
Filtrar
1.
Cell ; 184(26): 6281-6298.e23, 2021 12 22.
Artigo em Inglês | MEDLINE | ID: mdl-34875227

RESUMO

While intestinal Th17 cells are critical for maintaining tissue homeostasis, recent studies have implicated their roles in the development of extra-intestinal autoimmune diseases including multiple sclerosis. However, the mechanisms by which tissue Th17 cells mediate these dichotomous functions remain unknown. Here, we characterized the heterogeneity, plasticity, and migratory phenotypes of tissue Th17 cells in vivo by combined fate mapping with profiling of the transcriptomes and TCR clonotypes of over 84,000 Th17 cells at homeostasis and during CNS autoimmune inflammation. Inter- and intra-organ single-cell analyses revealed a homeostatic, stem-like TCF1+ IL-17+ SLAMF6+ population that traffics to the intestine where it is maintained by the microbiota, providing a ready reservoir for the IL-23-driven generation of encephalitogenic GM-CSF+ IFN-γ+ CXCR6+ T cells. Our study defines a direct in vivo relationship between IL-17+ non-pathogenic and GM-CSF+ and IFN-γ+ pathogenic Th17 populations and provides a mechanism by which homeostatic intestinal Th17 cells direct extra-intestinal autoimmune disease.


Assuntos
Autoimunidade , Intestinos/imunologia , Células-Tronco/metabolismo , Células Th17/imunologia , Animais , Movimento Celular , Células Clonais , Encefalomielite Autoimune Experimental/imunologia , Fator Estimulador de Colônias de Granulócitos e Macrófagos/metabolismo , Homeostase , Humanos , Interferon gama/metabolismo , Interleucina-17/metabolismo , Camundongos Endogâmicos C57BL , Especificidade de Órgãos , RNA/metabolismo , RNA-Seq , Receptores de Antígenos de Linfócitos T/metabolismo , Receptores CXCR6/metabolismo , Receptores de Interleucina/metabolismo , Reprodutibilidade dos Testes , Família de Moléculas de Sinalização da Ativação Linfocitária/metabolismo , Análise de Célula Única , Baço/metabolismo
2.
Cell ; 182(6): 1474-1489.e23, 2020 09 17.
Artigo em Inglês | MEDLINE | ID: mdl-32841603

RESUMO

Widespread changes to DNA methylation and chromatin are well documented in cancer, but the fate of higher-order chromosomal structure remains obscure. Here we integrated topological maps for colon tumors and normal colons with epigenetic, transcriptional, and imaging data to characterize alterations to chromatin loops, topologically associated domains, and large-scale compartments. We found that spatial partitioning of the open and closed genome compartments is profoundly compromised in tumors. This reorganization is accompanied by compartment-specific hypomethylation and chromatin changes. Additionally, we identify a compartment at the interface between the canonical A and B compartments that is reorganized in tumors. Remarkably, similar shifts were evident in non-malignant cells that have accumulated excess divisions. Our analyses suggest that these topological changes repress stemness and invasion programs while inducing anti-tumor immunity genes and may therefore restrain malignant progression. Our findings call into question the conventional view that tumor-associated epigenomic alterations are primarily oncogenic.


Assuntos
Cromatina/metabolismo , Cromossomos/metabolismo , Neoplasias Colorretais/genética , Neoplasias Colorretais/metabolismo , Metilação de DNA , Epigênese Genética , Regulação Neoplásica da Expressão Gênica/genética , Divisão Celular , Senescência Celular/genética , Sequenciamento de Cromatina por Imunoprecipitação , Cromossomos/genética , Estudos de Coortes , Neoplasias Colorretais/mortalidade , Neoplasias Colorretais/patologia , Biologia Computacional , Metilação de DNA/genética , Epigenômica , Células HCT116 , Humanos , Hibridização in Situ Fluorescente , Microscopia Eletrônica de Transmissão , Simulação de Dinâmica Molecular , RNA-Seq , Análise Espacial , Proteínas Supressoras de Tumor/genética , Proteínas Supressoras de Tumor/metabolismo
3.
Nat Methods ; 20(8): 1196-1202, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37429993

RESUMO

Unsupervised clustering of single-cell RNA-sequencing data enables the identification of distinct cell populations. However, the most widely used clustering algorithms are heuristic and do not formally account for statistical uncertainty. We find that not addressing known sources of variability in a statistically rigorous manner can lead to overconfidence in the discovery of novel cell types. Here we extend a previous method, significance of hierarchical clustering, to propose a model-based hypothesis testing approach that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations. We also adapt this approach to permit statistical assessment on the clusters reported by any algorithm. Finally, we extend these approaches to account for batch structure. We benchmarked our approach against popular clustering workflows, demonstrating improved performance. To show practical utility, we applied our approach to the Human Lung Cell Atlas and an atlas of the mouse cerebellar cortex, identifying several cases of over-clustering and recapitulating experimentally validated cell type definitions.


Assuntos
Algoritmos , Benchmarking , Humanos , Animais , Camundongos , Análise por Conglomerados , RNA , Análise de Célula Única/métodos , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos
4.
Nat Methods ; 19(9): 1076-1087, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-36050488

RESUMO

A central problem in spatial transcriptomics is detecting differentially expressed (DE) genes within cell types across tissue context. Challenges to learning DE include changing cell type composition across space and measurement pixels detecting transcripts from multiple cell types. Here, we introduce a statistical method, cell type-specific inference of differential expression (C-SIDE), that identifies cell type-specific DE in spatial transcriptomics, accounting for localization of other cell types. We model gene expression as an additive mixture across cell types of log-linear cell type-specific expression functions. C-SIDE's framework applies to many contexts: DE due to pathology, anatomical regions, cell-to-cell interactions and cellular microenvironment. Furthermore, C-SIDE enables statistical inference across multiple/replicates. Simulations and validation experiments on Slide-seq, MERFISH and Visium datasets demonstrate that C-SIDE accurately identifies DE with valid uncertainty quantification. Last, we apply C-SIDE to identify plaque-dependent immune activity in Alzheimer's disease and cellular interactions between tumor and immune cells. We distribute C-SIDE within the R package https://github.com/dmcable/spacexr .


Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Perfilação da Expressão Gênica/métodos
5.
Biostatistics ; 23(4): 1150-1164, 2022 10 14.
Artigo em Inglês | MEDLINE | ID: mdl-35770795

RESUMO

Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Expressão Gênica , Perfilação da Expressão Gênica/métodos , Humanos , RNA-Seq , Análise de Sequência de RNA/métodos , Software
6.
Biostatistics ; 24(1): 1-16, 2022 12 12.
Artigo em Inglês | MEDLINE | ID: mdl-34467372

RESUMO

High-dimensional biological data collection across heterogeneous groups of samples has become increasingly common, creating high demand for dimensionality reduction techniques that capture underlying structure of the data. Discovering low-dimensional embeddings that describe the separation of any underlying discrete latent structure in data is an important motivation for applying these techniques since these latent classes can represent important sources of unwanted variability, such as batch effects, or interesting sources of signal such as unknown cell types. The features that define this discrete latent structure are often hard to identify in high-dimensional data. Principal component analysis (PCA) is one of the most widely used methods as an unsupervised step for dimensionality reduction. This reduction technique finds linear transformations of the data which explain total variance. When the goal is detecting discrete structure, PCA is applied with the assumption that classes will be separated in directions of maximum variance. However, PCA will fail to accurately find discrete latent structure if this assumption does not hold. Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), attempt to mitigate these problems with PCA by creating a low-dimensional space where similar objects are modeled by nearby points in the low-dimensional embedding and dissimilar objects are modeled by distant points with high probability. However, since t-SNE and UMAP are computationally expensive, often a PCA reduction is done before applying them which makes it sensitive to PCAs downfalls. Also, tSNE is limited to only two or three dimensions as a visualization tool, which may not be adequate for retaining discriminatory information. The linear transformations of PCA are preferable to non-linear transformations provided by methods like t-SNE and UMAP for interpretable feature weights. Here, we propose iterative discriminant analysis (iDA), a dimensionality reduction technique designed to mitigate these limitations. iDA produces an embedding that carries discriminatory information which optimally separates latent clusters using linear transformations that permit post hoc analysis to determine features that define these latent structures.


Assuntos
Algoritmos , Humanos , Análise de Componente Principal
7.
Proc Natl Acad Sci U S A ; 117(51): 32772-32778, 2020 12 22.
Artigo em Inglês | MEDLINE | ID: mdl-33293417

RESUMO

Population displacement may occur after natural disasters, permanently altering the demographic composition of the affected regions. Measuring this displacement is vital for both optimal postdisaster resource allocation and calculation of measures of public health interest such as mortality estimates. Here, we analyzed data generated by mobile phones and social media to estimate the weekly island-wide population at risk and within-island geographic heterogeneity of migration in Puerto Rico after Hurricane Maria. We compared these two data sources with population estimates derived from air travel records and census data. We observed a loss of population across all data sources throughout the study period; however, the magnitude and dynamics differ by the data source. Census data predict a population loss of just over 129,000 from July 2017 to July 2018, a 4% decrease; air travel data predict a population loss of 168,295 for the same period, a 5% decrease; mobile phone-based estimates predict a loss of 235,375 from July 2017 to May 2018, an 8% decrease; and social media-based estimates predict a loss of 476,779 from August 2017 to August 2018, a 17% decrease. On average, municipalities with a smaller population size lost a bigger proportion of their population. Moreover, we infer that these municipalities experienced greater infrastructure damage as measured by the proportion of unknown locations stemming from these regions. Finally, our analysis measures a general shift of population from rural to urban centers within the island. Passively collected data provide a promising supplement to current at-risk population estimation procedures; however, each data source has its own biases and limitations.

8.
EMBO J ; 37(6)2018 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-29335281

RESUMO

In the post-genomic era, thousands of putative noncoding regulatory regions have been identified, such as enhancers, promoters, long noncoding RNAs (lncRNAs), and a cadre of small peptides. These ever-growing catalogs require high-throughput assays to test their functionality at scale. Massively parallel reporter assays have greatly enhanced the understanding of noncoding DNA elements en masse Here, we present a massively parallel RNA assay (MPRNA) that can assay 10,000 or more RNA segments for RNA-based functionality. We applied MPRNA to identify RNA-based nuclear localization domains harbored in lncRNAs. We examined a pool of 11,969 oligos densely tiling 38 human lncRNAs that were fused to a cytosolic transcript. After cell fractionation and barcode sequencing, we identified 109 unique RNA regions that significantly enriched this cytosolic transcript in the nucleus including a cytosine-rich motif. These nuclear enrichment sequences are highly conserved and over-represented in global nuclear fractionation sequencing. Importantly, many of these regions were independently validated by single-molecule RNA fluorescence in situ hybridization. Overall, we demonstrate the utility of MPRNA for future investigation of RNA-based functionalities.


Assuntos
RNA Longo não Codificante/genética , Núcleo Celular/genética , Células HeLa , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Hibridização in Situ Fluorescente , Análise de Sequência de RNA
9.
Development ; 146(6)2019 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-30923056

RESUMO

Cell type specification during early nervous system development in Drosophila melanogaster requires precise regulation of gene expression in time and space. Resolving the programs driving neurogenesis has been a major challenge owing to the complexity and rapidity with which distinct cell populations arise. To resolve the cell type-specific gene expression dynamics in early nervous system development, we have sequenced the transcriptomes of purified neurogenic cell types across consecutive time points covering crucial events in neurogenesis. The resulting gene expression atlas comprises a detailed resource of global transcriptome dynamics that permits systematic analysis of how cells in the nervous system acquire distinct fates. We resolve known gene expression dynamics and uncover novel expression signatures for hundreds of genes among diverse neurogenic cell types, most of which remain unstudied. We also identified a set of conserved long noncoding RNAs (lncRNAs) that are regulated in a tissue-specific manner and exhibit spatiotemporal expression during neurogenesis with exquisite specificity. lncRNA expression is highly dynamic and demarcates specific subpopulations within neurogenic cell types. Our spatiotemporal transcriptome atlas provides a comprehensive resource for investigating the function of coding genes and noncoding RNAs during crucial stages of early neurogenesis.


Assuntos
Drosophila melanogaster/genética , Regulação da Expressão Gênica no Desenvolvimento , Sistema Nervoso/embriologia , Neurogênese/genética , RNA Longo não Codificante/genética , Animais , Linhagem da Célula , Drosophila melanogaster/metabolismo , Citometria de Fluxo , Perfilação da Expressão Gênica , Redes Reguladoras de Genes , Hibridização in Situ Fluorescente , Neuroglia/fisiologia , Filogenia , Transcriptoma
10.
Epidemiology ; 33(3): 346-353, 2022 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-35383642

RESUMO

Quantifying the impact of natural disasters or epidemics is critical for guiding policy decisions and interventions. When the effects of an event are long-lasting and difficult to detect in the short term, the accumulated effects can be devastating. Mortality is one of the most reliably measured health outcomes, partly due to its unambiguous definition. As a result, excess mortality estimates are an increasingly effective approach for quantifying the effect of an event. However, the fact that indirect effects are often characterized by small, but enduring, increases in mortality rates present a statistical challenge. This is compounded by sources of variability introduced by demographic changes, secular trends, seasonal and day of the week effects, and natural variation. Here, we present a model that accounts for these sources of variability and characterizes concerning increases in mortality rates with smooth functions of time that provide statistical power. The model permits discontinuities in the smooth functions to model sudden increases due to direct effects. We implement a flexible estimation approach that permits both surveillance of concerning increases in mortality rates and careful characterization of the effect of a past event. We demonstrate our tools' utility by estimating excess mortality after hurricanes in the United States and Puerto Rico. We use Hurricane Maria as a case study to show appealing properties that are unique to our method compared with current approaches. Finally, we show the flexibility of our approach by detecting and quantifying the 2014 Chikungunya outbreak in Puerto Rico and the COVID-19 pandemic in the United States. We make our tools available through the excessmort R package available from https://cran.r-project.org/web/packages/excessmort/.


Assuntos
COVID-19 , Tempestades Ciclônicas , Humanos , Pandemias , Porto Rico/epidemiologia , Estados Unidos/epidemiologia
11.
N Engl J Med ; 379(2): 162-170, 2018 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-29809109

RESUMO

BACKGROUND: Quantifying the effect of natural disasters on society is critical for recovery of public health services and infrastructure. The death toll can be difficult to assess in the aftermath of a major disaster. In September 2017, Hurricane Maria caused massive infrastructural damage to Puerto Rico, but its effect on mortality remains contentious. The official death count is 64. METHODS: Using a representative, stratified sample, we surveyed 3299 randomly chosen households across Puerto Rico to produce an independent estimate of all-cause mortality after the hurricane. Respondents were asked about displacement, infrastructure loss, and causes of death. We calculated excess deaths by comparing our estimated post-hurricane mortality rate with official rates for the same period in 2016. RESULTS: From the survey data, we estimated a mortality rate of 14.3 deaths (95% confidence interval [CI], 9.8 to 18.9) per 1000 persons from September 20 through December 31, 2017. This rate yielded a total of 4645 excess deaths during this period (95% CI, 793 to 8498), equivalent to a 62% increase in the mortality rate as compared with the same period in 2016. However, this number is likely to be an underestimate because of survivor bias. The mortality rate remained high through the end of December 2017, and one third of the deaths were attributed to delayed or interrupted health care. Hurricane-related migration was substantial. CONCLUSIONS: This household-based survey suggests that the number of excess deaths related to Hurricane Maria in Puerto Rico is more than 70 times the official estimate. (Funded by the Harvard T.H. Chan School of Public Health and others.).


Assuntos
Tempestades Ciclônicas , Desastres/estatística & dados numéricos , Acessibilidade aos Serviços de Saúde/estatística & dados numéricos , Mortalidade , Adolescente , Adulto , Distribuição por Idade , Idoso , Idoso de 80 Anos ou mais , Causas de Morte , Criança , Pré-Escolar , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Mortalidade Prematura , Porto Rico/epidemiologia , Inquéritos e Questionários , Adulto Jovem
12.
Ann Intern Med ; 173(12): 1004-1007, 2020 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-32915654

RESUMO

As of mid-August 2020, more than 170 000 U.S. residents have died of coronavirus disease 2019 (COVID-19); however, the true number of deaths resulting from COVID-19, both directly and indirectly, is likely to be much higher. The proper attribution of deaths to this pandemic has a range of societal, legal, mortuary, and public health consequences. This article discusses the current difficulties of disaster death attribution and describes the strengths and limitations of relying on death counts from death certificates, estimations of indirect deaths, and estimations of excess mortality. Improving the tabulation of direct and indirect deaths on death certificates will require concerted efforts and consensus across medical institutions and public health agencies. In addition, actionable estimates of excess mortality will require timely access to standardized and structured vital registry data, which should be shared directly at the state level to ensure rapid response for local governments. Correct attribution of direct and indirect deaths and estimation of excess mortality are complementary goals that are critical to our understanding of the pandemic and its effect on human life.


Assuntos
COVID-19/mortalidade , Pandemias , Sistema de Registros , SARS-CoV-2 , Causas de Morte/tendências , Humanos , Taxa de Sobrevida/tendências
13.
Genome Res ; 27(11): 1930-1938, 2017 11.
Artigo em Inglês | MEDLINE | ID: mdl-29025895

RESUMO

The main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics' public catalogs is based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance, as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-seq experiments and that this variability leads to false-positive peak calls. More concerning is that the GC effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell line. However, accounting for GC content bias in ChIP-seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signals with unwanted variability. To account for this challenge, we introduce a statistical approach that accounts for GC effects on both nonspecific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false-positive peaks as well as improved consistency across laboratories.


Assuntos
Composição de Bases , DNA/metabolismo , Análise de Sequência de DNA/métodos , Algoritmos , Sítios de Ligação , Imunoprecipitação da Cromatina , DNA/química , Reações Falso-Positivas , Genômica , Sequenciamento de Nucleotídeos em Larga Escala
14.
Biostatistics ; 20(3): 367-383, 2019 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29481604

RESUMO

With recent advances in sequencing technology, it is now feasible to measure DNA methylation at tens of millions of sites across the entire genome. In most applications, biologists are interested in detecting differentially methylated regions, composed of multiple sites with differing methylation levels among populations. However, current computational approaches for detecting such regions do not provide accurate statistical inference. A major challenge in reporting uncertainty is that a genome-wide scan is involved in detecting these regions, which needs to be accounted for. A further challenge is that sample sizes are limited due to the costs associated with the technology. We have developed a new approach that overcomes these challenges and assesses uncertainty for differentially methylated regions in a rigorous manner. Region-level statistics are obtained by fitting a generalized least squares regression model with a nested autoregressive correlated error structure for the effect of interest on transformed methylation proportions. We develop an inferential approach, based on a pooled null distribution, that can be implemented even when as few as two samples per population are available. Here, we demonstrate the advantages of our method using both experimental data and Monte Carlo simulation. We find that the new method improves the specificity and sensitivity of lists of regions and accurately controls the false discovery rate.


Assuntos
Metilação de DNA , Genômica/métodos , Modelos Estatísticos , Análise de Sequência de DNA/métodos , Animais , Simulação por Computador , Genômica/normas , Humanos , Análise de Sequência de DNA/normas , Incerteza
15.
Nat Methods ; 14(4): 417-419, 2017 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-28263959

RESUMO

We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA-seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.


Assuntos
Algoritmos , Análise de Sequência de RNA/métodos , Composição de Bases , Teorema de Bayes , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos
16.
Biostatistics ; 19(4): 562-578, 2018 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-29121214

RESUMO

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.


Assuntos
Perfilação da Expressão Gênica/normas , Sequenciamento de Nucleotídeos em Larga Escala/normas , Análise de Sequência de RNA/normas , Análise de Célula Única/normas , Transcriptoma , Animais , Humanos
17.
Biostatistics ; 19(2): 185-198, 2018 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-29036413

RESUMO

Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions. We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods. A software implementation is available from https://github.com/stephaniehicks/qsmooth.


Assuntos
Bioestatística/métodos , Interpretação Estatística de Dados , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Modelos Estatísticos , Humanos
18.
Nucleic Acids Res ; 45(2): e9, 2017 01 25.
Artigo em Inglês | MEDLINE | ID: mdl-27694310

RESUMO

Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.


Assuntos
Perfilação da Expressão Gênica/métodos , Software , Regulação da Expressão Gênica , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Transcriptoma , Navegador
19.
Nat Methods ; 12(2): 115-21, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-25633503

RESUMO

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.


Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Genômica/métodos , Ensaios de Triagem em Larga Escala/métodos , Software , Linguagens de Programação , Interface Usuário-Computador
20.
Cancer Causes Control ; 28(2): 167-176, 2017 02.
Artigo em Inglês | MEDLINE | ID: mdl-28097472

RESUMO

Molecular pathological epidemiology (MPE) is a transdisciplinary and relatively new scientific discipline that integrates theory, methods, and resources from epidemiology, pathology, biostatistics, bioinformatics, and computational biology. The underlying objective of MPE research is to better understand the etiology and progression of complex and heterogeneous human diseases with the goal of informing prevention and treatment efforts in population health and clinical medicine. Although MPE research has been commonly applied to investigating breast, lung, and colorectal cancers, its methodology can be used to study most diseases. Recent successes in MPE studies include: (1) the development of new statistical methods to address etiologic heterogeneity; (2) the enhancement of causal inference; (3) the identification of previously unknown exposure-subtype disease associations; and (4) better understanding of the role of lifestyle/behavioral factors on modifying prognosis according to disease subtype. Central challenges to MPE include the relative lack of transdisciplinary experts, educational programs, and forums to discuss issues related to the advancement of the field. To address these challenges, highlight recent successes in the field, and identify new opportunities, a series of MPE meetings have been held at the Dana-Farber Cancer Institute in Boston, MA. Herein, we share the proceedings of the Third International MPE Meeting, held in May 2016 and attended by 150 scientists from 17 countries. Special topics included integration of MPE with immunology and health disparity research. This meeting series will continue to provide an impetus to foster further transdisciplinary integration of divergent scientific fields.


Assuntos
Epidemiologia , Neoplasias , Patologia Molecular , Boston , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA