Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 61
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
Genome Res ; 31(2): 337-347, 2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-33361113

RESUMO

Understanding the changes in diverse molecular pathways underlying the development of breast tumors is critical for improving diagnosis, treatment, and drug development. Here, we used RNA-profiling of canine mammary tumors (CMTs) coupled with a robust analysis framework to model molecular changes in human breast cancer. Our study leveraged a key advantage of the canine model, the frequent presence of multiple naturally occurring tumors at diagnosis, thus providing samples spanning normal tissue and benign and malignant tumors from each patient. We showed human breast cancer signals, at both expression and mutation level, are evident in CMTs. Profiling multiple tumors per patient enabled by the CMT model allowed us to resolve statistically robust transcription patterns and biological pathways specific to malignant tumors versus those arising in benign tumors or shared with normal tissues. We showed that multiple histological samples per patient is necessary to effectively capture these progression-related signatures, and that carcinoma-specific signatures are predictive of survival for human breast cancer patients. To catalyze and support similar analyses and use of the CMT model by other biomedical researchers, we provide FREYA, a robust data processing pipeline and statistical analyses framework.

2.
PLoS Genet ; 17(1): e1009241, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33465078

RESUMO

FST and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of FST and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of FST to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing FST and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and FST when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and FST estimates.


Assuntos
Genética Populacional/estatística & dados numéricos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Endogamia , Modelos Genéticos , Genótipo , Humanos , Linhagem , Polimorfismo de Nucleotídeo Único/genética
3.
Biostatistics ; 22(1): 68-81, 2021 01 28.
Artigo em Inglês | MEDLINE | ID: mdl-31135886

RESUMO

The false discovery rate (FDR) measures the proportion of false discoveries among a set of hypothesis tests called significant. This quantity is typically estimated based on p-values or test statistics. In some scenarios, there is additional information available that may be used to more accurately estimate the FDR. We develop a new framework for formulating and estimating FDRs and q-values when an additional piece of information, which we call an "informative variable", is available. For a given test, the informative variable provides information about the prior probability a null hypothesis is true or the power of that particular test. The FDR is then treated as a function of this informative variable. We consider two applications in genomics. Our first application is a genetics of gene expression (eQTL) experiment in yeast where every genetic marker and gene expression trait pair are tested for associations. The informative variable in this case is the distance between each genetic marker and gene. Our second application is to detect differentially expressed genes in an RNA-seq study carried out in mice. The informative variable in this study is the per-gene read depth. The framework we develop is quite general, and it should be useful in a broad range of scientific applications.


Assuntos
Marcadores Genéticos , Genômica , Animais , Perfilação da Expressão Gênica , Marcadores Genéticos/genética , Genômica/métodos , Camundongos , Probabilidade , Saccharomyces cerevisiae/genética
4.
Bioinformatics ; 37(3): 367-374, 2021 04 20.
Artigo em Inglês | MEDLINE | ID: mdl-32818252

RESUMO

MOTIVATION: Analysis of biological data often involves the simultaneous testing of thousands of genes. This requires two key steps: the ranking of genes and the selection of important genes based on a significance threshold. One such testing procedure, called the optimal discovery procedure (ODP), leverages information across different tests to provide an optimal ranking of genes. This approach can lead to substantial improvements in statistical power compared to other methods. However, current applications of the ODP have only been established for simple study designs using microarray technology. Here, we extend this work to the analysis of complex study designs and RNA-sequencing studies. RESULTS: We apply our extended framework to a static RNA-sequencing study, a longitudinal study, an independent sampling time-series study,and an independent sampling dose-response study. Our method shows improved performance compared to other testing procedures, finding more differentially expressed genes and increasing power for enrichment analysis. Thus, the extended ODP enables a favorable significance analysis of genome-wide gene expression studies. AVAILABILITY AND IMPLEMENTATION: The algorithm is implemented in our freely available R package called edge and can be downloaded at https://www.bioconductor.org/packages/release/bioc/html/edge.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Expressão Gênica , Estudos Longitudinais , Análise de Sequência de RNA
5.
Bioinformatics ; 32(5): 713-21, 2016 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-26545820

RESUMO

MOTIVATION: Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important problem is how to formulate and estimate probabilistic models of observed genotypes that account for complex population structure. The most prominent work on this problem has focused on estimating a model of admixture proportions of ancestral populations for each individual. Here, we instead focus on modeling variation of the genotypes without requiring a higher-level admixture interpretation. RESULTS: We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. First, we show how principal component analysis can be utilized to estimate a general model that includes the well-known Pritchard-Stephens-Donnelly admixture model as a special case. Noting some drawbacks of this approach, we introduce a new 'logistic factor analysis' framework that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure. We demonstrate these advances on data from the Human Genome Diversity Panel and 1000 Genomes Project, where we are able to identify SNPs that are highly differentiated with respect to structure while making minimal modeling assumptions. AVAILABILITY AND IMPLEMENTATION: A Bioconductor R package called lfa is available at http://www.bioconductor.org/packages/release/bioc/html/lfa.html CONTACT: jstorey@princeton.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Variação Genética , Algoritmos , Genótipo , Humanos , Modelos Estatísticos , Probabilidade , Software
6.
Nucleic Acids Res ; 43(20): e131, 2015 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-26130709

RESUMO

Understanding the differences between microarray and RNA-Seq technologies for measuring gene expression is necessary for informed design of experiments and choice of data analysis methods. Previous comparisons have come to sometimes contradictory conclusions, which we suggest result from a lack of attention to the intensity-dependent nature of variation generated by the technologies. To examine this trend, we carried out a parallel nested experiment performed simultaneously on the two technologies that systematically split variation into four stages (treatment, biological variation, library preparation and chip/lane noise), allowing a separation and comparison of the sources of variation in a well-controlled cellular system, Saccharomyces cerevisiae. With this novel dataset, we demonstrate that power and accuracy are more dependent on per-gene read depth in RNA-Seq than they are on fluorescence intensity in microarrays. However, we carried out quantitative PCR validations which indicate that microarrays may demonstrate greater systematic bias in low-intensity genes than in RNA-seq.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de RNA/métodos , Interpretação Estatística de Dados , Genes Fúngicos , Reação em Cadeia da Polimerase , Saccharomyces cerevisiae/genética
7.
Proc Natl Acad Sci U S A ; 111(6): E645-54, 2014 Feb 11.
Artigo em Inglês | MEDLINE | ID: mdl-24469817

RESUMO

A problem of substantial interest is to systematically map variation in chromatin structure to gene-expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity and genome-wide gene-expression levels in 20 diverse human cell types. We show that ∼25% of genes show cell-type-specific expression explained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., ±200 kb) capture more genes with this relationship than local regions (e.g., ±2.5 kb), yet the local regions show a more pronounced effect. By exploiting variation across cell types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type-specific expression, which we show have a range of contextual uses. This quantitative framework is likely applicable to other settings aimed at relating continuous genomic measurements to gene-expression variation.


Assuntos
Cromatina/fisiologia , Perfilação da Expressão Gênica , Cromatina/química , Desoxirribonuclease I/metabolismo , Humanos , Polimorfismo de Nucleotídeo Único , Elementos Reguladores de Transcrição
8.
Bioinformatics ; 31(4): 545-54, 2015 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-25336500

RESUMO

MOTIVATION: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. RESULTS: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. AVAILABILITY AND IMPLEMENTATION: An R software package, called jackstraw, is available in CRAN. CONTACT: jstorey@princeton.edu.


Assuntos
Algoritmos , Interpretação Estatística de Dados , Genômica/métodos , Inflamação/genética , Modelos Estatísticos , Análise de Componente Principal , Software , Simulação por Computador , Perfilação da Expressão Gênica , Genes cdc , Variação Genética , Humanos , Análise em Microsséries , Fenótipo , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Transtornos de Estresse Pós-Traumáticos/genética
9.
PLoS Comput Biol ; 11(11): e1004509, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26575353

RESUMO

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems-that is, those in which statistical tests can be partitioned naturally-controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Estrutura Terciária de Proteína , Proteínas/química , Algoritmos , Bases de Dados de Proteínas
10.
Bioinformatics ; 30(23): 3424-6, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25189781

RESUMO

MOTIVATION: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment. RESULTS: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth. AVAILABILITY AND IMPLEMENTATION: The subSeq R package is available at http://github.com/StoreyLab/subSeq/.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Software , Animais , Ratos
11.
Development ; 138(22): 4867-74, 2011 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-22007136

RESUMO

A crucial issue in studies of morphogen gradients relates to their range: the distance over which they can act as direct regulators of cell signaling, gene expression and cell differentiation. To address this, we present a straightforward statistical framework that can be used in multiple developmental systems. We illustrate the developed approach by providing a point estimate and confidence interval for the spatial range of the graded distribution of nuclear Dorsal, a transcription factor that controls the dorsoventral pattern of the Drosophila embryo.


Assuntos
Bioestatística/métodos , Biologia Computacional , Proteínas de Drosophila/metabolismo , Drosophila/embriologia , Drosophila/metabolismo , Genes Controladores do Desenvolvimento , Morfogênese/genética , Animais , Fase de Clivagem do Zigoto/metabolismo , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Simulação por Computador , Drosophila/genética , Proteínas de Drosophila/análise , Proteínas de Drosophila/genética , Embrião não Mamífero/química , Embrião não Mamífero/metabolismo , Regulação da Expressão Gênica no Desenvolvimento , Genes Controladores do Desenvolvimento/fisiologia , Imageamento Tridimensional , Hibridização in Situ Fluorescente , Morfogênese/fisiologia , Concentração Osmolar , Distribuição Tecidual/genética
12.
Proc Natl Acad Sci U S A ; 108(9): 3707-12, 2011 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-21317363

RESUMO

A 6.9 million-feature oligonucleotide array of the human transcriptome [Glue Grant human transcriptome (GG-H array)] has been developed for high-throughput and cost-effective analyses in clinical studies. This array allows comprehensive examination of gene expression and genome-wide identification of alternative splicing as well as detection of coding SNPs and noncoding transcripts. The performance of the array was examined and compared with mRNA sequencing (RNA-Seq) results over multiple independent replicates of liver and muscle samples. Compared with RNA-Seq of 46 million uniquely mappable reads per replicate, the GG-H array is highly reproducible in estimating gene and exon abundance. Although both platforms detect similar expression changes at the gene level, the GG-H array is more sensitive at the exon level. Deeper sequencing is required to adequately cover low-abundance transcripts. The array has been implemented in a multicenter clinical program and has generated high-quality, reproducible data. Considering the clinical trial requirements of cost, sample availability, and throughput, the GG-H array has a wide range of applications. An emerging approach for large-scale clinical genomic studies is to first use RNA-Seq to the sufficient depth for the discovery of transcriptome elements relevant to the disease process followed by high-throughput and reliable screening of these elements on thousands of patient samples using custom-designed arrays.


Assuntos
Perfilação da Expressão Gênica/métodos , Ensaios de Triagem em Larga Escala/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Processamento Alternativo/genética , Éxons/genética , Humanos , Especificidade de Órgãos/genética , RNA não Traduzido/genética , Reprodutibilidade dos Testes , Análise de Sequência de RNA
13.
BMC Bioinformatics ; 14: 360, 2013 Dec 12.
Artigo em Inglês | MEDLINE | ID: mdl-24330332

RESUMO

BACKGROUND: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features associated with illness. We propose a new approach, called gene set bagging, for measuring the probability that a gene set replicates in future studies. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate in the bagged samples. RESULTS: Using both simulated and publicly-available genomics data, we demonstrate that significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. We show our method estimates the replication probability (R), the probability that a gene set will replicate as a significant result in future studies, and show in simulations that this method reflects replication better than each set's p-value. CONCLUSIONS: Our results suggest that gene lists based on p-values are not necessarily stable, and therefore additional steps like gene set bagging may improve biological inference on gene sets.


Assuntos
Metilação de DNA/genética , Replicação do DNA/genética , Genômica/métodos , Algoritmos , Sítios de Ligação/genética , Química Encefálica/genética , Simulação por Computador , Bases de Dados Factuais , Perfilação da Expressão Gênica/métodos , Genoma Humano , Genômica/tendências , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Valor Preditivo dos Testes , Probabilidade , Tamanho da Amostra , Fumar/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
14.
Bioinformatics ; 28(6): 882-3, 2012 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-22257669

RESUMO

Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput experiments. The most well-known source of latent variation in genomic experiments are batch effects-when samples are processed on different days, in different groups or by different people. However, there are also a large number of other variables that may have a major impact on high-throughput measurements. Here we describe the sva package for identifying, estimating and removing unwanted sources of variation in high-throughput experiments. The sva package supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function.


Assuntos
Software , Perfilação da Expressão Gênica , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Regressão , Neoplasias da Bexiga Urinária/genética
15.
Bioinformatics ; 27(4): 509-15, 2011 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-21186247

RESUMO

MOTIVATION: It is well known that patterns of differential gene expression across biological conditions are often shared by many genes, particularly those within functional groups. Taking advantage of these patterns can lead to increased statistical power and biological clarity when testing for differential expression in a microarray experiment. The optimal discovery procedure (ODP), which maximizes the expected number of true positives for each fixed number of expected false positives, is a framework aimed at this goal. Storey et al. introduced an estimator of the ODP for identifying differentially expressed genes. However, their ODP estimator grows quadratically in computational time with respect to the number of genes. Reducing this computational burden is a key step in making the ODP practical for usage in a variety of high-throughput problems. RESULTS: Here, we propose a new estimate of the ODP called the modular ODP (mODP). The existing 'full ODP' requires that the likelihood function for each gene be evaluated according to the parameter estimates for all genes. The mODP assigns genes to modules according to a Kullback-Leibler distance, and then evaluates the statistic only at the module-averaged parameter estimates. We show that the mODP is relatively insensitive to the choice of the number of modules, but dramatically reduces the computational complexity from quadratic to linear in the number of genes. We compare the full ODP algorithm and mODP on simulated data and gene expression data from a recent study of Morrocan Amazighs. The mODP and full ODP algorithm perform very similarly across a range of comparisons. AVAILABILITY: The mODP methodology has been implemented into EDGE, a comprehensive gene expression analysis software package in R, available at http://genomine.org/edge/.


Assuntos
Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Simulação por Computador , Expressão Gênica , Humanos , Funções Verossimilhança , Modelos Estatísticos , Software
16.
PLoS Med ; 8(9): e1001093, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21931541

RESUMO

BACKGROUND: Trauma is the number one killer of individuals 1-44 y of age in the United States. The prognosis and treatment of inflammatory complications in critically injured patients continue to be challenging, with a history of failed clinical trials and poorly understood biology. New approaches are therefore needed to improve our ability to diagnose and treat this clinical condition. METHODS AND FINDINGS: We conducted a large-scale study on 168 blunt-force trauma patients over 28 d, measuring ∼400 clinical variables and longitudinally profiling leukocyte gene expression with ∼800 microarrays. Marshall MOF (multiple organ failure) clinical score trajectories were first utilized to organize the patients into five categories of increasingly poor outcomes. We then developed an analysis framework modeling early within-patient expression changes to produce a robust characterization of the genomic response to trauma. A quarter of the genome shows early expression changes associated with longer-term post-injury complications, captured by at least five dynamic co-expression modules of functionally related genes. In particular, early down-regulation of MHC-class II genes and up-regulation of p38 MAPK signaling pathway were found to strongly associate with longer-term post-injury complications, providing discrimination among patient outcomes from expression changes during the 40-80 h window post-injury. CONCLUSIONS: The genomic characterization provided here substantially expands the scope by which the molecular response to trauma may be characterized and understood. These results may be instrumental in furthering our understanding of the disease process and identifying potential targets for therapeutic intervention. Additionally, the quantitative approach we have introduced is potentially applicable to future genomics studies of rapidly progressing clinical conditions. TRIAL REGISTRATION: ClinicalTrials.gov NCT00257231


Assuntos
Regulação da Expressão Gênica , Inflamação/complicações , Sistema de Sinalização das MAP Quinases , Insuficiência de Múltiplos Órgãos/imunologia , Ferimentos e Lesões/imunologia , Adolescente , Adulto , Apresentação de Antígeno , Feminino , Genes MHC da Classe II , Humanos , Inflamação/imunologia , Leucócitos/citologia , Leucócitos/imunologia , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Insuficiência de Múltiplos Órgãos/etiologia , Insuficiência de Múltiplos Órgãos/fisiopatologia , Análise de Sequência com Séries de Oligonucleotídeos , Estudos Prospectivos , Ferimentos e Lesões/complicações , Ferimentos e Lesões/genética , Ferimentos e Lesões/fisiopatologia , Adulto Jovem
17.
Bioinformatics ; 26(10): 1308-15, 2010 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-20363728

RESUMO

MOTIVATION: A major challenge in utilizing microarray technologies to measure nucleic acid abundances is 'normalization', the goal of which is to separate biologically meaningful signal from other confounding sources of signal, often due to unavoidable technical factors. It is intuitively clear that true biological signal and confounding factors need to be simultaneously considered when performing normalization. However, the most popular normalization approaches do not utilize what is known about the study, both in terms of the biological variables of interest and the known technical factors in the study, such as batch or array processing date. RESULTS: We show here that failing to include all study-specific biological and technical variables when performing normalization leads to biased downstream analyses. We propose a general normalization framework that fits a study-specific model employing every known variable that is relevant to the expression study. The proposed method is generally applicable to the full range of existing probe designs, as well as to both single-channel and dual-channel arrays. We show through real and simulated examples that the method has favorable operating characteristics in comparison to some of the most highly used normalization methods. AVAILABILITY: An R package called snm implementing the methodology will be made available from Bioconductor (http://bioconductor.org). CONTACT: jstorey@princeton.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica
18.
PLoS Biol ; 6(5): e107, 2008 May 06.
Artigo em Inglês | MEDLINE | ID: mdl-18462017

RESUMO

Genetic variants that are associated with common human diseases do not lead directly to disease, but instead act on intermediate, molecular phenotypes that in turn induce changes in higher-order disease traits. Therefore, identifying the molecular phenotypes that vary in response to changes in DNA and that also associate with changes in disease traits has the potential to provide the functional information required to not only identify and validate the susceptibility genes that are directly affected by changes in DNA, but also to understand the molecular networks in which such genes operate and how changes in these networks lead to changes in disease traits. Toward that end, we profiled more than 39,000 transcripts and we genotyped 782,476 unique single nucleotide polymorphisms (SNPs) in more than 400 human liver samples to characterize the genetic architecture of gene expression in the human liver, a metabolically active tissue that is important in a number of common human diseases, including obesity, diabetes, and atherosclerosis. This genome-wide association study of gene expression resulted in the detection of more than 6,000 associations between SNP genotypes and liver gene expression traits, where many of the corresponding genes identified have already been implicated in a number of human diseases. The utility of these data for elucidating the causes of common human diseases is demonstrated by integrating them with genotypic and expression data from other human and mouse populations. This provides much-needed functional support for the candidate susceptibility genes being identified at a growing number of genetic loci that have been identified as key drivers of disease from genome-wide association studies of disease. By using an integrative genomics approach, we highlight how the gene RPS26 and not ERBB3 is supported by our data as the most likely susceptibility gene for a novel type 1 diabetes locus recently identified in a large-scale, genome-wide association study. We also identify SORT1 and CELSR2 as candidate susceptibility genes for a locus recently associated with coronary artery disease and plasma low-density lipoprotein cholesterol levels in the process.


Assuntos
Perfilação da Expressão Gênica , Predisposição Genética para Doença/genética , Fígado/metabolismo , Polimorfismo de Nucleotídeo Único/genética , Transcrição Gênica/genética , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Animais , Criança , Pré-Escolar , LDL-Colesterol/sangue , LDL-Colesterol/genética , Doença da Artéria Coronariana/genética , Diabetes Mellitus Tipo 1/genética , Feminino , Genes MHC da Classe II/genética , Genoma Humano , Genótipo , Humanos , Lactente , Masculino , Camundongos , Pessoa de Meia-Idade , Análise de Sequência com Séries de Oligonucleotídeos , Locos de Características Quantitativas/genética , RNA Mensageiro/análise , RNA Mensageiro/genética
19.
Nature ; 436(7051): 701-3, 2005 Aug 04.
Artigo em Inglês | MEDLINE | ID: mdl-16079846

RESUMO

Interactions between polymorphisms at different quantitative trait loci (QTLs) are thought to contribute to the genetics of many traits, and can markedly affect the power of genetic studies to detect QTLs. Interacting loci have been identified in many organisms. However, the prevalence of interactions, and the nucleotide changes underlying them, are largely unknown. Here we search for naturally occurring genetic interactions in a large set of quantitative phenotypes--the levels of all transcripts in a cross between two strains of Saccharomyces cerevisiae. For each transcript, we searched for secondary loci interacting with primary QTLs detected by their individual effects. Such locus pairs were estimated to be involved in the inheritance of 57% of transcripts; statistically significant pairs were identified for 225 transcripts. Among these, 67% of secondary loci had individual effects too small to be significant in a genome-wide scan. Engineered polymorphisms in isogenic strains confirmed an interaction between the mating-type locus MAT and the pheromone response gene GPA1. Our results indicate that genetic interactions are widespread in the genetics of transcript levels, and that many QTLs will be missed by single-locus tests but can be detected by two-stage tests that allow for interactions.


Assuntos
Regulação Fúngica da Expressão Gênica/genética , Polimorfismo Genético/genética , Saccharomyces cerevisiae/genética , Cruzamentos Genéticos , Subunidades alfa de Proteínas de Ligação ao GTP/genética , Subunidades alfa de Proteínas de Ligação ao GTP/metabolismo , Subunidades alfa Gq-G11 de Proteínas de Ligação ao GTP , Genes Fúngicos/genética , Genes Fúngicos Tipo Acasalamento , Genoma Fúngico , Haploidia , Mutação/genética , Feromônios/farmacologia , Locos de Características Quantitativas/genética , RNA Fúngico/genética , RNA Fúngico/metabolismo , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Saccharomyces cerevisiae/efeitos dos fármacos , Saccharomyces cerevisiae/fisiologia , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
20.
Proc Natl Acad Sci U S A ; 105(48): 18718-23, 2008 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-19033188

RESUMO

We develop a general framework for performing large-scale significance testing in the presence of arbitrarily strong dependence. We derive a low-dimensional set of random vectors, called a dependence kernel, that fully captures the dependence structure in an observed high-dimensional dataset. This result shows a surprising reversal of the "curse of dimensionality" in the high-dimensional hypothesis testing setting. We show theoretically that conditioning on a dependence kernel is sufficient to render statistical tests independent regardless of the level of dependence in the observed data. This framework for multiple testing dependence has implications in a variety of common multiple testing problems, such as in gene expression studies, brain imaging, and spatial epidemiology.


Assuntos
Algoritmos , Modelos Estatísticos , Simulação por Computador , Software , Estatística como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA