Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 61
Filtrar
2.
PLoS Genet ; 17(1): e1009241, 2021 01.
Artículo en Inglés | MEDLINE | ID: mdl-33465078

RESUMEN

FST and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of FST and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of FST to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing FST and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and FST when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and FST estimates.


Asunto(s)
Genética de Población/estadística & datos numéricos , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Endogamia , Modelos Genéticos , Genotipo , Humanos , Linaje , Polimorfismo de Nucleótido Simple/genética
3.
Genome Res ; 31(2): 337-347, 2021 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-33361113

RESUMEN

Understanding the changes in diverse molecular pathways underlying the development of breast tumors is critical for improving diagnosis, treatment, and drug development. Here, we used RNA-profiling of canine mammary tumors (CMTs) coupled with a robust analysis framework to model molecular changes in human breast cancer. Our study leveraged a key advantage of the canine model, the frequent presence of multiple naturally occurring tumors at diagnosis, thus providing samples spanning normal tissue and benign and malignant tumors from each patient. We showed human breast cancer signals, at both expression and mutation level, are evident in CMTs. Profiling multiple tumors per patient enabled by the CMT model allowed us to resolve statistically robust transcription patterns and biological pathways specific to malignant tumors versus those arising in benign tumors or shared with normal tissues. We showed that multiple histological samples per patient is necessary to effectively capture these progression-related signatures, and that carcinoma-specific signatures are predictive of survival for human breast cancer patients. To catalyze and support similar analyses and use of the CMT model by other biomedical researchers, we provide FREYA, a robust data processing pipeline and statistical analyses framework.

4.
Biostatistics ; 22(1): 68-81, 2021 01 28.
Artículo en Inglés | MEDLINE | ID: mdl-31135886

RESUMEN

The false discovery rate (FDR) measures the proportion of false discoveries among a set of hypothesis tests called significant. This quantity is typically estimated based on p-values or test statistics. In some scenarios, there is additional information available that may be used to more accurately estimate the FDR. We develop a new framework for formulating and estimating FDRs and q-values when an additional piece of information, which we call an "informative variable", is available. For a given test, the informative variable provides information about the prior probability a null hypothesis is true or the power of that particular test. The FDR is then treated as a function of this informative variable. We consider two applications in genomics. Our first application is a genetics of gene expression (eQTL) experiment in yeast where every genetic marker and gene expression trait pair are tested for associations. The informative variable in this case is the distance between each genetic marker and gene. Our second application is to detect differentially expressed genes in an RNA-seq study carried out in mice. The informative variable in this study is the per-gene read depth. The framework we develop is quite general, and it should be useful in a broad range of scientific applications.


Asunto(s)
Marcadores Genéticos , Genómica , Animales , Perfilación de la Expresión Génica , Marcadores Genéticos/genética , Genómica/métodos , Ratones , Probabilidad , Saccharomyces cerevisiae/genética
5.
Bioinformatics ; 37(3): 367-374, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32818252

RESUMEN

MOTIVATION: Analysis of biological data often involves the simultaneous testing of thousands of genes. This requires two key steps: the ranking of genes and the selection of important genes based on a significance threshold. One such testing procedure, called the optimal discovery procedure (ODP), leverages information across different tests to provide an optimal ranking of genes. This approach can lead to substantial improvements in statistical power compared to other methods. However, current applications of the ODP have only been established for simple study designs using microarray technology. Here, we extend this work to the analysis of complex study designs and RNA-sequencing studies. RESULTS: We apply our extended framework to a static RNA-sequencing study, a longitudinal study, an independent sampling time-series study,and an independent sampling dose-response study. Our method shows improved performance compared to other testing procedures, finding more differentially expressed genes and increasing power for enrichment analysis. Thus, the extended ODP enables a favorable significance analysis of genome-wide gene expression studies. AVAILABILITY AND IMPLEMENTATION: The algorithm is implemented in our freely available R package called edge and can be downloaded at https://www.bioconductor.org/packages/release/bioc/html/edge.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Expresión Génica , Estudios Longitudinales , Análisis de Secuencia de ARN
6.
Genetics ; 213(3): 759-770, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31537622

RESUMEN

Testing for Hardy-Weinberg equilibrium (HWE) is an important component in almost all analyses of population genetic data. Genetic markers that violate HWE are often treated as special cases; for example, they may be flagged as possible genotyping errors, or they may be investigated more closely for evolutionary signatures of interest. The presence of population structure is one reason why genetic markers may fail a test of HWE. This is problematic because almost all natural populations studied in the modern setting show some degree of structure. Therefore, it is important to be able to detect deviations from HWE for reasons other than structure. To this end, we extend statistical tests of HWE to allow for population structure, which we call a test of "structural HWE." Additionally, our new test allows one to automatically choose tuning parameters and identify accurate models of structure. We demonstrate our approach on several important studies, provide theoretical justification for the test, and present empirical evidence for its utility. We anticipate the proposed test will be useful in a broad range of analyses of genome-wide population genetic data.


Asunto(s)
Genética de Población/métodos , Técnicas de Genotipaje/métodos , Población/genética , Marcadores Genéticos , Genética de Población/normas , Técnicas de Genotipaje/normas , Humanos , Desequilibrio de Ligamiento , Estándares de Referencia
7.
Genetics ; 212(4): 1009-1029, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-31028112

RESUMEN

We introduce a simple and computationally efficient method for fitting the admixture model of genetic population structure, called ALStructure The strategy of ALStructure is to first estimate the low-dimensional linear subspace of the population admixture components, and then search for a model within this subspace that is consistent with the admixture model's natural probabilistic constraints. Central to this strategy is the observation that all models belonging to this constrained space of solutions are risk-minimizing and have equal likelihood, rendering any additional optimization unnecessary. The low-dimensional linear subspace is estimated through a recently introduced principal components analysis method that is appropriate for genotype data, thereby providing a solution that has both principal components and probabilistic admixture interpretations. Our approach differs fundamentally from other existing methods for estimating admixture, which aim to fit the admixture model directly by searching for parameters that maximize the likelihood function or the posterior probability. We observe that ALStructure typically outperforms existing methods both in accuracy and computational speed under a wide array of simulated and real human genotype datasets. Throughout this work, we emphasize that the admixture model is a special case of a much broader class of models for which algorithms similar to ALStructure may be successfully employed.


Asunto(s)
Algoritmos , Biología Computacional , Genética de Población , Funciones de Verosimilitud , Modelos Genéticos , Conjuntos de Datos como Asunto , Genoma Humano , Humanos , Análisis de Componente Principal
8.
Nat Commun ; 9(1): 2656, 2018 07 09.
Artículo en Inglés | MEDLINE | ID: mdl-29985403

RESUMEN

Genome-wide analysis of transcription in the malaria parasite Plasmodium falciparum has revealed robust variation in steady-state mRNA abundance throughout the 48-h intraerythrocytic developmental cycle (IDC), suggesting that this process is highly dynamic and tightly regulated. Here, we utilize rapid 4-thiouracil (4-TU) incorporation via pyrimidine salvage to specifically label, capture, and quantify newly-synthesized RNA transcripts at every hour throughout the IDC. This high-resolution global analysis of the transcriptome captures the timing and rate of transcription for each newly synthesized mRNA in vivo, revealing active transcription throughout all IDC stages. Using a statistical model to predict the mRNA dynamics contributing to the total mRNA abundance at each timepoint, we find varying degrees of transcription and stabilization for each mRNA corresponding to developmental transitions. Finally, our results provide new insight into co-regulation of mRNAs throughout the IDC through regulatory DNA sequence motifs, thereby expanding our understanding of P. falciparum mRNA dynamics.


Asunto(s)
Genes Protozoarios/genética , Genoma de Protozoos/genética , Plasmodium falciparum/genética , Transcripción Genética , Eritrocitos/parasitología , Perfilación de la Expresión Génica , Ontología de Genes , Humanos , Malaria Falciparum/parasitología , Plasmodium falciparum/fisiología , ARN Mensajero/genética , ARN Mensajero/metabolismo , ARN Protozoario/genética , ARN Protozoario/metabolismo
9.
Nat Genet ; 48(12): 1587-1590, 2016 12.
Artículo en Inglés | MEDLINE | ID: mdl-27819665

RESUMEN

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order of millions of individuals, and existing methods do not scale to data of this size. To solve this problem, we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (1012 observed genotypes; for example, 1 million individuals at 1 million SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure among individuals. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample sizes.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Enfermedad/genética , Marcadores Genéticos/genética , Predisposición Genética a la Enfermedad , Modelos Estadísticos , Polimorfismo de Nucleótido Simple/genética , Teorema de Bayes , Genética de Población , Humanos
10.
Science ; 354(6311)2016 10 28.
Artículo en Inglés | MEDLINE | ID: mdl-27789812

RESUMEN

Cellular metabolic fluxes are determined by enzyme activities and metabolite abundances. Biochemical approaches reveal the impact of specific substrates or regulators on enzyme kinetics but do not capture the extent to which metabolite and enzyme concentrations vary across physiological states and, therefore, how cellular reactions are regulated. We measured enzyme and metabolite concentrations and metabolic fluxes across 25 steady-state yeast cultures. We then assessed the extent to which flux can be explained by a Michaelis-Menten relationship between enzyme, substrate, product, and potential regulator concentrations. This revealed three previously unrecognized instances of cross-pathway regulation, which we biochemically verified. One of these involved inhibition of pyruvate kinase by citrate, which accumulated and thereby curtailed glycolytic outflow in nitrogen-limited yeast. Overall, substrate concentrations were the strongest driver of the net rates of cellular metabolic reactions, with metabolite concentrations collectively having more than double the physiological impact of enzymes.


Asunto(s)
Redes y Vías Metabólicas , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/enzimología , Regulación Alostérica , Citratos/metabolismo , Glucólisis , Cinética , Nitrógeno/deficiencia , Piruvato Quinasa/antagonistas & inhibidores , Piruvato Quinasa/química , Piruvato Quinasa/metabolismo , Proteínas de Saccharomyces cerevisiae/antagonistas & inhibidores , Proteínas de Saccharomyces cerevisiae/química
11.
Bioinformatics ; 32(5): 713-21, 2016 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-26545820

RESUMEN

MOTIVATION: Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important problem is how to formulate and estimate probabilistic models of observed genotypes that account for complex population structure. The most prominent work on this problem has focused on estimating a model of admixture proportions of ancestral populations for each individual. Here, we instead focus on modeling variation of the genotypes without requiring a higher-level admixture interpretation. RESULTS: We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. First, we show how principal component analysis can be utilized to estimate a general model that includes the well-known Pritchard-Stephens-Donnelly admixture model as a special case. Noting some drawbacks of this approach, we introduce a new 'logistic factor analysis' framework that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure. We demonstrate these advances on data from the Human Genome Diversity Panel and 1000 Genomes Project, where we are able to identify SNPs that are highly differentiated with respect to structure while making minimal modeling assumptions. AVAILABILITY AND IMPLEMENTATION: A Bioconductor R package called lfa is available at http://www.bioconductor.org/packages/release/bioc/html/lfa.html CONTACT: jstorey@princeton.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Variación Genética , Algoritmos , Genotipo , Humanos , Modelos Estadísticos , Probabilidad , Programas Informáticos
12.
PLoS Comput Biol ; 11(11): e1004509, 2015 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-26575353

RESUMEN

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems-that is, those in which statistical tests can be partitioned naturally-controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.


Asunto(s)
Biología Computacional/métodos , Modelos Estadísticos , Estructura Terciaria de Proteína , Proteínas/química , Algoritmos , Bases de Datos de Proteínas
13.
Nucleic Acids Res ; 43(20): e131, 2015 Nov 16.
Artículo en Inglés | MEDLINE | ID: mdl-26130709

RESUMEN

Understanding the differences between microarray and RNA-Seq technologies for measuring gene expression is necessary for informed design of experiments and choice of data analysis methods. Previous comparisons have come to sometimes contradictory conclusions, which we suggest result from a lack of attention to the intensity-dependent nature of variation generated by the technologies. To examine this trend, we carried out a parallel nested experiment performed simultaneously on the two technologies that systematically split variation into four stages (treatment, biological variation, library preparation and chip/lane noise), allowing a separation and comparison of the sources of variation in a well-controlled cellular system, Saccharomyces cerevisiae. With this novel dataset, we demonstrate that power and accuracy are more dependent on per-gene read depth in RNA-Seq than they are on fluorescence intensity in microarrays. However, we carried out quantitative PCR validations which indicate that microarrays may demonstrate greater systematic bias in low-intensity genes than in RNA-seq.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ARN/métodos , Interpretación Estadística de Datos , Genes Fúngicos , Reacción en Cadena de la Polimerasa , Saccharomyces cerevisiae/genética
14.
Nat Genet ; 47(5): 550-4, 2015 May.
Artículo en Inglés | MEDLINE | ID: mdl-25822090

RESUMEN

We present a new statistical test of association between a trait and genetic markers, which we theoretically and practically prove to be robust to arbitrarily complex population structure. The statistical test involves a set of parameters that can be directly estimated from large-scale genotyping data, such as those measured in genome-wide association studies (GWAS). We also derive a new set of methodologies, called a 'genotype-conditional association test' (GCAT), shown to provide accurate association tests in populations with complex structures, manifested in both the genetic and non-genetic contributions to the trait. We demonstrate the proposed method on a large simulation study and on the Northern Finland Birth Cohort study. In the Finland study, we identify several new significant loci that other methods do not detect. Our proposed framework provides a substantially different approach to the problem from existing methods, such as the linear mixed-model and principal-component approaches.


Asunto(s)
Estudio de Asociación del Genoma Completo , Simulación por Computador , Humanos , Modelos Lineales , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Análisis de Componente Principal
15.
Bioinformatics ; 31(4): 545-54, 2015 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-25336500

RESUMEN

MOTIVATION: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. RESULTS: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. AVAILABILITY AND IMPLEMENTATION: An R software package, called jackstraw, is available in CRAN. CONTACT: jstorey@princeton.edu.


Asunto(s)
Algoritmos , Interpretación Estadística de Datos , Genómica/métodos , Inflamación/genética , Modelos Estadísticos , Análisis de Componente Principal , Programas Informáticos , Simulación por Computador , Perfilación de la Expresión Génica , Genes cdc , Variación Genética , Humanos , Análisis por Micromatrices , Fenotipo , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Trastornos por Estrés Postraumático/genética
16.
Bioinformatics ; 30(23): 3424-6, 2014 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-25189781

RESUMEN

MOTIVATION: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment. RESULTS: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth. AVAILABILITY AND IMPLEMENTATION: The subSeq R package is available at http://github.com/StoreyLab/subSeq/.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Animales , Ratas
17.
Genome Med ; 6(5): 40, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24971157

RESUMEN

BACKGROUND: Genetic risk scores have been developed for coronary artery disease and atherosclerosis, but are not predictive of adverse cardiovascular events. We asked whether peripheral blood expression profiles may be predictive of acute myocardial infarction (AMI) and/or cardiovascular death. METHODS: Peripheral blood samples from 338 subjects aged 62 ± 11 years with coronary artery disease (CAD) were analyzed in two phases (discovery N = 175, and replication N = 163), and followed for a mean 2.4 years for cardiovascular death. Gene expression was measured on Illumina HT-12 microarrays with two different normalization procedures to control technical and biological covariates. Whole genome genotyping was used to support comparative genome-wide association studies of gene expression. Analysis of variance was combined with receiver operating curve and survival analysis to define a transcriptional signature of cardiovascular death. RESULTS: In both phases, there was significant differential expression between healthy and AMI groups with overall down-regulation of genes involved in T-lymphocyte signaling and up-regulation of inflammatory genes. Expression quantitative trait loci analysis provided evidence for altered local genetic regulation of transcript abundance in AMI samples. On follow-up there were 31 cardiovascular deaths. A principal component (PC1) score capturing covariance of 238 genes that were differentially expressed between deceased and survivors in the discovery phase significantly predicted risk of cardiovascular death in the replication and combined samples (hazard ratio = 8.5, P < 0.0001) and improved the C-statistic (area under the curve 0.82 to 0.91, P = 0.03) after adjustment for traditional covariates. CONCLUSIONS: A specific blood gene expression profile is associated with a significant risk of death in Caucasian subjects with CAD. This comprises a subset of transcripts that are also altered in expression during acute myocardial infarction.

18.
Proc Natl Acad Sci U S A ; 111(6): E645-54, 2014 Feb 11.
Artículo en Inglés | MEDLINE | ID: mdl-24469817

RESUMEN

A problem of substantial interest is to systematically map variation in chromatin structure to gene-expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity and genome-wide gene-expression levels in 20 diverse human cell types. We show that ∼25% of genes show cell-type-specific expression explained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., ±200 kb) capture more genes with this relationship than local regions (e.g., ±2.5 kb), yet the local regions show a more pronounced effect. By exploiting variation across cell types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type-specific expression, which we show have a range of contextual uses. This quantitative framework is likely applicable to other settings aimed at relating continuous genomic measurements to gene-expression variation.


Asunto(s)
Cromatina/fisiología , Perfilación de la Expresión Génica , Cromatina/química , Desoxirribonucleasa I/metabolismo , Humanos , Polimorfismo de Nucleótido Simple , Elementos Reguladores de la Transcripción
19.
G3 (Bethesda) ; 4(1): 11-8, 2014 Jan 10.
Artículo en Inglés | MEDLINE | ID: mdl-24192834

RESUMEN

High-throughput quantitative DNA sequencing enables the parallel phenotyping of pools of thousands of mutants. However, the appropriate analytical methods and experimental design that maximize the efficiency of these methods while maintaining statistical power are currently unknown. Here, we have used Bar-seq analysis of the Saccharomyces cerevisiae yeast deletion library to systematically test the effect of experimental design parameters and sequence read depth on experimental results. We present computational methods that efficiently and accurately estimate effect sizes and their statistical significance by adapting existing methods for RNA-seq analysis. Using simulated variation of experimental designs, we found that biological replicates are critical for statistical analysis of Bar-seq data, whereas technical replicates are of less value. By subsampling sequence reads, we found that when using four-fold biological replication, 6 million reads per condition achieved 96% power to detect a two-fold change (or more) at a 5% false discovery rate. Our guidelines for experimental design and computational analysis enables the study of the yeast deletion collection in up to 30 different conditions in a single sequencing lane. These findings are relevant to a variety of pooled genetic screening methods that use high-throughput quantitative DNA sequencing, including Tn-seq.


Asunto(s)
ARN de Hongos/análisis , Saccharomyces cerevisiae/genética , Eliminación de Gen , Biblioteca de Genes , Secuenciación de Nucleótidos de Alto Rendimiento , Proyectos de Investigación , Análisis de Secuencia de ARN
20.
BMC Bioinformatics ; 14: 360, 2013 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-24330332

RESUMEN

BACKGROUND: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features associated with illness. We propose a new approach, called gene set bagging, for measuring the probability that a gene set replicates in future studies. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate in the bagged samples. RESULTS: Using both simulated and publicly-available genomics data, we demonstrate that significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. We show our method estimates the replication probability (R), the probability that a gene set will replicate as a significant result in future studies, and show in simulations that this method reflects replication better than each set's p-value. CONCLUSIONS: Our results suggest that gene lists based on p-values are not necessarily stable, and therefore additional steps like gene set bagging may improve biological inference on gene sets.


Asunto(s)
Metilación de ADN/genética , Replicación del ADN/genética , Genómica/métodos , Algoritmos , Sitios de Unión/genética , Química Encefálica/genética , Simulación por Computador , Bases de Datos Factuales , Perfilación de la Expresión Génica/métodos , Genoma Humano , Genómica/tendencias , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Valor Predictivo de las Pruebas , Probabilidad , Tamaño de la Muestra , Fumar/genética , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...