Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Proc Natl Acad Sci U S A ; 118(44)2021 11 02.
Artigo em Inglês | MEDLINE | ID: mdl-34716259

RESUMO

In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms that perform clustering on several small subgraphs and finally patch the results into a single clustering. The main advantage of these algorithms is that they significantly bring down the computational cost of traditional algorithms, including spectral clustering, semidefinite programs, modularity-based methods, likelihood-based methods, etc., without losing accuracy, and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Since most traditional algorithms are accurate, and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove the consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.

2.
Proc Natl Acad Sci U S A ; 117(47): 29257-29259, 2020 11 24.
Artigo em Inglês | MEDLINE | ID: mdl-33188088
3.
Proc Natl Acad Sci U S A ; 116(38): 18943-18950, 2019 09 17.
Artigo em Inglês | MEDLINE | ID: mdl-31484776

RESUMO

Rapid advances in genomic technologies have led to a wealth of diverse data, from which novel discoveries can be gleaned through the application of robust statistical and computational methods. Here, we describe GeneFishing, a semisupervised computational approach to reconstruct context-specific portraits of biological processes by leveraging gene-gene coexpression information. GeneFishing incorporates multiple high-dimensional statistical ideas, including dimensionality reduction, clustering, subsampling, and results aggregation, to produce robust results. To illustrate the power of our method, we applied it using 21 genes involved in cholesterol metabolism as "bait" to "fish out" (or identify) genes not previously identified as being connected to cholesterol metabolism. Using simulation and real datasets, we found that the results obtained through GeneFishing were more interesting for our study than those provided by related gene prioritization methods. In particular, application of GeneFishing to the GTEx liver RNA sequencing (RNAseq) data not only reidentified many known cholesterol-related genes, but also pointed to glyoxalase I (GLO1) as a gene implicated in cholesterol metabolism. In a follow-up experiment, we found that GLO1 knockdown in human hepatoma cell lines increased levels of cellular cholesterol ester, validating a role for GLO1 in cholesterol metabolism. In addition, we performed pantissue analysis by applying GeneFishing on various tissues and identified many potential tissue-specific cholesterol metabolism-related genes. GeneFishing appears to be a powerful tool for identifying related components of complex biological systems and may be used across a wide range of applications.


Assuntos
Fenômenos Biológicos/genética , Biologia Computacional/métodos , Perfilação da Expressão Gênica , Genômica/métodos , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/metabolismo , Linhagem Celular Tumoral , Colesterol/metabolismo , Bases de Dados Genéticas , Humanos , Lactoilglutationa Liase/genética , Metabolismo dos Lipídeos/genética , Especificidade de Órgãos/genética , Reprodutibilidade dos Testes , Fluxo de Trabalho
4.
Proc Natl Acad Sci U S A ; 116(10): 4156-4165, 2019 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-30770453

RESUMO

There is growing interest in estimating and analyzing heterogeneous treatment effects in experimental and observational studies. We describe a number of metaalgorithms that can take advantage of any supervised learning or regression method in machine learning and statistics to estimate the conditional average treatment effect (CATE) function. Metaalgorithms build on base algorithms-such as random forests (RFs), Bayesian additive regression trees (BARTs), or neural networks-to estimate the CATE, a function that the base algorithms are not designed to estimate directly. We introduce a metaalgorithm, the X-learner, that is provably efficient when the number of units in one treatment group is much larger than in the other and can exploit structural properties of the CATE function. For example, if the CATE function is linear and the response functions in treatment and control are Lipschitz-continuous, the X-learner can still achieve the parametric rate under regularity conditions. We then introduce versions of the X-learner that use RF and BART as base learners. In extensive simulation studies, the X-learner performs favorably, although none of the metalearners is uniformly the best. In two persuasion field experiments from political science, we demonstrate how our X-learner can be used to target treatment regimes and to shed light on underlying mechanisms. A software package is provided that implements our methods.

5.
Proc Natl Acad Sci U S A ; 116(3): 900-908, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-30598455

RESUMO

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.


Assuntos
Proteínas de Drosophila , Embrião não Mamífero/embriologia , Desenvolvimento Embrionário/fisiologia , Elementos Facilitadores Genéticos/fisiologia , Análise de Sequência de DNA , Fatores de Transcrição , Animais , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Drosophila melanogaster , Estudo de Associação Genômica Ampla , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
6.
Ann Appl Stat ; 13(3): 1511-1536, 2019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-32968472

RESUMO

Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.

7.
Proc Natl Acad Sci U S A ; 115(37): 9151-9156, 2018 09 11.
Artigo em Inglês | MEDLINE | ID: mdl-30150379

RESUMO

Projection pursuit is a classical exploratory data analysis method to detect interesting low-dimensional structures in multivariate data. Originally, projection pursuit was applied mostly to data of moderately low dimension. Motivated by contemporary applications, we here study its properties in high-dimensional settings. Specifically, we analyze the asymptotic properties of projection pursuit on structureless multivariate Gaussian data with an identity covariance, as both dimension p and sample size n tend to infinity, with [Formula: see text] Our main results are that (i) if [Formula: see text] then there exist projections whose corresponding empirical cumulative distribution function can approximate any arbitrary distribution; and (ii) if [Formula: see text], not all limiting distributions are possible. However, depending on the value of γ, various non-Gaussian distributions may still be approximated. In contrast, if we restrict to sparse projections, involving only a few of the p variables, then asymptotically all empirical cumulative distribution functions are Gaussian. And (iii) if [Formula: see text], then asymptotically all projections are Gaussian. Some of these results extend to mean-centered sub-Gaussian data and to projections into k dimensions. Hence, in the "small n, large p" setting, unless sparsity is enforced, and regardless of the chosen projection index, projection pursuit may detect an apparent structure that has no statistical significance. Furthermore, our work reveals fundamental limitations on the ability to detect non-Gaussian signals in high-dimensional data, in particular through independent component analysis and related non-Gaussian component analysis.

8.
Genome Res ; 25(11): 1692-702, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26294687

RESUMO

In eukaryotic cells, RNAs exist as ribonucleoprotein particles (RNPs). Despite the importance of these complexes in many biological processes, including splicing, polyadenylation, stability, transportation, localization, and translation, their compositions are largely unknown. We affinity-purified 20 distinct RNA-binding proteins (RBPs) from cultured Drosophila melanogaster cells under native conditions and identified both the RNA and protein compositions of these RNP complexes. We identified "high occupancy target" (HOT) RNAs that interact with the majority of the RBPs we surveyed. HOT RNAs encode components of the nonsense-mediated decay and splicing machinery, as well as RNA-binding and translation initiation proteins. The RNP complexes contain proteins and mRNAs involved in RNA binding and post-transcriptional regulation. Genes with the capacity to produce hundreds of mRNA isoforms, ultracomplex genes, interact extensively with heterogeneous nuclear ribonuclear proteins (hnRNPs). Our data are consistent with a model in which subsets of RNPs include mRNA and protein products from the same gene, indicating the widespread existence of auto-regulatory RNPs. From the simultaneous acquisition and integrative analysis of protein and RNA constituents of RNPs, we identify extensive cross-regulatory and hierarchical interactions in post-transcriptional control.


Assuntos
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Regulação da Expressão Gênica , Proteínas de Ligação a RNA/metabolismo , Animais , Proteínas de Drosophila/genética , Ribonucleoproteínas Nucleares Heterogêneas/genética , Ribonucleoproteínas Nucleares Heterogêneas/metabolismo , Splicing de RNA/genética , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Proteínas de Ligação a RNA/genética , Análise de Sequência de RNA , Transfecção
9.
Nature ; 512(7515): 445-8, 2014 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-25164755

RESUMO

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.


Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Transcriptoma/genética , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , Cromatina/genética , Análise por Conglomerados , Drosophila melanogaster/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crescimento & desenvolvimento , Modelos Genéticos , Anotação de Sequência Molecular , Regiões Promotoras Genéticas/genética , Pupa/genética , Pupa/crescimento & desenvolvimento , RNA não Traduzido/genética , Análise de Sequência de RNA
10.
Nature ; 512(7515): 453-6, 2014 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-25164757

RESUMO

Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.


Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Evolução Molecular , Regulação da Expressão Gênica/genética , Redes Reguladoras de Genes/genética , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Caenorhabditis elegans/crescimento & desenvolvimento , Imunoprecipitação da Cromatina , Sequência Conservada/genética , Drosophila melanogaster/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento/genética , Genoma/genética , Humanos , Anotação de Sequência Molecular , Motivos de Nucleotídeos/genética , Especificidade de Órgãos/genética , Fatores de Transcrição/genética
11.
Genome Res ; 24(7): 1086-101, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24985912

RESUMO

We report a statistical study to discover transcriptome similarity of developmental stages from D. melanogaster and C. elegans using modENCODE RNA-seq data. We focus on "stage-associated genes" that capture specific transcriptional activities in each stage and use them to map pairwise stages within and between the two species by a hypergeometric test. Within each species, temporally adjacent stages exhibit high transcriptome similarity, as expected. Additionally, fly female adults and worm adults are mapped with fly and worm embryos, respectively, due to maternal gene expression. Between fly and worm, an unexpected strong collinearity is observed in the time course from early embryos to late larvae. Moreover, a second parallel pattern is found between fly prepupae through adults and worm late embryos through adults, consistent with the second large wave of cell proliferation and differentiation in the fly life cycle. The results indicate a partially duplicated developmental program in fly. Our results constitute the first comprehensive comparison between D. melanogaster and C. elegans developmental time courses and provide new insights into similarities in their development . We use an analogous approach to compare tissues and cells from fly and worm. Findings include strong transcriptome similarity of fly cell lines, clustering of fly adult tissues by origin regardless of sex and age, and clustering of worm tissues and dissected cells by developmental stage. Gene ontology analysis supports our results and gives a detailed functional annotation of different stages, tissues and cells. Finally, we show that standard correlation analyses could not effectively detect the mappings found by our method.


Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Desenvolvimento Embrionário/genética , Regulação da Expressão Gênica no Desenvolvimento , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , Análise por Conglomerados , Biologia Computacional/métodos , Drosophila melanogaster/embriologia , Drosophila melanogaster/crescimento & desenvolvimento , Feminino , Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Estágios do Ciclo de Vida/genética , Masculino , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Transcriptoma
12.
PeerJ ; 2: e270, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24688849

RESUMO

Large scale surveys in mammalian tissue culture cells suggest that the protein expressed at the median abundance is present at 8,000-16,000 molecules per cell and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find, however, that these surveys have significantly underestimated protein abundances and the relative importance of transcription. Using individual measurements for 61 housekeeping proteins to rescale whole proteome data from Schwanhausser et al. (2011), we find that the median protein detected is expressed at 170,000 molecules per cell and that our corrected protein abundance estimates show a higher correlation with mRNA abundances than do the uncorrected protein data. In addition, we estimated the impact of further errors in mRNA and protein abundances using direct experimental measurements of these errors. The resulting analysis suggests that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwanhausser et al. (2011), though because one major source of error could not be estimated the true percent contribution should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhausser et al. (2011), and that the measured and inferred translation rates correlate poorly (R(2) = 0.13). Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. We also determined the percent contributions of transcription, RNA degradation, translation and protein degradation to the variance in protein abundances using both of our strategies. While the magnitudes of the two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role. Finally, the above estimates only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.

13.
Nat Biotechnol ; 32(4): 341-6, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24633242

RESUMO

The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in the annotation of genomes. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call Generalized RNA Integration Tool, or GRIT. Applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, we recovered the vast majority of previously annotated transcripts and doubled the total number of transcripts cataloged. We found that 20% of protein coding genes encode multiple protein-localization signals and that, in 20-d-old adult fly heads, genes with multiple polyadenylation sites are more common than genes with alternative splicing or alternative promoters. GRIT demonstrates 30% higher precision and recall than the most widely used transcript assembly tools. GRIT will facilitate the automated generation of high-quality genome annotations without the need for extensive manual annotation.


Assuntos
Mapeamento Cromossômico/métodos , Genômica/métodos , Anotação de Sequência Molecular/métodos , RNA/química , RNA/genética , Análise de Sequência de RNA/métodos , Animais , Drosophila melanogaster/genética , Genoma de Inseto/genética , RNA/análise
14.
Methods ; 68(1): 38-47, 2014 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-24636835

RESUMO

modENCODE was a 5year NHGRI funded project (2007-2012) to map the function of every base in the genomes of worms and flies characterizing positions of modified histones and other chromatin marks, origins of DNA replication, RNA transcripts and the transcription factor binding sites that control gene expression. Here we describe the Drosophila modENCODE datasets and how best to access and use them for genome wide and individual gene studies.


Assuntos
Replicação do DNA/genética , Bases de Dados Genéticas , Biologia do Desenvolvimento/métodos , Animais , Cromatina/genética , Mineração de Dados , Drosophila melanogaster/genética , Drosophila melanogaster/crescimento & desenvolvimento , Genoma de Inseto
15.
Nature ; 512(7515): 393-9, 2014 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-24670639

RESUMO

Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long non-coding RNAs (lncRNAs), some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized, with this complexity arising from combinatorial usage of promoters, splice sites and polyadenylation sites.


Assuntos
Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Transcriptoma/genética , Processamento Alternativo/genética , Animais , Drosophila melanogaster/anatomia & histologia , Drosophila melanogaster/citologia , Feminino , Masculino , Anotação de Sequência Molecular , Tecido Nervoso/metabolismo , Especificidade de Órgãos , Poli A/genética , Poliadenilação , Regiões Promotoras Genéticas/genética , RNA Longo não Codificante/genética , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Caracteres Sexuais , Estresse Fisiológico/genética
16.
Proc Natl Acad Sci U S A ; 110(36): 14563-8, 2013 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-23954907

RESUMO

We consider, in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression using M-estimates when the error distribution is assumed to be known. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem. Although optimality is achieved under assumptions on the design matrix that will not always be satisfied, our analysis reveals generally interesting families of dimension-dependent objective functions.


Assuntos
Algoritmos , Funções Verossimilhança , Análise de Regressão , Simulação por Computador
17.
Proc Natl Acad Sci U S A ; 110(36): 14557-62, 2013 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-23954908

RESUMO

We study regression M-estimates in the setting where p, the number of covariates, and n, the number of observations, are both large, but p ≤ n. We find an exact stochastic representation for the distribution of ß = argmin(ß∈ℝ(p)) Σ(i=1)(n) ρ(Y(i) - X(i')ß) at fixed p and n under various assumptions on the objective function ρ and our statistical model. A scalar random variable whose deterministic limit rρ(κ) can be studied when p/n → κ > 0 plays a central role in this representation. We discover a nonlinear system of two deterministic equations that characterizes rρ(κ). Interestingly, the system shows that rρ(κ) depends on ρ through proximal mappings of ρ as well as various aspects of the statistical model underlying our study. Several surprising results emerge. In particular, we show that, when p/n is large enough, least squares becomes preferable to least absolute deviations for double-exponential errors.


Assuntos
Algoritmos , Modelos Lineares , Processos Estocásticos , Simulação por Computador , Análise dos Mínimos Quadrados
18.
Proc Natl Acad Sci U S A ; 109(52): 21330-5, 2012 Dec 26.
Artigo em Inglês | MEDLINE | ID: mdl-23236164

RESUMO

In animals, each sequence-specific transcription factor typically binds to thousands of genomic regions in vivo. Our previous studies of 20 transcription factors show that most genomic regions bound at high levels in Drosophila blastoderm embryos are known or probable functional targets, but genomic regions occupied only at low levels have characteristics suggesting that most are not involved in the cis-regulation of transcription. Here we use transgenic reporter gene assays to directly test the transcriptional activity of 104 genomic regions bound at different levels by the 20 transcription factors. Fifteen genomic regions were selected based solely on the DNA occupancy level of the transcription factor Kruppel. Five of the six most highly bound regions drive blastoderm patterns of reporter transcription. In contrast, only one of the nine lowly bound regions drives transcription at this stage and four of them are not detectably active at any stage of embryogenesis. A larger set of 89 genomic regions chosen using criteria designed to identify functional cis-regulatory regions supports the same trend: genomic regions occupied at high levels by transcription factors in vivo drive patterned gene expression, whereas those occupied only at lower levels mostly do not. These results support studies that indicate that the high cellular concentrations of sequence-specific transcription factors drive extensive, low-occupancy, nonfunctional interactions within the accessible portions of the genome.


Assuntos
DNA/metabolismo , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Regulação da Expressão Gênica no Desenvolvimento , Genes Reporter/genética , Fatores de Transcrição/metabolismo , Animais , Animais Geneticamente Modificados , Proteínas de Drosophila/genética , Drosophila melanogaster/embriologia , Embrião não Mamífero/metabolismo , Feminino , Genoma de Inseto/genética , Fatores de Transcrição Kruppel-Like/metabolismo , Masculino , Ligação Proteica/genética
19.
Nat Methods ; 9(6): 609-14, 2012 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-22522655

RESUMO

We evaluated how variations in sequencing depth and other parameters influence interpretation of chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. Using Drosophila melanogaster S2 cells, we generated ChIP-seq data sets for a site-specific transcription factor (Suppressor of Hairy-wing) and a histone modification (H3K36me3). We detected a chromatin-state bias: open chromatin regions yielded higher coverage, which led to false positives if not corrected. This bias had a greater effect on detection specificity than any base-composition bias. Paired-end sequencing revealed that single-end data underestimated ChIP-library complexity at high coverage. Removal of reads originating at the same base reduced false-positives but had little effect on detection sensitivity. Even at mappable-genome coverage depth of ∼1 read per base pair, ∼1% of the narrow peaks detected on a tiling array were missed by ChIP-seq. Evaluation of widely used ChIP-seq analysis tools suggests that adjustments or algorithm improvements are required to handle data sets with deep coverage.


Assuntos
Imunoprecipitação da Cromatina/métodos , Cromatina/química , Algoritmos , Animais , Imunoprecipitação da Cromatina/normas , Proteínas de Drosophila/genética , Drosophila melanogaster , Reações Falso-Positivas , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Histona-Lisina N-Metiltransferase/genética , Análise de Sequência com Séries de Oligonucleotídeos , Proteínas Repressoras/genética , Sensibilidade e Especificidade
20.
Proc Natl Acad Sci U S A ; 108(50): 19867-72, 2011 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-22135461

RESUMO

Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.


Assuntos
RNA Mensageiro/genética , Análise de Sequência de RNA/métodos , Animais , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Drosophila melanogaster/genética , Éxons/genética , Modelos Lineares , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , RNA Mensageiro/metabolismo , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA