Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Proc Natl Acad Sci U S A ; 118(44)2021 11 02.
Artigo em Inglês | MEDLINE | ID: mdl-34716259

RESUMO

In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms that perform clustering on several small subgraphs and finally patch the results into a single clustering. The main advantage of these algorithms is that they significantly bring down the computational cost of traditional algorithms, including spectral clustering, semidefinite programs, modularity-based methods, likelihood-based methods, etc., without losing accuracy, and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Since most traditional algorithms are accurate, and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove the consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.

2.
Proc Natl Acad Sci U S A ; 116(10): 4156-4165, 2019 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-30770453

RESUMO

There is growing interest in estimating and analyzing heterogeneous treatment effects in experimental and observational studies. We describe a number of metaalgorithms that can take advantage of any supervised learning or regression method in machine learning and statistics to estimate the conditional average treatment effect (CATE) function. Metaalgorithms build on base algorithms-such as random forests (RFs), Bayesian additive regression trees (BARTs), or neural networks-to estimate the CATE, a function that the base algorithms are not designed to estimate directly. We introduce a metaalgorithm, the X-learner, that is provably efficient when the number of units in one treatment group is much larger than in the other and can exploit structural properties of the CATE function. For example, if the CATE function is linear and the response functions in treatment and control are Lipschitz-continuous, the X-learner can still achieve the parametric rate under regularity conditions. We then introduce versions of the X-learner that use RF and BART as base learners. In extensive simulation studies, the X-learner performs favorably, although none of the metalearners is uniformly the best. In two persuasion field experiments from political science, we demonstrate how our X-learner can be used to target treatment regimes and to shed light on underlying mechanisms. A software package is provided that implements our methods.

3.
Proc Natl Acad Sci U S A ; 116(38): 18943-18950, 2019 09 17.
Artigo em Inglês | MEDLINE | ID: mdl-31484776

RESUMO

Rapid advances in genomic technologies have led to a wealth of diverse data, from which novel discoveries can be gleaned through the application of robust statistical and computational methods. Here, we describe GeneFishing, a semisupervised computational approach to reconstruct context-specific portraits of biological processes by leveraging gene-gene coexpression information. GeneFishing incorporates multiple high-dimensional statistical ideas, including dimensionality reduction, clustering, subsampling, and results aggregation, to produce robust results. To illustrate the power of our method, we applied it using 21 genes involved in cholesterol metabolism as "bait" to "fish out" (or identify) genes not previously identified as being connected to cholesterol metabolism. Using simulation and real datasets, we found that the results obtained through GeneFishing were more interesting for our study than those provided by related gene prioritization methods. In particular, application of GeneFishing to the GTEx liver RNA sequencing (RNAseq) data not only reidentified many known cholesterol-related genes, but also pointed to glyoxalase I (GLO1) as a gene implicated in cholesterol metabolism. In a follow-up experiment, we found that GLO1 knockdown in human hepatoma cell lines increased levels of cellular cholesterol ester, validating a role for GLO1 in cholesterol metabolism. In addition, we performed pantissue analysis by applying GeneFishing on various tissues and identified many potential tissue-specific cholesterol metabolism-related genes. GeneFishing appears to be a powerful tool for identifying related components of complex biological systems and may be used across a wide range of applications.


Assuntos
Fenômenos Biológicos/genética , Biologia Computacional/métodos , Perfilação da Expressão Gênica , Genômica/métodos , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/metabolismo , Linhagem Celular Tumoral , Colesterol/metabolismo , Bases de Dados Genéticas , Humanos , Lactoilglutationa Liase/genética , Metabolismo dos Lipídeos/genética , Especificidade de Órgãos/genética , Reprodutibilidade dos Testes , Fluxo de Trabalho
4.
Proc Natl Acad Sci U S A ; 116(3): 900-908, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-30598455

RESUMO

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.


Assuntos
Proteínas de Drosophila , Embrião não Mamífero/embriologia , Desenvolvimento Embrionário/fisiologia , Elementos Facilitadores Genéticos/fisiologia , Análise de Sequência de DNA , Fatores de Transcrição , Animais , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Drosophila melanogaster , Estudo de Associação Genômica Ampla , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
5.
Proc Natl Acad Sci U S A ; 115(37): 9151-9156, 2018 09 11.
Artigo em Inglês | MEDLINE | ID: mdl-30150379

RESUMO

Projection pursuit is a classical exploratory data analysis method to detect interesting low-dimensional structures in multivariate data. Originally, projection pursuit was applied mostly to data of moderately low dimension. Motivated by contemporary applications, we here study its properties in high-dimensional settings. Specifically, we analyze the asymptotic properties of projection pursuit on structureless multivariate Gaussian data with an identity covariance, as both dimension p and sample size n tend to infinity, with [Formula: see text] Our main results are that (i) if [Formula: see text] then there exist projections whose corresponding empirical cumulative distribution function can approximate any arbitrary distribution; and (ii) if [Formula: see text], not all limiting distributions are possible. However, depending on the value of γ, various non-Gaussian distributions may still be approximated. In contrast, if we restrict to sparse projections, involving only a few of the p variables, then asymptotically all empirical cumulative distribution functions are Gaussian. And (iii) if [Formula: see text], then asymptotically all projections are Gaussian. Some of these results extend to mean-centered sub-Gaussian data and to projections into k dimensions. Hence, in the "small n, large p" setting, unless sparsity is enforced, and regardless of the chosen projection index, projection pursuit may detect an apparent structure that has no statistical significance. Furthermore, our work reveals fundamental limitations on the ability to detect non-Gaussian signals in high-dimensional data, in particular through independent component analysis and related non-Gaussian component analysis.

6.
Nature ; 512(7515): 393-9, 2014 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-24670639

RESUMO

Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long non-coding RNAs (lncRNAs), some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized, with this complexity arising from combinatorial usage of promoters, splice sites and polyadenylation sites.


Assuntos
Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Transcriptoma/genética , Processamento Alternativo/genética , Animais , Drosophila melanogaster/anatomia & histologia , Drosophila melanogaster/citologia , Feminino , Masculino , Anotação de Sequência Molecular , Tecido Nervoso/metabolismo , Especificidade de Órgãos , Poli A/genética , Poliadenilação , Regiões Promotoras Genéticas/genética , RNA Longo não Codificante/genética , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Caracteres Sexuais , Estresse Fisiológico/genética
7.
Nature ; 512(7515): 445-8, 2014 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-25164755

RESUMO

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.


Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Transcriptoma/genética , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , Cromatina/genética , Análise por Conglomerados , Drosophila melanogaster/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crescimento & desenvolvimento , Modelos Genéticos , Anotação de Sequência Molecular , Regiões Promotoras Genéticas/genética , Pupa/genética , Pupa/crescimento & desenvolvimento , RNA não Traduzido/genética , Análise de Sequência de RNA
8.
Nature ; 512(7515): 453-6, 2014 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-25164757

RESUMO

Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.


Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Evolução Molecular , Regulação da Expressão Gênica/genética , Redes Reguladoras de Genes/genética , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Caenorhabditis elegans/crescimento & desenvolvimento , Imunoprecipitação da Cromatina , Sequência Conservada/genética , Drosophila melanogaster/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento/genética , Genoma/genética , Humanos , Anotação de Sequência Molecular , Motivos de Nucleotídeos/genética , Especificidade de Órgãos/genética , Fatores de Transcrição/genética
9.
Genome Res ; 25(11): 1692-702, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26294687

RESUMO

In eukaryotic cells, RNAs exist as ribonucleoprotein particles (RNPs). Despite the importance of these complexes in many biological processes, including splicing, polyadenylation, stability, transportation, localization, and translation, their compositions are largely unknown. We affinity-purified 20 distinct RNA-binding proteins (RBPs) from cultured Drosophila melanogaster cells under native conditions and identified both the RNA and protein compositions of these RNP complexes. We identified "high occupancy target" (HOT) RNAs that interact with the majority of the RBPs we surveyed. HOT RNAs encode components of the nonsense-mediated decay and splicing machinery, as well as RNA-binding and translation initiation proteins. The RNP complexes contain proteins and mRNAs involved in RNA binding and post-transcriptional regulation. Genes with the capacity to produce hundreds of mRNA isoforms, ultracomplex genes, interact extensively with heterogeneous nuclear ribonuclear proteins (hnRNPs). Our data are consistent with a model in which subsets of RNPs include mRNA and protein products from the same gene, indicating the widespread existence of auto-regulatory RNPs. From the simultaneous acquisition and integrative analysis of protein and RNA constituents of RNPs, we identify extensive cross-regulatory and hierarchical interactions in post-transcriptional control.


Assuntos
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Regulação da Expressão Gênica , Proteínas de Ligação a RNA/metabolismo , Animais , Proteínas de Drosophila/genética , Ribonucleoproteínas Nucleares Heterogêneas/genética , Ribonucleoproteínas Nucleares Heterogêneas/metabolismo , Splicing de RNA/genética , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Proteínas de Ligação a RNA/genética , Análise de Sequência de RNA , Transfecção
10.
Proc Natl Acad Sci U S A ; 117(47): 29257-29259, 2020 11 24.
Artigo em Inglês | MEDLINE | ID: mdl-33188088
11.
Genome Res ; 24(7): 1086-101, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24985912

RESUMO

We report a statistical study to discover transcriptome similarity of developmental stages from D. melanogaster and C. elegans using modENCODE RNA-seq data. We focus on "stage-associated genes" that capture specific transcriptional activities in each stage and use them to map pairwise stages within and between the two species by a hypergeometric test. Within each species, temporally adjacent stages exhibit high transcriptome similarity, as expected. Additionally, fly female adults and worm adults are mapped with fly and worm embryos, respectively, due to maternal gene expression. Between fly and worm, an unexpected strong collinearity is observed in the time course from early embryos to late larvae. Moreover, a second parallel pattern is found between fly prepupae through adults and worm late embryos through adults, consistent with the second large wave of cell proliferation and differentiation in the fly life cycle. The results indicate a partially duplicated developmental program in fly. Our results constitute the first comprehensive comparison between D. melanogaster and C. elegans developmental time courses and provide new insights into similarities in their development . We use an analogous approach to compare tissues and cells from fly and worm. Findings include strong transcriptome similarity of fly cell lines, clustering of fly adult tissues by origin regardless of sex and age, and clustering of worm tissues and dissected cells by developmental stage. Gene ontology analysis supports our results and gives a detailed functional annotation of different stages, tissues and cells. Finally, we show that standard correlation analyses could not effectively detect the mappings found by our method.


Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Desenvolvimento Embrionário/genética , Regulação da Expressão Gênica no Desenvolvimento , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , Análise por Conglomerados , Biologia Computacional/métodos , Drosophila melanogaster/embriologia , Drosophila melanogaster/crescimento & desenvolvimento , Feminino , Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Estágios do Ciclo de Vida/genética , Masculino , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Transcriptoma
12.
Nature ; 471(7339): 473-9, 2011 Mar 24.
Artigo em Inglês | MEDLINE | ID: mdl-21179090

RESUMO

Drosophila melanogaster is one of the most well studied genetic model organisms; nonetheless, its genome still contains unannotated coding and non-coding genes, transcripts, exons and RNA editing sites. Full discovery and annotation are pre-requisites for understanding how the regulation of transcription, splicing and RNA editing directs the development of this complex organism. Here we used RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events, and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. These data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.


Assuntos
Drosophila melanogaster/crescimento & desenvolvimento , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica no Desenvolvimento/genética , Transcrição Gênica/genética , Processamento Alternativo/genética , Animais , Sequência de Bases , Proteínas de Drosophila/genética , Drosophila melanogaster/embriologia , Éxons/genética , Feminino , Genes de Insetos/genética , Genoma de Inseto/genética , Masculino , MicroRNAs/genética , Análise de Sequência com Séries de Oligonucleotídeos , Isoformas de Proteínas/genética , Edição de RNA/genética , RNA Mensageiro/análise , RNA Mensageiro/genética , Pequeno RNA não Traduzido/análise , Pequeno RNA não Traduzido/genética , Análise de Sequência , Caracteres Sexuais
13.
Proc Natl Acad Sci U S A ; 110(36): 14563-8, 2013 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-23954907

RESUMO

We consider, in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression using M-estimates when the error distribution is assumed to be known. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem. Although optimality is achieved under assumptions on the design matrix that will not always be satisfied, our analysis reveals generally interesting families of dimension-dependent objective functions.


Assuntos
Algoritmos , Funções Verossimilhança , Análise de Regressão , Simulação por Computador
14.
Proc Natl Acad Sci U S A ; 110(36): 14557-62, 2013 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-23954908

RESUMO

We study regression M-estimates in the setting where p, the number of covariates, and n, the number of observations, are both large, but p ≤ n. We find an exact stochastic representation for the distribution of ß = argmin(ß∈ℝ(p)) Σ(i=1)(n) ρ(Y(i) - X(i')ß) at fixed p and n under various assumptions on the objective function ρ and our statistical model. A scalar random variable whose deterministic limit rρ(κ) can be studied when p/n → κ > 0 plays a central role in this representation. We discover a nonlinear system of two deterministic equations that characterizes rρ(κ). Interestingly, the system shows that rρ(κ) depends on ρ through proximal mappings of ρ as well as various aspects of the statistical model underlying our study. Several surprising results emerge. In particular, we show that, when p/n is large enough, least squares becomes preferable to least absolute deviations for double-exponential errors.


Assuntos
Algoritmos , Modelos Lineares , Processos Estocásticos , Simulação por Computador , Análise dos Mínimos Quadrados
15.
Nat Methods ; 9(6): 609-14, 2012 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-22522655

RESUMO

We evaluated how variations in sequencing depth and other parameters influence interpretation of chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. Using Drosophila melanogaster S2 cells, we generated ChIP-seq data sets for a site-specific transcription factor (Suppressor of Hairy-wing) and a histone modification (H3K36me3). We detected a chromatin-state bias: open chromatin regions yielded higher coverage, which led to false positives if not corrected. This bias had a greater effect on detection specificity than any base-composition bias. Paired-end sequencing revealed that single-end data underestimated ChIP-library complexity at high coverage. Removal of reads originating at the same base reduced false-positives but had little effect on detection sensitivity. Even at mappable-genome coverage depth of ∼1 read per base pair, ∼1% of the narrow peaks detected on a tiling array were missed by ChIP-seq. Evaluation of widely used ChIP-seq analysis tools suggests that adjustments or algorithm improvements are required to handle data sets with deep coverage.


Assuntos
Imunoprecipitação da Cromatina/métodos , Cromatina/química , Algoritmos , Animais , Imunoprecipitação da Cromatina/normas , Proteínas de Drosophila/genética , Drosophila melanogaster , Reações Falso-Positivas , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Histona-Lisina N-Metiltransferase/genética , Análise de Sequência com Séries de Oligonucleotídeos , Proteínas Repressoras/genética , Sensibilidade e Especificidade
16.
Methods ; 68(1): 38-47, 2014 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-24636835

RESUMO

modENCODE was a 5year NHGRI funded project (2007-2012) to map the function of every base in the genomes of worms and flies characterizing positions of modified histones and other chromatin marks, origins of DNA replication, RNA transcripts and the transcription factor binding sites that control gene expression. Here we describe the Drosophila modENCODE datasets and how best to access and use them for genome wide and individual gene studies.


Assuntos
Replicação do DNA/genética , Bases de Dados Genéticas , Biologia do Desenvolvimento/métodos , Animais , Cromatina/genética , Mineração de Dados , Drosophila melanogaster/genética , Drosophila melanogaster/crescimento & desenvolvimento , Genoma de Inseto
17.
Proc Natl Acad Sci U S A ; 109(52): 21330-5, 2012 Dec 26.
Artigo em Inglês | MEDLINE | ID: mdl-23236164

RESUMO

In animals, each sequence-specific transcription factor typically binds to thousands of genomic regions in vivo. Our previous studies of 20 transcription factors show that most genomic regions bound at high levels in Drosophila blastoderm embryos are known or probable functional targets, but genomic regions occupied only at low levels have characteristics suggesting that most are not involved in the cis-regulation of transcription. Here we use transgenic reporter gene assays to directly test the transcriptional activity of 104 genomic regions bound at different levels by the 20 transcription factors. Fifteen genomic regions were selected based solely on the DNA occupancy level of the transcription factor Kruppel. Five of the six most highly bound regions drive blastoderm patterns of reporter transcription. In contrast, only one of the nine lowly bound regions drives transcription at this stage and four of them are not detectably active at any stage of embryogenesis. A larger set of 89 genomic regions chosen using criteria designed to identify functional cis-regulatory regions supports the same trend: genomic regions occupied at high levels by transcription factors in vivo drive patterned gene expression, whereas those occupied only at lower levels mostly do not. These results support studies that indicate that the high cellular concentrations of sequence-specific transcription factors drive extensive, low-occupancy, nonfunctional interactions within the accessible portions of the genome.


Assuntos
DNA/metabolismo , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Regulação da Expressão Gênica no Desenvolvimento , Genes Reporter/genética , Fatores de Transcrição/metabolismo , Animais , Animais Geneticamente Modificados , Proteínas de Drosophila/genética , Drosophila melanogaster/embriologia , Embrião não Mamífero/metabolismo , Feminino , Genoma de Inseto/genética , Fatores de Transcrição Kruppel-Like/metabolismo , Masculino , Ligação Proteica/genética
18.
Genome Res ; 21(2): 182-92, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21177961

RESUMO

Core promoters are critical regions for gene regulation in higher eukaryotes. However, the boundaries of promoter regions, the relative rates of initiation at the transcription start sites (TSSs) distributed within them, and the functional significance of promoter architecture remain poorly understood. We produced a high-resolution map of promoters active in the Drosophila melanogaster embryo by integrating data from three independent and complementary methods: 21 million cap analysis of gene expression (CAGE) tags, 1.2 million RNA ligase mediated rapid amplification of cDNA ends (RLM-RACE) reads, and 50,000 cap-trapped expressed sequence tags (ESTs). We defined 12,454 promoters of 8037 genes. Our analysis indicates that, due to non-promoter-associated RNA background signal, previous studies have likely overestimated the number of promoter-associated CAGE clusters by fivefold. We show that TSS distributions form a complex continuum of shapes, and that promoters active in the embryo and adult have highly similar shapes in 95% of cases. This suggests that these distributions are generally determined by static elements such as local DNA sequence and are not modulated by dynamic signals such as histone modifications. Transcription factor binding motifs are differentially enriched as a function of promoter shape, and peaked promoter shape is correlated with both temporal and spatial regulation of gene expression. Our results contribute to the emerging view that core promoters are functionally diverse and control patterning of gene expression in Drosophila and mammals.


Assuntos
Biologia Computacional , Drosophila melanogaster/genética , Genoma de Inseto/genética , Regiões Promotoras Genéticas , Regiões 3' não Traduzidas/genética , Animais , Mapeamento Cromossômico , Drosophila melanogaster/embriologia , Etiquetas de Sequências Expressas , Perfilação da Expressão Gênica , Regulação da Expressão Gênica/genética , Estudo de Associação Genômica Ampla , Sítio de Iniciação de Transcrição
19.
Proc Natl Acad Sci U S A ; 108(50): 19867-72, 2011 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-22135461

RESUMO

Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.


Assuntos
RNA Mensageiro/genética , Análise de Sequência de RNA/métodos , Animais , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Drosophila melanogaster/genética , Éxons/genética , Modelos Lineares , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , RNA Mensageiro/metabolismo , Software
20.
Proc Natl Acad Sci U S A ; 106(50): 21068-73, 2009 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-19934050

RESUMO

Prompted by the increasing interest in networks in many fields, we present an attempt at unifying points of view and analyses of these objects coming from the social sciences, statistics, probability and physics communities. We apply our approach to the Newman-Girvan modularity, widely used for "community" detection, among others. Our analysis is asymptotic but we show by simulation and application to real examples that the theory is a reasonable guide to practice.


Assuntos
Teoria de Sistemas , Redes Comunitárias , Simulação por Computador , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA