Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 78
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 187(9): 2336-2341.e5, 2024 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-38582080

RESUMEN

The Genome Aggregation Database (gnomAD), widely recognized as the gold-standard reference map of human genetic variation, has largely overlooked tandem repeat (TR) expansions, despite the fact that TRs constitute ∼6% of our genome and are linked to over 50 human diseases. Here, we introduce the TR-gnomAD (https://wlcb.oit.uci.edu/TRgnomAD), a biobank-scale reference of 0.86 million TRs derived from 338,963 whole-genome sequencing (WGS) samples of diverse ancestries (39.5% non-European samples). TR-gnomAD offers critical insights into ancestry-specific disease prevalence using disparities in TR unit number frequencies among ancestries. Moreover, TR-gnomAD is able to differentiate between common, presumably benign TR expansions, which are prevalent in TR-gnomAD, from those potentially pathogenic TR expansions, which are found more frequently in disease groups than within TR-gnomAD. Together, TR-gnomAD is an invaluable resource for researchers and physicians to interpret TR expansions in individuals with genetic diseases.


Asunto(s)
Genoma Humano , Secuencias Repetidas en Tándem , Humanos , Secuencias Repetidas en Tándem/genética , Secuenciación Completa del Genoma , Bases de Datos Genéticas , Expansión de las Repeticiones de ADN/genética , Estudio de Asociación del Genoma Completo
2.
Cell ; 173(4): 1014-1030.e17, 2018 05 03.
Artículo en Inglés | MEDLINE | ID: mdl-29727661

RESUMEN

Tools to understand how the spliceosome functions in vivo have lagged behind advances in the structural biology of the spliceosome. Here, methods are described to globally profile spliceosome-bound pre-mRNA, intermediates, and spliced mRNA at nucleotide resolution. These tools are applied to three yeast species that span 600 million years of evolution. The sensitivity of the approach enables the detection of canonical and non-canonical events, including interrupted, recursive, and nested splicing. This application of statistical modeling uncovers independent roles for the size and position of the intron and the number of introns per transcript in substrate progression through the two catalytic stages. These include species-specific inputs suggestive of spliceosome-transcriptome coevolution. Further investigations reveal the ATP-dependent discard of numerous endogenous substrates after spliceosome assembly in vivo and connect this discard to intron retention, a form of splicing regulation. Spliceosome profiling is a quantitative, generalizable global technology used to investigate an RNP central to eukaryotic gene expression.


Asunto(s)
Ribonucleoproteínas Nucleares Pequeñas/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo , Empalmosomas/metabolismo , Adenosina Trifosfato/metabolismo , Teorema de Bayes , ARN Helicasas DEAD-box/genética , ARN Helicasas DEAD-box/metabolismo , Inmunoprecipitación , Precursores del ARN/metabolismo , Empalme del ARN , Factores de Empalme de ARN/genética , Factores de Empalme de ARN/metabolismo , ARN de Hongos/metabolismo , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Telomerasa/genética , Telomerasa/metabolismo , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
3.
Proc Natl Acad Sci U S A ; 120(6): e2202584120, 2023 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-36730203

RESUMEN

Model organisms are instrumental substitutes for human studies to expedite basic, translational, and clinical research. Despite their indispensable role in mechanistic investigation and drug development, molecular congruence of animal models to humans has long been questioned and debated. Little effort has been made for an objective quantification and mechanistic exploration of a model organism's resemblance to humans in terms of molecular response under disease or drug treatment. We hereby propose a framework, namely Congruence Analysis for Model Organisms (CAMO), for transcriptomic response analysis by developing threshold-free differential expression analysis, quantitative concordance/discordance scores incorporating data variabilities, pathway-centric downstream investigation, knowledge retrieval by text mining, and topological gene module detection for hypothesis generation. Instead of a genome-wide vague and dichotomous answer of "poorly" or "greatly" mimicking humans, CAMO assists researchers to numerically quantify congruence, to dissect true cross-species differences from unwanted biological or cohort variabilities, and to visually identify molecular mechanisms and pathway subnetworks that are best or least mimicked by model organisms, which altogether provides foundations for hypothesis generation and subsequent translational decisions.


Asunto(s)
Perfilación de la Expresión Génica , Transcriptoma , Animales , Humanos , Genoma , Proteómica , Modelos Animales
4.
RNA ; 28(6): 808-831, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35273099

RESUMEN

Neurons provide a rich setting for studying post-transcriptional control. Here, we investigate the landscape of translational control in neurons and search for mRNA features that explain differences in translational efficiency (TE), considering the interplay between TE, mRNA poly(A)-tail lengths, microRNAs, and neuronal activation. In neurons and brain tissues, TE correlates with tail length, and a few dozen mRNAs appear to undergo cytoplasmic polyadenylation upon light or chemical stimulation. However, the correlation between TE and tail length is modest, explaining <5% of TE variance, and even this modest relationship diminishes when accounting for other mRNA features. Thus, tail length appears to affect TE only minimally. Accordingly, miRNAs, which accelerate deadenylation of their mRNA targets, primarily influence target mRNA levels, with no detectable effect on either steady-state tail lengths or TE. Larger correlates with TE include codon composition and predicted mRNA folding energy. When combined in a model, the identified correlates explain 38%-45% of TE variance. These results provide a framework for considering the relative impact of factors that contribute to translational control in neurons. They indicate that when examined in bulk, translational control in neurons largely resembles that of other types of post-embryonic cells. Thus, detection of more specialized control might require analyses that can distinguish translation occurring in neuronal processes from that occurring in cell bodies.


Asunto(s)
MicroARNs , Regulación de la Expresión Génica , MicroARNs/genética , MicroARNs/metabolismo , Neuronas/metabolismo , Poli A/genética , Poli A/metabolismo , Poliadenilación , Biosíntesis de Proteínas , ARN Mensajero/metabolismo
5.
Bioinformatics ; 38(16): 3927-3934, 2022 08 10.
Artículo en Inglés | MEDLINE | ID: mdl-35758616

RESUMEN

MOTIVATION: Modeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models. RESULTS: Here, we propose the single-cell generalized trend model (scGTM) for capturing a gene's expression trend, which may be monotone, hill-shaped or valley-shaped, along cell pseudotime. The scGTM has three advantages: (i) it can capture non-monotonic trends that are easy to interpret, (ii) its parameters are biologically interpretable and trend informative, and (iii) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression datasets using the scGTM and show that scGTM can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying biological processes. AVAILABILITY AND IMPLEMENTATION: The Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Análisis de la Célula Individual , Programas Informáticos , Análisis de la Célula Individual/métodos , Algoritmos , Funciones de Verosimilitud , Expresión Génica
6.
Bioinformatics ; 38(11): 3126-3127, 2022 05 26.
Artículo en Inglés | MEDLINE | ID: mdl-35426898

RESUMEN

SUMMARY: The number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here, we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data. AVAILABILITY AND IMPLEMENTATION: scSampler is implemented in Python and is published under the MIT source license. It can be installed by "pip install scsampler" and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler. An R interface is available at: https://github.com/SONGDONGYUAN1994/rscsampler. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Programas Informáticos , Transcriptoma , Algoritmos , Análisis de Datos
7.
Genome Res ; 29(12): 2056-2072, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31694868

RESUMEN

Genome-wide accurate identification and quantification of full-length mRNA isoforms is crucial for investigating transcriptional and posttranscriptional regulatory mechanisms of biological phenomena. Despite continuing efforts in developing effective computational tools to identify or assemble full-length mRNA isoforms from second-generation RNA-seq data, it remains a challenge to accurately identify mRNA isoforms from short sequence reads owing to the substantial information loss in RNA-seq experiments. Here, we introduce a novel statistical method, annotation-assisted isoform discovery (AIDE), the first approach that directly controls false isoform discoveries by implementing the testing-based model selection principle. Solving the isoform discovery problem in a stepwise and conservative manner, AIDE prioritizes the annotated isoforms and precisely identifies novel isoforms whose addition significantly improves the explanation of observed RNA-seq reads. We evaluate the performance of AIDE based on multiple simulated and real RNA-seq data sets followed by PCR-Sanger sequencing validation. Our results show that AIDE effectively leverages the annotation information to compensate the information loss owing to short read lengths. AIDE achieves the highest precision in isoform discovery and the lowest error rates in isoform abundance estimation, compared with three state-of-the-art methods Cufflinks, SLIDE, and StringTie. As a robust bioinformatics tool for transcriptome analysis, AIDE enables researchers to discover novel transcripts with high confidence.


Asunto(s)
Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Anotación de Secuencia Molecular , Isoformas de ARN , ARN Mensajero , Análisis de Secuencia de ARN , Humanos , Isoformas de ARN/biosíntesis , Isoformas de ARN/genética , ARN Mensajero/biosíntesis , ARN Mensajero/genética
8.
Bioinformatics ; 37(Suppl_1): i358-i366, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252925

RESUMEN

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. RESULTS: Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. AVAILABILITY AND IMPLEMENTATION: The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de la Célula Individual , Algoritmos , Análisis de Secuencia de ARN , Programas Informáticos
9.
Bioinformatics ; 37(9): 1225-1233, 2021 06 09.
Artículo en Inglés | MEDLINE | ID: mdl-32814973

RESUMEN

MOTIVATION: Gene clustering is a widely used technique that has enabled computational prediction of unknown gene functions within a species. However, it remains a challenge to refine gene function prediction by leveraging evolutionarily conserved genes in another species. This challenge calls for a new computational algorithm to identify gene co-clusters in two species, so that genes in each co-cluster exhibit similar expression levels in each species and strong conservation between the species. RESULTS: Here, we develop the bipartite tight spectral clustering (BiTSC) algorithm, which identifies gene co-clusters in two species based on gene orthology information and gene expression data. BiTSC novelly implements a formulation that encodes gene orthology as a bipartite network and gene expression data as node covariates. This formulation allows BiTSC to adopt and combine the advantages of multiple unsupervised learning techniques: kernel enhancement, bipartite spectral clustering, consensus clustering, tight clustering and hierarchical clustering. As a result, BiTSC is a flexible and robust algorithm capable of identifying informative gene co-clusters without forcing all genes into co-clusters. Another advantage of BiTSC is that it does not rely on any distributional assumptions. Beyond cross-species gene co-clustering, BiTSC also has wide applications as a general algorithm for identifying tight node co-clusters in any bipartite network with node covariates. We demonstrate the accuracy and robustness of BiTSC through comprehensive simulation studies. In a real data example, we use BiTSC to identify conserved gene co-clusters of Drosophila melanogaster and Caenorhabditis elegans, and we perform a series of downstream analysis to both validate BiTSC and verify the biological significance of the identified co-clusters. AVAILABILITY AND IMPLEMENTATION: The Python package BiTSC is open-access and available at https://github.com/edensunyidan/BiTSC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Drosophila melanogaster , Perfilación de la Expresión Génica , Algoritmos , Animales , Análisis por Conglomerados , Expresión Génica
10.
Bioinformatics ; 37(17): 2741-2743, 2021 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-33532827

RESUMEN

SUMMARY: With the advance of genomic sequencing techniques, chromatin accessible regions, transcription factor binding sites and epigenetic modifications can be identified at genome-wide scale. Conventional analyses focus on the gene regulation at proximal regions; however, distal regions are usually less focused, largely due to the lack of reliable tools to link these regions to coding genes. In this study, we introduce RAD (Region Associated Differentially expressed genes), a user-friendly web tool to identify both proximal and distal region associated differentially expressed genes (DEGs). With DEGs and genomic regions of interest (gROI) as input, RAD maps the up- and down-regulated genes associated with any gROI and helps researchers to infer the regulatory function of these regions based on the distance of gROI to differentially expressed genes. RAD includes visualization of the results and statistical inference for significance. AVAILABILITY AND IMPLEMENTATION: RAD is implemented with Python 3.7 and run on a Nginx server. RAD is freely available at https://labw.org/rad as online web service. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.
PLoS Comput Biol ; 17(6): e1009095, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-34166361

RESUMEN

The effectiveness of immune responses depends on the precision of stimulus-responsive gene expression programs. Cells specify which genes to express by activating stimulus-specific combinations of stimulus-induced transcription factors (TFs). Their activities are decoded by a gene regulatory strategy (GRS) associated with each response gene. Here, we examined whether the GRSs of target genes may be inferred from stimulus-response (input-output) datasets, which remains an unresolved model-identifiability challenge. We developed a mechanistic modeling framework and computational workflow to determine the identifiability of all possible combinations of synergistic (AND) or non-synergistic (OR) GRSs involving three transcription factors. Considering different sets of perturbations for stimulus-response studies, we found that two thirds of GRSs are easily distinguishable but that substantially more quantitative data is required to distinguish the remaining third. To enhance the accuracy of the inference with timecourse experimental data, we developed an advanced error model that avoids error overestimates by distinguishing between value and temporal error. Incorporating this error model into a Bayesian framework, we show that GRS models can be identified for individual genes by considering multiple datasets. Our analysis rationalizes the allocation of experimental resources by identifying most informative TF stimulation conditions. Applying this computational workflow to experimental data of immune response genes in macrophages, we found that a much greater fraction of genes are combinatorially controlled than previously reported by considering compensation among transcription factors. Specifically, we revealed that a group of known NFκB target genes may also be regulated by IRF3, which is supported by chromatin immuno-precipitation analysis. Our study provides a computational workflow for designing and interpreting stimulus-response gene expression studies to identify underlying gene regulatory strategies and further a mechanistic understanding.


Asunto(s)
Redes Reguladoras de Genes , Modelos Biológicos , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Animales , Teorema de Bayes , Células Cultivadas , Secuenciación de Inmunoprecipitación de Cromatina , Biología Computacional , Simulación por Computador , Perfilación de la Expresión Génica , Inmunidad/genética , Funciones de Verosimilitud , Macrófagos/metabolismo , Ratones , Modelos Genéticos , RNA-Seq
12.
Stat Sci ; 36(1): 89-108, 2021 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-34305304

RESUMEN

The rise of network data in many different domains has offered researchers new insight into the problem of modeling complex systems and propelled the development of numerous innovative statistical methodologies and computational tools. In this paper, we primarily focus on two types of biological networks, gene networks and brain networks, where statistical network modeling has found both fruitful and challenging applications. Unlike other network examples such as social networks where network edges can be directly observed, both gene and brain networks require careful estimation of edges using covariates as a first step. We provide a discussion on existing statistical and computational methods for edge esitimation and subsequent statistical inference problems in these two types of biological networks.

13.
Nucleic Acids Res ; 47(13): e77, 2019 07 26.
Artículo en Inglés | MEDLINE | ID: mdl-31045217

RESUMEN

The availability of genome-wide epigenomic datasets enables in-depth studies of epigenetic modifications and their relationships with chromatin structures and gene expression. Various alignment tools have been developed to align nucleotide or protein sequences in order to identify structurally similar regions. However, there are currently no alignment methods specifically designed for comparing multi-track epigenomic signals and detecting common patterns that may explain functional or evolutionary similarities. We propose a new local alignment algorithm, EpiAlign, designed to compare chromatin state sequences learned from multi-track epigenomic signals and to identify locally aligned chromatin regions. EpiAlign is a dynamic programming algorithm that novelly incorporates varying lengths and frequencies of chromatin states. We demonstrate the efficacy of EpiAlign through extensive simulations and studies on the real data from the NIH Roadmap Epigenomics project. EpiAlign is able to extract recurrent chromatin state patterns along a single epigenome, and many of these patterns carry cell-type-specific characteristics. EpiAlign can also detect common chromatin state patterns across multiple epigenomes, and it will serve as a useful tool to group and distinguish epigenomic samples based on genome-wide or local chromatin state patterns.


Asunto(s)
Cromatina/ultraestructura , Biología Computacional/métodos , Epigenómica/métodos , Alineación de Secuencia , Algoritmos , Secuencia de Bases , Química Encefálica , Cromatina/genética , Metilación de ADN , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Ontología de Genes , Humanos , Proteínas del Tejido Nervioso/biosíntesis , Proteínas del Tejido Nervioso/química , Proteínas del Tejido Nervioso/genética , Programas Informáticos
14.
Proc Natl Acad Sci U S A ; 115(5): E1069-E1074, 2018 01 30.
Artículo en Inglés | MEDLINE | ID: mdl-29339507

RESUMEN

Genome-wide characterization by next-generation sequencing has greatly improved our understanding of the landscape of epigenetic modifications. Since 2008, whole-genome bisulfite sequencing (WGBS) has become the gold standard for DNA methylation analysis, and a tremendous amount of WGBS data has been generated by the research community. However, the systematic comparison of DNA methylation profiles to identify regulatory mechanisms has yet to be fully explored. Here we reprocessed the raw data of over 500 publicly available Arabidopsis WGBS libraries from various mutant backgrounds, tissue types, and stress treatments and also filtered them based on sequencing depth and efficiency of bisulfite conversion. This enabled us to identify high-confidence differentially methylated regions (hcDMRs) by comparing each test library to over 50 high-quality wild-type controls. We developed statistical and quantitative measurements to analyze the overlapping of DMRs and to cluster libraries based on their effect on DNA methylation. In addition to confirming existing relationships, we revealed unanticipated connections between well-known genes. For instance, MET1 and CMT3 were found to be required for the maintenance of asymmetric CHH methylation at nonoverlapping regions of CMT2 targeted heterochromatin. Our comparative methylome approach has established a framework for extracting biological insights via large-scale comparison of methylomes and can also be adopted for other genomics datasets.


Asunto(s)
Arabidopsis/genética , Metilación de ADN , Epigenómica , Regulación de la Expresión Génica de las Plantas , Análisis por Conglomerados , Biología Computacional , Islas de CpG , Epigénesis Genética , Biblioteca de Genes , Genoma de Planta , Heterocromatina/química , Secuenciación de Nucleótidos de Alto Rendimiento , Plantas Modificadas Genéticamente , Análisis de Secuencia de ADN , Análisis de Secuencia de ARN , Programas Informáticos
15.
Bioinformatics ; 35(14): i41-i50, 2019 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-31510652

RESUMEN

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information. RESULTS: Here we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and 6 different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experimental design and comparison of scRNA-seq computational methods based on specific research goals. AVAILABILITY AND IMPLEMENTATION: We have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Perfilación de la Expresión Génica , ARN Citoplasmático Pequeño , Análisis de la Célula Individual , Reproducibilidad de los Resultados , Proyectos de Investigación , Análisis de Secuencia de ARN , Programas Informáticos
16.
Nature ; 512(7515): 445-8, 2014 Aug 28.
Artículo en Inglés | MEDLINE | ID: mdl-25164755

RESUMEN

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.


Asunto(s)
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilación de la Expresión Génica , Transcriptoma/genética , Animales , Caenorhabditis elegans/embriología , Caenorhabditis elegans/crecimiento & desarrollo , Cromatina/genética , Análisis por Conglomerados , Drosophila melanogaster/crecimiento & desarrollo , Regulación del Desarrollo de la Expresión Génica/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crecimiento & desarrollo , Modelos Genéticos , Anotación de Secuencia Molecular , Regiones Promotoras Genéticas/genética , Pupa/genética , Pupa/crecimiento & desarrollo , ARN no Traducido/genética , Análisis de Secuencia de ARN
17.
Nature ; 512(7515): 453-6, 2014 Aug 28.
Artículo en Inglés | MEDLINE | ID: mdl-25164757

RESUMEN

Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.


Asunto(s)
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Evolución Molecular , Regulación de la Expresión Génica/genética , Redes Reguladoras de Genes/genética , Factores de Transcripción/metabolismo , Animales , Sitios de Unión , Caenorhabditis elegans/crecimiento & desarrollo , Inmunoprecipitación de Cromatina , Secuencia Conservada/genética , Drosophila melanogaster/crecimiento & desarrollo , Regulación del Desarrollo de la Expresión Génica/genética , Genoma/genética , Humanos , Anotación de Secuencia Molecular , Motivos de Nucleótidos/genética , Especificidad de Órganos/genética , Factores de Transcripción/genética
18.
Nucleic Acids Res ; 45(20): 11821-11836, 2017 Nov 16.
Artículo en Inglés | MEDLINE | ID: mdl-29040683

RESUMEN

Translation rate per mRNA molecule correlates positively with mRNA abundance. As a result, protein levels do not scale linearly with mRNA levels, but instead scale with the abundance of mRNA raised to the power of an 'amplification exponent'. Here we show that to quantitate translational control, the translation rate must be decomposed into two components. One, TRmD, depends on the mRNA level and defines the amplification exponent. The other, TRmIND, is independent of mRNA amount and impacts the correlation coefficient between protein and mRNA levels. We show that in Saccharomyces cerevisiae TRmD represents ∼20% of the variance in translation and directs an amplification exponent of 1.20 with a 95% confidence interval [1.14, 1.26]. TRmIND constitutes the remaining ∼80% of the variance in translation and explains ∼5% of the variance in protein expression. We also find that TRmD and TRmIND are preferentially determined by different mRNA sequence features: TRmIND by the length of the open reading frame and TRmD both by a ∼60 nucleotide element that spans the initiating AUG and by codon and amino acid frequency. Our work provides more appropriate estimates of translational control and implies that TRmIND is under different evolutionary selective pressures than TRmD.


Asunto(s)
Regulación Fúngica de la Expresión Génica , Biosíntesis de Proteínas/genética , ARN Mensajero/genética , Saccharomyces cerevisiae/genética , Algoritmos , Secuencia de Bases , Codón/genética , Codón Iniciador/genética , Modelos Genéticos , Sistemas de Lectura Abierta/genética , Iniciación de la Cadena Peptídica Traduccional/genética , ARN Mensajero/metabolismo , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
19.
Nucleic Acids Res ; 45(4): 1657-1672, 2017 02 28.
Artículo en Inglés | MEDLINE | ID: mdl-27980097

RESUMEN

Distinguishing cell states based only on gene expression data remains a challenging task. This is true even for analyses within a species. In cross-species comparisons, the results obtained by different groups have varied widely. Here, we integrate RNA-seq data from more than 40 cell and tissue types of four mammalian species to identify sets of associated genes as indicators for specific cell states in each species. We employ a statistical method, TROM, to identify both protein-coding and non-coding indicators. Next, we map the cell states within each species and also between species using these indicator genes. We recapitulate known phenotypic similarity between related cell and tissue types and reveal molecular basis for their similarity. We also report novel associations between several tissues and cell types with functional support. Moreover, our identified conserved associated genes are found to be a good resource for studying cell differentiation and reprogramming. Lastly, long non-coding RNAs can serve well as associated genes to indicate cell states. We further infer the biological functions of those non-coding associated genes based on their co-expressed protein-coding genes. This study demonstrates that combining statistical modeling with public RNA-seq data can be powerful for improving our understanding of cell identity control.


Asunto(s)
Mapeo Contig , Evolución Molecular , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Mamíferos/genética , Transcriptoma , Algoritmos , Animales , Análisis por Conglomerados , Biología Computacional/métodos , Regulación del Desarrollo de la Expresión Génica , Ontología de Genes , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Ratones , Anotación de Secuencia Molecular , Familia de Multigenes , Especificidad de Órganos
20.
BMC Genomics ; 18(1): 234, 2017 03 16.
Artículo en Inglés | MEDLINE | ID: mdl-28302059

RESUMEN

BACKGROUND: We report a statistical study to find correspondence of D. melanogaster and C. elegans developmental stages based on alternative splicing (AS) characteristics of conserved cassette exons using modENCODE RNA-seq data. We identify "stage-associated exons" to capture the AS characteristics of each stage and use these exons to map pairwise stages within and between the two species by an overlap test. RESULTS: Within fly and worm, adjacent developmental stages are mapped to each other, i.e., a strong diagonal pattern is observed as expected, supporting the validity of our approach. Between fly and worm, two parallel mapping patterns are observed between fly early embryos to early larvae and worm life cycle, and between fly late larvae to adults and worm late embryos to adults. We also apply this approach to compare tissues and cells from fly and worm. Findings include the high similarity between fly/worm adults and fly/worm embryos, groupings of fly cell lines, and strong mappings of fly head tissues to worm late embryos and male adults. Gene ontology and KEGG enrichment analyses provide a detailed functional annotation of the identified stage-associated exons, as well as a functional explanation of the observed correspondence map between fly and worm developmental stages. CONCLUSIONS: Our results suggest that AS dynamics of the exon pairs that share similar DNA sequences are informative for finding transcriptomic similarity of biological samples. Our study is innovative in two aspects. First, to our knowledge, our study is the first comprehensive study of AS events in fly and worm developmental stages, tissues, and cells. AS events provide an alternative perspective of transcriptome dynamics, compared to gene expression events. Second, our results do not entirely rely on the information of orthologous genes. Interesting results are also observed for fly and worm cassette exon pairs with DNA sequence similarity but not in orthologous gene pairs.


Asunto(s)
Empalme Alternativo , Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Exones , Regulación del Desarrollo de la Expresión Génica , Animales , Caenorhabditis elegans/crecimiento & desarrollo , Análisis por Conglomerados , Biología Computacional/métodos , Drosophila melanogaster/crecimiento & desarrollo , Evolución Molecular , Perfilación de la Expresión Génica , Ontología de Genes , Genoma , Genómica/métodos , Estadios del Ciclo de Vida/genética , Anotación de Secuencia Molecular , Especificidad de Órganos/genética , Transcriptoma
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA