Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 34
Filtrar
1.
Genome Res ; 30(7): 1047-1059, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32759341

RESUMEN

We have produced RNA sequencing data for 53 primary cells from different locations in the human body. The clustering of these primary cells reveals that most cells in the human body share a few broad transcriptional programs, which define five major cell types: epithelial, endothelial, mesenchymal, neural, and blood cells. These act as basic components of many tissues and organs. Based on gene expression, these cell types redefine the basic histological types by which tissues have been traditionally classified. We identified genes whose expression is specific to these cell types, and from these genes, we estimated the contribution of the major cell types to the composition of human tissues. We found this cellular composition to be a characteristic signature of tissues and to reflect tissue morphological heterogeneity and histology. We identified changes in cellular composition in different tissues associated with age and sex, and found that departures from the normal cellular composition correlate with histological phenotypes associated with disease.


Asunto(s)
Transcripción Genética , Línea Celular , Células Endoteliales/metabolismo , Células Epiteliales/metabolismo , Femenino , Perfilación de la Expresión Génica , Ginecomastia/genética , Ginecomastia/metabolismo , Humanos , Masculino , Mesodermo/citología , Mesodermo/metabolismo , Neoplasias/genética , Especificidad de Órganos , Análisis de Secuencia de ARN
2.
Nature ; 543(7644): 199-204, 2017 03 09.
Artículo en Inglés | MEDLINE | ID: mdl-28241135

RESUMEN

Long non-coding RNAs (lncRNAs) are largely heterogeneous and functionally uncharacterized. Here, using FANTOM5 cap analysis of gene expression (CAGE) data, we integrate multiple transcript collections to generate a comprehensive atlas of 27,919 human lncRNA genes with high-confidence 5' ends and expression profiles across 1,829 samples from the major human primary cell types and tissues. Genomic and epigenomic classification of these lncRNAs reveals that most intergenic lncRNAs originate from enhancers rather than from promoters. Incorporating genetic and expression data, we show that lncRNAs overlapping trait-associated single nucleotide polymorphisms are specifically expressed in cell types relevant to the traits, implicating these lncRNAs in multiple diseases. We further demonstrate that lncRNAs overlapping expression quantitative trait loci (eQTL)-associated single nucleotide polymorphisms of messenger RNAs are co-expressed with the corresponding messenger RNAs, suggesting their potential roles in transcriptional regulation. Combining these findings with conservation data, we identify 19,175 potentially functional lncRNAs in the human genome.


Asunto(s)
Bases de Datos Genéticas , ARN Largo no Codificante/química , ARN Largo no Codificante/genética , Transcriptoma/genética , Células Cultivadas , Secuencia Conservada/genética , Conjuntos de Datos como Asunto , Elementos de Facilitación Genéticos/genética , Epigénesis Genética , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Genoma Humano/genética , Estudio de Asociación del Genoma Completo , Genómica , Humanos , Internet , Anotación de Secuencia Molecular , Especificidad de Órganos/genética , Polimorfismo de Nucleótido Simple , Regiones Promotoras Genéticas/genética , Sitios de Carácter Cuantitativo/genética , Estabilidad del ARN , ARN Mensajero/genética
3.
Nature ; 515(7527): 355-64, 2014 Nov 20.
Artículo en Inglés | MEDLINE | ID: mdl-25409824

RESUMEN

The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.


Asunto(s)
Genoma/genética , Genómica , Ratones/genética , Anotación de Secuencia Molecular , Animales , Linaje de la Célula/genética , Cromatina/genética , Cromatina/metabolismo , Secuencia Conservada/genética , Replicación del ADN/genética , Desoxirribonucleasa I/metabolismo , Regulación de la Expresión Génica/genética , Redes Reguladoras de Genes/genética , Estudio de Asociación del Genoma Completo , Humanos , ARN/genética , Secuencias Reguladoras de Ácidos Nucleicos/genética , Especificidad de la Especie , Factores de Transcripción/metabolismo , Transcriptoma/genética
4.
Nature ; 512(7515): 445-8, 2014 Aug 28.
Artículo en Inglés | MEDLINE | ID: mdl-25164755

RESUMEN

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.


Asunto(s)
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilación de la Expresión Génica , Transcriptoma/genética , Animales , Caenorhabditis elegans/embriología , Caenorhabditis elegans/crecimiento & desarrollo , Cromatina/genética , Análisis por Conglomerados , Drosophila melanogaster/crecimiento & desarrollo , Regulación del Desarrollo de la Expresión Génica/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crecimiento & desarrollo , Modelos Genéticos , Anotación de Secuencia Molecular , Regiones Promotoras Genéticas/genética , Pupa/genética , Pupa/crecimiento & desarrollo , ARN no Traducido/genética , Análisis de Secuencia de ARN
5.
BMC Biol ; 17(1): 108, 2019 12 30.
Artículo en Inglés | MEDLINE | ID: mdl-31884969

RESUMEN

BACKGROUND: Comparative genomics studies are central in identifying the coding and non-coding elements associated with complex traits, and the functional annotation of genomes is a critical step to decipher the genotype-to-phenotype relationships in livestock animals. As part of the Functional Annotation of Animal Genomes (FAANG) action, the FR-AgENCODE project aimed to create reference functional maps of domesticated animals by profiling the landscape of transcription (RNA-seq), chromatin accessibility (ATAC-seq) and conformation (Hi-C) in species representing ruminants (cattle, goat), monogastrics (pig) and birds (chicken), using three target samples related to metabolism (liver) and immunity (CD4+ and CD8+ T cells). RESULTS: RNA-seq assays considerably extended the available catalog of annotated transcripts and identified differentially expressed genes with unknown function, including new syntenic lncRNAs. ATAC-seq highlighted an enrichment for transcription factor binding sites in differentially accessible regions of the chromatin. Comparative analyses revealed a core set of conserved regulatory regions across species. Topologically associating domains (TADs) and epigenetic A/B compartments annotated from Hi-C data were consistent with RNA-seq and ATAC-seq data. Multi-species comparisons showed that conserved TAD boundaries had stronger insulation properties than species-specific ones and that the genomic distribution of orthologous genes in A/B compartments was significantly conserved across species. CONCLUSIONS: We report the first multi-species and multi-assay genome annotation results obtained by a FAANG project. Beyond the generation of reference annotations and the confirmation of previous findings on model animals, the integrative analysis of data from multiple assays and species sheds a new light on the multi-scale selective pressure shaping genome organization from birds to mammals. Overall, these results emphasize the value of FAANG for research on domesticated animals and reinforces the importance of future meta-analyses of the reference datasets being generated by this community on different species.


Asunto(s)
Animales Domésticos/genética , Cromatina/genética , Anotación de Secuencia Molecular , Transcriptoma , Animales , Bovinos , Pollos , Cabras , Filogenia , Sus scrofa
6.
RNA Biol ; 16(9): 1190-1204, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31120323

RESUMEN

To investigate the dynamics of circRNA expression in pig testes, we designed specific strategies to individually study circRNA production from intron lariats and circRNAs originating from back-splicing of two exons. By applying these methods on seven Total-RNA-seq datasets sampled during the testicular puberty, we detected 126 introns in 114 genes able to produce circRNAs and 5,236 exonic circRNAs produced by 2,516 genes. Comparing our RNA-seq datasets to datasets from the literature (embryonic cortex and postnatal muscle stages) revealed highly abundant intronic and exonic circRNAs in one sample each in pubertal testis and embryonic cortex, respectively. This abundance was due to higher production of circRNA by the same genes in comparison to other testis samples, rather than to the recruitment of new genes. No global relationship between circRNA and mRNA production was found. We propose ExoCirc-9244 (SMARCA5) as a marker of a particular stage in testis, which is characterized by a very low plasma estradiol level and a high abundance of circRNA in testis. We hypothesize that the abundance of testicular circRNA is associated with an abrupt switch of the cellular process to overcome a particular challenge that may have arisen in the early stages of steroid production. We also hypothesize that, in certain circumstances, isoforms and circular transcripts from different genes share functions and that a global regulation of circRNA production is established. Our data indicate that this massive production of circRNAs is much more related to the structure of the genes generating circRNAs than to their function. Abbreviations: PE: Paired Ends; CR: chimeric Read; SR: Split Read; circRNA: circular RNA; NC: non conventional; ExoCirc-RNA: exonic circular RNA; IntroLCirc-: name of a porcine intronic lariat circRNA; ExoCirc-: name of a porcine exonic circRNA; IntronCircle-: name of a porcine intron circle; sisRNA: stable intronic sequence RNA; P: porcine breed Pietrain; LW: porcine breed Large White; RT: reverse transcription/reverse transcriptase; Total-RNA-seq: RNA-seq obtained from total RNA after ribosomal depletion; mRNA-seq: RNA-seq of poly(A) transcripts; TPM: transcripts per million; CR-PM: chimeric reads per million; RBP: RNA binding protein; miRNA: micro RNA; E2: estradiol; DHT: dihydrotestesterone.


Asunto(s)
Regulación de la Expresión Génica , ARN Circular/genética , Porcinos/genética , Transcriptoma/genética , Animales , Embrión de Mamíferos/metabolismo , Exones/genética , Intrones/genética , Masculino , Músculos/metabolismo , ARN Circular/metabolismo , Reproducibilidad de los Resultados , Porcinos/embriología , Testículo/metabolismo
7.
Nature ; 489(7414): 101-8, 2012 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-22955620

RESUMEN

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.


Asunto(s)
ADN/genética , Enciclopedias como Asunto , Genoma Humano/genética , Anotación de Secuencia Molecular , Secuencias Reguladoras de Ácidos Nucleicos/genética , Transcripción Genética/genética , Transcriptoma/genética , Alelos , Línea Celular , ADN Intergénico/genética , Elementos de Facilitación Genéticos , Exones/genética , Perfilación de la Expresión Génica , Genes/genética , Genómica , Humanos , Poliadenilación/genética , Isoformas de Proteínas/genética , ARN/biosíntesis , ARN/genética , Edición de ARN/genética , Empalme del ARN/genética , Secuencias Repetitivas de Ácidos Nucleicos/genética , Análisis de Secuencia de ARN
8.
BMC Genomics ; 18(1): 7, 2017 01 03.
Artículo en Inglés | MEDLINE | ID: mdl-28049418

RESUMEN

BACKGROUND: Chimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment. RESULTS: Here we present ChimPipe, a modular and easy-to-use method to reliably identify fusion genes and transcription-induced chimeras from paired-end Illumina RNA-seq data. We have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which we hypothesized a new role. Applying ChimPipe to human and mouse ENCODE RNA-seq data led to the identification of 131 recurrent chimeras common to both species, and therefore potentially conserved. CONCLUSIONS: ChimPipe combines discordant paired-end reads and split-reads to detect any kind of chimeras, including those originating from polymerase read-through, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy.


Asunto(s)
Proteínas de Fusión Oncogénica , Recombinación Genética , Programas Informáticos , Transcripción Genética , Animales , Biología Computacional/métodos , Simulación por Computador , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Ratones , Reproducibilidad de los Resultados , Análisis de Secuencia de ARN
9.
Genome Res ; 24(2): 212-26, 2014 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-24265505

RESUMEN

Chronic lymphocytic leukemia (CLL) has heterogeneous clinical and biological behavior. Whole-genome and -exome sequencing has contributed to the characterization of the mutational spectrum of the disease, but the underlying transcriptional profile is still poorly understood. We have performed deep RNA sequencing in different subpopulations of normal B-lymphocytes and CLL cells from a cohort of 98 patients, and characterized the CLL transcriptional landscape with unprecedented resolution. We detected thousands of transcriptional elements differentially expressed between the CLL and normal B cells, including protein-coding genes, noncoding RNAs, and pseudogenes. Transposable elements are globally derepressed in CLL cells. In addition, two thousand genes-most of which are not differentially expressed-exhibit CLL-specific splicing patterns. Genes involved in metabolic pathways showed higher expression in CLL, while genes related to spliceosome, proteasome, and ribosome were among the most down-regulated in CLL. Clustering of the CLL samples according to RNA-seq derived gene expression levels unveiled two robust molecular subgroups, C1 and C2. C1/C2 subgroups and the mutational status of the immunoglobulin heavy variable (IGHV) region were the only independent variables in predicting time to treatment in a multivariate analysis with main clinico-biological features. This subdivision was validated in an independent cohort of patients monitored through DNA microarrays. Further analysis shows that B-cell receptor (BCR) activation in the microenvironment of the lymph node may be at the origin of the C1/C2 differences.


Asunto(s)
Linfocitos B , Regulación Neoplásica de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Leucemia Linfocítica Crónica de Células B/genética , Anciano , Secuencia de Bases , Femenino , Perfilación de la Expresión Génica , Humanos , Región Variable de Inmunoglobulina , Leucemia Linfocítica Crónica de Células B/patología , Masculino , Persona de Mediana Edad , Mutación , Ribosomas/genética , Empalmosomas/genética
10.
Genet Sel Evol ; 49(1): 6, 2017 01 10.
Artículo en Inglés | MEDLINE | ID: mdl-28073357

RESUMEN

BACKGROUND: Improving functional annotation of the chicken genome is a key challenge in bridging the gap between genotype and phenotype. Among all transcribed regions, long noncoding RNAs (lncRNAs) are a major component of the transcriptome and its regulation, and whole-transcriptome sequencing (RNA-Seq) has greatly improved their identification and characterization. We performed an extensive profiling of the lncRNA transcriptome in the chicken liver and adipose tissue by RNA-Seq. We focused on these two tissues because of their importance in various economical traits for which energy storage and mobilization play key roles and also because of their high cell homogeneity. To predict lncRNAs, we used a recently developed tool called FEELnc, which also classifies them with respect to their distance and strand orientation to the closest protein-coding genes. Moreover, to confidently identify the genes/transcripts expressed in each tissue (a complex task for weakly expressed molecules such as lncRNAs), we probed a particularly large number of biological replicates (16 per tissue) compared to common multi-tissue studies with a larger set of tissues but less sampling. RESULTS: We predicted 2193 lncRNA genes, among which 1670 were robustly expressed across replicates in the liver and/or adipose tissue and which were classified into 1493 intergenic and 177 intragenic lncRNAs located between and within protein-coding genes, respectively. We observed similar structural features between chickens and mammals, with strong synteny conservation but without sequence conservation. As previously reported, we confirm that lncRNAs have a lower and more tissue-specific expression than mRNAs. Finally, we showed that adjacent lncRNA-mRNA genes in divergent orientation have a higher co-expression level when separated by less than 1 kb compared to more distant divergent pairs. Among these, we highlighted for the first time a novel lncRNA candidate involved in lipid metabolism, lnc_DHCR24, which is highly correlated with the DHCR24 gene that encodes a key enzyme of cholesterol biosynthesis. CONCLUSIONS: We provide a comprehensive lncRNA repertoire in the chicken liver and adipose tissue, which shows interesting patterns of co-expression between mRNAs and lncRNAs. It contributes to improving the structural and functional annotation of the chicken genome and provides a basis for further studies on energy storage and mobilization traits in the chicken.


Asunto(s)
Tejido Adiposo/metabolismo , Pollos/genética , Hígado/metabolismo , ARN Largo no Codificante/genética , Transcriptoma , Animales , Pollos/metabolismo , Secuencia Conservada , Evolución Molecular , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Genoma , Genotipo , Humanos , Metabolismo de los Lípidos/genética , Sistemas de Lectura Abierta , Especificidad de Órganos , Fenotipo , Sitios de Carácter Cuantitativo , ARN sin Sentido , ARN Largo no Codificante/química , ARN Mensajero/genética
11.
Genome Res ; 22(9): 1616-25, 2012 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-22955974

RESUMEN

Splicing remains an incompletely understood process. Recent findings suggest that chromatin structure participates in its regulation. Here, we analyze the RNA from subcellular fractions obtained through RNA-seq in the cell line K562. We show that in the human genome, splicing occurs predominantly during transcription. We introduce the coSI measure, based on RNA-seq reads mapping to exon junctions and borders, to assess the degree of splicing completion around internal exons. We show that, as expected, splicing is almost fully completed in cytosolic polyA+ RNA. In chromatin-associated RNA (which includes the RNA that is being transcribed), for 5.6% of exons, the removal of the surrounding introns is fully completed, compared with 0.3% of exons for which no intron-removal has occurred. The remaining exons exist as a mixture of spliced and fewer unspliced molecules, with a median coSI of 0.75. Thus, most RNAs undergo splicing while being transcribed: "co-transcriptional splicing." Consistent with co-transcriptional spliceosome assembly and splicing, we have found significant enrichment of spliceosomal snRNAs in chromatin-associated RNA compared with other cellular RNA fractions and other nonspliceosomal snRNAs. CoSI scores decrease along the gene, pointing to a "first transcribed, first spliced" rule, yet more downstream exons carry other characteristics, favoring rapid, co-transcriptional intron removal. Exons with low coSI values, that is, in the process of being spliced, are enriched with chromatin marks, consistent with a role for chromatin in splicing during transcription. For alternative exons and long noncoding RNAs, splicing tends to occur later, and the latter might remain unspliced in some cases.


Asunto(s)
Genoma Humano , Empalme del ARN , ARN Largo no Codificante/metabolismo , Transcripción Genética , Cromatina/metabolismo , Análisis por Conglomerados , Biología Computacional/métodos , Exones , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , ARN/genética , ARN/metabolismo , Análisis de Secuencia de ARN , Empalmosomas/genética , Empalmosomas/metabolismo , Fracciones Subcelulares/química
12.
Genome Res ; 22(9): 1658-67, 2012 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-22955978

RESUMEN

Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcriptional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner.


Asunto(s)
Regulación de la Expresión Génica , Genómica , Factores de Transcripción/metabolismo , Transcripción Genética , Composición de Base , Sitios de Unión/genética , Línea Celular , Cromatina/genética , Cromatina/metabolismo , Biología Computacional/métodos , Histonas/genética , Humanos , Modelos Biológicos , Regiones Promotoras Genéticas , Unión Proteica/genética , Sitio de Iniciación de la Transcripción
13.
Genome Res ; 22(9): 1775-89, 2012 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-22955988

RESUMEN

The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences-particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.


Asunto(s)
Bases de Datos Genéticas , ARN Largo no Codificante/genética , Empalme Alternativo , Animales , Núcleo Celular/genética , Núcleo Celular/metabolismo , Análisis por Conglomerados , Evolución Molecular , Exones , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Histonas/metabolismo , Humanos , Anotación de Secuencia Molecular , Sistemas de Lectura Abierta , Especificidad de Órganos/genética , Primates/genética , Procesamiento Postranscripcional del ARN , Sitios de Empalme de ARN , ARN Mensajero/genética , Selección Genética , Transcripción Genética
14.
Nucleic Acids Res ; 41(15): 7220-30, 2013 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-23761436

RESUMEN

Although protein recognition of DNA motifs in promoter regions has been traditionally considered as a critical regulatory element in transcription, the location of promoters, and in particular transcription start sites (TSSs), still remains a challenge. Here we perform a comprehensive analysis of putative core promoter sequences relative to non-annotated predicted TSSs along the human genome, which were defined by distinct DNA physical properties implemented in our ProStar computational algorithm. A representative sampling of predicted regions was subjected to extensive experimental validation and analyses. Interestingly, the vast majority proved to be transcriptionally active despite the lack of specific sequence motifs, indicating that physical signaling is indeed able to detect promoter activity beyond conventional TSS prediction methods. Furthermore, highly active regions displayed typical chromatin features associated to promoters of housekeeping genes. Our results enable to redefine the promoter signatures and analyze the diversity, evolutionary conservation and dynamic regulation of human core promoters at large-scale. Moreover, the present study strongly supports the hypothesis of an ancient regulatory mechanism encoded by the intrinsic physical properties of the DNA that may contribute to the complexity of transcription regulation in the human genome.


Asunto(s)
Genoma Humano , Regiones Promotoras Genéticas , Programas Informáticos , Animales , Cromatina/genética , Biología Computacional/métodos , Secuencia Conservada , Epigénesis Genética , Código Genético , Histonas/genética , Histonas/metabolismo , Humanos , Conformación de Ácido Nucleico , Análisis de Secuencia de ADN , Transcripción Genética
15.
Bioinform Adv ; 4(1): vbad188, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38213821

RESUMEN

Motivation: Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with common diseases. These results include a mix of causal and non-causal variants related through strong linkage disequilibrium (LD, i.e. highly correlated). Fine-mapping methods have been developed to decipher the causal from non-causal variants using GWAS results and LD information, assigning to each variant a probability of being causal. In this field, the PAINTOR program has become a standard, one of its advantages being its ability to take into account functional annotations. This approach requires many pre- and post-processing steps. Here, we developed a Nextflow pipeline called PaintorPipe that wraps all these steps and the fine-mapping itself together. PaintorPipe uses three independent sources of information: GWAS summary statistics, LD information and functional annotations, to rank the variants according to their susceptibility to be involved in the disease development. The PAINTOR framework is used to calculate the posterior probability of each variant (single nucleotide polymorphism) to be causal (a.k.a. Bayesian fine-mapping). The resulting credible sets of variants are annotated with their biological functions and visualized using CANVIS. This pipeline requires minimal input from users (a GWAS summary statistics file and a set of functional annotation files) and is designed to be modular and customizable, allowing for an easy integration of diverse functional annotations. Availability and implementation: PaintorPipe is implemented in the Nextflow pipeline specific language, can be run locally or on a slurm cluster and handles containerization using Singularity. PaintorPipe is freely available on GitHub (https://github.com/sdjebali/PaintorPipe).

16.
NAR Genom Bioinform ; 5(4): lqad089, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-37850035

RESUMEN

Genome annotation plays a crucial role in providing comprehensive catalog of genes and transcripts for a particular species. As research projects generate new transcriptome data worldwide, integrating this information into existing annotations becomes essential. However, most bioinformatics pipelines are limited in their ability to effectively and consistently update annotations using new RNA-seq data. Here we introduce TAGADA, an RNA-seq pipeline for Transcripts And Genes Assembly, Deconvolution, and Analysis. Given a genomic sequence, a reference annotation and RNA-seq reads, TAGADA enhances existing gene models by generating an improved annotation. It also computes expression values for both the reference and novel annotation, identifies long non-coding transcripts (lncRNAs), and provides a comprehensive quality control report. Developed using Nextflow DSL2, TAGADA offers user-friendly functionalities and ensures reproducibility across different computing platforms through its containerized environment. In this study, we demonstrate the efficacy of TAGADA using RNA-seq data from the GENE-SWiTCH project alongside chicken and pig genome annotations as references. Results indicate that TAGADA can substantially increase the number of annotated transcripts by approximately [Formula: see text] in these species. Furthermore, we illustrate how TAGADA can integrate Illumina NovaSeq short reads with PacBio Iso-Seq long reads, showcasing its versatility. TAGADA is available at github.com/FAANG/analysis-TAGADA.

17.
Front Bioinform ; 3: 1092853, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36909938

RESUMEN

Differences in cells' functions arise from differential activity of regulatory elements, including enhancers. Enhancers are cis-regulatory elements that cooperate with promoters through transcription factors to activate the expression of one or several genes by getting physically close to them in the 3D space of the nucleus. There is increasing evidence that genetic variants associated with common diseases are enriched in enhancers active in cell types relevant to these diseases. Identifying the enhancers associated with genes and conversely, the sets of genes activated by each enhancer (the so-called enhancer/gene or E/G relationships) across cell types, can help understanding the genetic mechanisms underlying human diseases. There are three broad approaches for the genome-wide identification of E/G relationships in a cell type: 1) genetic link methods or eQTL, 2) functional link methods based on 1D functional data such as open chromatin, histone mark or gene expression and 3) spatial link methods based on 3D data such as HiC. Since 1) and 3) are costly, the current strategy is to develop functional link methods and to use data from 1) and 3) as reference to evaluate them. However, there is still no consensus on the best functional link method to date, and method comparison remain seldom. Here, we compared the relative performances of three recent methods for the identification of enhancer-gene links, TargetFinder, Average-Rank, and the ABC model, using the three latest benchmarks from the field: a reference that combines 3D and eQTL data, called BENGI, and two genetic screening references, called CRiFF and CRiSPRi. Overall, none of the three methods performed best on the three references. CRiFF and CRISPRi reference sets are likely more reliable, but CRiFF is not genome-wide and CRiFF and CRISPRi are mostly available on the K562 cancer cell line. The BENGI reference set is genome-wide but likely contains many false positives. This study therefore calls for new reliable and genome-wide E/G reference data rather than new functional link E/G identification methods.

18.
Sci Data ; 10(1): 369, 2023 06 08.
Artículo en Inglés | MEDLINE | ID: mdl-37291142

RESUMEN

Inspired by the production of reference data sets in the Genome in a Bottle project, we sequenced one Charolais heifer with different technologies: Illumina paired-end, Oxford Nanopore, Pacific Biosciences (HiFi and CLR), 10X Genomics linked-reads, and Hi-C. In order to generate haplotypic assemblies, we also sequenced both parents with short reads. From these data, we built two haplotyped trio high quality reference genomes and a consensus assembly, using up-to-date software packages. The assemblies obtained using PacBio HiFi reaches a size of 3.2 Gb, which is significantly larger than the 2.7 Gb ARS-UCD1.2 reference. The BUSCO score of the consensus assembly reaches a completeness of 95.8%, among highly conserved mammal genes. We also identified 35,866 structural variants larger than 50 base pairs. This assembly is a contribution to the bovine pangenome for the "Charolais" breed. These datasets will prove to be useful resources enabling the community to gain additional insight on sequencing technologies for applications such as SNP, indel or structural variant calling, and de novo assembly.


Asunto(s)
Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Animales , Bovinos , Femenino , Benchmarking , Genoma , Análisis de Secuencia de ADN
19.
Nat Methods ; 5(7): 629-35, 2008 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-18500348

RESUMEN

Rapid amplification of cDNA ends (RACE) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. To improve sampling efficiency of human transcripts, we hybridized the products of the RACE reaction onto tiling arrays and used the detected exons to delineate a series of reverse-transcriptase (RT)-PCRs, through which the original RACE transcript population was segregated into simpler transcript populations. We independently cloned the products and sequenced randomly selected clones. This approach, RACEarray, is superior to direct cloning and sequencing of RACE products because it specifically targets new transcripts and often results in overall normalization of transcript abundance. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of new transcripts, and we investigated multiplexing the strategy by pooling RACE reactions from multiple interrogated loci before hybridization.


Asunto(s)
ADN Complementario/genética , Perfilación de la Expresión Génica/métodos , Biblioteca de Genes , Técnicas de Amplificación de Ácido Nucleico/métodos , ARN/genética , Empalme Alternativo , Cromosomas Humanos Par 21/genética , Cromosomas Humanos Par 22/genética , Clonación Molecular , Exones , Genoma Humano , Humanos , Datos de Secuencia Molecular , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Isoformas de Proteínas/genética , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa , Transcripción Genética
20.
Genes (Basel) ; 12(4)2021 04 09.
Artículo en Inglés | MEDLINE | ID: mdl-33918852

RESUMEN

Steroid metabolism is a fundamental process in the porcine testis to provide testosterone but also estrogens and androstenone, which are essential for the physiology of the boar. This study concerns boars at an early stage of puberty. Using a RT-qPCR approach, we showed that the transcriptional activities of several genes providing key enzymes involved in this metabolism (such as CYP11A1) are correlated. Surprisingly, HSD17B3, a key gene for testosterone production, was absent from this group. An additional weighted gene co-expression network analysis was performed on two large sets of mRNA-seq to identify co-expression modules. Of these modules, two containing either CYP11A1 or HSD17B3 were further analyzed. This comprehensive correlation meta-analysis identified a group of 85 genes with CYP11A1 as hub gene, but did not allow the characterization of a robust correlation network around HSD17B3. As the CYP11A1-group includes most of the genes involved in steroid synthesis pathways (including LHCGR encoding for the LH receptor), it may control the synthesis of most of the testicular steroids. The independent expression of HSD17B3 probably allows part of the production of testosterone to escape this control. This CYP11A1-group contained also INSL3 and AGT genes encoding a peptide hormone and an angiotensin peptide precursor, respectively.


Asunto(s)
17-Hidroxiesteroide Deshidrogenasas/metabolismo , Enzima de Desdoblamiento de la Cadena Lateral del Colesterol/metabolismo , Redes Reguladoras de Genes , Transducción de Señal , Testículo/metabolismo , Testosterona/metabolismo , 17-Hidroxiesteroide Deshidrogenasas/genética , Animales , Enzima de Desdoblamiento de la Cadena Lateral del Colesterol/genética , Masculino , Porcinos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA