ABSTRACT
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
Subject(s)
Genes , Genome, Human , Molecular Sequence Annotation , Protein Isoforms , Humans , Genome, Human/genetics , Molecular Sequence Annotation/standards , Molecular Sequence Annotation/trends , Protein Isoforms/genetics , Human Genome Project , Pseudogenes , RNA/geneticsABSTRACT
Circadian and circannual cycles trigger physiological changes whose reflection on human transcriptomes remains largely uncharted. We used the time and season of death of 932 individuals from GTEx to jointly investigate transcriptomic changes associated with those cycles across multiple tissues. Overall, most variation across tissues during day-night and among seasons was unique to each cycle. Although all tissues remodeled their transcriptomes, brain and gonadal tissues exhibited the highest seasonality, whereas those in the thoracic cavity showed stronger day-night regulation. Core clock genes displayed marked day-night differences across multiple tissues, which were largely conserved in baboon and mouse, but adapted to their nocturnal or diurnal habits. Seasonal variation of expression affected multiple pathways, and it was enriched among genes associated with the immune response, consistent with the seasonality of viral infections. Furthermore, they unveiled cytoarchitectural changes in brain regions. Altogether, our results provide the first combined atlas of how transcriptomes from human tissues adapt to major cycling environmental conditions. This atlas may have multiple applications; for example, drug targets with day-night or seasonal variation in gene expression may benefit from temporally adjusted doses.
Subject(s)
Gene Expression Profiling , Transcriptome , Humans , Animals , Mice , Seasons , Transcriptome/genetics , Adaptation, Physiological , Circadian Rhythm/geneticsABSTRACT
The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.
Subject(s)
DNA/genetics , Databases, Genetic , Genome/genetics , Genomics , Molecular Sequence Annotation , Registries , Regulatory Sequences, Nucleic Acid/genetics , Animals , Chromatin/genetics , Chromatin/metabolism , DNA/chemistry , DNA Footprinting , DNA Methylation/genetics , DNA Replication Timing , Deoxyribonuclease I/metabolism , Genome, Human , Histones/metabolism , Humans , Mice , Mice, Transgenic , RNA-Binding Proteins/genetics , Transcription, Genetic/genetics , Transposases/metabolismABSTRACT
The tuatara (Sphenodon punctatus)-the only living member of the reptilian order Rhynchocephalia (Sphenodontia), once widespread across Gondwana1,2-is an iconic species that is endemic to New Zealand2,3. A key link to the now-extinct stem reptiles (from which dinosaurs, modern reptiles, birds and mammals evolved), the tuatara provides key insights into the ancestral amniotes2,4. Here we analyse the genome of the tuatara, which-at approximately 5 Gb-is among the largest of the vertebrate genomes yet assembled. Our analyses of this genome, along with comparisons with other vertebrate genomes, reinforce the uniqueness of the tuatara. Phylogenetic analyses indicate that the tuatara lineage diverged from that of snakes and lizards around 250 million years ago. This lineage also shows moderate rates of molecular evolution, with instances of punctuated evolution. Our genome sequence analysis identifies expansions of proteins, non-protein-coding RNA families and repeat elements, the latter of which show an amalgam of reptilian and mammalian features. The sequencing of the tuatara genome provides a valuable resource for deep comparative analyses of tetrapods, as well as for tuatara biology and conservation. Our study also provides important insights into both the technical challenges and the cultural obligations that are associated with genome sequencing.
Subject(s)
Evolution, Molecular , Genome/genetics , Phylogeny , Reptiles/genetics , Animals , Conservation of Natural Resources/trends , Female , Genetics, Population , Lizards/genetics , Male , Molecular Sequence Annotation , New Zealand , Sex Characteristics , Snakes/genetics , SyntenyABSTRACT
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
ABSTRACT
While the long noncoding RNAs (ncRNAs) constitute a large portion of the mammalian transcriptome, their biological functions has remained elusive. A few long ncRNAs that have been studied in any detail silence gene expression in processes such as X-inactivation and imprinting. We used a GENCODE annotation of the human genome to characterize over a thousand long ncRNAs that are expressed in multiple cell lines. Unexpectedly, we found an enhancer-like function for a set of these long ncRNAs in human cell lines. Depletion of a number of ncRNAs led to decreased expression of their neighboring protein-coding genes, including the master regulator of hematopoiesis, SCL (also called TAL1), Snai1 and Snai2. Using heterologous transcription assays we demonstrated a requirement for the ncRNAs in activation of gene expression. These results reveal an unanticipated role for a class of long ncRNAs in activation of critical regulators of development and differentiation.
Subject(s)
Enhancer Elements, Genetic , Genome, Human , RNA, Untranslated/metabolism , Cell Line , Cell Line, Tumor , Cells, Cultured , Humans , RNA, Messenger/genetics , Snail Family Transcription Factors , Transcription Factors/genetics , Transcriptional ActivationABSTRACT
Properties that make organisms ideal laboratory models in developmental and medical research are often the ones that also make them less representative of wild relatives. The waterflea Daphnia magna is an exception, by both sharing many properties with established laboratory models and being a keystone species, a sentinel species for assessing water quality, an indicator of environmental change and an established ecotoxicology model. Yet, Daphnia's full potential has not been fully exploited because of the challenges associated with assembling and annotating its gene-rich genome. Here, we present the first hologenome of Daphnia magna, consisting of a chromosomal-level assembly of the D. magna genome and the draft assembly of its metagenome. By sequencing and mapping transcriptomes from exposures to environmental conditions and from developmental morphological landmarks, we expand the previously annotates gene set for this species. We also provide evidence for the potential role of gene-body DNA-methylation as a mutagen mediating genome evolution. For the first time, our study shows that the gut microbes provide resistance to commonly used antibiotics and virulence factors, potentially mediating Daphnia's environmental-driven rapid evolution. Key findings in this study improve our understanding of the contribution of DNA methylation and gut microbiota to genome evolution in response to rapidly changing environments.
ABSTRACT
Aquaporin-mediated oocyte hydration is considered important for the evolution of pelagic eggs and the radiative success of marine teleosts. However, the molecular regulatory mechanisms controlling this vital process are not fully understood. Here, we analyzed >400 piscine genomes to uncover a previously unknown teleost-specific aquaporin-1 cluster (TSA1C) comprised of tandemly arranged aqp1aa-aqp1ab2-aqp1ab1 genes. Functional evolutionary analysis of the TSA1C reveals a â¼300-million-year history of downstream aqp1ab-type gene loss, neofunctionalization, and subfunctionalization, but with marine species that spawn highly hydrated pelagic eggs almost exclusively retaining at least one of the downstream paralogs. Unexpectedly, one-third of the modern marine euacanthomorph teleosts selectively retain both aqp1ab-type channels and co-evolved protein kinase-mediated phosphorylation sites in the intracellular subdomains together with teleost-specific Ywhaz-like (14-3-3ζ-like) binding proteins for co-operative membrane trafficking regulation. To understand the selective evolutionary advantages of these mechanisms, we show that a two-step regulated channel shunt avoids competitive occupancy of the same plasma membrane space in the oocyte and accelerates hydration. These data suggest that the evolution of the adaptive molecular regulatory features of the TSA1C facilitated the rise of pelagic eggs and their subsequent geodispersal in the oceanic currents.
Subject(s)
14-3-3 Proteins , Oocytes , Animals , 14-3-3 Proteins/genetics , 14-3-3 Proteins/metabolism , Oocytes/metabolism , Evolution, Molecular , Fishes/genetics , PhylogenyABSTRACT
Tissue function and homeostasis reflect the gene expression signature by which the combination of ubiquitous and tissue-specific genes contribute to the tissue maintenance and stimuli-responsive function. Enhancers are central to control this tissue-specific gene expression pattern. Here, we explore the correlation between the genomic location of enhancers and their role in tissue-specific gene expression. We find that enhancers showing tissue-specific activity are highly enriched in intronic regions and regulate the expression of genes involved in tissue-specific functions, whereas housekeeping genes are more often controlled by intergenic enhancers, common to many tissues. Notably, an intergenic-to-intronic active enhancers continuum is observed in the transition from developmental to adult stages: the most differentiated tissues present higher rates of intronic enhancers, whereas the lowest rates are observed in embryonic stem cells. Altogether, our results suggest that the genomic location of active enhancers is key for the tissue-specific control of gene expression.
Subject(s)
Embryonic Stem Cells , Enhancer Elements, Genetic , Embryonic Stem Cells/metabolism , Genes, Essential , Introns/geneticsABSTRACT
In contrast to the western honey bee, Apis mellifera, other honey bee species have been largely neglected despite their importance and diversity. The genetic basis of the evolutionary diversification of honey bees remains largely unknown. Here, we provide a genome-wide comparison of three honey bee species, each representing one of the three subgenera of honey bees, namely the dwarf (Apis florea), giant (A. dorsata), and cavity-nesting (A. mellifera) honey bees with bumblebees as an outgroup. Our analyses resolve the phylogeny of honey bees with the dwarf honey bees diverging first. We find that evolution of increased eusocial complexity in Apis proceeds via increases in the complexity of gene regulation, which is in agreement with previous studies. However, this process seems to be related to pathways other than transcriptional control. Positive selection patterns across Apis reveal a trade-off between maintaining genome stability and generating genetic diversity, with a rapidly evolving piRNA pathway leading to genomes depleted of transposable elements, and a rapidly evolving DNA repair pathway associated with high recombination rates in all Apis species. Diversification within Apis is accompanied by positive selection in several genes whose putative functions present candidate mechanisms for lineage-specific adaptations, such as migration, immunity, and nesting behavior.
ABSTRACT
SUMMARY: Large-scale sharing of genomic quantification data requires standardized access interfaces. In this Global Alliance for Genomics and Health project, we developed RNAget, an API for secure access to genomic quantification data in matrix form. RNAget provides for slicing matrices to extract desired subsets of data and is applicable to all expression matrix-format data, including RNA sequencing and microarrays. Further, it generalizes to quantification matrices of other sequence-based genomics such as ATAC-seq and ChIP-seq. AVAILABILITY AND IMPLEMENTATION: https://ga4gh-rnaseq.github.io/schema/docs/index.html.
Subject(s)
RNA , Software , Genomics , Genome , Sequence Analysis, RNAABSTRACT
Social insect reproductives and non-reproductives represent ideal models with which to understand the expression and regulation of alternative phenotypes. Most research in this area has focused on the developmental regulation of reproductive phenotypes in obligately social taxa such as honey bees, while relatively few studies have addressed the molecular correlates of reproductive differentiation in species in which the division of reproductive labour is established only in plastic dominance hierarchies. To address this knowledge gap, we generate the first genome for any stenogastrine wasp and analyse brain transcriptomic data for non-reproductives and reproductives of the facultatively social species Liostenogaster flavolineata, a representative of one of the simplest forms of social living. By experimentally manipulating the reproductive 'queues' exhibited by social colonies of this species, we show that reproductive division of labour in this species is associated with transcriptomic signatures that are more subtle and variable than those observed in social taxa in which colony living has become obligate; that variation in gene expression among non-reproductives reflects their investment into foraging effort more than their social rank; and that genes associated with reproductive division of labour overlap to some extent with those underlying division of labour in the separate polistine origin of wasp sociality but only explain a small portion of overall variation in this trait. These results indicate that broad patterns of within-colony transcriptomic differentiation in this species are similar to those in Polistinae but offer little support for the existence of a strongly conserved 'toolkit' for sociality.
Subject(s)
Wasps , Bees/genetics , Animals , Wasps/genetics , Social Behavior , Social Dominance , Gene Expression Profiling , Transcriptome/genetics , Reproduction/geneticsABSTRACT
BACKGROUND: Long non-coding RNAs (lncRNAs) are pivotal players in cellular processes, and their unique cell-type specific expression patterns render them attractive biomarkers and therapeutic targets. Yet, the functional roles of most lncRNAs remain enigmatic. To address the need to identify new druggable lncRNAs, we developed a comprehensive approach integrating transcription factor binding data with other genetic features to generate a machine learning model, which we have called INFLAMeR (Identifying Novel Functional LncRNAs with Advanced Machine Learning Resources). METHODS: INFLAMeR was trained on high-throughput CRISPR interference (CRISPRi) screens across seven cell lines, and the algorithm was based on 71 genetic features. To validate the predictions, we selected candidate lncRNAs in the human K562 leukemia cell line and determined the impact of their knockdown (KD) on cell proliferation and chemotherapeutic drug response. We further performed transcriptomic analysis for candidate genes. Based on these findings, we assessed the lncRNA small nucleolar RNA host gene 6 (SNHG6) for its role in myeloid differentiation. Finally, we established a mouse K562 leukemia xenograft model to determine whether SNHG6 KD attenuates tumor growth in vivo. RESULTS: The INFLAMeR model successfully reconstituted CRISPRi screening data and predicted functional lncRNAs that were previously overlooked. Intensive cell-based and transcriptomic validation of nearly fifty genes in K562 revealed cell type-specific functionality for 85% of the predicted lncRNAs. In this respect, our cell-based and transcriptomic analyses predicted a role for SNHG6 in hematopoiesis and leukemia. Consistent with its predicted role in hematopoietic differentiation, SNHG6 transcription is regulated by hematopoiesis-associated transcription factors. SNHG6 KD reduced the proliferation of leukemia cells and sensitized them to differentiation. Treatment of K562 leukemic cells with hemin and PMA, respectively, demonstrated that SNHG6 inhibits red blood cell differentiation but strongly promotes megakaryocyte differentiation. Using a xenograft mouse model, we demonstrate that SNHG6 KD attenuated tumor growth in vivo. CONCLUSIONS: Our approach not only improved the identification and characterization of functional lncRNAs through genomic approaches in a cell type-specific manner, but also identified new lncRNAs with roles in hematopoiesis and leukemia. Such approaches can be readily applied to identify novel targets for precision medicine.
Subject(s)
Leukemia , RNA, Long Noncoding , Animals , Humans , Mice , Cell Differentiation/genetics , Cell Line, Tumor , Cell Proliferation/genetics , Gene Expression Regulation, Neoplastic , Genomics , Leukemia/genetics , RNA, Long Noncoding/genetics , RNA, Long Noncoding/metabolismABSTRACT
Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.
Subject(s)
Chromosome Mapping , Gene Expression Profiling , Genome, Human , RNA, Long Noncoding , Transcriptome/physiology , Genome-Wide Association Study , Humans , RNA, Long Noncoding/biosynthesis , RNA, Long Noncoding/geneticsABSTRACT
The European Genome-phenome Archive (EGA - https://ega-archive.org/) is a resource for long term secure archiving of all types of potentially identifiable genetic, phenotypic, and clinical data resulting from biomedical research projects. Its mission is to foster hosted data reuse, enable reproducibility, and accelerate biomedical and translational research in line with the FAIR principles. Launched in 2008, the EGA has grown quickly, currently archiving over 4,500 studies from nearly one thousand institutions. The EGA operates a distributed data access model in which requests are made to the data controller, not to the EGA, therefore, the submitter keeps control on who has access to the data and under which conditions. Given the size and value of data hosted, the EGA is constantly improving its value chain, that is, how the EGA can contribute to enhancing the value of human health data by facilitating its submission, discovery, access, and distribution, as well as leading the design and implementation of standards and methods necessary to deliver the value chain. The EGA has become a key GA4GH Driver Project, leading multiple development efforts and implementing new standards and tools, and has been appointed as an ELIXIR Core Data Resource.
Subject(s)
Confidentiality/legislation & jurisprudence , Genome, Human , Information Dissemination/methods , Phenomics/organization & administration , Translational Research, Biomedical/methods , Datasets as Topic , Genotype , History, 20th Century , History, 21st Century , Humans , Information Dissemination/ethics , Metadata/ethics , Metadata/statistics & numerical data , Phenomics/history , PhenotypeABSTRACT
We have produced RNA sequencing data for 53 primary cells from different locations in the human body. The clustering of these primary cells reveals that most cells in the human body share a few broad transcriptional programs, which define five major cell types: epithelial, endothelial, mesenchymal, neural, and blood cells. These act as basic components of many tissues and organs. Based on gene expression, these cell types redefine the basic histological types by which tissues have been traditionally classified. We identified genes whose expression is specific to these cell types, and from these genes, we estimated the contribution of the major cell types to the composition of human tissues. We found this cellular composition to be a characteristic signature of tissues and to reflect tissue morphological heterogeneity and histology. We identified changes in cellular composition in different tissues associated with age and sex, and found that departures from the normal cellular composition correlate with histological phenotypes associated with disease.
Subject(s)
Transcription, Genetic , Cell Line , Endothelial Cells/metabolism , Epithelial Cells/metabolism , Female , Gene Expression Profiling , Gynecomastia/genetics , Gynecomastia/metabolism , Humans , Male , Mesoderm/cytology , Mesoderm/metabolism , Neoplasms/genetics , Organ Specificity , Sequence Analysis, RNAABSTRACT
Cross-species comparisons of genomes, transcriptomes and gene regulation are now feasible at unprecedented resolution and throughput, enabling the comparison of human and mouse biology at the molecular level. Insights have been gained into the degree of conservation between human and mouse at the level of not only gene expression but also epigenetics and inter-individual variation. However, a number of limitations exist, including incomplete transcriptome characterization and difficulties in identifying orthologous phenotypes and cell types, which are beginning to be addressed by emerging technologies. Ultimately, these comparisons will help to identify the conditions under which the mouse is a suitable model of human physiology and disease, and optimize the use of animal models.
Subject(s)
Disease Models, Animal , Evolution, Molecular , Gene Expression Regulation , Transcriptome , Animals , Conserved Sequence , Genome, Human , Humans , Mice , RNA, Long Noncoding/geneticsABSTRACT
Cytoplasmic polyadenylation plays a key role in the translational control of mRNAs driving biological processes such as gametogenesis, cell-cycle progression, and synaptic plasticity. What determines the distinct time of polyadenylation and extent of translational control of a given mRNA, however, is poorly understood. The polyadenylation-regulated translation is controlled by the cytoplasmic polyadenylation element (CPE) and its binding protein, CPEB, which can assemble both translational repression or activation complexes. Using a combination of mutagenesis and experimental validation of genome-wide computational predictions, we show that the number and relative position of two elements, the CPE and the Pumilio-binding element, with respect to the polyadenylation signal define a combinatorial code that determines whether an mRNA will be translationally repressed by CPEB, as well as the extent and time of cytoplasmic polyadenylation-dependent translational activation.