RESUMEN
We describe an update of MirGeneDB, the manually curated microRNA gene database. Adhering to uniform and consistent criteria for microRNA annotation and nomenclature, we substantially expanded MirGeneDB with 30 additional species representing previously missing metazoan phyla such as sponges, jellyfish, rotifers and flatworms. MirGeneDB 2.1 now consists of 75 species spanning over â¼800 million years of animal evolution, and contains a total number of 16 670 microRNAs from 1549 families. Over 6000 microRNAs were added in this update using â¼550 datasets with â¼7.5 billion sequencing reads. By adding new phylogenetically important species, especially those relevant for the study of whole genome duplication events, and through updating evolutionary nodes of origin for many families and genes, we were able to substantially refine our nomenclature system. All changes are traceable in the specifically developed MirGeneDB version tracker. The performance of read-pages is improved and microRNA expression matrices for all tissues and species are now also downloadable. Altogether, this update represents a significant step toward a complete sampling of all major metazoan phyla, and a widely needed foundation for comparative microRNA genomics and transcriptomics studies. MirGeneDB 2.1 is part of RNAcentral and Elixir Norway, publicly and freely available at http://www.mirgenedb.org/.
Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Evolución Molecular , Genómica , Animales , Humanos , MicroARNs/clasificación , MicroARNs/genética , FilogeniaRESUMEN
Muscle cells have different phenotypes adapted to different usage, and can be grossly divided into fast/glycolytic and slow/oxidative types. While most muscles contain a mixture of such fiber types, we aimed at providing a genome-wide analysis of the epigenetic landscape by ChIP-Seq in two muscle extremes, the fast/glycolytic extensor digitorum longus (EDL) and slow/oxidative soleus muscles. Muscle is a heterogeneous tissue where up to 60% of the nuclei can be of a different origin. Since cellular homogeneity is critical in epigenome-wide association studies we developed a new method for purifying skeletal muscle nuclei from whole tissue, based on the nuclear envelope protein Pericentriolar material 1 (PCM1) being a specific marker for myonuclei. Using antibody labelling and a magnetic-assisted sorting approach, we were able to sort out myonuclei with 95% purity in muscles from mice, rats and humans. The sorting eliminated influence from the other cell types in the tissue and improved the myo-specific signal. A genome-wide comparison of the epigenetic landscape in EDL and soleus reflected the differences in the functional properties of the two muscles, and revealed distinct regulatory programs involving distal enhancers, including a glycolytic super-enhancer in the EDL. The two muscles were also regulated by different sets of transcription factors; e.g. in soleus, binding sites for MEF2C, NFATC2 and PPARA were enriched, while in EDL MYOD1 and SIX1 binding sites were found to be overrepresented. In addition, more novel transcription factors for muscle regulation such as members of the MAF family, ZFX and ZBTB14 were identified.
Asunto(s)
Autoantígenos/inmunología , Proteínas de Ciclo Celular/inmunología , Núcleo Celular/metabolismo , Epigénesis Genética , Fibras Musculares de Contracción Rápida/metabolismo , Fibras Musculares de Contracción Lenta/metabolismo , Animales , Anticuerpos , Glucólisis , Humanos , Ratones , Células Musculares , Oxidación-Reducción , RatasRESUMEN
The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets. All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure.
Asunto(s)
Genómica/métodos , Algoritmos , Conjuntos de Datos como Asunto , Regulación de la Expresión Génica , Variación Genética , HumanosRESUMEN
Small non-coding RNAs have gained substantial attention due to their roles in animal development and human disorders. Among them, microRNAs are special because individual gene sequences are conserved across the animal kingdom. In addition, unique and mechanistically well understood features can clearly distinguish bona fide miRNAs from the myriad other small RNAs generated by cells. However, making this distinction is not a common practice and, thus, not surprisingly, the heterogeneous quality of available miRNA complements has become a major concern in microRNA research. We addressed this by extensively expanding our curated microRNA gene database - MirGeneDB - to 45 organisms, encompassing a wide phylogenetic swath of animal evolution. By consistently annotating and naming 10,899 microRNA genes in these organisms, we show that previous microRNA annotations contained not only many false positives, but surprisingly lacked >2000 bona fide microRNAs. Indeed, curated microRNA complements of closely related organisms are very similar and can be used to reconstruct ancestral miRNA repertoires. MirGeneDB represents a robust platform for microRNA-based research, providing deeper and more significant insights into the biology and evolution of miRNAs as well as biomedical and biomarker research. MirGeneDB is publicly and freely available at http://mirgenedb.org/.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos , MicroARNs/genética , Programas Informáticos , Navegador Web , Animales , Secuencia Conservada , Evolución Molecular , MicroARNs/clasificación , Anotación de Secuencia Molecular , Filogenia , Interfaz Usuario-ComputadorRESUMEN
Functional genomics assays produce sets of genomic regions as one of their main outputs. To biologically interpret such region-sets, researchers often use colocalization analysis, where the statistical significance of colocalization (overlap, spatial proximity) between two or more region-sets is tested. Existing colocalization analysis tools vary in the statistical methodology and analysis approaches, thus potentially providing different conclusions for the same research question. As the findings of colocalization analysis are often the basis for follow-up experiments, it is helpful to use several tools in parallel and to compare the results. We developed the Coloc-stats web service to facilitate such analyses. Coloc-stats provides a unified interface to perform colocalization analysis across various analytical methods and method-specific options (e.g. colocalization measures, resolution, null models). Coloc-stats helps the user to find a method that supports their experimental requirements and allows for a straightforward comparison across methods. Coloc-stats is implemented as a web server with a graphical user interface that assists users with configuring their colocalization analyses. Coloc-stats is freely available at https://hyperbrowser.uio.no/coloc-stats/.
Asunto(s)
Genómica/métodos , Programas Informáticos , Inmunoprecipitación de Cromatina , Factor de Transcripción GATA1/metabolismo , Internet , Análisis de Secuencia de ADN , Interfaz Usuario-ComputadorRESUMEN
BACKGROUND: The current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted portion of the genomic sequence, meaning that regions from the unassayed portion of the genomic sequence cannot be detected in those experiments. We here refer to all such regions as inaccessible regions, and hypothesize that ignoring these regions in the null model may increase false findings in statistical testing of colocalization of genomic features. RESULTS: Our explorative analyses confirm that the genomic regions in public genomic tracks intersect very little with assembly gaps of human reference genomes (hg19 and hg38). The little intersection was observed only at the beginning and end portions of the gap regions. Further, we simulated a set of synthetic tracks by matching the properties of real genomic tracks in a way that nullified any true association between them. This allowed us to test our hypothesis that not avoiding inaccessible regions (as represented by assembly gaps) in the null model would result in spurious inflation of statistical significance. We contrasted the distributions of test statistics and p-values of Monte Carlo-based permutation tests that either avoided or did not avoid assembly gaps in the null model when testing colocalization between a pair of tracks. We observed that the statistical tests that did not account for assembly gaps in the null model resulted in a distribution of the test statistic that is shifted to the right and a distribution of p-values that is shifted to the left (indicating inflated significance). We observed a similar level of inflated significance in hg19 and hg38, despite assembly gaps covering a smaller proportion of the latter reference genome. CONCLUSION: We provide empirical evidence demonstrating that inaccessible regions, even when covering only a few percentages of the genome, can lead to a substantial amount of false findings if not accounted for in statistical colocalization analysis.
Asunto(s)
Factores de Confusión Epidemiológicos , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Estadística como Asunto , Genómica , HumanosRESUMEN
BACKGROUND: A visualization referred to as rainfall plot has recently gained popularity in genome data analysis. The plot is mostly used for illustrating the distribution of somatic cancer mutations along a reference genome, typically aiming to identify mutation hotspots. In general terms, the rainfall plot can be seen as a scatter plot showing the location of events on the x-axis versus the distance between consecutive events on the y-axis. Despite its frequent use, the motivation for applying this particular visualization and the appropriateness of its usage have never been critically addressed in detail. RESULTS: We show that the rainfall plot allows visual detection even for events occurring at high frequency over very short distances. In addition, event clustering at multiple scales may be detected as distinct horizontal bands in rainfall plots. At the same time, due to the limited size of standard figures, rainfall plots might suffer from inability to distinguish overlapping events, especially when multiple datasets are plotted in the same figure. We demonstrate the consequences of plot congestion, which results in obscured visual data interpretations. CONCLUSIONS: This work provides the first comprehensive survey of the characteristics and proper usage of rainfall plots. We find that the rainfall plot is able to convey a large amount of information without any need for parameterization or tuning. However, we also demonstrate how plot congestion and the use of a logarithmic y-axis may result in obscured visual data interpretations. To aid the productive utilization of rainfall plots, we demonstrate their characteristics and potential pitfalls using both simulated and real data, and provide a set of practical guidelines for their proper interpretation and usage.
Asunto(s)
Motivación , Programas Informáticos , Genoma Humano , Guías como Asunto , Humanos , Mutación/genética , Neoplasias Pancreáticas/genéticaRESUMEN
Many high-throughput sequencing datasets can be represented as objects with coordinates along a reference genome. Currently, biological investigations often involve a large number of such datasets, for example representing different cell types or epigenetic factors. Drawing overall conclusions from a large collection of results for individual datasets may be challenging and time-consuming. Meaningful interpretation often requires the results to be aggregated according to metadata that represents biological characteristics of interest. In this light, we here propose the hierarchical Genomic Suite HyperBrowser (hGSuite), an open-source extension to the GSuite HyperBrowser platform, which aims to provide a means for extracting key results from an aggregated collection of high-throughput DNA sequencing data. The hGSuite utilizes a metadata-informed data cube to calculate various statistics across the multiple dimensions of the datasets. With this work, we show that the hGSuite and its associated data cube methodology offers a quick and accessible way for exploratory analysis of large genomic datasets. The web-based toolkit named hGsuite Hyperbrowser is available at https://hyperbrowser.uio.no/hgsuite under a GPLv3 license.
Asunto(s)
Metadatos , Programas Informáticos , Genómica/métodos , Genoma , InternetRESUMEN
Improved transcriptomic sequencing technologies now make it possible to perform longitudinal experiments, thus generating a large amount of data. Currently, there are no dedicated or comprehensive methods for the analysis of these experiments. In this article, we describe our TimeSeries Analysis pipeline (TiSA) which combines differential gene expression, clustering based on recursive thresholding, and a functional enrichment analysis. Differential gene expression is performed for both the temporal and conditional axes. Clustering is performed on the identified differentially expressed genes, with each cluster being evaluated using a functional enrichment analysis. We show that TiSA can be used to analyse longitudinal transcriptomic data from both microarrays and RNA-seq, as well as small, large, and/or datasets with missing data points. The tested datasets ranged in complexity, some originating from cell lines while another was from a longitudinal experiment of severity in COVID-19 patients. We have also included custom figures to aid with the biological interpretation of the data, these plots include Principal Component Analyses, Multi Dimensional Scaling plots, functional enrichment dotplots, trajectory plots, and complex heatmaps showing the broad overview of results. To date, TiSA is the first pipeline to provide an easy solution to the analysis of longitudinal transcriptomics experiments.
RESUMEN
BACKGROUND: Single-cell RNA sequencing (scRNA-seq) provides high-resolution transcriptome data to understand the heterogeneity of cell populations at the single-cell level. The analysis of scRNA-seq data requires the utilization of numerous computational tools. However, nonexpert users usually experience installation issues, a lack of critical functionality or batch analysis modes, and the steep learning curves of existing pipelines. RESULTS: We have developed cellsnake, a comprehensive, reproducible, and accessible single-cell data analysis workflow, to overcome these problems. Cellsnake offers advanced features for standard users and facilitates downstream analyses in both R and Python environments. It is also designed for easy integration into existing workflows, allowing for rapid analyses of multiple samples. CONCLUSION: As an open-source tool, cellsnake is accessible through Bioconda, PyPi, Docker, and GitHub, making it a cost-effective and user-friendly option for researchers. By using cellsnake, researchers can streamline the analysis of scRNA-seq data and gain insights into the complex biology of single cells.
Asunto(s)
Programas Informáticos , Transcriptoma , Análisis de la Célula Individual , Flujo de Trabajo , Análisis de Secuencia de ARN , Perfilación de la Expresión Génica , ARNRESUMEN
Macrophages are a heterogeneous population of cells involved in tissue homeostasis, inflammation, and cancer. Although macrophages are densely distributed throughout the human intestine, our understanding of how gut macrophages maintain tissue homeostasis is limited. Here we show that colonic lamina propria macrophages (LpMs) and muscularis macrophages (MMs) consist of monocyte-like cells that differentiate into multiple transcriptionally distinct subsets. LpMs comprise subsets with proinflammatory properties and subsets with high antigen-presenting and phagocytic capacity. The latter are strategically positioned close to the surface epithelium. Most MMs differentiate along two trajectories: one that upregulates genes associated with immune activation and angiogenesis, and one that upregulates genes associated with neuronal homeostasis. Importantly, MMs are located adjacent to neurons and vessels. Cell-cell interaction and gene network analysis indicated that survival, migration, transcriptional reprogramming, and niche-specific localization of LpMs and MMs are controlled by an extensive interaction with tissue-resident cells and a few key transcription factors.
Asunto(s)
Colon/inmunología , Macrófagos/clasificación , Análisis de la Célula Individual/métodos , Transcriptoma , Anciano , Comunicación Celular , Diferenciación Celular , Femenino , Redes Reguladoras de Genes , Humanos , Macrófagos/fisiología , Masculino , Persona de Mediana Edad , Factores de Transcripción/fisiologíaRESUMEN
Although microRNAs (miRNAs) contribute to all hallmarks of cancer, miRNA dysregulation in metastasis remains poorly understood. The aim of this work was to reliably identify miRNAs associated with metastatic progression of colorectal cancer (CRC) using novel and previously published next-generation sequencing (NGS) datasets generated from 268 samples of primary (pCRC) and metastatic CRC (mCRC; liver, lung and peritoneal metastases) and tumor adjacent tissues. Differential expression analysis was performed using a meticulous bioinformatics pipeline, including only bona fide miRNAs, and utilizing miRNA-tailored quality control and processing. Five miRNAs were identified as up-regulated at multiple metastatic sites Mir-210_3p, Mir-191_5p, Mir-8-P1b_3p [mir-141-3p], Mir-1307_5p and Mir-155_5p. Several have previously been implicated in metastasis through involvement in epithelial-to-mesenchymal transition and hypoxia, while other identified miRNAs represent novel findings. The use of a publicly available pipeline facilitates reproducibility and allows new datasets to be added as they become available. The set of miRNAs identified here provides a reliable starting-point for further research into the role of miRNAs in metastatic progression.
RESUMEN
Prenatal exposure to persistent organic pollutants (POPs) is associated with neurodevelopmental disorders. In the present study, we explored whether a human-relevant POP mixture affects the development of chicken embryo cerebellum. We used a defined mixture of 29 POPs, with chemical composition and concentrations based on blood levels in the Scandinavian population. We also evaluated exposure to a prominent compound in the mixture, perfluorooctane sulfonic acid (PFOS), alone. Embryos (n = 7-9 per exposure group) were exposed by injection directly into the allantois at embryonic day 13 (E13). Cerebella were isolated at E17 and subjected to morphological, RNA-seq and shot-gun proteomics analyses. There was a reduction in thickness of the molecular layer of cerebellar cortex in both exposure scenarios. Exposure to the POP mixture significantly affected expression of 65 of 13,800 transcripts, and 43 of 2,568 proteins, when compared to solvent control. PFOS alone affected expression of 80 of 13,859 transcripts, and 69 of 2,555 proteins. Twenty-five genes and 15 proteins were common for both exposure groups. These findings point to alterations in molecular events linked to retinoid X receptor (RXR) signalling, neuronal cell proliferation and migration, cellular stress responses including unfolded protein response, lipid metabolism, and myelination. Exposure to the POP mixture increased methionine oxidation, whereas PFOS decreased oxidation. Several of the altered genes and proteins are involved in a wide variety of neurological disorders. We conclude that POP exposure can interfere with fundamental aspects of neurodevelopment, altering molecular pathways that are associated with adverse neurocognitive and behavioural outcomes.
RESUMEN
[This corrects the article DOI: 10.1093/narcan/zcaa019.].
RESUMEN
In B lymphocytes, the uracil N-glycosylase (UNG) excises genomic uracils made by activation-induced deaminase (AID), thus underpinning antibody gene diversification and oncogenic chromosomal translocations, but also initiating faithful DNA repair. Ung-/- mice develop B-cell lymphoma (BCL). However, since UNG has anti- and pro-oncogenic activities, its tumor suppressor relevance is unclear. Moreover, how the constant DNA damage and repair caused by the AID and UNG interplay affects B-cell fitness and thereby the dynamics of cell populations in vivo is unknown. Here, we show that UNG specifically protects the fitness of germinal center B cells, which express AID, and not of any other B-cell subset, coincident with AID-induced telomere damage activating p53-dependent checkpoints. Consistent with AID expression being detrimental in UNG-deficient B cells, Ung-/- mice develop BCL originating from activated B cells but lose AID expression in the established tumor. Accordingly, we find that UNG is rarely lost in human BCL. The fitness preservation activity of UNG contingent to AID expression was confirmed in a B-cell leukemia model. Hence, UNG, typically considered a tumor suppressor, acquires tumor-enabling activity in cancer cell populations that express AID by protecting cell fitness.
RESUMEN
Genomic locations are represented as coordinates on a specific genome build version, but the build information is frequently missing when coordinates are provided. We show that this information is essential to correctly interpret and analyse the genomic intervals contained in genomic track files. Although not a substitute for best practices, we also provide a tool to predict the genome build version of genomic track files.
Asunto(s)
Genoma , Genómica , Animales , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Genoma Humano , HumanosRESUMEN
BACKGROUND: Studies on medication safety in pregnancy often rely on an oversimplification of medication use into exposed or non-exposed, without considering intensity and timing of use in pregnancy, or concomitant medication use. This study uses paracetamol in pregnancy as the motivating example to introduce a method of clustering medication exposures longitudinally throughout pregnancy. The aim of this study was to use hierarchical cluster analysis (HCA) to better identify clusters of medication exposure throughout pregnancy. METHODS: Data from the Norwegian Mother and Child Cohort Study was used to identify subclasses of women using paracetamol during pregnancy. HCA with customized distance measure was used to identify clusters of medication exposures in pregnancy among children at 18 months. RESULTS: The pregnancies in the study (N = 9 778) were grouped in 5 different clusters depending on their medication exposure profile throughout pregnancy. CONCLUSION: Using HCA, we identified and described profiles of women exposed to different medications in combination with paracetamol during pregnancy. Identifying these clusters allows researchers to define exposure in ways that better reflects real-world medication usage patterns. This method could be extended to other medications and used as pre-analysis for identifying risks associated with different profiles of exposure.
Asunto(s)
Acetaminofén/uso terapéutico , Análisis por Conglomerados , Estudios de Cohortes , Interacciones Farmacológicas , Femenino , Humanos , Noruega , EmbarazoRESUMEN
Both a DNA lesion and an intermediate for antibody maturation, uracil is primarily processed by base excision repair (BER), either initiated by uracil-DNA glycosylase (UNG) or by single-strand selective monofunctional uracil DNA glycosylase (SMUG1). The relative in vivo contributions of each glycosylase remain elusive. To assess the impact of SMUG1 deficiency, we measured uracil and 5-hydroxymethyluracil, another SMUG1 substrate, in Smug1 -/- mice. We found that 5-hydroxymethyluracil accumulated in Smug1 -/- tissues and correlated with 5-hydroxymethylcytosine levels. The highest increase was found in brain, which contained about 26-fold higher genomic 5-hydroxymethyluracil levels than the wild type. Smug1 -/- mice did not accumulate uracil in their genome and Ung -/- mice showed slightly elevated uracil levels. Contrastingly, Ung -/- Smug1 -/- mice showed a synergistic increase in uracil levels with up to 25-fold higher uracil levels than wild type. Whole genome sequencing of UNG/SMUG1-deficient tumours revealed that combined UNG and SMUG1 deficiency leads to the accumulation of mutations, primarily C to T transitions within CpG sequences. This unexpected sequence bias suggests that CpG dinucleotides are intrinsically more mutation prone. In conclusion, we showed that SMUG1 efficiently prevent genomic uracil accumulation, even in the presence of UNG, and identified mutational signatures associated with combined UNG and SMUG1 deficiency.
Asunto(s)
Citosina/metabolismo , Fosfatos de Dinucleósidos/metabolismo , Uracil-ADN Glicosidasa/deficiencia , Uracilo/metabolismo , Animales , Islas de CpG , Desaminación , Genoma , Genómica/métodos , Ratones , Ratones Noqueados , MutaciónRESUMEN
Background: Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. Findings: We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Conclusions: Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no.