RESUMO
Data discovery, the ability to find datasets relevant to an analysis, increases scientific opportunity, improves rigour and accelerates activity. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.A set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model. Harmonisation strategies used were simple calibration, algorithmic transformation and standardisation to the Z-distribution. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity.Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to a larger variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.
Assuntos
Conjuntos de Dados como Assunto , Descoberta do Conhecimento , Humanos , Padrões de ReferênciaRESUMO
Although biological effects of endocrine disrupting chemicals (EDCs) are often observed at unexpectedly low doses with occasional nonmonotonic dose-response characteristics, transcriptome-wide profiles of sensitivities or dose-dependent behaviors of the EDC responsive genes have remained unexplored. Here, we describe expressome analysis for the comprehensive examination of dose-dependent gene responses and its applications to characterize estrogen responsive genes in MCF-7 cells. Transcriptomes of MCF-7 cells exposed to varying concentrations of representative natural and xenobiotic estrogens for 48 h were determined by microarray and used for computational calculation of interpolated approximations of estimated transcriptomes for 300 doses uniformly distributed in log space for each chemical. The entire collection of these estimated transcriptomes, designated as the expressome, has provided unique opportunities to profile chemical-specific distributions of ligand sensitivities for large numbers of estrogen responsive genes, revealing that at low concentrations estrogens generally tended to suppress rather than to activate transcription. Gene ontology analysis demonstrated distinct functional enrichment between high- and low-sensitivity estrogen responsive genes, supporting the notion that a single EDC chemical can cause qualitatively distinct biological responses at different doses. Expressomal heatmap visualization of dose-dependent induction of Bisphenol A inducible genes showed a weak gene activation peak at a very low concentration range (ca. 0.1 nM) in addition to the main, strong gene activation peak at and above 100 nM. Thus, expressome analysis is a powerful approach to understanding the EDC dose-dependent dynamic changes in gene expression at the transcriptomal level, providing important information on the overall profiles of ligand sensitivities and nonmonotonic responses.
Assuntos
Disruptores Endócrinos/toxicidade , Estrogênios/toxicidade , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica/efeitos dos fármacos , RNA Mensageiro/metabolismo , Compostos Benzidrílicos , Relação Dose-Resposta a Droga , Ontologia Genética , Humanos , Células MCF-7 , Análise em Microsséries , FenóisRESUMO
Identifying transcription factors (TF) involved in producing a genome-wide transcriptional profile is an essential step in building mechanistic model that can explain observed gene expression data. We developed a statistical framework for constructing genome-wide signatures of TF activity, and for using such signatures in the analysis of gene expression data produced by complex transcriptional regulatory programs. Our framework integrates ChIP-seq data and appropriately matched gene expression profiles to identify True REGulatory (TREG) TF-gene interactions. It provides genome-wide quantification of the likelihood of regulatory TF-gene interaction that can be used to either identify regulated genes, or as genome-wide signature of TF activity. To effectively use ChIP-seq data, we introduce a novel statistical model that integrates information from all binding "peaks" within 2 Mb window around a gene's transcription start site (TSS), and provides gene-level binding scores and probabilities of regulatory interaction. In the second step we integrate these binding scores and regulatory probabilities with gene expression data to assess the likelihood of True REGulatory (TREG) TF-gene interactions. We demonstrate the advantages of TREG framework in identifying genes regulated by two TFs with widely different distribution of functional binding events (ERα and E2f1). We also show that TREG signatures of TF activity vastly improve our ability to detect involvement of ERα in producing complex diseases-related transcriptional profiles. Through a large study of disease-related transcriptional signatures and transcriptional signatures of drug activity, we demonstrate that increase in statistical power associated with the use of TREG signatures makes the crucial difference in identifying key targets for treatment, and drugs to use for treatment. All methods are implemented in an open-source R package treg. The package also contains all data used in the analysis including 494 TREG binding profiles based on ENCODE ChIP-seq data. The treg package can be downloaded at http://GenomicsPortals.org.
Assuntos
Estudo de Associação Genômica Ampla , Fatores de Transcrição/fisiologia , Imunoprecipitação da Cromatina , Doença , Perfilação da Expressão Gênica , Humanos , Probabilidade , Fatores de Transcrição/genéticaRESUMO
There is common consensus that data sharing accelerates science. Data sharing enhances the utility of data and promotes the creation and competition of scientific ideas. Within the Alzheimer's disease and related dementias (ADRD) community, data types and modalities are spread across many organizations, geographies, and governance structures. The ADRD community is not alone in facing these challenges, however, the problem is even more difficult because of the need to share complex biomarker data from centers around the world. Heavy-handed data sharing mandates have, to date, been met with limited success and often outright resistance. Interest in making data Findable, Accessible, Interoperable, and Reusable (FAIR) has often resulted in centralized platforms. However, when data governance and sovereignty structures do not allow the movement of data, other methods, such as federation, must be pursued. Implementation of fully federated data approaches are not without their challenges. The user experience may become more complicated, and federated analysis of unstructured data types remains challenging. Advancement in federated data sharing should be accompanied by improvement in federated learning methodologies so that federated data sharing becomes functionally equivalent to direct access to record level data. In this article, we discuss federated data sharing approaches implemented by three data platforms in the ADRD field: Dementia's Platform UK (DPUK) in 2014, the Global Alzheimer's Association Interactive Network (GAAIN) in 2012, and the Alzheimer's Disease Data Initiative (ADDI) in 2020. We conclude by addressing open questions that the research community needs to solve together.
RESUMO
MOTIVATION: Functional enrichment analysis using primary genomics datasets is an emerging approach to complement established methods for functional enrichment based on predefined lists of functionally related genes. Currently used methods depend on creating lists of 'significant' and 'non-significant' genes based on ad hoc significance cutoffs. This can lead to loss of statistical power and can introduce biases affecting the interpretation of experimental results. RESULTS: We developed and validated a new statistical framework, generalized random set (GRS) analysis, for comparing the genomic signatures in two datasets without the need for gene categorization. In our tests, GRS produced correct measures of statistical significance, and it showed dramatic improvement in the statistical power over other methods currently used in this setting. We also developed a procedure for identifying genes driving the concordance of the genomics profiles and demonstrated a dramatic improvement in functional coherence of genes identified in such analysis. AVAILABILITY: GRS can be downloaded as part of the R package CLEAN from http://ClusterAnalysis.org/. An online implementation is available at http://GenomicsPortals.org/.
Assuntos
Perfilação da Expressão Gênica/métodos , Genômica/métodos , Animais , Neoplasias da Mama/genética , Interpretação Estatística de Dados , Dieta , Feminino , Expressão Gênica , Humanos , RatosRESUMO
BACKGROUND: A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. RESULTS: Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. CONCLUSION: The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.
Assuntos
Mineração de Dados/métodos , Genômica/métodos , Software , Animais , Perfilação da Expressão Gênica , Humanos , Internet , Camundongos , RatosRESUMO
Pulmonary fibrosis is often triggered by an epithelial injury resulting in the formation of fibrotic lesions in the lung, which progress to impair gas exchange and ultimately cause death. Recent clinical trials using drugs that target either inflammation or a specific molecule have failed, suggesting that multiple pathways and cellular processes need to be attenuated for effective reversal of established and progressive fibrosis. Although activation of MAPK and PI3K pathways have been detected in human fibrotic lung samples, the therapeutic benefits of in vivo modulation of the MAPK and PI3K pathways in combination are unknown. Overexpression of TGFα in the lung epithelium of transgenic mice results in the formation of fibrotic lesions similar to those found in human pulmonary fibrosis, and previous work from our group shows that inhibitors of either the MAPK or PI3K pathway can alter the progression of fibrosis. In this study, we sought to determine whether simultaneous inhibition of the MAPK and PI3K signaling pathways is a more effective therapeutic strategy for established and progressive pulmonary fibrosis. Our results showed that inhibiting both pathways had additive effects compared to inhibiting either pathway alone in reducing fibrotic burden, including reducing lung weight, pleural thickness, and total collagen in the lungs of TGFα mice. This study demonstrates that inhibiting MEK and PI3K in combination abolishes proliferative changes associated with fibrosis and myfibroblast accumulation and thus may serve as a therapeutic option in the treatment of human fibrotic lung disease where these pathways play a role.
Assuntos
Sistema de Sinalização das MAP Quinases/efeitos dos fármacos , Inibidores de Fosfoinositídeo-3 Quinase , Fibrose Pulmonar/tratamento farmacológico , Análise de Variância , Animais , Benzimidazóis/farmacologia , Western Blotting , Quimioterapia Combinada , Gonanos/farmacologia , Imuno-Histoquímica , Pulmão/metabolismo , Pulmão/patologia , Camundongos , Camundongos Transgênicos , Reação em Cadeia da Polimerase em Tempo Real , Análise de Sequência de RNA , Fator de Crescimento Transformador alfa/metabolismoRESUMO
On-going efforts to improve protein structure prediction stimulate the development of scoring functions and methods for model quality assessment (MQA) that can be used to rank and select the best protein models for further refinement. In this work, sequence-based prediction of relative solvent accessibility (RSA) is employed as a basis for a simple MQA method for soluble proteins, and subsequently extended to the much less explored case of (alpha-helical) membrane proteins. In analogy to soluble proteins, the level of exposure to the lipid of amino acid residues in transmembrane (TM) domains is captured in terms of the relative lipid accessibility (RLA), which is predicted from sequence using low-complexity Support Vector Regression models. On an independent set of 23 TM proteins, the new SVR-based predictor yields correlation coefficient (CC) of 0.56 between the predicted and observed RLA profiles, as opposed to CC of 0.13 for a baseline predictor that utilizes TMLIP2H empirical lipophilicity scale (with standard deviations of about 0.15). A simple MQA approach is then defined by ranking models of membrane proteins in terms of consistency between predicted and observed RLA profiles, as a measure of similarity to the native structure. The new method does not require a set of decoy models to optimize parameters, circumventing current limitations in this regard. Several different sets of models, including those generated by fragment based folding simulations, and decoys obtained by swapping TM helices to mimic errors in template based assignment, are used to assess the new approach. Predicted RLA profiles can be used to successfully discriminate near native models from non-native decoys in most cases, significantly improving the separation of correct and incorrectly folded models compared to a simple baseline approach that utilizes TMLIP2H. As suggested by the robust performance of a simple MQA method for soluble proteins that utilizes more accurate RSA predictions, further significant improvements are likely to be achieved. The steady growth in the number of resolved membrane protein structures is expected to yield enhanced RLA predictions, facilitating further efforts to improve de novo and template based prediction of membrane protein structure.