RESUMEN
The Solve-RD project brings together clinicians, scientists, and patient representatives from 51 institutes spanning 15 countries to collaborate on genetically diagnosing ("solving") rare diseases (RDs). The project aims to significantly increase the diagnostic success rate by co-analyzing data from thousands of RD cases, including phenotypes, pedigrees, exome/genome sequencing, and multiomics data. Here we report on the data infrastructure devised and created to support this co-analysis. This infrastructure enables users to store, find, connect, and analyze data and metadata in a collaborative manner. Pseudonymized phenotypic and raw experimental data are submitted to the RD-Connect Genome-Phenome Analysis Platform and processed through standardized pipelines. Resulting files and novel produced omics data are sent to the European Genome-Phenome Archive, which adds unique file identifiers and provides long-term storage and controlled access services. MOLGENIS "RD3" and Café Variome "Discovery Nexus" connect data and metadata and offer discovery services, and secure cloud-based "Sandboxes" support multiparty data analysis. This successfully deployed and useful infrastructure design provides a blueprint for other projects that need to analyze large amounts of heterogeneous data.
Asunto(s)
Enfermedades Raras , Enfermedades Raras/genética , Humanos , Bases de Datos Genéticas , Fenotipo , Metadatos , Biología Computacional/métodos , Genómica/métodosRESUMEN
PURPOSE: Metadata for data dIscoverability aNd study rEplicability in obseRVAtional studies (MINERVA), a European Medicines Agency-funded project (EUPAS39322), defined a set of metadata to describe real-world data sources (RWDSs) and piloted metadata collection in a prototype catalogue to assist investigators from data source discoverability through study conduct. METHODS: A list of metadata was created from a review of existing metadata catalogues and recommendations, structured interviews, a stakeholder survey, and a technical workshop. The prototype was designed to comply with the FAIR principles (findable, accessible, interoperable, reusable), using MOLGENIS software. Metadata collection was piloted by 15 data access partners (DAPs) from across Europe. RESULTS: A total of 442 metadata variables were defined in six domains: institutions (organizations connected to a data source); data banks (data collections sustained by an organization); data sources (collections of linkable data banks covering a common underlying population); studies; networks (of institutions); and common data models (CDMs). A total of 26 institutions were recorded in the prototype. Each DAP populated the metadata of one data source and its selected data banks. The number of data banks varied by data source; the most common data banks were hospital administrative records and pharmacy dispensation records (10 data sources each). Quantitative metadata were successfully extracted from three data sources conforming to different CDMs and entered into the prototype. CONCLUSIONS: A metadata list was finalized, a prototype was successfully populated, and a good practice guide was developed. Setting up and maintaining a metadata catalogue on RWDSs will require substantial effort to support discoverability of data sources and reproducibility of studies in Europe.
Asunto(s)
Metadatos , Estudios Observacionales como Asunto , Europa (Continente) , Humanos , Proyectos Piloto , Reproducibilidad de los Resultados , Estudios Observacionales como Asunto/métodos , Recolección de Datos/métodos , Recolección de Datos/normas , Bases de Datos Factuales/estadística & datos numéricos , Programas Informáticos , Farmacoepidemiología/métodosRESUMEN
While its etiology is not fully elucidated, preterm birth represents a major public health concern as it is the leading cause of child mortality and morbidity. Stress is one of the most common perinatal conditions and may increase the risk of preterm birth. In this paper we aimed to investigate the association of maternal perceived stress and anxiety with length of gestation. We used harmonized data from five birth cohorts from Canada, France, and Norway. A total of 5297 pregnancies of singletons were included in the analysis of perceived stress and gestational duration, and 55,775 pregnancies for anxiety. Federated analyses were performed through the DataSHIELD platform using Cox regression models within intervals of gestational age. The models were fit for each cohort separately, and the cohort-specific results were combined using random effects study-level meta-analysis. Moderate and high levels of perceived stress during pregnancy were associated with a shorter length of gestation in the very/moderately preterm interval [moderate: hazard ratio (HR) 1.92 (95%CI 0.83, 4.48); high: 2.04 (95%CI 0.77, 5.37)], albeit not statistically significant. No association was found for the other intervals. Anxiety was associated with gestational duration in the very/moderately preterm interval [1.66 (95%CI 1.32, 2.08)], and in the early term interval [1.15 (95%CI 1.08, 1.23)]. Our findings suggest that perceived stress and anxiety are associated with an increased risk of earlier birth, but only in the earliest gestational ages. We also found an association in the early term period for anxiety, but the result was only driven by the largest cohort, which collected information the latest in pregnancy. This raised a potential issue of reverse causality as anxiety later in pregnancy could be due to concerns about early signs of a possible preterm birth.
Asunto(s)
Ansiedad , Edad Gestacional , Nacimiento Prematuro , Estrés Psicológico , Humanos , Femenino , Embarazo , Estrés Psicológico/epidemiología , Ansiedad/epidemiología , Canadá/epidemiología , Adulto , Nacimiento Prematuro/epidemiología , Nacimiento Prematuro/psicología , Cohorte de Nacimiento , Complicaciones del Embarazo/epidemiología , Complicaciones del Embarazo/psicología , Estudios de Cohortes , Factores de Riesgo , Recién Nacido , Modelos de Riesgos Proporcionales , Noruega/epidemiologíaRESUMEN
With the introduction of Next Generation Sequencing (NGS) techniques increasing numbers of disease-associated variants are being identified. This ongoing progress might lead to diagnoses in formerly undiagnosed patients and novel insights in already solved cases. Therefore, many studies suggest introducing systematic reanalysis of NGS data in routine diagnostics. Introduction will, however, also have ethical, economic, legal and (psycho)social (ELSI) implications that Genetic Health Professionals (GHPs) from laboratories should consider before possible implementation of systematic reanalysis. To get a first impression we performed a scoping literature review. Our findings show that for the vast majority of included articles ELSI aspects were not mentioned as such. However, often these issues were raised implicitly. In total, we identified nine ELSI aspects, such as (perceived) professional responsibilities, implications for consent and cost-effectiveness. The identified ELSI aspects brought forward necessary trade-offs for GHPs to consciously take into account when considering responsible implementation of systematic reanalysis of NGS data in routine diagnostics, balancing the various strains on their laboratories and personnel while creating optimal results for new and former patients. Some important aspects are not well explored yet. For example, our study shows GHPs see the values of systematic reanalysis but also experience barriers, often mentioned as being practical or financial only, but in fact also being ethical or psychosocial. Engagement of these GHPs in further research on ELSI aspects is important for sustainable implementation.
Asunto(s)
Pruebas Genéticas , Humanos , Pruebas Genéticas/ética , Pruebas Genéticas/economía , Pruebas Genéticas/legislación & jurisprudencia , Pruebas Genéticas/normas , Pruebas Genéticas/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/ética , Genómica/ética , Genómica/legislación & jurisprudencia , Genómica/métodos , Laboratorios ClínicosRESUMEN
In this report, we analyse the use of virtual reality (VR) as a method to navigate and explore complex knowledge graphs. Over the past few decades, linked data technologies [Resource Description Framework (RDF) and Web Ontology Language (OWL)] have shown to be valuable to encode such graphs and many tools have emerged to interactively visualize RDF. However, as knowledge graphs get larger, most of these tools struggle with the limitations of 2D screens or 3D projections. Therefore, in this paper, we evaluate the use of VR to visually explore SPARQL Protocol and RDF Query Language (SPARQL) (construct) queries, including a series of tutorial videos that demonstrate the power of VR (see Graph2VR tutorial playlist: https://www.youtube.com/playlist?list=PLRQCsKSUyhNIdUzBNRTmE-_JmuiOEZbdH). We first review existing methods for Linked Data visualization and then report the creation of a prototype, Graph2VR. Finally, we report a first evaluation of the use of VR for exploring linked data graphs. Our results show that most participants enjoyed testing Graph2VR and found it to be a useful tool for graph exploration and data discovery. The usability study also provides valuable insights for potential future improvements to Linked Data visualization in VR.
Asunto(s)
Web Semántica , Realidad Virtual , Humanos , Bases de Datos Factuales , LenguajeRESUMEN
Research data is accumulating rapidly and with it the challenge of fully reproducible science. As a consequence, implementation of high-quality management of scientific data has become a global priority. The FAIR (Findable, Accesible, Interoperable and Reusable) principles provide practical guidelines for maximizing the value of research data; however, processing data using workflows-systematic executions of a series of computational tools-is equally important for good data management. The FAIR principles have recently been adapted to Research Software (FAIR4RS Principles) to promote the reproducibility and reusability of any type of research software. Here, we propose a set of 10 quick tips, drafted by experienced workflow developers that will help researchers to apply FAIR4RS principles to workflows. The tips have been arranged according to the FAIR acronym, clarifying the purpose of each tip with respect to the FAIR4RS principles. Altogether, these tips can be seen as practical guidelines for workflow developers who aim to contribute to more reproducible and sustainable computational science, aiming to positively impact the open science and FAIR community.
RESUMEN
BACKGROUND: Expression quantitative trait loci (eQTL) studies show how genetic variants affect downstream gene expression. Single-cell data allows reconstruction of personalized co-expression networks and therefore the identification of SNPs altering co-expression patterns (co-expression QTLs, co-eQTLs) and the affected upstream regulatory processes using a limited number of individuals. RESULTS: We conduct a co-eQTL meta-analysis across four scRNA-seq peripheral blood mononuclear cell datasets using a novel filtering strategy followed by a permutation-based multiple testing approach. Before the analysis, we evaluate the co-expression patterns required for co-eQTL identification using different external resources. We identify a robust set of cell-type-specific co-eQTLs for 72 independent SNPs affecting 946 gene pairs. These co-eQTLs are replicated in a large bulk cohort and provide novel insights into how disease-associated variants alter regulatory networks. One co-eQTL SNP, rs1131017, that is associated with several autoimmune diseases, affects the co-expression of RPS26 with other ribosomal genes. Interestingly, specifically in T cells, the SNP additionally affects co-expression of RPS26 and a group of genes associated with T cell activation and autoimmune disease. Among these genes, we identify enrichment for targets of five T-cell-activation-related transcription factors whose binding sites harbor rs1131017. This reveals a previously overlooked process and pinpoints potential regulators that could explain the association of rs1131017 with autoimmune diseases. CONCLUSION: Our co-eQTL results highlight the importance of studying context-specific gene regulation to understand the biological implications of genetic variation. With the expected growth of sc-eQTL datasets, our strategy and technical guidelines will facilitate future co-eQTL identification, further elucidating unknown disease mechanisms.
Asunto(s)
Enfermedades Autoinmunes , Leucocitos Mononucleares , Humanos , Regulación de la Expresión Génica , Sitios de Carácter Cuantitativo , Proteínas Ribosómicas/genética , Enfermedades Autoinmunes/genética , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma CompletoRESUMEN
The mapping of human-entered data to codified data formats that can be analysed is a common problem across medical research and health care. To identify risk and protective factors for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) susceptibility and coronavirus disease 2019 (COVID-19) severity, frequent questionnaires were sent out to participants of the Lifelines Cohort Study starting 30 March 2020. Because specific drugs were suspected COVID-19 risk factors, the questionnaires contained multiple-choice questions about commonly used drugs and open-ended questions to capture all other drugs used. To classify and evaluate the effects of those drugs and group participants taking similar drugs, the free-text answers needed to be translated into standard Anatomical Therapeutic Chemical (ATC) codes. This translation includes handling misspelt drug names, brand names, comments or multiple drugs listed in one line that would prevent a computer from finding these terms in a simple lookup table. In the past, the translation of free-text responses to ATC codes was time-intensive manual labour for experts. To reduce the amount of manual curation required, we developed a method for the semi-automated recoding of the free-text questionnaire responses into ATC codes suitable for further analysis. For this purpose, we built an ontology containing the Dutch drug names linked to their respective ATC code(s). In addition, we designed a semi-automated process that builds upon the Molgenis method SORTA to map the responses to ATC codes. This method can be applied to support the encoding of free-text responses to facilitate the evaluation, categorization and filtering of free-text responses. Our semi-automatic approach to coding of drugs using SORTA turned out to be more than two times faster than current manual approaches to performing this activity. Database URL https://doi.org/10.1093/database/baad019.
Asunto(s)
COVID-19 , Humanos , Estudios de Cohortes , COVID-19/epidemiología , SARS-CoV-2 , Encuestas y Cuestionarios , Bases de Datos FactualesRESUMEN
(1) Background: Sexual function can be affected up to and beyond 18 months postpartum, with some studies suggesting that spontaneous vaginal birth results in less sexual dysfunction. This review examined the impact of mode of birth on sexual function in the medium- (≥6 months and <12 months postpartum) and longer-term (≥12 months postpartum). (2) Methods: Literature published after January 2000 were identified in PubMed, Embase and CINAHL. Studies that compared at least two modes of birth and used valid sexual function measures were included. Systematic reviews, unpublished articles, protocols and articles not written in English were excluded. Quality was assessed using the Newcastle Ottawa Scale. (3) Results: In the medium-term, assisted vaginal birth and vaginal birth with episiotomy were associated with worse sexual function, compared to caesarean section. In the longer-term, assisted vaginal birth was associated with worse sexual function, compared with spontaneous vaginal birth and caesarean section; and planned caesarean section was associated with worse sexual function in several domains, compared to spontaneous vaginal birth. (4) Conclusions: Sexual function, in the medium- and longer-term, can be affected by mode of birth. Women should be encouraged to seek support should their sexual function be affected after birth.
Asunto(s)
Cesárea , Episiotomía , Femenino , Humanos , Embarazo , Cesárea/efectos adversos , Parto Obstétrico/efectos adversos , Parto Obstétrico/métodos , Episiotomía/efectos adversos , Parto , Periodo PospartoRESUMEN
BACKGROUND: Even with the introduction of new genetic techniques that enable accurate genomic characterization, knowledge about the phenotypic spectrum of rare chromosomal disorders is still limited, both in literature and existing databases. Yet this clinical information is of utmost importance for health professionals and the parents of children with rare diseases. Since existing databases are often hampered by the limited time and willingness of health professionals to input new data, we collected phenotype data directly from parents of children with a chromosome 6 disorder. These parents were reached via social media, and the information was collected via the online Chromosome 6 Questionnaire, which includes 115 main questions on congenital abnormalities, medical problems, behaviour, growth and development. METHODS: Here, we assess data consistency by comparing parent-reported phenotypes to phenotypes based on copies of medical files for the same individual (n = 20) and data availability by comparing the data available on specific characteristics reported by parents (n = 34) to data available in existing literature (n = 39). RESULTS: The reported answers to the main questions on phenotype characteristics were 85-95% consistent, and the consistency of answers to subsequent more detailed questions was 77-96%. For all but two main questions, significantly more data was collected from parents via the Chromosome 6 Questionnaire than was currently available in literature. For the topics developmental delay and brain abnormalities, no significant difference in the amount of available data was found. The only feature for which significantly more data was available in literature was a sub-question on the type of brain abnormality present. CONCLUSION: This is the first study to compare phenotype data collected directly from parents to data extracted from medical files on the same individuals. We found that the data was highly consistent, and phenotype data collected via the online Chromosome 6 Questionnaire resulted in more available information on most clinical characteristics when compared to phenotypes reported in literature reports thus far. We encourage active patient participation in rare disease research and have shown that parent-reported phenotypes are reliable and contribute to our knowledge of the phenotypic spectrum of rare chromosomal disorders.
Asunto(s)
Encefalopatías , Cromosomas Humanos Par 6 , Humanos , Aberraciones Cromosómicas , Proyectos de Investigación , Encuestas y Cuestionarios , Fenotipo , PadresRESUMEN
BACKGROUND: Terminal 6p deletions are rare, and information on their clinical consequences is scarce, which impedes optimal management and follow-up by clinicians. The parent-driven Chromosome 6 Project collaborates with families of affected children worldwide to better understand the clinical effects of chromosome 6 aberrations and to support clinical guidance. A microarray report is required for participation, and detailed phenotype information is collected directly from parents through a multilingual web-based questionnaire. Information collected from parents is then combined with case data from literature reports. Here, we present our findings on 13 newly identified patients and 46 literature cases with genotypically well-characterised terminal and subterminal 6p deletions. We provide phenotype descriptions for both the whole group and for subgroups based on deletion size and HI gene content. RESULTS: The total group shared a common phenotype characterised by ocular anterior segment dysgenesis, vision problems, brain malformations, congenital defects of the cardiac septa and valves, mild to moderate hearing impairment, eye movement abnormalities, hypotonia, mild developmental delay and dysmorphic features. These characteristics were observed in all subgroups where FOXC1 was included in the deletion, confirming a dominant role for this gene. Additional characteristics were seen in individuals with terminal deletions exceeding 4.02 Mb, namely complex heart defects, corpus callosum abnormalities, kidney abnormalities and orofacial clefting. Some of these additional features may be related to the loss of other genes in the terminal 6p region, such as RREB1 for the cardiac phenotypes and TUBB2A and TUBB2B for the cerebral phenotypes. In the newly identified patients, we observed previously unreported features including gastrointestinal problems, neurological abnormalities, balance problems and sleep disturbances. CONCLUSIONS: We present an overview of the phenotypic characteristics observed in terminal and subterminal 6p deletions. This reveals a common phenotype that can be highly attributable to haploinsufficiency of FOXC1, with a possible additional effect of other genes in the 6p25 region. We also delineate the developmental abilities of affected individuals and report on previously unrecognised features, showing the added benefit of collecting information directly from parents. Based on our overview, we provide recommendations for clinical surveillance to support clinicians, patients and families.
Asunto(s)
Anomalías del Ojo , Cardiopatías Congénitas , Medios de Comunicación Sociales , Humanos , Fenotipo , Aberraciones Cromosómicas , Anomalías del Ojo/genética , Cardiopatías Congénitas/genética , Deleción Cromosómica , Cromosomas Humanos Par 6/genéticaRESUMEN
The c.40_42delAGA variant in the phospholamban gene (PLN) has been associated with dilated and arrhythmogenic cardiomyopathy, with up to 70% of carriers experiencing a major cardiac event by age 70. However, there are carriers who remain asymptomatic at older ages. To understand the mechanisms behind this incomplete penetrance, we evaluated potential phenotypic and genetic modifiers in 74 PLN:c.40_42delAGA carriers identified in 36,339 participants of the Lifelines population cohort. Asymptomatic carriers (N = 48) showed shorter QRS duration (- 5.73 ms, q value = 0.001) compared to asymptomatic non-carriers, an effect we could replicate in two different independent cohorts. Furthermore, symptomatic carriers showed a higher correlation (rPearson = 0.17) between polygenic predisposition to higher QRS (PGSQRS) and QRS (p value = 1.98 × 10-8), suggesting that the effect of the genetic variation on cardiac rhythm might be increased in symptomatic carriers. Our results allow for improved clinical interpretation for asymptomatic carriers, while our approach could guide future studies on genetic diseases with incomplete penetrance.
Asunto(s)
Cardiomiopatías , Humanos , Anciano , Mutación , Cardiomiopatías/diagnóstico , Cardiomiopatías/genética , Proteínas de Unión al Calcio/genética , GenotipoRESUMEN
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection causes severe COVID-19 in some patients and mild COVID-19 in others. Dysfunctional innate immune responses have been identified to contribute to COVID-19 severity, but the key regulators are still unknown. Here, we present an integrative single-cell multi-omics analysis of peripheral blood mononuclear cells from hospitalized and convalescent COVID-19 patients. In classical monocytes, we identified genes that were potentially regulated by differential chromatin accessibility. Then, sub-clustering and motif-enrichment analyses revealed disease condition-specific regulation by transcription factors and their targets, including an interaction between C/EBPs and a long-noncoding RNA LUCAT1, which we validated through loss-of-function experiments. Finally, we investigated genetic risk variants that exhibit allele-specific open chromatin (ASoC) in COVID-19 patients and identified a SNP rs6800484-C, which is associated with lower expression of CCR2 and may contribute to higher viral loads and higher risk of COVID-19 hospitalization. Altogether, our study highlights the diverse genetic and epigenetic regulators that contribute to COVID-19.
RESUMEN
Optimizing research on the developmental origins of health and disease (DOHaD) involves implementing initiatives maximizing the use of the available cohort study data; achieving sufficient statistical power to support subgroup analysis; and using participant data presenting adequate follow-up and exposure heterogeneity. It also involves being able to undertake comparison, cross-validation, or replication across data sets. To answer these requirements, cohort study data need to be findable, accessible, interoperable, and reusable (FAIR), and more particularly, it often needs to be harmonized. Harmonization is required to achieve or improve comparability of the putatively equivalent measures collected by different studies on different individuals. Although the characteristics of the research initiatives generating and using harmonized data vary extensively, all are confronted by similar issues. Having to collate, understand, process, host, and co-analyze data from individual cohort studies is particularly challenging. The scientific success and timely management of projects can be facilitated by an ensemble of factors. The current document provides an overview of the 'life course' of research projects requiring harmonization of existing data and highlights key elements to be considered from the inception to the end of the project.
Asunto(s)
Proyectos de Investigación , Humanos , Estudios de Cohortes , Estudios RetrospectivosRESUMEN
INTRODUCTION: Rare disease patient data are typically sensitive, present in multiple registries controlled by different custodians, and non-interoperable. Making these data Findable, Accessible, Interoperable, and Reusable (FAIR) for humans and machines at source enables federated discovery and analysis across data custodians. This facilitates accurate diagnosis, optimal clinical management, and personalised treatments. In Europe, twenty-four European Reference Networks (ERNs) work on rare disease registries in different clinical domains. The process and the implementation choices for making data FAIR ('FAIRification') differ among ERN registries. For example, registries use different software systems and are subject to different legal regulations. To support the ERNs in making informed decisions and to harmonise FAIRification, the FAIRification steward team was established to work as liaisons between ERNs and researchers from the European Joint Programme on Rare Diseases. RESULTS: The FAIRification steward team inventoried the FAIRification challenges of the ERN registries and proposed solutions collectively with involved stakeholders to address them. Ninety-eight FAIRification challenges from 24 ERNs' registries were collected and categorised into "training" (31), "community" (9), "modelling" (12), "implementation" (26), and "legal" (20). After curating and aggregating highly similar challenges, 41 unique FAIRification challenges remained. The two categories with the most challenges were "training" (15) and "implementation" (9), followed by "community" (7), and then "modelling" (5) and "legal" (5). To address all challenges, eleven types of solutions were proposed. Among them, the provision of guidelines and the organisation of training activities resolved the "training" challenges, which ranged from less-technical "coffee-rounds" to technical workshops, from informal FAIR Games to formal hackathons. Obtaining implementation support from technical experts was the solution type for tackling the "implementation" challenges. CONCLUSION: This work shows that a dedicated team of FAIR data stewards is an asset for harmonising the various processes of making data FAIR in a large organisation with multiple stakeholders. Additionally, multi-levelled training activities are required to accommodate the diverse needs of the ERNs. Finally, the lessons learned from the experience of the FAIRification steward team described in this paper may help to increase FAIR awareness and provide insights into FAIRification challenges and solutions of rare disease registries.
Asunto(s)
Enfermedades Raras , Programas Informáticos , Humanos , Europa (Continente) , Enfermedades Raras/terapia , Sistema de RegistrosRESUMEN
Objective: This study investigates current standards and operational gaps in the management and sharing of next generation sequencing (NGS) data within the healthcare and research setting and according to Findable, Accessible, Interoperable and Reusable (FAIR) principles. Methods: The analysis was performed as the basis from which to bridge identified gaps and develop widely accepted working standards that ensure optimal reusability of genomic data in healthcare and research settings in the Netherlands. This work is part of the 'Rational Pharmacotherapy Program' led by ZonMw, The Netherlands Organisation for Health Research and Development, which aims to promote the efficient implementation of NGS and personalised medicine within Dutch healthcare, with an initial focus on oncology and rare diseases. Results: Based on this analysis and as part of this programme, a consortium was formed to develop an instruction manual for FAIR genomic data in clinical care and research based on an inventory of commonly used workflows and standards in the (inter)national field of genome analysis. Conclusions: The gap analysis presented and discussed in this paper represents the starting point for this inventory and is a possible contribution from the Netherlands to the European 1+ Million Genomes Initiative. This paper addresses the topics of data generation, data quality, (meta)data standards, data storage and archiving and data integration and exchange.
RESUMEN
The genomes of thousands of individuals are profiled within Dutch healthcare and research each year. However, this valuable genomic data, associated clinical data and consent are captured in different ways and stored across many systems and organizations. This makes it difficult to discover rare disease patients, reuse data for personalized medicine and establish research cohorts based on specific parameters. FAIR Genomes aims to enable NGS data reuse by developing metadata standards for the data descriptions needed to FAIRify genomic data while also addressing ELSI issues. We developed a semantic schema of essential data elements harmonized with international FAIR initiatives. The FAIR Genomes schema v1.1 contains 110 elements in 9 modules. It reuses common ontologies such as NCIT, DUO and EDAM, only introducing new terms when necessary. The schema is represented by a YAML file that can be transformed into templates for data entry software (EDC) and programmatic interfaces (JSON, RDF) to ease genomic data sharing in research and healthcare. The schema, documentation and MOLGENIS reference implementation are available at https://fairgenomes.org .
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Metadatos , Atención a la Salud , Genómica , Humanos , Programas InformáticosRESUMEN
BACKGROUND: The European Platform on Rare Disease Registration (EU RD Platform) aims to address the fragmentation of European rare disease (RD) patient data, scattered among hundreds of independent and non-coordinating registries, by establishing standards for integration and interoperability. The first practical output of this effort was a set of 16 Common Data Elements (CDEs) that should be implemented by all RD registries. Interoperability, however, requires decisions beyond data elements - including data models, formats, and semantics. Within the European Joint Programme on Rare Diseases (EJP RD), we aim to further the goals of the EU RD Platform by generating reusable RD semantic model templates that follow the FAIR Data Principles. RESULTS: Through a team-based iterative approach, we created semantically grounded models to represent each of the CDEs, using the SemanticScience Integrated Ontology as the core framework for representing the entities and their relationships. Within that framework, we mapped the concepts represented in the CDEs, and their possible values, into domain ontologies such as the Orphanet Rare Disease Ontology, Human Phenotype Ontology and National Cancer Institute Thesaurus. Finally, we created an exemplar, reusable ETL pipeline that we will be deploying over these non-coordinating data repositories to assist them in creating model-compliant FAIR data without requiring site-specific coding nor expertise in Linked Data or FAIR. CONCLUSIONS: Within the EJP RD project, we determined that creating reusable, expert-designed templates reduced or eliminated the requirement for our participating biomedical domain experts and rare disease data hosts to understand OWL semantics. This enabled them to publish highly expressive FAIR data using tools and approaches that were already familiar to them.