RESUMEN
MOTIVATION: The volume and complexity of biological data increases rapidly. Many clinical professionals and biomedical researchers without a bioinformatics background are generating big '-omics' data, but do not always have the tools to manage, process or publicly share these data. RESULTS: Here we present MOLGENIS Research, an open-source web-application to collect, manage, analyze, visualize and share large and complex biomedical datasets, without the need for advanced bioinformatics skills. AVAILABILITY AND IMPLEMENTATION: MOLGENIS Research is freely available (open source software). It can be installed from source code (see http://github.com/molgenis), downloaded as a precompiled WAR file (for your own server), setup inside a Docker container (see http://molgenis.github.io), or requested as a Software-as-a-Service subscription. For a public demo instance and complete installation instructions see http://molgenis.org/research.
Asunto(s)
Biología Computacional , Programas Informáticos , Algoritmos , Genoma , GenómicaRESUMEN
Rett syndrome (RTT) is a monogenic rare disorder that causes severe neurological problems. In most cases, it results from a loss-of-function mutation in the gene encoding methyl-CPG-binding protein 2 (MECP2). Currently, about 900 unique MECP2 variations (benign and pathogenic) have been identified and it is suspected that the different mutations contribute to different levels of disease severity. For researchers and clinicians, it is important that genotype-phenotype information is available to identify disease-causing mutations for diagnosis, to aid in clinical management of the disorder, and to provide counseling for parents. In this study, 13 genotype-phenotype databases were surveyed for their general functionality and availability of RTT-specific MECP2 variation data. For each database, we investigated findability and interoperability alongside practical user functionality, and type and amount of genetic and phenotype data. The main conclusions are that, as well as being challenging to find these databases and specific MECP2 variants held within, interoperability is as yet poorly developed and requires effort to search across databases. Nevertheless, we found several thousand online database entries for MECP2 variations and their associated phenotypes, diagnosis, or predicted variant effects, which is a good starting point for researchers and clinicians who want to provide, annotate, and use the data.
Asunto(s)
Bases de Datos Genéticas , Proteína 2 de Unión a Metil-CpG/genética , Síndrome de Rett/genética , Femenino , Genotipo , Humanos , Mutación con Pérdida de Función/genética , Masculino , Mutación/genética , Fenotipo , Síndrome de Rett/patologíaRESUMEN
MOTIVATION: Biobanks are indispensable for large-scale genetic/epidemiological studies, yet it remains difficult for researchers to determine which biobanks contain data matching their research questions. RESULTS: To overcome this, we developed a new matching algorithm that identifies pairs of related data elements between biobanks and research variables with high precision and recall. It integrates lexical comparison, Unified Medical Language System ontology tagging and semantic query expansion. The result is BiobankUniverse, a fast matchmaking service for biobanks and researchers. Biobankers upload their data elements and researchers their desired study variables, BiobankUniverse automatically shortlists matching attributes between them. Users can quickly explore matching potential and search for biobanks/data elements matching their research. They can also curate matches and define personalized data-universes. AVAILABILITY AND IMPLEMENTATION: BiobankUniverse is available at http://biobankuniverse.com or can be downloaded as part of the open source MOLGENIS suite at http://github.com/molgenis/molgenis. CONTACT: m.a.swertz@rug.nl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Factuales , Programas Informáticos , AlgoritmosRESUMEN
MOTIVATION: While the size and number of biobanks, patient registries and other data collections are increasing, biomedical researchers still often need to pool data for statistical power, a task that requires time-intensive retrospective integration. RESULTS: To address this challenge, we developed MOLGENIS/connect, a semi-automatic system to find, match and pool data from different sources. The system shortlists relevant source attributes from thousands of candidates using ontology-based query expansion to overcome variations in terminology. Then it generates algorithms that transform source attributes to a common target DataSchema. These include unit conversion, categorical value matching and complex conversion patterns (e.g. calculation of BMI). In comparison to human-experts, MOLGENIS/connect was able to auto-generate 27% of the algorithms perfectly, with an additional 46% needing only minor editing, representing a reduction in the human effort and expertise needed to pool data. AVAILABILITY AND IMPLEMENTATION: Source code, binaries and documentation are available as open-source under LGPLv3 from http://github.com/molgenis/molgenis and www.molgenis.org/connect CONTACT: : m.a.swertz@rug.nl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Bancos de Muestras Biológicas , Biología Computacional/métodos , Fenotipo , Programas Informáticos , Algoritmos , Ontologías Biológicas , HumanosRESUMEN
The capacity to link records associated with the same individual across data sets is a key challenge for data-driven research. The challenge is exacerbated by the potential inclusion of both genomic and clinical data in data sets that may span multiple legal jurisdictions, and by the need to enable re-identification in limited circumstances. Privacy-Preserving Record Linkage (PPRL) methods address these challenges. In 2016, the Interdisciplinary Committee of the International Rare Diseases Research Consortium (IRDiRC) launched a task team to explore approaches to PPRL. The task team is a collaboration with the Global Alliance for Genomics and Health (GA4GH) Regulatory and Ethics and Data Security Work Streams, and aims to prepare policy and technology standards to enable highly reliable linking of records associated with the same individual without disclosing their identity except under conditions in which the use of the data has led to information of importance to the individual's safety or health, and applicable law allows or requires the return of results. The PPRL Task Force has examined the ethico-legal requirements, constraints, and implications of PPRL, and has applied this knowledge to the exploration of technology methods and approaches to PPRL. This paper reports and justifies the findings and recommendations thus far.
Asunto(s)
Seguridad Computacional , Confidencialidad , Genómica , Informática Médica/métodos , Macrodatos , Bases de Datos Factuales , Europa (Continente) , Ligamiento Genético , Genoma Humano , Humanos , Comunicación Interdisciplinaria , Informática Médica/normas , Enfermedades Raras/genética , Estados UnidosRESUMEN
The Open Source Registry for Rare Diseases (OSSE) provides a concept and a software for the management of registries for patients with rare diseases. A disease is defined as rare if less than 5 out of 10,000 people are affected. Up to date, approximately 6,000 rare diseases are catalogued. Networking and data exchange for research purposes remains challenging due to the paucity of interoperability and due to the fact that small data stocks are stored locally. The so called "Findable, Accessible, Interoperable, Reusable" (FAIR) Data Principles have been developed to improve research in the field of rare diseases. Subsequently, the OSSE architecture was adapted to implement the FAIR Data Principles. Therefore, the so-called FAIR Data Point was integrated into OSSE to provide a description of metadata in a FAIR manner. OSSE relies on the existing metadata repository (MDR), which is used in to define data elements in the system. This is an important step towards unified documentation across multiple registries. The integration and use of new procedures to improve interoperability plays an important role in the context of registries for rare diseases.
Asunto(s)
Metadatos , Enfermedades Raras , Sistema de Registros , Estadística como Asunto , Humanos , Investigación , Programas InformáticosRESUMEN
Rare diseases (RD) patient registries are powerful instruments that help develop clinical research, facilitate the planning of appropriate clinical trials, improve patient care, and support healthcare management. They constitute a key information system that supports the activities of European Reference Networks (ERNs) on rare diseases. A rapid proliferation of RD registries has occurred during the last years and there is a need to develop guidance for the minimum requirements, recommendations and standards necessary to maintain a high-quality registry. In response to these heterogeneities, in the framework of RD-Connect, a European platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research, we report on a list of recommendations, developed by a group of experts, including members of patient organizations, to be used as a framework for improving the quality of RD registries. This list includes aspects of governance, Findable, Accessible, Interoperable and Reusable (FAIR) data and information, infrastructure, documentation, training, and quality audit. The list is intended to be used by established as well as new RD registries. Further work includes the development of a toolkit to enable continuous assessment and improvement of their organizational and data quality.
Asunto(s)
Mejoramiento de la Calidad , Enfermedades Raras , Sistema de Registros/normas , Investigación Biomédica , Biología Computacional , Exactitud de los Datos , Europa (Continente) , Humanos , Almacenamiento y Recuperación de la Información/normasRESUMEN
There is a need among researchers for the easy discoverability of biobank samples. Currently, there is no uniform way for finding samples and negotiate access. Instead, researchers have to communicate with each biobank separately. We present the architecture for the BBMRI-CS IT platform, whose goal is to facilitate sample location and access. We chose a decentral approach, which allows for strong data protection and provides the high flexibility needed in the highly heterogeneous landscape of European biobanks. This is the first implementation of a decentral search in the biobank field. With the addition of a Negotiator component, it also allows for easy communication and a follow-through of the lengthy approval process for accessing samples.
Asunto(s)
Bancos de Muestras Biológicas , Bases de Datos Factuales , Negociación , Acceso a la InformaciónRESUMEN
The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.
RESUMEN
Biobanks are the biological back end of data-driven medicine, but lack standards and generic solutions for interoperability and information harmonization. The move toward a global information infrastructure for biobanking demands semantic interoperability through harmonized services and common ontologies. To tackle this issue, the Minimum Information About BIobank data Sharing (MIABIS) was developed in 2012 by the Biobanking and BioMolecular Resources Research Infrastructure of Sweden (BBMRI.se). The wide acceptance of the first version of MIABIS encouraged evolving it to a more structured and descriptive standard. In 2013 a working group was formed under the largest infrastructure for health in Europe, Biobanking and BioMolecular Resources Research Infrastructure (BBMRI-ERIC), with the remit to continue the development of MIABIS (version 2.0) through a multicountry governance process. MIABIS 2.0 Core has been developed with 22 attributes describing Biobanks, Sample Collections, and Studies according to a modular structure that makes it easier to adhere to and to extend the standard. This integration standard will make a great contribution to the discovery and exploitation of biobank resources and lead to a wider and more efficient use of valuable bioresources, thereby speeding up the research on human diseases. Many within the European Union have accepted MIABIS 2.0 Core as the "de facto" biobank information standard.
Asunto(s)
Bancos de Muestras Biológicas/organización & administración , Manejo de Especímenes/normas , Bancos de Muestras Biológicas/normas , Bases de Datos Factuales , Unión Europea , Humanos , Difusión de la Información , Programas InformáticosRESUMEN
High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed: https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer.
RESUMEN
Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (~35,000 samples) with the population-specific reference panel created by the Genome of The Netherlands Project and perform association testing with blood lipid levels. We report the discovery of five novel associations at four loci (P value <6.61 × 10(-4)), including a rare missense variant in ABCA6 (rs77542162, p.Cys1359Arg, frequency 0.034), which is predicted to be deleterious. The frequency of this ABCA6 variant is 3.65-fold increased in the Dutch and its effect (ßLDL-C=0.135, ßTC=0.140) is estimated to be very similar to those observed for single variants in well-known lipid genes, such as LDLR.
Asunto(s)
Transportadoras de Casetes de Unión a ATP/genética , Colesterol/sangre , Mutación Missense/genética , Frecuencia de los Genes , Estudios de Asociación Genética , Humanos , Países BajosRESUMEN
Although genome-wide association studies (GWAS) have identified many common variants associated with complex traits, low-frequency and rare variants have not been interrogated in a comprehensive manner. Imputation from dense reference panels, such as the 1000 Genomes Project (1000G), enables testing of ungenotyped variants for association. Here we present the results of imputation using a large, new population-specific panel: the Genome of The Netherlands (GoNL). We benchmarked the performance of the 1000G and GoNL reference sets by comparing imputation genotypes with 'true' genotypes typed on ImmunoChip in three European populations (Dutch, British, and Italian). GoNL showed significant improvement in the imputation quality for rare variants (MAF 0.05-0.5%) compared with 1000G. In Dutch samples, the mean observed Pearson correlation, r(2), increased from 0.61 to 0.71. We also saw improved imputation accuracy for other European populations (in the British samples, r(2) improved from 0.58 to 0.65, and in the Italians from 0.43 to 0.47). A combined reference set comprising 1000G and GoNL improved the imputation of rare variants even further. The Italian samples benefitted the most from this combined reference (the mean r(2) increased from 0.47 to 0.50). We conclude that the creation of a large population-specific reference is advantageous for imputing rare variants and that a combined reference panel across multiple populations yields the best imputation results.