Pesquisa | BVS Doenças Infecciosas e Parasitárias

dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning.

Cao, Han; Zhang, Youcheng; Baumbach, Jan; Burton, Paul R; Dwyer, Dominic; Koutsouleris, Nikolaos; Matschinske, Julian; Marcon, Yannick; Rajan, Sivanesan; Rieg, Thilo; Ryser-Welch, Patricia; Späth, Julian; Herrmann, Carl; Schwarz, Emanuel.

Bioinformatics ; 38(21): 4919-4926, 2022 10 31.

Artigo em Inglês | MEDLINE | ID: mdl-36073911

RESUMO

MOTIVATION: In multi-cohort machine learning studies, it is critical to differentiate between effects that are reproducible across cohorts and those that are cohort-specific. Multi-task learning (MTL) is a machine learning approach that facilitates this differentiation through the simultaneous learning of prediction tasks across cohorts. Since multi-cohort data can often not be combined into a single storage solution, there would be the substantial utility of an MTL application for geographically distributed data sources. RESULTS: Here, we describe the development of 'dsMTL', a computational framework for privacy-preserving, distributed multi-task machine learning that includes three supervised and one unsupervised algorithms. First, we derive the theoretical properties of these methods and the relevant machine learning workflows to ensure the validity of the software implementation. Second, we implement dsMTL as a library for the R programming language, building on the DataSHIELD platform that supports the federated analysis of sensitive individual-level data. Third, we demonstrate the applicability of dsMTL for comorbidity modeling in distributed data. We show that comorbidity modeling using dsMTL outperformed conventional, federated machine learning, as well as the aggregation of multiple models built on the distributed datasets individually. The application of dsMTL was computationally efficient and highly scalable when applied to moderate-size (n < 500), real expression data given the actual network latency. AVAILABILITY AND IMPLEMENTATION: dsMTL is freely available at https://github.com/transbioZI/dsMTLBase (server-side package) and https://github.com/transbioZI/dsMTLClient (client-side package). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Aprendizado de Máquina , Privacidade , Humanos , Software , Linguagens de Programação , Algoritmos

Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.

Marcon, Yannick; Bishop, Tom; Avraam, Demetris; Escriba-Montagut, Xavier; Ryser-Welch, Patricia; Wheater, Stuart; Burton, Paul; González, Juan R.

PLoS Comput Biol ; 17(3): e1008880, 2021 03.

Artigo em Inglês | MEDLINE | ID: mdl-33784300

RESUMO

Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).

Assuntos

Big Data , Segurança Computacional , Software , Bases de Dados Factuais , Genômica , Sistemas de Informação Geográfica , Humanos

Epimutation detection in the clinical context: guidelines and a use case from a new Bioconductor package.

Ruiz-Arenas, Carlos; Abarrategui, Leire; Hernandez-Ferrer, Carles; Escribà-Montagut, Xavier; Pelegrí-Sisó, Dolors; Ryser-Welch, Patricia; Vrijheid, Martine; Bustamante, Mariona; Grazuleviciene, Regina; Lepeule, Johanna; Mathai, Mathew; Vafeiadi, Marina; Beltran, Sergi; Pérez-Jurado, Luis A; González, Juan R.

Epigenetics ; 18(1): 2230670, 2023 12.

Artigo em Inglês | MEDLINE | ID: mdl-37409354

RESUMO

Epimutations are rare alterations of the normal DNA methylation pattern at specific loci, which can lead to rare diseases. Methylation microarrays enable genome-wide epimutation detection, but technical limitations prevent their use in clinical settings: methods applied to rare diseases' data cannot be easily incorporated to standard analyses pipelines, while epimutation methods implemented in R packages (ramr) have not been validated for rare diseases. We have developed epimutacions, a Bioconductor package (https://bioconductor.org/packages/release/bioc/html/epimutacions.html). epimutacions implements two previously reported methods and four new statistical approaches to detect epimutations, along with functions to annotate and visualize epimutations. Additionally, we have developed an user-friendly Shiny app to facilitate epimutations detection (https://github.com/isglobal-brge/epimutacionsShiny) to non-bioinformatician users. We first compared the performance of epimutacions and ramr packages using three public datasets with experimentally validated epimutations. Methods in epimutacions had a high performance at low sample sizes and outperformed methods in ramr. Second, we used two general population children cohorts (INMA and HELIX) to determine the technical and biological factors that affect epimutations detection, providing guidelines on how designing the experiments or preprocessing the data. In these cohorts, most epimutations did not correlate with detectable regional gene expression changes. Finally, we exemplified how epimutacions can be used in a clinical context. We run epimutacions in a cohort of children with autism disorder and identified novel recurrent epimutations in candidate genes for autism. Overall, we present epimutacions a new Bioconductor package for incorporating epimutations detection to rare disease diagnosis and provide guidelines for the design and data analyses.

Assuntos

Metilação de DNA , Software , Criança , Humanos , Doenças Raras , Genoma

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA