Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 601
Filtrar
1.
PLoS One ; 18(9): e0291169, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37729186

RESUMO

Campaign contributions are a staple of congressional life. Yet, the search for tangible effects of congressional donations often focuses on the association between contributions and votes on congressional bills. We present an alternative approach by considering the relationship between money and legislators' speech. Floor speeches are an important component of congressional behavior, and reflect a legislator's policy priorities and positions in a way that voting cannot. Our research provides the first comprehensive analysis of the association between a legislator's campaign donors and the policy issues they prioritize with congressional speech. Ultimately, we find a robust relationship between donors and speech, indicating a more pervasive role of money in politics than previously assumed. We use a machine learning framework on a new dataset that brings together legislator metadata for all representatives in the US House between 1995 and 2018, including committee assignments, legislative speech, donation records, and information about Political Action Committees. We compare information about donations against other potential explanatory variables, such as party affiliation, home state, and committee assignments, and find that donors consistently have the strongest association with legislators' issue-attention. We further contribute a procedure for identifying speech and donation events that occur in close proximity to one another and share meaningful connections, identifying the proverbial needles in the haystack of speech and donation activity in Congress which may be cases of interest for investigative journalism. Taken together, our framework, data, and findings can help increase the transparency of the role of money in politics.


Assuntos
Aprendizado de Máquina , Doadores de Tecidos , Humanos , Metadados , Políticas , Política
2.
Stud Health Technol Inform ; 307: 3-11, 2023 Sep 12.
Artigo em Inglês | MEDLINE | ID: mdl-37697832

RESUMO

Metadata is essential for handling medical data according to FAIR principles. Standards are well-established for many types of electrophysiological methods but are still lacking for microneurographic recordings of peripheral sensory nerve fibers in humans. Developing a new concept to enhance laboratory workflows is a complex process. We propose a standard for structuring and storing microneurography metadata based on odML and odML-tables. Further, we present an extension to the odML-tables GUI that enables user-friendly search functionality of the database. With our open-source repository, we encourage other microneurography labs to incorporate odML-based metadata into their experimental routines.


Assuntos
Decoração de Interiores e Mobiliário , Metadados , Humanos , Bases de Dados Factuais , Laboratórios , Fluxo de Trabalho
3.
Stud Health Technol Inform ; 307: 31-38, 2023 Sep 12.
Artigo em Inglês | MEDLINE | ID: mdl-37697835

RESUMO

INTRODUCTION: With increasing availability of reusable biomedical data - from cohort studies to clinical routine data, data re-users face the problem to manage transferred data according to the heterogeneous data use agreements. While structured metadata is addressed in many contexts including informed consent, contracts are to date still unstructured text documents. In particular within collaborative and active working groups the actual usage agreement's regulations are highly relevant for the daily practice - can I share the data with colleagues from the same university or the same research network, can they be stored on a PHD student's laptop, can I store the data for further approved data usage requests? METHODS: In this article, we inspect and review seven different data usage agreements. We focus on digital data that is copied and transferred to the requester's environment. RESULTS: We identified 24 metadata items in the four main categories data usage, storage, and sharing, as well as publication of results. DISCUSSION: While the topics are largely overlap in the data use agreements, the actual regulations of the topics are diverse. Although we do not explicitly investigate trusted research environments, where data is offered within an analytics platform, we consider them a as subgroup, where most of the practical questions from the data scientist's perspective also arise. CONCLUSION: With a limited set of structured metadata items, data scientists could have information about the data use agreement at hand along with the transferred data in an easily accessible way.


Assuntos
Metadados , Médicos , Humanos , Consentimento Livre e Esclarecido , Microcomputadores , Confiança
4.
Stud Health Technol Inform ; 307: 243-248, 2023 Sep 12.
Artigo em Inglês | MEDLINE | ID: mdl-37697859

RESUMO

To provide clinical data in distributed research architectures, a fundamental challenge involves defining and distributing suitable metadata within Metadata Repositories. Especially for structured data, data elements need to be bound against suitable terminologies; otherwise, other systems will only be able to interpret the data with complex and error-prone manual involvement. As current Metadata Repository implementations lack support for querying externally defined terminologies in FHIR terminology servers, we propose an intermediate solution that uses appropriate annotations on metadata elements to allow run-time Terminology Services mediated queries of that metadata. This allows a very clear separation of concerns between the two related systems, greatly simplifying terminological maintenance. The system performed well in a prototypical deployment.


Assuntos
Metadados
5.
Sci Data ; 10(1): 628, 2023 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-37717051

RESUMO

The Two Weeks in the World research project has resulted in a dataset of 3087 clinically relevant bacterial genomes with pertaining metadata, collected from 59 diagnostic units in 35 countries around the world during 2020. A relational database is available with metadata and summary data from selected bioinformatic analysis, such as species prediction and identification of acquired resistance genes.


Assuntos
Bactérias , Genoma Bacteriano , Bactérias/genética , Biologia Computacional , Bases de Dados Factuais , Metadados
6.
Sci Data ; 10(1): 633, 2023 09 18.
Artigo em Inglês | MEDLINE | ID: mdl-37723189

RESUMO

The field of human action recognition has made great strides in recent years, much helped by the availability of a wide variety of datasets that use Kinect to record human movement. Conversely, progress towards the use of Kinect in clinical practice has been hampered by the lack of appropriate data. In particular, datasets that contain clinically significant movements and appropriate metadata. This paper proposes a dataset to address this issue, namely KINECAL. It contains the recordings of 90 individuals carrying out 11 movements, commonly used in the clinical assessment of balance. The dataset contains relevant metadata, including clinical labelling, falls history labelling and postural sway metrics. KINECAL should be of interest to researchers interested in the clinical use of motion capture and motion analysis.


Assuntos
Movimento , Humanos , Benchmarking , Metadados , Movimento (Física) , Medição de Risco , Acidentes por Quedas
7.
Sci Rep ; 13(1): 14375, 2023 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-37658079

RESUMO

Deep neural network models (DNNs) are essential to modern AI and provide powerful models of information processing in biological neural networks. Researchers in both neuroscience and engineering are pursuing a better understanding of the internal representations and operations that undergird the successes and failures of DNNs. Neuroscientists additionally evaluate DNNs as models of brain computation by comparing their internal representations to those found in brains. It is therefore essential to have a method to easily and exhaustively extract and characterize the results of the internal operations of any DNN. Many models are implemented in PyTorch, the leading framework for building DNN models. Here we introduce TorchLens, a new open-source Python package for extracting and characterizing hidden-layer activations in PyTorch models. Uniquely among existing approaches to this problem, TorchLens has the following features: (1) it exhaustively extracts the results of all intermediate operations, not just those associated with PyTorch module objects, yielding a full record of every step in the model's computational graph, (2) it provides an intuitive visualization of the model's complete computational graph along with metadata about each computational step in a model's forward pass for further analysis, (3) it contains a built-in validation procedure to algorithmically verify the accuracy of all saved hidden-layer activations, and (4) the approach it uses can be automatically applied to any PyTorch model with no modifications, including models with conditional (if-then) logic in their forward pass, recurrent models, branching models where layer outputs are fed into multiple subsequent layers in parallel, and models with internally generated tensors (e.g., injections of noise). Furthermore, using TorchLens requires minimal additional code, making it easy to incorporate into existing pipelines for model development and analysis, and useful as a pedagogical aid when teaching deep learning concepts. We hope this contribution will help researchers in AI and neuroscience understand the internal representations of DNNs.


Assuntos
Encéfalo , Cognição , Engenharia , Metadados , Redes Neurais de Computação
8.
Foodborne Pathog Dis ; 20(9): 405-413, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37540138

RESUMO

Salmonella enterica (S. enterica) is a commensal organism or pathogen causing diseases in animals and humans, as well as widespread in the environment. Antimicrobial resistance (AMR) has increasingly affected both animal and human health and continues to raise public health concerns. A decade ago, it was estimated that the increased use of whole genome sequencing (WGS) combined with sharing of public data would drastically change and improve the surveillance and understanding of Salmonella epidemiology and AMR. This study aimed to evaluate the current usefulness of public WGS data for Salmonella surveillance and to investigate the associations between serovars, antibiotic resistance genes (ARGs), and metadata. Out of 191,306 Salmonella genomes deposited in European Nucleotide Archive and NCBI databases, 47,452 WGS with sufficient minimum metadata (country, year, and source) of S. enterica were retrieved from 116 countries and isolated between 1905 and 2020. For in silico analysis of the WGS data, KmerFinder, SISTR, and ResFinder were used for species, serovars, and AMR identification, respectively. The results showed that the five common isolation sources of S. enterica are human (29.10%), avian (22.50%), environment (11.89%), water (9.33%), and swine (6.62%). The most common ARG profiles for each class of antimicrobials are ß-lactam (blaTEM-1B; 6.78%), fluoroquinolone [(parC[T57S], qnrB19); 0.87%], folate pathway antagonist (sul2; 8.35%), macrolide [mph(A); 0.39%], phenicol (floR; 5.94%), polymyxin B (mcr-1.1; 0.09%), and tetracycline [tet(A); 12.95%]. Our study reports the first overview of ARG profiles in publicly available Salmonella genomes from online databases. All data sets from this study can be searched at Microreact.


Assuntos
Antibacterianos , Salmonella enterica , Humanos , Animais , Suínos , Antibacterianos/farmacologia , Metadados , Farmacorresistência Bacteriana/genética , Salmonella/genética , Farmacorresistência Bacteriana Múltipla/genética
9.
Sci Data ; 10(1): 548, 2023 08 22.
Artigo em Inglês | MEDLINE | ID: mdl-37607929

RESUMO

To extract meaningful and reproducible models of brain function from stroke images, for both clinical and research proposes, is a daunting task severely hindered by the great variability of lesion frequency and patterns. Large datasets are therefore imperative, as well as fully automated image post-processing tools to analyze them. The development of such tools, particularly with artificial intelligence, is highly dependent on the availability of large datasets to model training and testing. We present a public dataset of 2,888 multimodal clinical MRIs of patients with acute and early subacute stroke, with manual lesion segmentation, and metadata. The dataset provides high quality, large scale, human-supervised knowledge to feed artificial intelligence models and enable further development of tools to automate several tasks that currently rely on human labor, such as lesion segmentation, labeling, calculation of disease-relevant scores, and lesion-based studies relating function to frequency lesion maps.


Assuntos
Imageamento por Ressonância Magnética , Acidente Vascular Cerebral , Humanos , Inteligência Artificial , Processamento de Imagem Assistida por Computador , Metadados , Pacientes , Acidente Vascular Cerebral/diagnóstico por imagem
10.
Microb Genom ; 9(8)2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37650865

RESUMO

Inferring the spatiotemporal spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) via Bayesian phylogeography has been complicated by the overwhelming sampling bias present in the global genomic dataset. Previous work has demonstrated the utility of metadata in addressing this bias. Specifically, the inclusion of recent travel history of SARS-CoV-2-positive individuals into extended phylogeographical models has demonstrated increased accuracy of estimates, along with proposing alternative hypotheses that were not apparent using only genomic and geographical data. However, as the availability of comprehensive epidemiological metadata is limited, many of the current estimates rely on sequence data and basic metadata (i.e. sample date and location). As the bias within the SARS-CoV-2 sequence dataset is extensive, the degree to which we can rely on results drawn from standard phylogeographical models (i.e. discrete trait analysis) that lack integrated metadata is of great concern. This is particularly important when estimates influence and inform public health policy. We compared results generated from the same dataset, using two discrete phylogeographical models: one including travel history metadata and one without. We utilized sequences from Victoria, Australia, in this case study for two unique properties. Firstly, the high proportion of cases sequenced throughout 2020 within Victoria and the rest of Australia. Secondly, individual travel history was collected from returning travellers in Victoria during the first wave (January to May) of the coronavirus disease 2019 (COVID-19) pandemic. We found that the implementation of individual travel history was essential for the estimation of SARS-CoV-2 movement via discrete phylogeography models. Without the additional information provided by the travel history metadata, the discrete trait analysis could not be fit to the data due to numerical instability. We also suggest that during the first wave of the COVID-19 pandemic in Australia, the primary driving force behind the spread of SARS-CoV-2 was viral importation from international locations. This case study demonstrates the necessity of robust genomic datasets supplemented with epidemiological metadata for generating accurate estimates from phylogeographical models in datasets that have significant sampling bias. For future work, we recommend the collection of metadata in conjunction with genomic data. Furthermore, we highlight the risk of applying phylogeographical models to biased datasets without incorporating appropriate metadata, especially when estimates influence public health policy decision making.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Filogeografia , COVID-19/epidemiologia , Teorema de Bayes , Metadados , Pandemias , Vitória
11.
PLoS One ; 18(7): e0286330, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37467208

RESUMO

Many high-throughput sequencing datasets can be represented as objects with coordinates along a reference genome. Currently, biological investigations often involve a large number of such datasets, for example representing different cell types or epigenetic factors. Drawing overall conclusions from a large collection of results for individual datasets may be challenging and time-consuming. Meaningful interpretation often requires the results to be aggregated according to metadata that represents biological characteristics of interest. In this light, we here propose the hierarchical Genomic Suite HyperBrowser (hGSuite), an open-source extension to the GSuite HyperBrowser platform, which aims to provide a means for extracting key results from an aggregated collection of high-throughput DNA sequencing data. The hGSuite utilizes a metadata-informed data cube to calculate various statistics across the multiple dimensions of the datasets. With this work, we show that the hGSuite and its associated data cube methodology offers a quick and accessible way for exploratory analysis of large genomic datasets. The web-based toolkit named hGsuite Hyperbrowser is available at https://hyperbrowser.uio.no/hgsuite under a GPLv3 license.


Assuntos
Metadados , Software , Genômica/métodos , Genoma , Internet
12.
BMC Bioinformatics ; 24(1): 299, 2023 Jul 24.
Artigo em Inglês | MEDLINE | ID: mdl-37482620

RESUMO

BACKGROUND: An updated version of the mwtab Python package for programmatic access to the Metabolomics Workbench (MetabolomicsWB) data repository was released at the beginning of 2021. Along with updating the package to match the changes to MetabolomicsWB's 'mwTab' file format specification and enhancing the package's functionality, the included validation facilities were used to detect and catalog file inconsistencies and errors across all publicly available datasets in MetabolomicsWB. RESULTS: The MetabolomicsWB File Status website was developed to provide continuous validation of MetabolomicsWB data files and a useful interface to all found inconsistencies and errors. This list of detectable issues/errors include format parsing errors, format compliance issues, access problems via MetabolomicsWB's REST interface, and other small inconsistencies that can hinder reusability. The website uses the mwtab Python package to pull down and validate each available analysis file and then generates an html report. The website is updated on a weekly basis. Moreover, the Python website design utilizes GitHub and GitHub.io, providing an easy to replicate template for implementing other metadata, virtual, and meta- repositories. CONCLUSIONS: The MetabolomicsWB File Status website provides a metadata repository of validation metadata to promote the FAIR use of existing metabolomics datasets from the MetabolomicsWB data repository.


Assuntos
Metadados , Software , Metabolômica , Armazenamento e Recuperação da Informação
13.
Microbiol Spectr ; 11(4): e0101023, 2023 08 17.
Artigo em Inglês | MEDLINE | ID: mdl-37458594

RESUMO

Staphylococcus aureus is an opportunistic pathogen and a leading cause of morbidity and mortality worldwide. Genomic-based surveillance has greatly improved our ability to track the emergence and spread of high-risk clones, but the full potential of genomic data is only reached when used in conjunction with detailed metadata. Here, we demonstrate the utility of an integrated approach by leveraging a curated collection of clinical and epidemiological metadata of S. aureus in the San Matteo Hospital (Italy) through a semisupervised clustering strategy. We sequenced 226 sepsis S. aureus samples, recovered over a period of 9 years. By using existing antibiotic profiling data, we selected strains that capture the full diversity of the population. Genome analysis revealed 49 sequence types, 16 of which are novel. Comparative genomic analyses of hospital- and community-acquired infection ruled out the existence of genomic features differentiating them, while evolutionary analyses of genes and traits of interest highlighted different dynamics of acquisition and loss between antibiotic resistance and virulence genes. Finally, highly resistant clones belonging to clonal complexes (CC) 8 and 22 were found to be responsible for abundant infections and deaths, while the highly virulent CC30 was responsible for rare but deadly episodes of infections. IMPORTANCE Genome sequencing is an important tool in clinical microbiology, as it allows in-depth characterization of isolates of interest and can propel genome-based surveillance studies. Such studies can benefit from ad hoc methods of sample selection to capture the genomic diversity present in a data set. Here, we present an approach based on clustering of antibiotic resistance profiles that allows optimal sample selection for bacterial genomic surveillance. We apply the method to a 9-year collection of Staphylococcus aureus from a large hospital in northern Italy. Our method allows us to sequence the genomes of a large variety of strains of this important pathogen, which we then leverage to characterize the epidemiology in the hospital and to perform evolutionary analyses on genes and traits of interest. These analyses highlight different dynamics of acquisition and loss between antibiotic resistance and virulence genes.


Assuntos
Staphylococcus aureus Resistente à Meticilina , Infecções Estafilocócicas , Humanos , Staphylococcus aureus , Metadados , Infecções Estafilocócicas/microbiologia , Genoma Bacteriano , Antibacterianos/farmacologia , Hospitais , Staphylococcus aureus Resistente à Meticilina/genética , Testes de Sensibilidade Microbiana
14.
Sci Rep ; 13(1): 12076, 2023 07 26.
Artigo em Inglês | MEDLINE | ID: mdl-37495578

RESUMO

Glaucoma is an acquired optic neuropathy, which can lead to irreversible vision loss. Deep learning(DL), especially convolutional neural networks(CNN), has achieved considerable success in the field of medical image recognition due to the availability of large-scale annotated datasets and CNNs. However, obtaining fully annotated datasets like ImageNet in the medical field is still a challenge. Meanwhile, single-modal approaches remain both unreliable and inaccurate due to the diversity of glaucoma disease types and the complexity of symptoms. In this paper, a new multimodal dataset for glaucoma is constructed and a new multimodal neural network for glaucoma diagnosis and classification (GMNNnet) is proposed aiming to address both of these issues. Specifically, the dataset includes the five most important types of glaucoma labels, electronic medical records and four kinds of high-resolution medical images. The structure of GMNNnet consists of three branches. Branch 1 consisting of convolutional, cyclic and transposition layers processes patient metadata, branch 2 uses Unet to extract features from glaucoma segmentation based on domain knowledge, and branch 3 uses ResFormer to directly process glaucoma medical images.Branch one and branch two are mixed together and then processed by the Catboost classifier. We introduce a gradient-weighted class activation mapping (Grad-GAM) method to increase the interpretability of the model and a transfer learning method for the case of insufficient training data,i.e.,fine-tuning CNN models pre-trained from natural image dataset to medical image tasks. The results show that GMNNnet can better present the high-dimensional information of glaucoma and achieves excellent performance under multimodal data.


Assuntos
Glaucoma , Metadados , Humanos , Redes Neurais de Computação , Interpretação de Imagem Assistida por Computador/métodos , Glaucoma/diagnóstico por imagem , Aprendizado de Máquina
15.
Eur J Radiol ; 166: 110964, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37453274

RESUMO

PURPOSE: The ever-increasing volume of medical imaging data and interest in Big Data research brings challenges to data organization, categorization, and retrieval. Although the radiological value chain is almost entirely digital, data structuring has been widely performed pragmatically, but with insufficient naming and metadata standards for the stringent needs of image analysis. To enable automated data management independent of naming and metadata, this study focused on developing a convolutional neural network (CNN) that classifies medical images based solely on voxel data. METHOD: A 3D CNN (3D-ResNet18) was trained using a dataset of 31,602 prostate MRI volumes with 10 different sequence types of 1243 patients. A five-fold cross-validation approach with patient-based splits was chosen for training and testing. Training was repeated with a gradual reduction in training data assessing classification accuracies to determine the minimum training data required for sufficient performance. The trained model and developed method were tested on three external datasets. RESULTS: The model achieved an overall accuracy of 99.88 % ± 0.13 % in classifying typical prostate MRI sequence types. When being trained with approximately 10 % of the original cohort (112 patients), the CNN still achieved an accuracy of 97.43 % ± 2.10 %. In external testing the model achieved sensitivities of > 90 % for 10/15 tested sequence types. CONCLUSIONS: The herein developed CNN enabled automatic and reliable sequence identification in prostate MRI. Ultimately, such CNN models for voxel-based sequence identification could substantially enhance the management of medical imaging data, improve workflow efficiency and data quality, and allow for robust clinical AI workflows.


Assuntos
Metadados , Próstata , Masculino , Humanos , Imageamento por Ressonância Magnética , Redes Neurais de Computação , Processamento de Imagem Assistida por Computador/métodos
16.
Sci Data ; 10(1): 389, 2023 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-37328607

RESUMO

We present a draft Minimum Information About Geospatial Information System (MIAGIS) standard for facilitating public deposition of geospatial information system (GIS) datasets that follows the FAIR (Findable, Accessible, Interoperable and Reusable) principles. The draft MIAGIS standard includes a deposition directory structure and a minimum javascript object notation (JSON) metadata formatted file that is designed to capture critical metadata describing GIS layers and maps as well as their sources of data and methods of generation. The associated miagis Python package facilitates the creation of this MIAGIS metadata file and directly supports metadata extraction from both Esri JSON and GEOJSON GIS data formats plus options for extraction from user-specified JSON formats. We also demonstrate their use in crafting two example depositions of ArcGIS generated maps. We hope this draft MIAGIS standard along with the supporting miagis Python package will assist in establishing a GIS standards group that will develop the draft into a full standard for the wider GIS community as well as a future public repository for GIS datasets.


Assuntos
Sistemas de Informação , Metadados
17.
Stud Health Technol Inform ; 305: 24-27, 2023 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-37386948

RESUMO

Although data quality is well defined, the relationship to data quantity remains unclear. Especially the big data approach promises advantages of volume in comparison with small samples in good quality. Aim of this study was to review this issue. Based on the experiences with six registries within a German funding initiative, the definition of data quality provided by the International Organization for Standardization (ISO) was confronted with several aspects of data quantity. The results of a literature search combining both concepts were considered additionally. Data quantity was identified as an umbrella of some inherent characteristics of data like case and data completeness. The same time, quantity could be regarded as a non inherent characteristic of data beyond the ISO standard focusing on the breadth and depth of metadata, i.e. data elements along with their value sets. The FAIR Guiding Principles take into account the latter solely. Surprisingly, the literature agreed in demanding an increase in data quality with volume, turning the big data approach inside out. A usage of data without context - as it could be the case in data mining or machine learning - is neither covered by the concept of data quality nor of data quantity.


Assuntos
Big Data , Confiabilidade dos Dados , Mineração de Dados , Aprendizado de Máquina , Metadados
18.
Stud Health Technol Inform ; 305: 620-623, 2023 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-37387108

RESUMO

Learning Health System (LHS) and integrated care are challenged due to a fragmented health data landscape. An information model is agnostic to the underlying data structures and can potentially contribute to mitigating some of the gaps. In a research project, Valkyrie, we are exploring how metadata can be organized and used to promote service coordination and interoperability across levels of care. An information model is viewed as central in this context and as a future integrated LHS support. We examined the literature regarding property requirements for data, information and knowledge models in the context of semantic interoperability and an LHS. The requirements were elicited and synthesized into five guiding principles as a vocabulary to inform the information model design of Valkyrie. Further research on requirements and guiding principles for information model design and evaluation are welcomed.


Assuntos
Sistema de Aprendizagem em Saúde , Conhecimento , Metadados , Projetos de Pesquisa
19.
Bioinformatics ; 39(39 Suppl 1): i168-i176, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387172

RESUMO

The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ancestry is principal component analysis (PCA). In this article, we propose a privacy-preserving framework which utilizes PCA to assign individuals to populations across multiple collaborators as part of the population stratification step. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators' datasets. Our results on real genomic data show that the proposed framework can perform population stratification analysis with high accuracy while preserving the privacy of the research participants.


Assuntos
Genômica , Privacidade , Humanos , Mapeamento Cromossômico , Metadados , Análise de Componente Principal
20.
Bioinformatics ; 39(6)2023 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-37354497

RESUMO

SUMMARY: Biological data repositories are an invaluable source of publicly available research evidence. Unfortunately, the lack of convergence of the scientific community on a common metadata annotation strategy has resulted in large amounts of data with low FAIRness (Findable, Accessible, Interoperable and Reusable). The possibility of generating high-quality insights from their integration relies on data curation, which is typically an error-prone process while also being expensive in terms of time and human labour. Here, we present ESPERANTO, an innovative framework that enables a standardized semi-supervised harmonization and integration of toxicogenomics metadata and increases their FAIRness in a Good Laboratory Practice-compliant fashion. The harmonization across metadata is guaranteed with the definition of an ad hoc vocabulary. The tool interface is designed to support the user in metadata harmonization in a user-friendly manner, regardless of the background and the type of expertise. AVAILABILITY AND IMPLEMENTATION: ESPERANTO and its user manual are freely available for academic purposes at https://github.com/fhaive/esperanto. The input and the results showcased in Supplementary File S1 are available at the same link.


Assuntos
Metadados , Software , Humanos , Toxicogenética , Idioma , Curadoria de Dados
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...