Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 142
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
Proc Natl Acad Sci U S A ; 120(43): e2206981120, 2023 Oct 24.
Artículo en Inglés | MEDLINE | ID: mdl-37831745

RESUMEN

In January 2023, a new NIH policy on data sharing went into effect. The policy applies to both quantitative and qualitative research (QR) data such as data from interviews or focus groups. QR data are often sensitive and difficult to deidentify, and thus have rarely been shared in the United States. Over the past 5 y, our research team has engaged stakeholders on QR data sharing, developed software to support data deidentification, produced guidance, and collaborated with the ICPSR data repository to pilot the deposit of 30 QR datasets. In this perspective article, we share important lessons learned by addressing eight clusters of questions on issues such as where, when, and what to share; how to deidentify data and support high-quality secondary use; budgeting for data sharing; and the permissions needed to share data. We also offer a brief assessment of the state of preparedness of data repositories, QR journals, and QR textbooks to support data sharing. While QR data sharing could yield important benefits to the research community, we quickly need to develop enforceable standards, expertise, and resources to support responsible QR data sharing. Absent these resources, we risk violating participant confidentiality and wasting a significant amount of time and funding on data that are not useful for either secondary use or data transparency and verification.

2.
Hum Brain Mapp ; 45(9): e26721, 2024 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-38899549

RESUMEN

With the rise of open data, identifiability of individuals based on 3D renderings obtained from routine structural magnetic resonance imaging (MRI) scans of the head has become a growing privacy concern. To protect subject privacy, several algorithms have been developed to de-identify imaging data using blurring, defacing or refacing. Completely removing facial structures provides the best re-identification protection but can significantly impact post-processing steps, like brain morphometry. As an alternative, refacing methods that replace individual facial structures with generic templates have a lower effect on the geometry and intensity distribution of original scans, and are able to provide more consistent post-processing results by the price of higher re-identification risk and computational complexity. In the current study, we propose a novel method for anonymized face generation for defaced 3D T1-weighted scans based on a 3D conditional generative adversarial network. To evaluate the performance of the proposed de-identification tool, a comparative study was conducted between several existing defacing and refacing tools, with two different segmentation algorithms (FAST and Morphobox). The aim was to evaluate (i) impact on brain morphometry reproducibility, (ii) re-identification risk, (iii) balance between (i) and (ii), and (iv) the processing time. The proposed method takes 9 s for face generation and is suitable for recovering consistent post-processing results after defacing.


Asunto(s)
Imagen por Resonancia Magnética , Humanos , Imagen por Resonancia Magnética/métodos , Adulto , Encéfalo/diagnóstico por imagen , Encéfalo/anatomía & histología , Masculino , Femenino , Redes Neurales de la Computación , Imagenología Tridimensional/métodos , Neuroimagen/métodos , Neuroimagen/normas , Anonimización de la Información , Adulto Joven , Procesamiento de Imagen Asistido por Computador/métodos , Procesamiento de Imagen Asistido por Computador/normas , Algoritmos
3.
MAGMA ; 2024 Jun 21.
Artículo en Inglés | MEDLINE | ID: mdl-38904745

RESUMEN

RATIONALE AND OBJECTIVES: Defacing research MRI brain scans is often a mandatory step. With current defacing software, there are issues with Windows compatibility and researcher doubt regarding the adequacy of preservation of brain voxels in non-T1w scans. To address this, we developed PyFaceWipe, a multiplatform software for multiple MRI contrasts, which was evaluated based on its anonymisation ability and effect on downstream processing. MATERIALS AND METHODS: Multiple MRI brain scan contrasts from the OASIS-3 dataset were defaced with PyFaceWipe and PyDeface and manually assessed for brain voxel preservation, remnant facial features and effect on automated face detection. Original and PyFaceWipe-defaced data from locally acquired T1w structural scans underwent volumetry with FastSurfer and brain atlas generation with ANTS. RESULTS: 214 MRI scans of several contrasts from OASIS-3 were successfully processed with both PyFaceWipe and PyDeface. PyFaceWipe maintained complete brain voxel preservation in all tested contrasts except ASL (45%) and DWI (90%), and PyDeface in all tested contrasts except ASL (95%), BOLD (25%), DWI (40%) and T2* (25%). Manual review of PyFaceWipe showed no failures of facial feature removal. Pinna removal was less successful (6% of T1 scans showed residual complete pinna). PyDeface achieved 5.1% failure rate. Automated detection found no faces in PyFaceWipe-defaced scans, 19 faces in PyDeface scans compared with 78 from the 224 original scans. Brain atlas generation showed no significant difference between atlases created from original and defaced data in both young adulthood and late elderly cohorts. Structural volumetry dice scores were ≥ 0.98 for all structures except for grey matter which had 0.93. PyFaceWipe output was identical across the tested operating systems. CONCLUSION: PyFaceWipe is a promising multiplatform defacing tool, demonstrating excellent brain voxel preservation and competitive defacing in multiple MRI contrasts, performing favourably against PyDeface. ASL, BOLD, DWI and T2* scans did not produce recognisable 3D renders and hence should not require defacing. Structural volumetry dice scores (≥ 0.98) were higher than previously published FreeSurfer results, except for grey matter which were comparable. The effect is measurable and care should be exercised during studies. ANTS atlas creation showed no significant effect from PyFaceWipe defacing.

4.
J Med Internet Res ; 26: e55676, 2024 May 28.
Artículo en Inglés | MEDLINE | ID: mdl-38805692

RESUMEN

BACKGROUND: Clinical natural language processing (NLP) researchers need access to directly comparable evaluation results for applications such as text deidentification across a range of corpus types and the means to easily test new systems or corpora within the same framework. Current systems, reported metrics, and the personally identifiable information (PII) categories evaluated are not easily comparable. OBJECTIVE: This study presents an open-source and extensible end-to-end framework for comparing clinical NLP system performance across corpora even when the annotation categories do not align. METHODS: As a use case for this framework, we use 6 off-the-shelf text deidentification systems (ie, CliniDeID, deid from PhysioNet, MITRE Identity Scrubber Toolkit [MIST], NeuroNER, National Library of Medicine [NLM] Scrubber, and Philter) across 3 standard clinical text corpora for the task (2 of which are publicly available) and 1 private corpus (all in English), with annotation categories that are not directly analogous. The framework is built on shell scripts that can be extended to include new systems, corpora, and performance metrics. We present this open tool, multiple means for aligning PII categories during evaluation, and our initial timing and performance metric findings. Code for running this framework with all settings needed to run all pairs are available via Codeberg and GitHub. RESULTS: From this case study, we found large differences in processing speed between systems. The fastest system (ie, MIST) processed an average of 24.57 (SD 26.23) notes per second, while the slowest (ie, CliniDeID) processed an average of 1.00 notes per second. No system uniformly outperformed the others at identifying PII across corpora and categories. Instead, a rich tapestry of performance trade-offs emerged for PII categories. CliniDeID and Philter prioritize recall over precision (with an average recall 6.9 and 11.2 points higher, respectively, for partially matching spans of text matching any PII category), while the other 4 systems consistently have higher precision (with MIST's precision scoring 20.2 points higher, NLM Scrubber scoring 4.4 points higher, NeuroNER scoring 7.2 points higher, and deid scoring 17.1 points higher). The macroaverage recall across corpora for identifying names, one of the more sensitive PII categories, included deid (48.8%) and MIST (66.9%) at the low end and NeuroNER (84.1%), NLM Scrubber (88.1%), and CliniDeID (95.9%) at the high end. A variety of metrics across categories and corpora are reported with a wider variety (eg, F2-score) available via the tool. CONCLUSIONS: NLP systems in general and deidentification systems and corpora in our use case tend to be evaluated in stand-alone research articles that only include a limited set of comparators. We hold that a single evaluation pipeline across multiple systems and corpora allows for more nuanced comparisons. Our open pipeline should reduce barriers to evaluation and system advancement.


Asunto(s)
Procesamiento de Lenguaje Natural
5.
BMC Med Inform Decis Mak ; 24(1): 147, 2024 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-38816848

RESUMEN

BACKGROUND: Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset's utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset's utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. METHODS: Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. RESULTS: All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. CONCLUSIONS: As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data's intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.


Asunto(s)
Confidencialidad , Anonimización de la Información , Humanos , Confidencialidad/normas , Servicio de Urgencia en Hospital , Tiempo de Internación , República de Corea , Masculino
6.
BMC Med Inform Decis Mak ; 24(1): 162, 2024 Jun 12.
Artículo en Inglés | MEDLINE | ID: mdl-38915012

RESUMEN

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.


Asunto(s)
Procesamiento de Lenguaje Natural , Humanos , Privacidad , Suecia , Anónimos y Seudónimos , Seguridad Computacional/normas , Confidencialidad/normas , Registros Electrónicos de Salud/normas
7.
BMC Med Inform Decis Mak ; 24(1): 54, 2024 Feb 16.
Artículo en Inglés | MEDLINE | ID: mdl-38365677

RESUMEN

BACKGROUND: Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers. METHODS: We proposed an automated annotation process for French clinical de-identification, exploiting data from the eHOP clinical data warehouse (CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods. RESULTS: A French de-identification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information. CONCLUSIONS: This study provides an automatic de-identification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain.


Asunto(s)
Aprendizaje Profundo , Humanos , Anonimización de la Información , Registros Electrónicos de Salud , Análisis Costo-Beneficio , Confidencialidad , Procesamiento de Lenguaje Natural
8.
Sensors (Basel) ; 24(10)2024 May 16.
Artículo en Inglés | MEDLINE | ID: mdl-38794019

RESUMEN

Differential privacy has emerged as a practical technique for privacy-preserving deep learning. However, recent studies on privacy attacks have demonstrated vulnerabilities in the existing differential privacy implementations for deep models. While encryption-based methods offer robust security, their computational overheads are often prohibitive. To address these challenges, we propose a novel differential privacy-based image generation method. Our approach employs two distinct noise types: one makes the image unrecognizable to humans, preserving privacy during transmission, while the other maintains features essential for machine learning analysis. This allows the deep learning service to provide accurate results, without compromising data privacy. We demonstrate the feasibility of our method on the CIFAR100 dataset, which offers a realistic complexity for evaluation.

9.
Neuroimage ; 276: 120199, 2023 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-37269958

RESUMEN

It is now widely known that research brain MRI, CT, and PET images may potentially be re-identified using face recognition, and this potential can be reduced by applying face-deidentification ("de-facing") software. However, for research MRI sequences beyond T1-weighted (T1-w) and T2-FLAIR structural images, the potential for re-identification and quantitative effects of de-facing are both unknown, and the effects of de-facing T2-FLAIR are also unknown. In this work we examine these questions (where applicable) for T1-w, T2-w, T2*-w, T2-FLAIR, diffusion MRI (dMRI), functional MRI (fMRI), and arterial spin labelling (ASL) sequences. Among current-generation, vendor-product research-grade sequences, we found that 3D T1-w, T2-w, and T2-FLAIR were highly re-identifiable (96-98%). 2D T2-FLAIR and 3D multi-echo GRE (ME-GRE) were also moderately re-identifiable (44-45%), and our derived T2* from ME-GRE (comparable to a typical 2D T2*) matched at only 10%. Finally, diffusion, functional and ASL images were each minimally re-identifiable (0-8%). Applying de-facing with mri_reface version 0.3 reduced successful re-identification to ≤8%, while differential effects on popular quantitative pipelines for cortical volumes and thickness, white matter hyperintensities (WMH), and quantitative susceptibility mapping (QSM) measurements were all either comparable with or smaller than scan-rescan estimates. Consequently, high-quality de-facing software can greatly reduce the risk of re-identification for identifiable MRI sequences with only negligible effects on automated intracranial measurements. The current-generation echo-planar and spiral sequences (dMRI, fMRI, and ASL) each had minimal match rates, suggesting that they have a low risk of re-identification and can be shared without de-facing, but this conclusion should be re-evaluated if they are acquired without fat suppression, with a full-face scan coverage, or if newer developments reduce the current levels of artifacts and distortion around the face.


Asunto(s)
Imagen de Difusión por Resonancia Magnética , Imagen por Resonancia Magnética , Humanos , Imagen por Resonancia Magnética/métodos , Imagen de Difusión por Resonancia Magnética/métodos , Neuroimagen , Artefactos , Marcadores de Spin
10.
BMC Med Inform Decis Mak ; 23(1): 85, 2023 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-37147600

RESUMEN

BACKGROUND: Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. METHODS: We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. RESULTS: The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband's presence in the sample database with an area under the receiver operating curve of 0.997-0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931-0.994), and the misidentification rate was 0.00249 (range 0.00123-0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. CONCLUSIONS: Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available.


Asunto(s)
Registro Médico Coordinado , Privacidad , Humanos , Masculino , Teorema de Bayes , Medicina Estatal , Programas Informáticos
11.
Artículo en Alemán | MEDLINE | ID: mdl-36648500

RESUMEN

Merging sensitive data and tracing their analysis results back to the data subjects is an essential part of data processing in the health sector. This challenges the protection of the data and thus its very purpose, the protection of the data subjects, since the scientific and health findings are often based on certain characteristics in the datasets, which should be preserved in their property as personal in order to make the results of the data analysis fruitful. The EU General Data Protection Regulation (GDPR) establishes a risk-based approach that determines both the identifiability of data and the proportionality of their processing.This paper analyses how the risk-based approach opens the scope of the GDPR and relates it to the risks for the rights and freedoms of data subjects posed by the processing of personal data. Furthermore, the question is explored to what extent the risk-based approach of the GDPR influences the rules for international data transfer and how international data processing in the health sector is currently organised on its basis.Overall, the present analysis sheds light on how the technical measures of data processing and the organisational measures for handling them can contribute to maintaining the proportionality of data processing under the GDPR, which can essentially be determined on a risk-based basis, while at the same time taking into account the specificity of data processing in the health sector.


Asunto(s)
Seguridad Computacional , Libertad , Humanos , Unión Europea , Alemania
12.
Entropy (Basel) ; 25(2)2023 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-36832640

RESUMEN

Privacy protection data processing has been critical in recent years when pervasively equipped mobile devices could easily capture high-resolution personal images and videos that may disclose personal information. We propose a new controllable and reversible privacy protection system to address the concern in this work. The proposed scheme can automatically and stably anonymize and de-anonymize face images with one neural network and provide strong security protection with multi-factor identification solutions. Furthermore, users can include other attributes as identification factors, such as passwords and specific facial attributes. Our solution lies in a modified conditional-GAN-based training framework, the Multi-factor Modifier (MfM), to simultaneously accomplish the function of multi-factor facial anonymization and de-anonymization. It can successfully anonymize face images while generating realistic faces satisfying the conditions specified by the multi-factor features, such as gender, hair colors, and facial appearance. Furthermore, MfM can also de-anonymize de-identified faces to their corresponding original ones. One crucial part of our work is design of physically meaningful information-theory-based loss functions, which include mutual information between authentic and de-identification images and mutual information between original and re-identification images. Moreover, extensive experiments and analyses show that, with the correct multi-factor feature information, the MfM can effectively achieve nearly perfect reconstruction and generate high-fidelity and diverse anonymized faces to defend attacks from hackers better than other methods with compatible functionalities. Finally, we justify the advantages of this work through perceptual quality comparison experiments. Our experiments show that the resulting LPIPS (with a value of 0.35), FID (with a value of 28), and SSIM (with a value of 0.95) of MfM demonstrate significantly better de-identification effects than state-of-the-art works. Additionally, the MfM we designed can achieve re-identification, which improves real-world practicability.

13.
Cas Lek Cesk ; 162(2-3): 61-66, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37474288

RESUMEN

Healthcare data held by state-run organisations is a valuable intangible asset for society. Its use should be a priority for its administrators and the state. A completely paternalistic approach by administrators and the state is undesirable, however much it aims to protect the privacy rights of persons registered in databases. In line with European policies and the global trend, these measures should not outweigh the social benefit that arises from the analysis of these data if the technical possibilities exist to sufficiently protect the privacy rights of individuals. Czech society is having an intense discussion on the topic, but according to the authors, it is insufficiently based on facts and lacks clearly articulated opinions of the expert public. The aim of this article is to fill these gaps. Data anonymization techniques provide a solution to protect individuals' privacy rights while preserving the scientific value of the data. The risk of identifying individuals in anonymised data sets is scalable and can be minimised depending on the type and content of the data and its use by the specific applicant. Finding the optimal form and scope of deidentified data requires competence and knowledge on the part of both the applicant and the administrator. It is in the interest of the applicant, the administrator, as well as the protected persons in the databases that both parties show willingness and have the ability and expertise to communicate during the application and its processing.


Asunto(s)
Confidencialidad , Anonimización de la Información , Humanos , Privacidad
14.
Neuroimage ; 258: 119357, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-35660089

RESUMEN

It is well known that de-identified research brain images from MRI and CT can potentially be re-identified using face recognition; however, this has not been examined for PET images. We generated face reconstruction images of 182 volunteers using amyloid, tau, and FDG PET scans, and we measured how accurately commercial face recognition software (Microsoft Azure's Face API) automatically matched them with the individual participants' face photographs. We then compared this accuracy with the same experiments using participants' CT and MRI. Face reconstructions from PET images from PET/CT scanners were correctly matched at rates of 42% (FDG), 35% (tau), and 32% (amyloid), while CT were matched at 78% and MRI at 97-98%. We propose that these recognition rates are high enough that research studies should consider using face de-identification ("de-facing") software on PET images, in addition to CT and structural MRI, before data sharing. We also updated our mri_reface de-identification software with extended functionality to replace face imagery in PET and CT images. Rates of face recognition on de-faced images were reduced to 0-4% for PET, 5% for CT, and 8% for MRI. We measured the effects of de-facing on regional amyloid PET measurements from two different measurement pipelines (PETSurfer/FreeSurfer 6.0, and one in-house method based on SPM12 and ANTs), and these effects were small: ICC values between de-faced and original images were > 0.98, biases were <2%, and median relative errors were < 2%. Effects on global amyloid PET SUVR measurements were even smaller: ICC values were 1.00, biases were <0.5%, and median relative errors were also <0.5%.


Asunto(s)
Reconocimiento Facial , Tomografía Computarizada por Tomografía de Emisión de Positrones , Amiloide , Encéfalo/diagnóstico por imagen , Fluorodesoxiglucosa F18 , Humanos , Imagen por Resonancia Magnética/métodos , Tomografía de Emisión de Positrones/métodos
15.
J Biomed Inform ; 135: 104215, 2022 11.
Artículo en Inglés | MEDLINE | ID: mdl-36195240

RESUMEN

Electronic Medical Records (EMRs) contain clinical narrative text that is of great potential value to medical researchers. However, this information is mixed with Personally Identifiable Information (PII) that presents risks to patient and clinician confidentiality. This paper presents an end-to-end de-identification framework to automatically remove PII from Australian hospital discharge summaries. Our corpus included 600 hospital discharge summaries which were extracted from the EMRs of two principal referral hospitals in Sydney, Australia. Our end-to-end de-identification framework consists of three components: (1) Annotation: labelling of PII in the 600 hospital discharge summaries using five pre-defined categories: person, address, date of birth, individual identification number, phone/fax number; (2) Modelling: training six named entity recognition (NER) deep learning base-models on balanced and imbalanced datasets; and evaluating ensembles that combine all six base-models, the three base-models with the best F1 scores and the three base-models with the best recall scores respectively, using token-level majority voting and stacking methods; and (3) De-identification: removing PII from the hospital discharge summaries. Our results showed that the ensemble model combined using the stacking Support Vector Machine (SVM) method on the three base-models with the best F1 scores achieved excellent results with a F1 score of 99.16% on the test set of our corpus. We also evaluated the robustness of our modelling component on the 2014 i2b2 de-identification dataset. Our ensemble model, which uses the token-level majority voting method on all six base-models, achieved the highest F1 score of 96.24% at strict entity matching and the highest F1 score of 98.64% at binary token-level matching compared to two state-of-the-art methods. The end-to-end framework provides a robust solution to de-identifying clinical narrative corpuses safely. It can easily be applied to any kind of clinical narrative documents.


Asunto(s)
Aprendizaje Profundo , Alta del Paciente , Humanos , Australia , Registros Electrónicos de Salud , Hospitales , Procesamiento de Lenguaje Natural
16.
BMC Med Ethics ; 23(1): 65, 2022 06 25.
Artículo en Inglés | MEDLINE | ID: mdl-35752778

RESUMEN

BACKGROUND: Sharing anonymized/de-identified clinical trial data and publishing research outcomes in scientific journals, or presenting them at conferences, is key to data-driven scientific exchange. However, when data from scientific publications are linked to other publicly available personal information, the risk of reidentification of trial participants increases, raising privacy concerns. Therefore, we defined a set of criteria allowing us to determine and minimize the risk of data reidentification. We also implemented a review process at Takeda for clinical publications prior to submission for publication in journals or presentation at medical conferences. METHODS: Abstracts, manuscripts, posters, and oral presentations containing study participant information were reviewed and the potential impact on study participant privacy was assessed. Our focus was on direct (participant ID, initials) and indirect identifiers, such as sex, age or geographical indicators in rare disease studies or studies with small sample size treatment groups. Risk minimization was sought using a generalized presentation of identifier-relevant information and decision-making on data sharing for further research. Additional risk identification was performed based on study participant/personnel parameters present in materials destined for the public domain. The potential for participant/personnel identification was then calculated to facilitate presentation of meaningful but de-identified information. RESULTS: The potential for reidentification was calculated using a risk ratio of the exposed versus available individuals, with a value above the threshold of 0.09 deemed an unacceptable level of reidentification risk. We found that in 13% of Takeda clinical trial publications reviewed, either individuals could potentially be reidentified (despite the use of anonymized data sets) or inappropriate data sharing plans could pose a data privacy risk to study participants. In 1/110 abstracts, 58/275 manuscripts, 5/87 posters and 3/58 presentations, changes were necessary due to data privacy concerns/rules. Despite the implementation of risk-minimization measures prior to release, direct and indirect identifiers were found in 11% and 34% of the analysed documents, respectively. CONCLUSIONS: Risk minimization using de-identification of clinical trial data presented in scientific publications and controlled data sharing conditions improved privacy protection for study participants. Our results also suggest that additional safeguards should be implemented to ensure that higher data privacy standards are met.


Asunto(s)
Seguridad Computacional , Privacidad , Humanos , Difusión de la Información , Preparaciones Farmacéuticas
17.
J Korean Med Sci ; 37(26): e205, 2022 Jul 04.
Artículo en Inglés | MEDLINE | ID: mdl-35790207

RESUMEN

BACKGROUND: The advancement of information technology has immensely increased the quality and volume of health data. This has led to an increase in observational study, as well as to the threat of privacy invasion. Recently, a distributed research network based on the common data model (CDM) has emerged, enabling collaborative international medical research without sharing patient-level data. Although the CDM database for each institution is built inside a firewall, the risk of re-identification requires management. Hence, this study aims to elucidate the perceptions CDM users have towards CDM and risk management for re-identification. METHODS: The survey, targeted to answer specific in-depth questions on CDM, was conducted from October to November 2020. We targeted well-experienced researchers who actively use CDM. Basic statistics (total number and percent) were computed for all covariates. RESULTS: There were 33 valid respondents. Of these, 43.8% suggested additional anonymization was unnecessary beyond, "minimum cell count" policy, which obscures a cell with a value lower than certain number (usually 5) in shared results to minimize the liability of re-identification due to rare conditions. During extract-transform-load processes, 81.8% of respondents assumed structured data is under control from the risk of re-identification. However, respondents noted that date of birth and death were highly re-identifiable information. The majority of respondents (n = 22, 66.7%) conceded the possibility of identifier-contained unstructured data in the NOTE table. CONCLUSION: Overall, CDM users generally attributed high reliability for privacy protection to the intrinsic nature of CDM. There was little demand for additional de-identification methods. However, unstructured data in the CDM were suspected to have risks. The necessity for a coordinating consortium to define and manage the re-identification risk of CDM was urged.


Asunto(s)
Investigación Biomédica , Estudios Transversales , Bases de Datos Factuales , Humanos , Reproducibilidad de los Resultados
18.
Sensors (Basel) ; 22(7)2022 Mar 28.
Artículo en Inglés | MEDLINE | ID: mdl-35408203

RESUMEN

A problem with biometric information is that it is more sensitive to external leakage, because it is information that cannot be changed immediately compared to general authentication methods. Regarding facial information, a case in which authentication was permitted by facial information output by a 3D printer was found. Therefore, a method for minimizing the leakage of biometric information to the outside is required. In this paper, different levels of identification information according to the authority of the user are provided by the de-identification of metadata and face information in stages. For face information and metadata, the level of de-identification is determined and achieved according to the risk level of the de-identified subject. Then, we propose a mechanism to minimize the leakage path by preventing reckless data access by classifying access rights to unidentified data according to four roles. The proposed mechanism provides only differentially de-identified data according to the authority of the accessor, and the required time to perform the de-identification of one image was, on average, 3.6 ms for 300 datapoints, 3.5 ms for 500 datapoints, and 3.47 ms for 1000 datapoints. This confirmed that the required execution time was shortened in proportion to the increase in the size of the dataset. The results for the metadata were similar, and it was confirmed that it took 4.3 ms for 300 cases, 3.78 ms for 500 cases, and 3.5 ms for 1000 cases.


Asunto(s)
Identificación Biométrica , Anonimización de la Información , Identificación Biométrica/métodos , Biometría , Atención a la Salud , Cara/anatomía & histología
19.
Sensors (Basel) ; 22(3)2022 Jan 20.
Artículo en Inglés | MEDLINE | ID: mdl-35161511

RESUMEN

Wrist-worn devices equipped with accelerometers constitute a non-intrusive way to achieve active and assisted living (AAL) goals, such as automatic journaling for self-reflection, i.e., lifelogging, as well as to provide other services, such as general health and wellbeing monitoring, personal autonomy assessment, among others. Human action recognition (HAR), and in particular, the recognition of activities of daily living (ADLs), can be used for these types of assessment or journaling. In this paper, a many-objective evolutionary algorithm (MaOEA) is used in order to maximise action recognition from individuals while concealing (minimising recognition of) gender and age. To validate the proposed method, the PAAL accelerometer signal ADL dataset (v2.0) is used, which includes data from 52 participants (26 men and 26 women) and 24 activity class labels. The results show a drop in gender and age recognition to 58% (from 89%, a 31% drop), and to 39% (from 83%, a 44% drop), respectively; while action recognition stays closer to the initial value of 68% (from: 87%, i.e., 19% down).


Asunto(s)
Actividades Cotidianas , Reconocimiento de Normas Patrones Automatizadas , Acelerometría , Algoritmos , Evolución Biológica , Femenino , Humanos , Masculino , Privacidad
20.
J Digit Imaging ; 35(6): 1694-1698, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-35715655

RESUMEN

Natural language processing (NLP) techniques for electronic health records have shown great potential to improve the quality of medical care. The text of radiology reports frequently constitutes a large fraction of EHR data, and can provide valuable information about patients' diagnoses, medical history, and imaging findings. The lack of a major public repository for radiological reports severely limits the development, testing, and application of new NLP tools. De-identification of protected health information (PHI) presents a major challenge to building such repositories, as many automated tools for de-identification were trained or designed for clinical notes and do not perform sufficiently well to build a public database of radiology reports. We developed and evaluated six ensemble models based on three publically available de-identification tools: MIT de-id, NeuroNER, and Philter. A set of 1023 reports was set aside as the testing partition. Two individuals with medical training annotated the test set for PHI; differences were resolved by consensus. Ensemble methods included simple voting schemes (1-Vote, 2-Votes, and 3-Votes), a decision tree, a naïve Bayesian classifier, and Adaboost boosting. The 1-Vote ensemble achieved recall of 998 / 1043 (95.7%); the 3-Votes ensemble had precision of 1035 / 1043 (99.2%). F1 scores were: 93.4% for the decision tree, 71.2% for the naïve Bayesian classifier, and 87.5% for the boosting method. Basic voting algorithms and machine learning classifiers incorporating the predictions of multiple tools can outperform each tool acting alone in de-identifying radiology reports. Ensemble methods hold substantial potential to improve automated de-identification tools for radiology reports to make such reports more available for research use to improve patient care and outcomes.


Asunto(s)
Procesamiento de Lenguaje Natural , Radiología , Humanos , Teorema de Bayes , Registros Electrónicos de Salud , Aprendizaje Automático
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA