Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
1.
IEEE J Biomed Health Inform ; 26(5): 2351-2359, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-34797768

RESUMEN

A Relational-Sequential dataset (or RS-dataset for short) contains records comprised of a patient's values in demographic attributes and their sequence of diagnosis codes. The task of clustering an RS-dataset is helpful for analyses ranging from pattern mining to classification. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RS-dataset can be clustered effectively and efficiently. We formalize the task of clustering an RS-dataset as an optimization problem. At the heart of the problem is a distance measure we design to quantify the pairwise similarity between records of an RS-dataset. Our measure uses a tree structure that encodes hierarchical relationships between records, based on their demographics, as well as an edit-distance-like measure that captures both the sequentiality and the semantic similarity of diagnosis codes. We also develop an algorithm which first identifies k representative records (centers), for a given k, and then constructs k clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two Electronic Health Record datasets demonstrate that our algorithm constructs compact and well-separated clusters, which preserve meaningful relationships between demographics and sequences of diagnosis codes, while being efficient and scalable.


Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Análisis por Conglomerados , Demografía , Humanos , Semántica
2.
J Biomed Inform ; 102: 103360, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-31904428

RESUMEN

Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.


Asunto(s)
Registros Electrónicos de Salud , Análisis por Conglomerados , Demografía , Humanos
3.
J Biomed Inform ; 65: 76-96, 2017 01.
Artículo en Inglés | MEDLINE | ID: mdl-27832965

RESUMEN

Publishing data about patients that contain both demographics and diagnosis codes is essential to perform large-scale, low-cost medical studies. However, preserving the privacy and utility of such data is challenging, because it requires: (i) guarding against identity disclosure (re-identification) attacks based on both demographics and diagnosis codes, (ii) ensuring that the anonymized data remain useful in intended analysis tasks, and (iii) minimizing the information loss, incurred by anonymization, to preserve the utility of general analysis tasks that are difficult to determine before data publishing. Existing anonymization approaches are not suitable for being used in this setting, because they cannot satisfy all three requirements. Therefore, in this work, we propose a new approach to deal with this problem. We enforce the requirement (i) by applying (k,km)-anonymity, a privacy principle that prevents re-identification from attackers who know the demographics of a patient and up to m of their diagnosis codes, where k and m are tunable parameters. To capture the requirement (ii), we propose the concept of utility constraint for both demographics and diagnosis codes. Utility constraints limit the amount of generalization and are specified by data owners (e.g., the healthcare institution that performs anonymization). We also capture requirement (iii), by employing well-established information loss measures for demographics and for diagnosis codes. To realize our approach, we develop an algorithm that enforces (k,km)-anonymity on a dataset containing both demographics and diagnosis codes, in a way that satisfies the specified utility constraints and with minimal information loss, according to the measures. Our experiments with a large dataset containing more than 200,000 electronic health records show the effectiveness and efficiency of our algorithm.


Asunto(s)
Algoritmos , Confidencialidad , Grupos Diagnósticos Relacionados , Registros Electrónicos de Salud , Demografía , Humanos
5.
J Biomed Inform ; 50: 4-19, 2014 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-24936746

RESUMEN

The dissemination of Electronic Health Records (EHRs) can be highly beneficial for a range of medical studies, spanning from clinical trials to epidemic control studies, but it must be performed in a way that preserves patients' privacy. This is not straightforward, because the disseminated data need to be protected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work, we present a survey of algorithms that have been proposed for publishing structured patient data, in a privacy-preserving way. We review more than 45 algorithms, derive insights on their operation, and highlight their advantages and disadvantages. We also provide a discussion of some promising directions for future research in this area.


Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Privacidad , Revelación de la Verdad
6.
J Biomed Inform ; 50: 46-61, 2014 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-24879898

RESUMEN

The dissemination of Electronic Health Record (EHR) data, beyond the originating healthcare institutions, can enable large-scale, low-cost medical studies that have the potential to improve public health. Thus, funding bodies, such as the National Institutes of Health (NIH) in the U.S., encourage or require the dissemination of EHR data, and a growing number of innovative medical investigations are being performed using such data. However, simply disseminating EHR data, after removing identifying information, may risk privacy, as patients can still be linked with their record, based on diagnosis codes. This paper proposes the first approach that prevents this type of data linkage using disassociation, an operation that transforms records by splitting them into carefully selected subsets. Our approach preserves privacy with significantly lower data utility loss than existing methods and does not require data owners to specify diagnosis codes that may lead to identity disclosure, as these methods do. Consequently, it can be employed when data need to be shared broadly and be used in studies, beyond the intended ones. Through extensive experiments using EHR data, we demonstrate that our method can construct data that are highly useful for supporting various types of clinical case count studies and general medical analysis tasks.


Asunto(s)
Registros Electrónicos de Salud , Privacidad , Humanos , Estados Unidos
7.
8.
PLoS One ; 8(2): e53875, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23405076

RESUMEN

Health information technologies facilitate the collection of massive quantities of patient-level data. A growing body of research demonstrates that such information can support novel, large-scale biomedical investigations at a fraction of the cost of traditional prospective studies. While healthcare organizations are being encouraged to share these data in a de-identified form, there is hesitation over concerns that it will allow corresponding patients to be re-identified. Currently proposed technologies to anonymize clinical data may make unrealistic assumptions with respect to the capabilities of a recipient to ascertain a patients identity. We show that more pragmatic assumptions enable the design of anonymization algorithms that permit the dissemination of detailed clinical profiles with provable guarantees of protection. We demonstrate this strategy with a dataset of over one million medical records and show that 192 genotype-phenotype associations can be discovered with fidelity equivalent to non-anonymized clinical data.


Asunto(s)
Estudios de Asociación Genética/métodos , Privacidad Genética/normas , Genoma Humano , Estudio de Asociación del Genoma Completo/métodos , Genómica/métodos , Algoritmos , Bases de Datos Factuales , Humanos , Sistemas de Registros Médicos Computarizados
9.
Expert Syst Appl ; 39(10): 9764-9777, 2012 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-22563145

RESUMEN

Transaction data record various information about individuals, including their purchases and diagnoses, and are increasingly published to support large-scale and low-cost studies in domains such as marketing and medicine. However, the dissemination of transaction data may lead to privacy breaches, as it allows an attacker to link an individual's record to their identity. Approaches that anonymize data by eliminating certain values in an individual's record or by replacing them with more general values have been proposed recently, but they often produce data of limited usefulness. This is because these approaches adopt value transformation strategies that do not guarantee data utility in intended applications and objective measures that may lead to excessive data distortion. In this paper, we propose a novel approach for anonymizing data in a way that satisfies data publishers' utility requirements and incurs low information loss. To achieve this, we introduce an accurate information loss measure and an effective anonymization algorithm that explores a large part of the problem space. An extensive experimental study, using click-stream and medical data, demonstrates that our approach permits many times more accurate query answering than the state-of-the-art methods, while it is comparable to them in terms of efficiency.

10.
IEEE Trans Inf Technol Biomed ; 16(3): 413-23, 2012 May.
Artículo en Inglés | MEDLINE | ID: mdl-22287248

RESUMEN

Electronic medical record (EMR) systems have enabled healthcare providers to collect detailed patient information from the primary care domain. At the same time, longitudinal data from EMRs are increasingly combined with biorepositories to generate personalized clinical decision support protocols. Emerging policies encourage investigators to disseminate such data in a deidentified form for reuse and collaboration, but organizations are hesitant to do so because they fear such actions will jeopardize patient privacy. In particular, there are concerns that residual demographic and clinical features could be exploited for reidentification purposes. Various approaches have been developed to anonymize clinical data, but they neglect temporal information and are, thus, insufficient for emerging biomedical research paradigms. This paper proposes a novel approach to share patient-specific longitudinal data that offers robust privacy guarantees, while preserving data utility for many biomedical investigations. Our approach aggregates temporal and diagnostic information using heuristics inspired from sequence alignment and clustering methods. We demonstrate that the proposed approach can generate anonymized data that permit effective biomedical analysis using several patient cohorts derived from the EMR system of the Vanderbilt University Medical Center.


Asunto(s)
Confidencialidad , Registros Electrónicos de Salud , Algoritmos , Análisis por Conglomerados , Estudios de Cohortes , Sistemas de Administración de Bases de Datos , Humanos
11.
Hum Genet ; 130(3): 383-92, 2011 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-21739176

RESUMEN

The collection and sharing of person-specific biospecimens has raised significant questions regarding privacy. In particular, the question of identifiability, or the degree to which materials stored in biobanks can be linked to the name of the individuals from which they were derived, is under scrutiny. The goal of this paper is to review the extent to which biospecimens and affiliated data can be designated as identifiable. To achieve this goal, we summarize recent research in identifiability assessment for DNA sequence data, as well as associated demographic and clinical data, shared via biobanks. We demonstrate the variability of the degree of risk, the factors that contribute to this variation, and potential ways to mitigate and manage such risk. Finally, we discuss the policy implications of these findings, particularly as they pertain to biobank security and access policies. We situate our review in the context of real data sharing scenarios and biorepositories.


Asunto(s)
Bancos de Muestras Biológicas , Confidencialidad , Bancos de Muestras Biológicas/normas , Privacidad Genética , Guías como Asunto , Difusión de la Información , Opinión Pública , Riesgo , Gestión de Riesgos , Pesos y Medidas
12.
J Am Med Inform Assoc ; 17(3): 322-7, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-20442151

RESUMEN

OBJECTIVE: De-identified clinical data in standardized form (eg, diagnosis codes), derived from electronic medical records, are increasingly combined with research data (eg, DNA sequences) and disseminated to enable scientific investigations. This study examines whether released data can be linked with identified clinical records that are accessible via various resources to jeopardize patients' anonymity, and the ability of popular privacy protection methodologies to prevent such an attack. DESIGN: The study experimentally evaluates the re-identification risk of a de-identified sample of Vanderbilt's patient records involved in a genome-wide association study. It also measures the level of protection from re-identification, and data utility, provided by suppression and generalization. MEASUREMENT: Privacy protection is quantified using the probability of re-identifying a patient in a larger population through diagnosis codes. Data utility is measured at a dataset level, using the percentage of retained information, as well as its description, and at a patient level, using two metrics based on the difference between the distribution of Internal Classification of Disease (ICD) version 9 codes before and after applying privacy protection. RESULTS: More than 96% of 2800 patients' records are shown to be uniquely identified by their diagnosis codes with respect to a population of 1.2 million patients. Generalization is shown to reduce further the percentage of de-identified records by less than 2%, and over 99% of the three-digit ICD-9 codes need to be suppressed to prevent re-identification. CONCLUSIONS: Popular privacy protection methods are inadequate to deliver a sufficiently protected and useful result when sharing data derived from complex clinical systems. The development of alternative privacy protection models is thus required.


Asunto(s)
Confidencialidad , Registros Electrónicos de Salud , Clasificación Internacional de Enfermedades , Registro Médico Coordinado , Sujetos de Investigación , Anciano , Femenino , Humanos , Masculino , Persona de Mediana Edad , Tennessee , Estados Unidos
13.
Proc Natl Acad Sci U S A ; 107(17): 7898-903, 2010 Apr 27.
Artículo en Inglés | MEDLINE | ID: mdl-20385806

RESUMEN

Genome-wide association studies (GWAS) facilitate the discovery of genotype-phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients' standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data "as is" may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients' identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt's University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks.


Asunto(s)
Algoritmos , Pruebas Anónimas/métodos , Recolección de Datos/métodos , Registros Electrónicos de Salud , Privacidad Genética , Estudio de Asociación del Genoma Completo/métodos , Medicina de Precisión/métodos
14.
AMIA Annu Symp Proc ; 2010: 782-6, 2010 Nov 13.
Artículo en Inglés | MEDLINE | ID: mdl-21347085

RESUMEN

Patient-specific data from electronic medical records (EMRs) is increasingly shared in a de-identified form to support research. However, EMRs are susceptible to noise, error, and variation, which can limit their utility for reuse. One way to enhance the utility of EMRs is to record the number of times diagnosis codes are assigned to a patient when this data is shared. This is, however, challenging because releasing such data may be leveraged to compromise patients' identity. In this paper, we present an approach that, to the best of our knowledge, is the first that can prevent re-identification through repeated diagnosis codes. Our method transforms records to preserve privacy while retaining much of their utility. Experiments conducted using 2676 patients from the EMR system of the Vanderbilt University Medical Center verify that our method is able to retain an average of 95.4% of the diagnosis codes in a common data sharing scenario.


Asunto(s)
Registros Electrónicos de Salud , Privacidad , Codificación Clínica , Confidencialidad , Humanos
15.
IHI ; 2010: 163-172, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-25525631

RESUMEN

Regulations in various countries permit the reuse of health information without patient authorization provided the data is "de-identified". In the United States, for instance, the Privacy Rule of the Health Insurance Portability and Accountability Act defines two distinct approaches to achieve de-identification; the first is Safe Harbor, which requires the removal of a list of identifiers and the second is Expert Determination, which requires that an expert certify the re-identification risk inherent in the data is sufficiently low. In reality, most healthcare organizations eschew the expert route because there are no standardized approaches and Safe Harbor is much simpler to interpret. This, however, precludes a wide range of worthwhile endeavors that are dependent on features suppressed by Safe Harbor, such as gerontological studies requiring detailed ages over 89. In response, we propose a novel approach to automatically discover alternative de-identification policies that contain no more re-identification risk than Safe Harbor. We model this task as a lattice-search problem, introduce a measure to capture the re-identification risk, and develop an algorithm that efficiently discovers polices by exploring the lattice. Using a cohort of approximately 3000 patient records from the Vanderbilt University Medical Center, as well as the Adult dataset from the UCI Machine Learning Repository, we also experimentally verify that a large number of alternative policies can be discovered in an efficient manner.

16.
ITAB Corfu Greece (2010) ; 2010: 1-6, 2010 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-25580471

RESUMEN

Patient-specific records contained in Electronic Medical Record (EMR) systems are increasingly combined with genomic sequences and deposited into bio-repositories. This allows researchers to perform large-scale, low-cost biomedical studies, such as Genome-Wide Association Studies (GWAS) aimed at identifying associations between genetic factors and complex health-related phenomena, which are an integral facet of personalized medicine. Disseminating this data, however, raises serious privacy concerns because patients' genomic sequences can be linked to their identities through diagnosis codes. This work proposes an approach that guards against this type of data linkage by modifying diagnosis codes in a way that limits the probability of associating a patient's identity to their genomic sequence. Experiments using EMRs from the Vanderbilt University Medical Center verify that our approach generates data that can support up to 29.4% more GWAS than the best-so-far method, while permitting biomedical analysis tasks several orders of magnitude more accurately.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...