Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 109
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Nat Rev Genet ; 23(7): 429-445, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35246669

RESUMO

Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information.


Assuntos
Genômica , Privacidade , Genoma
2.
Genome Res ; 33(7): 1113-1123, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37217251

RESUMO

The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web services called Beacons. However, even such limited releases are susceptible to likelihood ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the data set and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is in the form of either summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public data sets.


Assuntos
Disseminação de Informação , Privacidade , Humanos , Disseminação de Informação/métodos , Genômica , Frequência do Gene , Alelos
3.
J Biomed Inform ; 153: 104640, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38608915

RESUMO

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.


Assuntos
Inteligência Artificial , Medicina Baseada em Evidências , Humanos , Confiança , Processamento de Linguagem Natural
5.
Clin Infect Dis ; 74(4): 584-590, 2022 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-34128970

RESUMO

BACKGROUND: With limited severe acute respiratory syndrome coronavirus (SARS-CoV-2) testing capacity in the United States at the start of the epidemic (January-March 2020), testing was focused on symptomatic patients with a travel history throughout February, obscuring the picture of SARS-CoV-2 seeding and community transmission. We sought to identify individuals with SARS-CoV-2 antibodies in the early weeks of the US epidemic. METHODS: All of Us study participants in all 50 US states provided blood specimens during study visits from 2 January to 18 March 2020. Participants were considered seropositive if they tested positive for SARS-CoV-2 immunoglobulin G (IgG) antibodies with the Abbott Architect SARS-CoV-2 IgG enzyme-linked immunosorbent assay (ELISA) and the EUROIMMUN SARS-CoV-2 ELISA in a sequential testing algorithm. The sensitivity and specificity of these ELISAs and the net sensitivity and specificity of the sequential testing algorithm were estimated, along with 95% confidence intervals (CIs). RESULTS: The estimated sensitivities of the Abbott and EUROIMMUN assays were 100% (107 of 107 [95% CI: 96.6%-100%]) and 90.7% (97 of 107 [83.5%-95.4%]), respectively, and the estimated specificities were 99.5% (995 of 1000 [98.8%-99.8%]) and 99.7% (997 of 1000 [99.1%-99.9%]), respectively. The net sensitivity and specificity of our sequential testing algorithm were 90.7% (97 of 107 [95% CI: 83.5%-95.4%]) and 100.0% (1000 of 1000 [99.6%-100%]), respectively. Of the 24 079 study participants with blood specimens from 2 January to 18 March 2020, 9 were seropositive, 7 before the first confirmed case in the states of Illinois, Massachusetts, Wisconsin, Pennsylvania, and Mississippi. CONCLUSIONS: Our findings identified SARS-CoV-2 infections weeks before the first recognized cases in 5 US states.


Assuntos
COVID-19 , Saúde da População , Anticorpos Antivirais , COVID-19/diagnóstico , Ensaio de Imunoadsorção Enzimática , Humanos , Imunoglobulina G , SARS-CoV-2 , Sensibilidade e Especificidade
6.
J Biomed Inform ; 125: 103977, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34920126

RESUMO

Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.


Assuntos
Saúde da População , Confidencialidade , Revelação , Genômica , Humanos , Aprendizado de Máquina
7.
J Med Internet Res ; 23(3): e22806, 2021 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-33661128

RESUMO

BACKGROUND: Documentation burden is a common problem with modern electronic health record (EHR) systems. To reduce this burden, various recording methods (eg, voice recorders or motion sensors) have been proposed. However, these solutions are in an early prototype phase and are unlikely to transition into practice in the near future. A more pragmatic alternative is to directly modify the implementation of the existing functionalities of an EHR system. OBJECTIVE: This study aims to assess the nature of free-text comments entered into EHR flowsheets that supplement quantitative vital sign values and examine opportunities to simplify functionality and reduce documentation burden. METHODS: We evaluated 209,055 vital sign comments in flowsheets that were generated in the Epic EHR system at the Vanderbilt University Medical Center in 2018. We applied topic modeling, as well as the natural language processing Clinical Language Annotation, Modeling, and Processing software system, to extract generally discussed topics and detailed medical terms (expressed as probability distribution) to investigate the stories communicated in these comments. RESULTS: Our analysis showed that 63.33% (6053/9557) of the users who entered vital signs made at least one free-text comment in vital sign flowsheet entries. The user roles that were most likely to compose comments were registered nurse, technician, and licensed nurse. The most frequently identified topics were the notification of a result to health care providers (0.347), the context of a measurement (0.307), and an inability to obtain a vital sign (0.224). There were 4187 unique medical terms that were extracted from 46,029 (0.220) comments, including many symptom-related terms such as "pain," "upset," "dizziness," "coughing," "anxiety," "distress," and "fever" and drug-related terms such as "tylenol," "anesthesia," "cannula," "oxygen," "motrin," "rituxan," and "labetalol." CONCLUSIONS: Considering that flowsheet comments are generally not displayed or automatically pulled into any clinical notes, our findings suggest that the flowsheet comment functionality can be simplified (eg, via structured response fields instead of a text input dialog) to reduce health care provider effort. Moreover, rich and clinically important medical terms such as medications and symptoms should be explicitly recorded in clinical notes for better visibility.


Assuntos
Documentação , Registros Eletrônicos de Saúde , Centros Médicos Acadêmicos , Humanos , Processamento de Linguagem Natural , Sinais Vitais
8.
BMC Med Inform Decis Mak ; 21(1): 353, 2021 12 18.
Artigo em Inglês | MEDLINE | ID: mdl-34922536

RESUMO

BACKGROUND: Information retrieval (IR) help clinicians answer questions posed to large collections of electronic medical records (EMRs), such as how best to identify a patient's cancer stage. One of the more promising approaches to IR for EMRs is to expand a keyword query with similar terms (e.g., augmenting cancer with mets). However, there is a large range of clinical chart review tasks, such that fixed sets of similar terms is insufficient. Current language models, such as Bidirectional Encoder Representations from Transformers (BERT) embeddings, do not capture the full non-textual context of a task. In this study, we present new methods that provide similar terms dynamically by adjusting with the context of the chart review task. METHODS: We introduce a vector space for medical-context in which each word is represented by a vector that captures the word's usage in different medical contexts (e.g., how frequently cancer is used when ordering a prescription versus describing family history) beyond the context learned from the surrounding text. These vectors are transformed into a vector space for customizing the set of similar terms selected for different chart review tasks. We evaluate the vector space model with multiple chart review tasks, in which supervised machine learning models learn to predict the preferred terms of clinically knowledgeable reviewers. To quantify the usefulness of the predicted similar terms to a baseline of standard word2vec embeddings, we measure (1) the prediction performance of the medical-context vector space model using the area under the receiver operating characteristic curve (AUROC) and (2) the labeling effort required to train the models. RESULTS: The vector space outperformed the baseline word2vec embeddings in all three chart review tasks with an average AUROC of 0.80 versus 0.66, respectively. Additionally, the medical-context vector space significantly reduced the number of labels required to learn and predict the preferred similar terms of reviewers. Specifically, the labeling effort was reduced to 10% of the entire dataset in all three tasks. CONCLUSIONS: The set of preferred similar terms that are relevant to a chart review task can be learned by leveraging the medical context of the task.


Assuntos
Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Área Sob a Curva , Registros Eletrônicos de Saúde , Humanos , Aprendizado de Máquina
9.
J Biomed Inform ; 100: 103334, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31678588

RESUMO

OBJECTIVE: Models for predicting preterm birth generally have focused on very preterm (28-32 weeks) and moderate to late preterm (32-37 weeks) settings. However, extreme preterm birth (EPB), before the 28th week of gestational age, accounts for the majority of newborn deaths. We investigated the extent to which deep learning models that consider temporal relations documented in electronic health records (EHRs) can predict EPB. STUDY DESIGN: EHR data were subject to word embedding and a temporal deep learning model, in the form of recurrent neural networks (RNNs) to predict EPB. Due to the low prevalence of EPB, the models were trained on datasets where controls were undersampled to balance the case-control ratio. We then applied an ensemble approach to group the trained models to predict EPB in an evaluation setting with a nature EPB ratio. We evaluated the RNN ensemble models with 10 years of EHR data from 25,689 deliveries at Vanderbilt University Medical Center. We compared their performance with traditional machine learning models (logistical regression, support vector machine, gradient boosting) trained on the datasets with balanced and natural EPB ratio. Risk factors associated with EPB were identified using an adjusted odds ratio. RESULTS: The RNN ensemble models trained on artificially balanced data achieved a higher AUC (0.827 vs. 0.744) and sensitivity (0.965 vs. 0.682) than those RNN models trained on the datasets with naturally imbalanced EPB ratio. In addition, the AUC (0.827) and sensitivity (0.965) of the RNN ensemble models were better than the AUC (0.777) and sensitivity (0.819) of the best baseline models trained on balanced data. Also, risk factors, including twin pregnancy, short cervical length, hypertensive disorder, systemic lupus erythematosus, and hydroxychloroquine sulfate, were found to be associated with EPB at a significant level. CONCLUSION: Temporal deep learning can predict EPB up to 8 weeks earlier than its occurrence. Accurate prediction of EPB may allow healthcare organizations to allocate resources effectively and ensure patients receive appropriate care.


Assuntos
Aprendizado Profundo , Registros Eletrônicos de Saúde , Lactente Extremamente Prematuro , Algoritmos , Conjuntos de Dados como Assunto , Humanos , Recém-Nascido , Classificação Internacional de Doenças
10.
Int J Clin Pract ; 73(11): e13393, 2019 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-31347754

RESUMO

BACKGROUND: Hepatorenal syndrome (HRS) is a life-threatening complication of cirrhosis and early detection of evolving HRS may provide opportunities for early intervention. We developed a HRS risk model to assist early recognition of inpatient HRS. METHODS: We analysed a retrospective cohort of patients hospitalised from among 122 medical centres in the US Department of Veterans Affairs between 1 January 2005 and 31 December 2013. We included cirrhotic patients who had Kidney Disease Improving Global Outcomes criteria based acute kidney injury on admission. We developed a logistic regression risk prediction model to detect HRS on admission using 10 variables. We calculated 95% confidence intervals on the model building dataset and, subsequently, calculated performance on a 1000 sample holdout test set. We report model performance with area under the curve (AUC) for discrimination and several calibration measures. RESULTS: The cohort included 19 368 patients comprising 32 047 inpatient admissions. The event rate for hospitalised HRS was 2810/31 047 (9.1%) and 79/1000 (7.9%) in the model building and validation datasets, respectively. The variable selection procedure designed a parsimonious model involving ten predictor variables. Final model performance in the validation dataset had an AUC of 0.87, Brier score of 0.05, slope of 1.10 and intercept of 0.04. CONCLUSIONS: We developed a probabilistic risk model to diagnose HRS within 24 hours of hospital admission using routine clinical variables in the largest ever published HRS cohort. The performance was excellent and this model may help identify high-risk patients for HRS and promote early intervention.


Assuntos
Síndrome Hepatorrenal/diagnóstico , Unidades de Terapia Intensiva , Admissão do Paciente/estatística & dados numéricos , Índice de Gravidade de Doença , Injúria Renal Aguda/diagnóstico , Adulto , Área Sob a Curva , Estudos de Coortes , Feminino , Síndrome Hepatorrenal/epidemiologia , Hospitalização/estatística & dados numéricos , Humanos , Cirrose Hepática/diagnóstico , Modelos Logísticos , Masculino , Pessoa de Meia-Idade , Estudos Retrospectivos
11.
J Biomed Inform ; 77: 1-10, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-29174994

RESUMO

OBJECTIVE: The traditional fee-for-service approach to healthcare can lead to the management of a patient's conditions in a siloed manner, inducing various negative consequences. It has been recognized that a bundled approach to healthcare - one that manages a collection of health conditions together - may enable greater efficacy and cost savings. However, it is not always evident which sets of conditions should be managed in a bundled manner. In this study, we investigate if a data-driven approach can automatically learn potential bundles. METHODS: We designed a framework to infer health condition collections (HCCs) based on the similarity of their clinical workflows, according to electronic medical record (EMR) utilization. We evaluated the framework with data from over 16,500 inpatient stays from Northwestern Memorial Hospital in Chicago, Illinois. The plausibility of the inferred HCCs for bundled care was assessed through an online survey of a panel of five experts, whose responses were analyzed via an analysis of variance (ANOVA) at a 95% confidence level. We further assessed the face validity of the HCCs using evidence in the published literature. RESULTS: The framework inferred four HCCs, indicative of (1) fetal abnormalities, (2) late pregnancies, (3) prostate problems, and (4) chronic diseases, with congestive heart failure featuring prominently. Each HCC was substantiated with evidence in the literature and was deemed plausible for bundled care by the experts at a statistically significant level. CONCLUSIONS: The findings suggest that an automated EMR data-driven framework conducted can provide a basis for discovering bundled care opportunities. Still, translating such findings into actual care management will require further refinement, implementation, and evaluation.


Assuntos
Mineração de Dados/métodos , Atenção à Saúde/organização & administração , Registros Eletrônicos de Saúde , Pacotes de Assistência ao Paciente , Comorbidade , Humanos , Aprendizado de Máquina , Informática Médica , Administração dos Cuidados ao Paciente , Fenótipo , Fluxo de Trabalho
12.
J Biomed Inform ; 80: 87-95, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-29530803

RESUMO

OBJECTIVE: Hepatorenal Syndrome (HRS) is a devastating form of acute kidney injury (AKI) in advanced liver disease patients with high morbidity and mortality, but phenotyping algorithms have not yet been developed using large electronic health record (EHR) databases. We evaluated and compared multiple phenotyping methods to achieve an accurate algorithm for HRS identification. MATERIALS AND METHODS: A national retrospective cohort of patients with cirrhosis and AKI admitted to 124 Veterans Affairs hospitals was assembled from electronic health record data collected from 2005 to 2013. AKI was defined by the Kidney Disease: Improving Global Outcomes criteria. Five hundred and four hospitalizations were selected for manual chart review and served as the gold standard. Electronic Health Record based predictors were identified using structured and free text clinical data, subjected through NLP from the clinical Text Analysis Knowledge Extraction System. We explored several dimension reduction techniques for the NLP data, including newer high-throughput phenotyping and word embedding methods, and ascertained their effectiveness in identifying the phenotype without structured predictor variables. With the combined structured and NLP variables, we analyzed five phenotyping algorithms: penalized logistic regression, naïve Bayes, support vector machines, random forest, and gradient boosting. Calibration and discrimination metrics were calculated using 100 bootstrap iterations. In the final model, we report odds ratios and 95% confidence intervals. RESULTS: The area under the receiver operating characteristic curve (AUC) for the different models ranged from 0.73 to 0.93; with penalized logistic regression having the best discriminatory performance. Calibration for logistic regression was modest, but gradient boosting and support vector machines were superior. NLP identified 6985 variables; a priori variable selection performed similarly to dimensionality reduction using high-throughput phenotyping and semantic similarity informed clustering (AUC of 0.81 - 0.82). CONCLUSION: This study demonstrated improved phenotyping of a challenging AKI etiology, HRS, over ICD-9 coding. We also compared performance among multiple approaches to EHR-derived phenotyping, and found similar results between methods. Lastly, we showed that automated NLP dimension reduction is viable for acute illness.


Assuntos
Algoritmos , Diagnóstico por Computador/métodos , Síndrome Hepatorrenal/diagnóstico , Fenótipo , Injúria Renal Aguda , Idoso , Registros Eletrônicos de Saúde , Feminino , Síndrome Hepatorrenal/etiologia , Síndrome Hepatorrenal/fisiopatologia , Humanos , Cirrose Hepática/complicações , Masculino , Pessoa de Meia-Idade , Processamento de Linguagem Natural , Razão de Chances , Curva ROC , Estudos Retrospectivos , Máquina de Vetores de Suporte
13.
J Biomed Inform ; 61: 97-109, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-27020263

RESUMO

OBJECTIVE: Electronic medical records (EMRs) are increasingly repurposed for activities beyond clinical care, such as to support translational research and public policy analysis. To mitigate privacy risks, healthcare organizations (HCOs) aim to remove potentially identifying patient information. A substantial quantity of EMR data is in natural language form and there are concerns that automated tools for detecting identifiers are imperfect and leak information that can be exploited by ill-intentioned data recipients. Thus, HCOs have been encouraged to invest as much effort as possible to find and detect potential identifiers, but such a strategy assumes the recipients are sufficiently incentivized and capable of exploiting leaked identifiers. In practice, such an assumption may not hold true and HCOs may overinvest in de-identification technology. The goal of this study is to design a natural language de-identification framework, rooted in game theory, which enables an HCO to optimize their investments given the expected capabilities of an adversarial recipient. METHODS: We introduce a Stackelberg game to balance risk and utility in natural language de-identification. This game represents a cost-benefit model that enables an HCO with a fixed budget to minimize their investment in the de-identification process. We evaluate this model by assessing the overall payoff to the HCO and the adversary using 2100 clinical notes from Vanderbilt University Medical Center. We simulate several policy alternatives using a range of parameters, including the cost of training a de-identification model and the loss in data utility due to the removal of terms that are not identifiers. In addition, we compare policy options where, when an attacker is fined for misuse, a monetary penalty is paid to the publishing HCO as opposed to a third party (e.g., a federal regulator). RESULTS: Our results show that when an HCO is forced to exhaust a limited budget (set to $2000 in the study), the precision and recall of the de-identification of the HCO are 0.86 and 0.8, respectively. A game-based approach enables a more refined cost-benefit tradeoff, improving both privacy and utility for the HCO. For example, our investigation shows that it is possible for an HCO to release the data without spending all their budget on de-identification and still deter the attacker, with a precision of 0.77 and a recall of 0.61 for the de-identification. There also exist scenarios in which the model indicates an HCO should not release any data because the risk is too great. In addition, we find that the practice of paying fines back to a HCO (an artifact of suing for breach of contract), as opposed to a third party such as a federal regulator, can induce an elevated level of data sharing risk, where the HCO is incentivized to bait the attacker to elicit compensation. CONCLUSIONS: A game theoretic framework can be applied in leading HCO's to optimized decision making in natural language de-identification investments before sharing EMR data.


Assuntos
Confidencialidade , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Idioma , Risco
14.
Bioinformatics ; 30(23): 3334-41, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25147357

RESUMO

MOTIVATION: Sharing genomic data is crucial to support scientific investigation such as genome-wide association studies. However, recent investigations suggest the privacy of the individual participants in these studies can be compromised, leading to serious concerns and consequences, such as overly restricted access to data. RESULTS: We introduce a novel cryptographic strategy to securely perform meta-analysis for genetic association studies in large consortia. Our methodology is useful for supporting joint studies among disparate data sites, where privacy or confidentiality is of concern. We validate our method using three multisite association studies. Our research shows that genetic associations can be analyzed efficiently and accurately across substudy sites, without leaking information on individual participants and site-level association summaries. AVAILABILITY AND IMPLEMENTATION: Our software for secure meta-analysis of genetic association studies, SecureMA, is publicly available at http://github.com/XieConnect/SecureMA. Our customized secure computation framework is also publicly available at http://github.com/XieConnect/CircuitService.


Assuntos
Estudos de Associação Genética/métodos , Privacidade Genética , Metanálise como Assunto , Estudo de Associação Genômica Ampla/métodos , Genômica , Humanos , Hipotireoidismo/genética , Obesidade/genética , Software
15.
ACM Comput Surv ; 48(1)2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-26640318

RESUMO

Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward.

16.
J Biomed Inform ; 52: 243-50, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25038554

RESUMO

OBJECTIVE: Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome-phenome association studies under various conditions. METHODS: We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r(2)) of the p-values of association significance. RESULTS: Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000-75,000). We observed a general trend of increasing r(2) for larger data set sizes: r(2)=0.9481 for small-sized datasets, r(2)=0.9493 for moderately-sized datasets, r(2)=0.9934 for large-sized datasets. CONCLUSIONS: This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.


Assuntos
Pesquisa Biomédica/métodos , Confidencialidade , Bases de Dados Genéticas , Registros Eletrônicos de Saúde , Estudos de Associação Genética/estatística & dados numéricos , Tamanho da Amostra , Algoritmos , Simulação por Computador , Genótipo , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único
17.
J Biomed Inform ; 52: 199-211, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25038555

RESUMO

The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts; however, most of these approaches require labor intensive supervision from experienced clinical professionals. Furthermore, existing approaches are often disease-centric and specialized to the idiosyncrasies of the information technology and/or business practices of a single healthcare organization. In this paper, we propose Limestone, a nonnegative tensor factorization method to derive phenotype candidates with virtually no human supervision. Limestone represents the data source interactions naturally using tensors (a generalization of matrices). In particular, we investigate the interaction of diagnoses and medications among patients. The resulting tensor factors are reported as phenotype candidates that automatically reveal patient clusters on specific diagnoses and medications. Using the proposed method, multiple phenotypes can be identified simultaneously from data. We demonstrate the capability of Limestone on a cohort of 31,815 patient records from the Geisinger Health System. The dataset spans 7years of longitudinal patient records and was initially constructed for a heart failure onset prediction study. Our experiments demonstrate the robustness, stability, and the conciseness of Limestone-derived phenotypes. Our results show that using only 40 phenotypes, we can outperform the original 640 features (169 diagnosis categories and 471 medication types) to achieve an area under the receiver operator characteristic curve (AUC) of 0.720 (95% CI 0.715 to 0.725). Moreover, in consultation with a medical expert, we confirmed 82% of the top 50 candidates automatically extracted by Limestone are clinically meaningful.


Assuntos
Mineração de Dados/métodos , Registros Eletrônicos de Saúde/classificação , Algoritmos , Bases de Dados Factuais/classificação , Humanos , Fenótipo
18.
JMIR AI ; 3: e52054, 2024 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-38875581

RESUMO

BACKGROUND: Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (ie, voiceprints), sharing recordings raises privacy concerns. This is especially relevant when working with patient data protected under the Health Insurance Portability and Accountability Act. OBJECTIVE: We aimed to determine the reidentification risk for speech recordings, without reference to demographics or metadata, in clinical data sets considering both the size of the search space (ie, the number of comparisons that must be considered when reidentifying) and the nature of the speech recording (ie, the type of speech task). METHODS: Using a state-of-the-art speaker identification model, we modeled an adversarial attack scenario in which an adversary uses a large data set of identified speech (hereafter, the known set) to reidentify as many unknown speakers in a shared data set (hereafter, the unknown set) as possible. We first considered the effect of search space size by attempting reidentification with various sizes of known and unknown sets using VoxCeleb, a data set with recordings of natural, connected speech from >7000 healthy speakers. We then repeated these tests with different types of recordings in each set to examine whether the nature of a speech recording influences reidentification risk. For these tests, we used our clinical data set composed of recordings of elicited speech tasks from 941 speakers. RESULTS: We found that the risk was inversely related to the number of comparisons an adversary must consider (ie, the search space), with a positive linear correlation between the number of false acceptances (FAs) and the number of comparisons (r=0.69; P<.001). The true acceptances (TAs) stayed relatively stable, and the ratio between FAs and TAs rose from 0.02 at 1 × 105 comparisons to 1.41 at 6 × 106 comparisons, with a near 1:1 ratio at the midpoint of 3 × 106 comparisons. In effect, risk was high for a small search space but dropped as the search space grew. We also found that the nature of a speech recording influenced reidentification risk, with nonconnected speech (eg, vowel prolongation: FA/TA=98.5; alternating motion rate: FA/TA=8) being harder to identify than connected speech (eg, sentence repetition: FA/TA=0.54) in cross-task conditions. The inverse was mostly true in within-task conditions, with the FA/TA ratio for vowel prolongation and alternating motion rate dropping to 0.39 and 1.17, respectively. CONCLUSIONS: Our findings suggest that speaker identification models can be used to reidentify participants in specific circumstances, but in practice, the reidentification risk appears small. The variation in risk due to search space size and type of speech task provides actionable recommendations to further increase participant privacy and considerations for policy regarding public release of speech recordings.

19.
Sci Rep ; 14(1): 16117, 2024 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-38997332

RESUMO

Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. We introduce a novel adaptation of the word2vec model, PK-word2vec (where PK stands for prior knowledge), for small-scale messages. PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec in a case study of patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. The dataset was composed of 1389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7981 non-medical and 1116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p = 0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks' choices between the two groups of reviewers ( p = 0.774 under a paired t-test). PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.


Assuntos
Neoplasias da Mama , Processamento de Linguagem Natural , Portais do Paciente , Humanos , Feminino , Semântica , Registros Eletrônicos de Saúde
20.
Res Sq ; 2024 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-38798621

RESUMO

Background: Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. Objective: We introduce a novel adaptation of the word2vec model, PK-word2vec, for small-scale messages. Methods: PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec on patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. Results: The dataset was composed of 1,389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7,981 non-medical and 1,116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p=0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks' choices between the two groups of reviewers (p = 0.774 under a paired t-test). Conclusions: PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA