Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
1.
medRxiv ; 2024 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-38712148

RESUMO

Background: The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective: This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods: We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results: Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions: Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

3.
IEEE Trans Nanobioscience ; 22(4): 808-817, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37289605

RESUMO

Sharing individual-level pandemic data is essential for accelerating the understanding of a disease. For example, COVID-19 data have been widely collected to support public health surveillance and research. In the United States, these data are typically de-identified before publication to protect the privacy of the corresponding individuals. However, current data publishing approaches for this type of data, such as those adopted by the U.S. Centers for Disease Control and Prevention (CDC), have not flexed over time to account for the dynamic nature of infection rates. Thus, the policies generated by these strategies have the potential to both raise privacy risks or overprotect the data and impair the data utility (or usability). To optimize the tradeoff between privacy risk and data utility, we introduce a game theoretic model that adaptively generates policies for the publication of individual-level COVID-19 data according to infection dynamics. We model the data publishing process as a two-player Stackelberg game between a data publisher and a data recipient and then search for the best strategy for the publisher. In this game, we consider 1) average performance of predicting future case counts; and 2) mutual information between the original data and the released data. We use COVID-19 case data from Vanderbilt University Medical Center from March 2020 to December 2021 to demonstrate the effectiveness of the new model. The results indicate that the game theoretic model outperforms all state-of-the-art baseline approaches, including those adopted by CDC, while maintaining low privacy risk. We further perform an extensive sensitivity analyses to show that our findings are robust to order-of-magnitude parameter fluctuations.


Assuntos
COVID-19 , Privacidade , Humanos , Estados Unidos/epidemiologia , Pandemias , COVID-19/epidemiologia , Editoração
4.
Genome Res ; 33(7): 1113-1123, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37217251

RESUMO

The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web services called Beacons. However, even such limited releases are susceptible to likelihood ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the data set and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is in the form of either summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public data sets.


Assuntos
Disseminação de Informação , Privacidade , Humanos , Disseminação de Informação/métodos , Genômica , Frequência do Gene , Alelos
5.
Sci Rep ; 13(1): 6932, 2023 04 28.
Artigo em Inglês | MEDLINE | ID: mdl-37117219

RESUMO

As recreational genomics continues to grow in its popularity, many people are afforded the opportunity to share their genomes in exchange for various services, including third-party interpretation (TPI) tools, to understand their predisposition to health problems and, based on genome similarity, to find extended family members. At the same time, these services have increasingly been reused by law enforcement to track down potential criminals through family members who disclose their genomic information. While it has been observed that many potential users shy away from such data sharing when they learn that their privacy cannot be assured, it remains unclear how potential users' valuations of the service will affect a population's behavior. In this paper, we present a game theoretic framework to model interdependent privacy challenges in genomic data sharing online. Through simulations, we find that in addition to the boundary cases when (1) no player and (2) every player joins, there exist pure-strategy Nash equilibria when a relatively small portion of players choose to join the genomic database. The result is consistent under different parametric settings. We further examine the stability of Nash equilibria and illustrate that the only equilibrium that is resistant to a random dropping of players is when all players join the genomic database. Finally, we show that when players consider the impact that their data sharing may have on their relatives, the only pure strategy Nash equilibria are when either no player or every player shares their genomic data.


Assuntos
Hepatopatia Gordurosa não Alcoólica , Privacidade , Humanos , Disseminação de Informação , Família , Genômica
6.
J Am Med Inform Assoc ; 30(5): 907-914, 2023 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-36809550

RESUMO

OBJECTIVE: The All of Us Research Program makes individual-level data available to researchers while protecting the participants' privacy. This article describes the protections embedded in the multistep access process, with a particular focus on how the data was transformed to meet generally accepted re-identification risk levels. METHODS: At the time of the study, the resource consisted of 329 084 participants. Systematic amendments were applied to the data to mitigate re-identification risk (eg, generalization of geographic regions, suppression of public events, and randomization of dates). We computed the re-identification risk for each participant using a state-of-the-art adversarial model specifically assuming that it is known that someone is a participant in the program. We confirmed the expected risk is no greater than 0.09, a threshold that is consistent with guidelines from various US state and federal agencies. We further investigated how risk varied as a function of participant demographics. RESULTS: The results indicated that 95th percentile of the re-identification risk of all the participants is below current thresholds. At the same time, we observed that risk levels were higher for certain race, ethnic, and genders. CONCLUSIONS: While the re-identification risk was sufficiently low, this does not imply that the system is devoid of risk. Rather, All of Us uses a multipronged data protection strategy that includes strong authentication practices, active monitoring of data misuse, and penalization mechanisms for users who violate terms of service.


Assuntos
Saúde da População , Humanos , Masculino , Feminino , Privacidade , Gestão de Riscos , Segurança Computacional , Pesquisadores
7.
J Med Internet Res ; 25: e42985, 2023 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-36790847

RESUMO

BACKGROUND: By the end of 2022, more than 100 million people were infected with COVID-19 in the United States, and the cumulative death rate in rural areas (383.5/100,000) was much higher than in urban areas (280.1/100,000). As the pandemic spread, people used social media platforms to express their opinions and concerns about COVID-19-related topics. OBJECTIVE: This study aimed to (1) identify the primary COVID-19-related topics in the contiguous United States communicated over Twitter and (2) compare the sentiments urban and rural users expressed about these topics. METHODS: We collected tweets containing geolocation data from May 2020 to January 2022 in the contiguous United States. We relied on the tweets' geolocations to determine if their authors were in an urban or rural setting. We trained multiple word2vec models with several corpora of tweets based on geospatial and timing information. Using a word2vec model built on all tweets, we identified hashtags relevant to COVID-19 and performed hashtag clustering to obtain related topics. We then ran an inference analysis for urban and rural sentiments with respect to the topics based on the similarity between topic hashtags and opinion adjectives in the corresponding urban and rural word2vec models. Finally, we analyzed the temporal trend in sentiments using monthly word2vec models. RESULTS: We created a corpus of 407 million tweets, 350 million (86%) of which were posted by users in urban areas, while 18 million (4.4%) were posted by users in rural areas. There were 2666 hashtags related to COVID-19, which clustered into 20 topics. Rural users expressed stronger negative sentiments than urban users about COVID-19 prevention strategies and vaccination (P<.001). Moreover, there was a clear political divide in the perception of politicians by urban and rural users; these users communicated stronger negative sentiments about Republican and Democratic politicians, respectively (P<.001). Regarding misinformation and conspiracy theories, urban users exhibited stronger negative sentiments about the "covidiots" and "China virus" topics, while rural users exhibited stronger negative sentiments about the "Dr. Fauci" and "plandemic" topics. Finally, we observed that urban users' sentiments about the economy appeared to transition from negative to positive in late 2021, which was in line with the US economic recovery. CONCLUSIONS: This study demonstrates there is a statistically significant difference in the sentiments of urban and rural Twitter users regarding a wide range of COVID-19-related topics. This suggests that social media can be relied upon to monitor public sentiment during pandemics in disparate types of regions. This may assist in the geographically targeted deployment of epidemic prevention and management efforts.


Assuntos
COVID-19 , Mídias Sociais , Humanos , Estados Unidos , COVID-19/epidemiologia , Estudos Retrospectivos , SARS-CoV-2 , Atitude
8.
Nat Commun ; 13(1): 7609, 2022 Dec 09.
Artigo em Inglês | MEDLINE | ID: mdl-36494374

RESUMO

Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.


Assuntos
Pesquisa Biomédica , Registros Eletrônicos de Saúde , Privacidade , Benchmarking
9.
J Med Internet Res ; 24(3): e31687, 2022 03 11.
Artigo em Inglês | MEDLINE | ID: mdl-35275077

RESUMO

BACKGROUND: In November 2018, a Chinese researcher reported that his team had applied clustered regularly interspaced palindromic repeats or associated protein 9 to delete the gene C-C chemokine receptor type 5 from embryos and claimed that the 2 newborns would have lifetime immunity from HIV infection, an event referred to as #GeneEditedBabies on social media platforms. Although this event stirred a worldwide debate on ethical and legal issues regarding clinical trials with embryonic gene sequences, the focus has mainly been on academics and professionals. However, how the public, especially stratified by geographic region and culture, reacted to these issues is not yet well-understood. OBJECTIVE: The aim of this study is to examine web-based posts about the #GeneEditedBabies event and characterize and compare the public's stance across social media platforms with different user bases. METHODS: We used a set of relevant keywords to search for web-based posts in 4 worldwide or regional mainstream social media platforms: Sina Weibo (China), Twitter, Reddit, and YouTube. We applied structural topic modeling to analyze the main discussed topics and their temporal trends. On the basis of the topics we found, we designed an annotation codebook to label 2000 randomly sampled posts from each platform on whether a supporting, opposing, or neutral stance toward this event was expressed and what the major considerations of those posts were if a stance was described. The annotated data were used to compare stances and the language used across the 4 web-based platforms. RESULTS: We collected >220,000 posts published by approximately 130,000 users regarding the #GeneEditedBabies event. Our results indicated that users discussed a wide range of topics, some of which had clear temporal trends. Our results further showed that although almost all experts opposed this event, many web-based posts supported this event. In particular, Twitter exhibited the largest number of posts in opposition (701/816, 85.9%), followed by Sina Weibo (968/1140, 84.91%), Reddit (550/898, 61.2%), and YouTube (567/1078, 52.6%). The primary opposing reason was rooted in ethical concerns, whereas the primary supporting reason was based on the expectation that such technology could prevent the occurrence of diseases in the future. Posts from these 4 platforms had different language uses and patterns when they expressed stances on the #GeneEditedBabies event. CONCLUSIONS: This research provides evidence that posts on web-based platforms can offer insights into the public's stance on gene editing techniques. However, these stances vary across web-based platforms and often differ from those raised by academics and policy makers.


Assuntos
Infecções por HIV , Mídias Sociais , China/epidemiologia , Humanos , Recém-Nascido , Opinião Pública
10.
Nat Rev Genet ; 23(7): 429-445, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35246669

RESUMO

Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information.


Assuntos
Genômica , Privacidade , Genoma
12.
J Am Med Inform Assoc ; 29(5): 853-863, 2022 04 13.
Artigo em Inglês | MEDLINE | ID: mdl-35182149

RESUMO

OBJECTIVE: Supporting public health research and the public's situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 and recent state-level regulations, permits sharing deidentified person-level data; however, current deidentification approaches are limited. Namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt deidentification for near-real time sharing of person-level surveillance data. MATERIALS AND METHODS: The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the reidentification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework's effectiveness in maintaining the PK11 threshold of 0.01. RESULTS: When sharing COVID-19 county-level case data across all US counties, the framework's approach meets the threshold for 96.2% of daily data releases, while a policy based on current deidentification techniques meets the threshold for 32.3%. CONCLUSION: Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features.


Assuntos
COVID-19 , Privacidade , Humanos , Pandemias , Políticas , Estudos Prospectivos , Saúde Pública , Estudos Retrospectivos
13.
JMIR Infodemiology ; 2(2): e35702, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-37113452

RESUMO

Background: As direct-to-consumer genetic testing services have grown in popularity, the public has increasingly relied upon online forums to discuss and share their test results. Initially, users did so anonymously, but more recently, they have included face images when discussing their results. Various studies have shown that sharing images on social media tends to elicit more replies. However, users who do this forgo their privacy. When these images truthfully represent a user, they have the potential to disclose that user's identity. Objective: This study investigates the face image sharing behavior of direct-to-consumer genetic testing users in an online environment to determine if there exists an association between face image sharing and the attention received from other users. Methods: This study focused on r/23andme, a subreddit dedicated to discussing direct-to-consumer genetic testing results and their implications. We applied natural language processing to infer the themes associated with posts that included a face image. We applied a regression analysis to characterize the association between the attention that a post received, in terms of the number of comments, the karma score (defined as the number of upvotes minus the number of downvotes), and whether the post contained a face image. Results: We collected over 15,000 posts from the r/23andme subreddit, published between 2012 and 2020. Face image posting began in late 2019 and grew rapidly, with over 800 individuals revealing their faces by early 2020. The topics in posts including a face were primarily about sharing, discussing ancestry composition, or sharing family reunion photos with relatives discovered via direct-to-consumer genetic testing. On average, posts including a face image received 60% (5/8) more comments and had karma scores 2.4 times higher than other posts. Conclusions: Direct-to-consumer genetic testing consumers in the r/23andme subreddit are increasingly posting face images and testing reports on social platforms. The association between face image posting and a greater level of attention suggests that people are forgoing their privacy in exchange for attention from others. To mitigate this risk, platform organizers and moderators could inform users about the risk of posting face images in a direct, explicit manner to make it clear that their privacy may be compromised if personal images are shared.

14.
AMIA Annu Symp Proc ; 2022: 279-288, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-37128430

RESUMO

Data access limitations have stifled COVID-19 disparity investigations in the United States. Though federal and state legislation permits publicly disseminating de-identified data, methods for de-identification, including a recently proposed dynamic policy approach to pandemic data sharing, remain unproved in their ability to support pandemic disparity studies. Thus, in this paper, we evaluate how such an approach enables timely, accurate, and fair disparity detection, with respect to potential adversaries with varying prior knowledge about the population. We show that, when considering reasonably enabled adversaries, dynamic policies support up to three times earlier disparity detection in partially synthetic data than data sharing policies derived from two current, public datasets. Using real-world COVID-19 data, we also show how granular date information, which dynamic policies were designed to share, improves disparity characterization. Our results highlight the potential of the dynamic policy approach to publish data that supports disparity investigations in current and future pandemics.


Assuntos
COVID-19 , Humanos , Estados Unidos , COVID-19/epidemiologia , Políticas , Disseminação de Informação , Pandemias , Vigilância em Saúde Pública/métodos
15.
Sci Adv ; 7(50): eabe9986, 2021 Dec 10.
Artigo em Inglês | MEDLINE | ID: mdl-34890225

RESUMO

Person-specific biomedical data are now widely collected, but its sharing raises privacy concerns, specifically about the re-identification of seemingly anonymous records. Formal re-identification risk assessment frameworks can inform decisions about whether and how to share data; current techniques, however, focus on scenarios where the data recipients use only one resource for re-identification purposes. This is a concern because recent attacks show that adversaries can access multiple resources, combining them in a stage-wise manner, to enhance the chance of an attack's success. In this work, we represent a re-identification game using a two-player Stackelberg game of perfect information, which can be applied to assess risk, and suggest an optimal data sharing strategy based on a privacy-utility tradeoff. We report on experiments with large-scale genomic datasets to show that, using game theoretic models accounting for adversarial capabilities to launch multistage attacks, most data can be effectively shared with low re-identification risk.

16.
J Am Med Inform Assoc ; 28(4): 744-752, 2021 03 18.
Artigo em Inglês | MEDLINE | ID: mdl-33448306

RESUMO

OBJECTIVE: Re-identification risk methods for biomedical data often assume a worst case, in which attackers know all identifiable features (eg, age and race) about a subject. Yet, worst-case adversarial modeling can overestimate risk and induce heavy editing of shared data. The objective of this study is to introduce a framework for assessing the risk considering the attacker's resources and capabilities. MATERIALS AND METHODS: We integrate 3 established risk measures (ie, prosecutor, journalist, and marketer risks) and compute re-identification probabilities for data subjects. This probability is dependent on an attacker's capabilities (eg, ability to obtain external identified resources) and the subject's decision on whether to reveal their participation in a dataset. We illustrate the framework through case studies using data from over 1 000 000 patients from Vanderbilt University Medical Center and show how re-identification risk changes when attackers are pragmatic and use 2 known resources for attack: (1) voter registration lists and (2) social media posts. RESULTS: Our framework illustrates that the risk is substantially smaller in the pragmatic scenarios than in the worst case. Our experiments yield a median worst-case risk of 0.987 (where 0 is least risky and 1 is most risky); however, the median reduction in risk was 90.1% in the voter registration scenario and 100% in the social media posts scenario. Notably, these observations hold true for a wide range of adversarial capabilities. CONCLUSIONS: This research illustrates that re-identification risk is situationally dependent and that appropriate adversarial modeling may permit biomedical data sharing on a wider scale than is currently the case.


Assuntos
Segurança Computacional , Confidencialidade , Anonimização de Dados , Probabilidade , Humanos , Risco , Medição de Risco
17.
AMIA Annu Symp Proc ; 2021: 793-802, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35309009

RESUMO

Numerous studies have shown that a person's health status is closely related to their socioeconomic status. It is evident that incorporating socioeconomic data associated with a patient's geographic area of residence into clinical datasets will promote medical research. However, most socioeconomic variables are unique in combination and are affiliated with small geographical regions (e.g., census tracts) that are often associated with less than 20,000 people. Thus, sharing such tract-level data can violate the Safe Harbor implementation of de-identification under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). In this paper, we introduce a constraint-based k-means clustering approach to generate census tract-level socioeconomic data that is de-identification compliant. Our experimental analysis with data from the American Community Survey illustrates that the approach generates a protected dataset with high similarity to the unaltered values, and achieves a substantially better data utility than the HIPAA Safe Harbor recommendation of 3-digit ZIP code.


Assuntos
Pesquisa Biomédica , Setor Censitário , Análise por Conglomerados , Health Insurance Portability and Accountability Act , Humanos , Classe Social , Estados Unidos
18.
AMIA Annu Symp Proc ; 2019: 607-616, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-32308855

RESUMO

To accelerate medical knowledge discovery, an increasing number of research programs are gathering and sharing data on a large number of participants. Due to the privacy concerns and legal restrictions on data sharing, these programs apply various strategies to mitigate privacy risk. However, the activities of participants and research program sponsors, particularly on social media, might reveal an individual's membership in a study, making it easier to recognize participants' records and uncover the information they have yet to disclose. This behavior can jeopardize the privacy of the participants themselves, the reputation of the projects, sponsors, and the research enterprise. To investigate the dangers of self-disclosure behavior, we gathered and analyzed 4,020 tweets, and uncovered over 100 tweets disclosing the individuals' memberships in over 15 programs. Our investigation showed that self-disclosure on social media can reveal participants' membership in research cohorts, and such activity might lead to the leakage of a person's identity, genomic, and other sensitive health information.


Assuntos
Pesquisa Biomédica , Revelação , Disseminação de Informação , Autorrevelação , Mídias Sociais , Ensaios Clínicos como Assunto , Feminino , Humanos , Masculino , Privacidade
19.
J Am Med Inform Assoc ; 25(1): 25-31, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29036325

RESUMO

Objective: Biomedical science is driven by datasets that are being accumulated at an unprecedented rate, with ever-growing volume and richness. There are various initiatives to make these datasets more widely available to recipients who sign Data Use Certificate agreements, whereby penalties are levied for violations. A particularly popular penalty is the temporary revocation, often for several months, of the recipient's data usage rights. This policy is based on the assumption that the value of biomedical research data depreciates significantly over time; however, no studies have been performed to substantiate this belief. This study investigates whether this assumption holds true and the data science policy implications. Methods: This study tests the hypothesis that the value of data for scientific investigators, in terms of the impact of the publications based on the data, decreases over time. The hypothesis is tested formally through a mixed linear effects model using approximately 1200 publications between 2007 and 2013 that used datasets from the Database of Genotypes and Phenotypes, a data-sharing initiative of the National Institutes of Health. Results: The analysis shows that the impact factors for publications based on Database of Genotypes and Phenotypes datasets depreciate in a statistically significant manner. However, we further discover that the depreciation rate is slow, only ∼10% per year, on average. Conclusion: The enduring value of data for subsequent studies implies that revoking usage for short periods of time may not sufficiently deter those who would violate Data Use Certificate agreements and that alternative penalty mechanisms may need to be invoked.


Assuntos
Pesquisa Biomédica , Conjuntos de Dados como Assunto , Disseminação de Informação , Fator de Impacto de Revistas , Publicações/normas , Fatores de Tempo
20.
AMIA Annu Symp Proc ; 2018: 760-769, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30815118

RESUMO

As the quantity and detail of association studies between clinical phenotypes and genotypes grows, there is a push to make summary statistics widely available. Genome wide summary statistics have been shown to be vulnerable to the inference of a targeted individual's presence. In this paper, we show that presence attacks are feasible with phenome wide summary statistics as well. We use data from three healthcare organizations and an online resource that publishes summary statistics. We introduce a novel attack that achieves over 80% recall and precision within a population of 16,346, where 8,173 individuals are targets. However, the feasibility of the attack is dependent on the attacker's knowledge about 1) the targeted individual and 2) the reference dataset. Within a population of over 2 million, where 8,173 individuals are targets, our attack achieves 31% recall and 17% precision. As a result, it is plausible that sharing of phenomic summary statistics may be accomplished with an acceptable level of privacy risk.


Assuntos
Segurança Computacional , Informações Pessoalmente Identificáveis , Fenótipo , Estudo de Associação Genômica Ampla , Genótipo , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA