Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 200
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Nat Rev Genet ; 23(7): 429-445, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-35246669

RESUMEN

Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information.


Asunto(s)
Genómica , Privacidad , Genoma
2.
Genome Res ; 33(7): 1113-1123, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37217251

RESUMEN

The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web services called Beacons. However, even such limited releases are susceptible to likelihood ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the data set and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is in the form of either summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public data sets.


Asunto(s)
Difusión de la Información , Privacidad , Humanos , Difusión de la Información/métodos , Genómica , Frecuencia de los Genes , Alelos
3.
J Biomed Inform ; 153: 104640, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38608915

RESUMEN

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.


Asunto(s)
Inteligencia Artificial , Medicina Basada en la Evidencia , Humanos , Confianza , Procesamiento de Lenguaje Natural
4.
J Med Internet Res ; 26: e49445, 2024 04 24.
Artículo en Inglés | MEDLINE | ID: mdl-38657232

RESUMEN

BACKGROUND: Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set's statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice. OBJECTIVE: The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study. METHODS: The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case-specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case-specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results. RESULTS: Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2% and 87.6%, and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case-specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy. CONCLUSIONS: Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case-specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data. TRIAL REGISTRATION: German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1093/ndt/gfr456.


Asunto(s)
Anonimización de la Información , Humanos , Insuficiencia Renal Crónica/terapia , Difusión de la Información/métodos , Algoritmos , Alemania , Confidencialidad , Privacidad
5.
J Med Internet Res ; 26: e52508, 2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38696776

RESUMEN

The number of papers presenting machine learning (ML) models that are being submitted to and published in the Journal of Medical Internet Research and other JMIR Publications journals has steadily increased. Editors and peer reviewers involved in the review process for such manuscripts often go through multiple review cycles to enhance the quality and completeness of reporting. The use of reporting guidelines or checklists can help ensure consistency in the quality of submitted (and published) scientific manuscripts and, for example, avoid instances of missing information. In this Editorial, the editors of JMIR Publications journals discuss the general JMIR Publications policy regarding authors' application of reporting guidelines and specifically focus on the reporting of ML studies in JMIR Publications journals, using the Consolidated Reporting of Machine Learning Studies (CREMLS) guidelines, with an example of how authors and other journals could use the CREMLS checklist to ensure transparency and rigor in reporting.


Asunto(s)
Aprendizaje Automático , Humanos , Guías como Asunto , Pronóstico , Lista de Verificación
7.
J Med Internet Res ; 25: e42985, 2023 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-36790847

RESUMEN

BACKGROUND: By the end of 2022, more than 100 million people were infected with COVID-19 in the United States, and the cumulative death rate in rural areas (383.5/100,000) was much higher than in urban areas (280.1/100,000). As the pandemic spread, people used social media platforms to express their opinions and concerns about COVID-19-related topics. OBJECTIVE: This study aimed to (1) identify the primary COVID-19-related topics in the contiguous United States communicated over Twitter and (2) compare the sentiments urban and rural users expressed about these topics. METHODS: We collected tweets containing geolocation data from May 2020 to January 2022 in the contiguous United States. We relied on the tweets' geolocations to determine if their authors were in an urban or rural setting. We trained multiple word2vec models with several corpora of tweets based on geospatial and timing information. Using a word2vec model built on all tweets, we identified hashtags relevant to COVID-19 and performed hashtag clustering to obtain related topics. We then ran an inference analysis for urban and rural sentiments with respect to the topics based on the similarity between topic hashtags and opinion adjectives in the corresponding urban and rural word2vec models. Finally, we analyzed the temporal trend in sentiments using monthly word2vec models. RESULTS: We created a corpus of 407 million tweets, 350 million (86%) of which were posted by users in urban areas, while 18 million (4.4%) were posted by users in rural areas. There were 2666 hashtags related to COVID-19, which clustered into 20 topics. Rural users expressed stronger negative sentiments than urban users about COVID-19 prevention strategies and vaccination (P<.001). Moreover, there was a clear political divide in the perception of politicians by urban and rural users; these users communicated stronger negative sentiments about Republican and Democratic politicians, respectively (P<.001). Regarding misinformation and conspiracy theories, urban users exhibited stronger negative sentiments about the "covidiots" and "China virus" topics, while rural users exhibited stronger negative sentiments about the "Dr. Fauci" and "plandemic" topics. Finally, we observed that urban users' sentiments about the economy appeared to transition from negative to positive in late 2021, which was in line with the US economic recovery. CONCLUSIONS: This study demonstrates there is a statistically significant difference in the sentiments of urban and rural Twitter users regarding a wide range of COVID-19-related topics. This suggests that social media can be relied upon to monitor public sentiment during pandemics in disparate types of regions. This may assist in the geographically targeted deployment of epidemic prevention and management efforts.


Asunto(s)
COVID-19 , Medios de Comunicación Sociales , Humanos , Estados Unidos , COVID-19/epidemiología , Estudios Retrospectivos , SARS-CoV-2 , Actitud
8.
J Med Internet Res ; 25: e43251, 2023 03 24.
Artículo en Inglés | MEDLINE | ID: mdl-36961506

RESUMEN

The potential of artificial intelligence (AI) to reduce health care disparities and inequities is recognized, but it can also exacerbate these issues if not implemented in an equitable manner. This perspective identifies potential biases in each stage of the AI life cycle, including data collection, annotation, machine learning model development, evaluation, deployment, operationalization, monitoring, and feedback integration. To mitigate these biases, we suggest involving a diverse group of stakeholders, using human-centered AI principles. Human-centered AI can help ensure that AI systems are designed and used in a way that benefits patients and society, which can reduce health disparities and inequities. By recognizing and addressing biases at each stage of the AI life cycle, AI can achieve its potential in health care.


Asunto(s)
Inteligencia Artificial , Aprendizaje Automático , Humanos , Disparidades en Atención de Salud , Sesgo
9.
J Med Internet Res ; 25: e48193, 2023 11 17.
Artículo en Inglés | MEDLINE | ID: mdl-37976095

RESUMEN

BACKGROUND: Alzheimer disease or related dementias (ADRD) are severe neurological disorders that impair the thinking and memory skills of older adults. Most persons living with dementia receive care at home from their family members or other unpaid informal caregivers; this results in significant mental, physical, and financial challenges for these caregivers. To combat these challenges, many informal ADRD caregivers seek social support in online environments. Although research examining online caregiving discussions is growing, few investigations have distinguished caregivers according to their kin relationships with persons living with dementias. Various studies have suggested that caregivers in different relationships experience distinct caregiving challenges and support needs. OBJECTIVE: This study aims to examine and compare the online behaviors of adult-child and spousal caregivers, the 2 largest groups of informal ADRD caregivers, in an open online community. METHODS: We collected posts from ALZConnected, an online community managed by the Alzheimer's Association. To gain insights into online behaviors, we first applied structural topic modeling to identify topics and topic prevalence between adult-child and spousal caregivers. Next, we applied VADER (Valence Aware Dictionary for Sentiment Reasoning) and LIWC (Linguistic Inquiry and Word Count) to evaluate sentiment changes in the online posts over time for both types of caregivers. We further built machine learning models to distinguish the posts of each caregiver type and evaluated them in terms of precision, recall, F1-score, and area under the precision-recall curve. Finally, we applied the best prediction model to compare the temporal trend of relationship-predicting capacities in posts between the 2 types of caregivers. RESULTS: Our analysis showed that the number of posts from both types of caregivers followed a long-tailed distribution, indicating that most caregivers in this online community were infrequent users. In comparison with adult-child caregivers, spousal caregivers tended to be more active in the community, publishing more posts and engaging in discussions on a wider range of caregiving topics. Spousal caregivers also exhibited slower growth in positive emotional communication over time. The best machine learning model for predicting adult-child, spousal, or other caregivers achieved an area under the precision-recall curve of 81.3%. The subsequent trend analysis showed that it became more difficult to predict adult-child caregiver posts than spousal caregiver posts over time. This suggests that adult-child and spousal caregivers might gradually shift their discussions from questions that are more directly related to their own experiences and needs to questions that are more general and applicable to other types of caregivers. CONCLUSIONS: Our findings suggest that it is important for researchers and community organizers to consider the heterogeneity of caregiving experiences and subsequent online behaviors among different types of caregivers when tailoring online peer support to meet the specific needs of each caregiver group.


Asunto(s)
Hijos Adultos , Enfermedad de Alzheimer , Cuidadores , Anciano , Humanos , Cuidadores/psicología , Comunicación , Familia , Apoyo Social , Hijos Adultos/psicología
10.
Clin Infect Dis ; 74(4): 584-590, 2022 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-34128970

RESUMEN

BACKGROUND: With limited severe acute respiratory syndrome coronavirus (SARS-CoV-2) testing capacity in the United States at the start of the epidemic (January-March 2020), testing was focused on symptomatic patients with a travel history throughout February, obscuring the picture of SARS-CoV-2 seeding and community transmission. We sought to identify individuals with SARS-CoV-2 antibodies in the early weeks of the US epidemic. METHODS: All of Us study participants in all 50 US states provided blood specimens during study visits from 2 January to 18 March 2020. Participants were considered seropositive if they tested positive for SARS-CoV-2 immunoglobulin G (IgG) antibodies with the Abbott Architect SARS-CoV-2 IgG enzyme-linked immunosorbent assay (ELISA) and the EUROIMMUN SARS-CoV-2 ELISA in a sequential testing algorithm. The sensitivity and specificity of these ELISAs and the net sensitivity and specificity of the sequential testing algorithm were estimated, along with 95% confidence intervals (CIs). RESULTS: The estimated sensitivities of the Abbott and EUROIMMUN assays were 100% (107 of 107 [95% CI: 96.6%-100%]) and 90.7% (97 of 107 [83.5%-95.4%]), respectively, and the estimated specificities were 99.5% (995 of 1000 [98.8%-99.8%]) and 99.7% (997 of 1000 [99.1%-99.9%]), respectively. The net sensitivity and specificity of our sequential testing algorithm were 90.7% (97 of 107 [95% CI: 83.5%-95.4%]) and 100.0% (1000 of 1000 [99.6%-100%]), respectively. Of the 24 079 study participants with blood specimens from 2 January to 18 March 2020, 9 were seropositive, 7 before the first confirmed case in the states of Illinois, Massachusetts, Wisconsin, Pennsylvania, and Mississippi. CONCLUSIONS: Our findings identified SARS-CoV-2 infections weeks before the first recognized cases in 5 US states.


Asunto(s)
COVID-19 , Salud Poblacional , Anticuerpos Antivirales , COVID-19/diagnóstico , Ensayo de Inmunoadsorción Enzimática , Humanos , Inmunoglobulina G , SARS-CoV-2 , Sensibilidad y Especificidad
11.
J Biomed Inform ; 125: 103977, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34920126

RESUMEN

Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.


Asunto(s)
Salud Poblacional , Confidencialidad , Revelación , Genómica , Humanos , Aprendizaje Automático
12.
J Med Internet Res ; 24(3): e31687, 2022 03 11.
Artículo en Inglés | MEDLINE | ID: mdl-35275077

RESUMEN

BACKGROUND: In November 2018, a Chinese researcher reported that his team had applied clustered regularly interspaced palindromic repeats or associated protein 9 to delete the gene C-C chemokine receptor type 5 from embryos and claimed that the 2 newborns would have lifetime immunity from HIV infection, an event referred to as #GeneEditedBabies on social media platforms. Although this event stirred a worldwide debate on ethical and legal issues regarding clinical trials with embryonic gene sequences, the focus has mainly been on academics and professionals. However, how the public, especially stratified by geographic region and culture, reacted to these issues is not yet well-understood. OBJECTIVE: The aim of this study is to examine web-based posts about the #GeneEditedBabies event and characterize and compare the public's stance across social media platforms with different user bases. METHODS: We used a set of relevant keywords to search for web-based posts in 4 worldwide or regional mainstream social media platforms: Sina Weibo (China), Twitter, Reddit, and YouTube. We applied structural topic modeling to analyze the main discussed topics and their temporal trends. On the basis of the topics we found, we designed an annotation codebook to label 2000 randomly sampled posts from each platform on whether a supporting, opposing, or neutral stance toward this event was expressed and what the major considerations of those posts were if a stance was described. The annotated data were used to compare stances and the language used across the 4 web-based platforms. RESULTS: We collected >220,000 posts published by approximately 130,000 users regarding the #GeneEditedBabies event. Our results indicated that users discussed a wide range of topics, some of which had clear temporal trends. Our results further showed that although almost all experts opposed this event, many web-based posts supported this event. In particular, Twitter exhibited the largest number of posts in opposition (701/816, 85.9%), followed by Sina Weibo (968/1140, 84.91%), Reddit (550/898, 61.2%), and YouTube (567/1078, 52.6%). The primary opposing reason was rooted in ethical concerns, whereas the primary supporting reason was based on the expectation that such technology could prevent the occurrence of diseases in the future. Posts from these 4 platforms had different language uses and patterns when they expressed stances on the #GeneEditedBabies event. CONCLUSIONS: This research provides evidence that posts on web-based platforms can offer insights into the public's stance on gene editing techniques. However, these stances vary across web-based platforms and often differ from those raised by academics and policy makers.


Asunto(s)
Infecciones por VIH , Medios de Comunicación Sociales , China/epidemiología , Humanos , Recién Nacido , Opinión Pública
13.
J Med Internet Res ; 23(3): e22806, 2021 03 04.
Artículo en Inglés | MEDLINE | ID: mdl-33661128

RESUMEN

BACKGROUND: Documentation burden is a common problem with modern electronic health record (EHR) systems. To reduce this burden, various recording methods (eg, voice recorders or motion sensors) have been proposed. However, these solutions are in an early prototype phase and are unlikely to transition into practice in the near future. A more pragmatic alternative is to directly modify the implementation of the existing functionalities of an EHR system. OBJECTIVE: This study aims to assess the nature of free-text comments entered into EHR flowsheets that supplement quantitative vital sign values and examine opportunities to simplify functionality and reduce documentation burden. METHODS: We evaluated 209,055 vital sign comments in flowsheets that were generated in the Epic EHR system at the Vanderbilt University Medical Center in 2018. We applied topic modeling, as well as the natural language processing Clinical Language Annotation, Modeling, and Processing software system, to extract generally discussed topics and detailed medical terms (expressed as probability distribution) to investigate the stories communicated in these comments. RESULTS: Our analysis showed that 63.33% (6053/9557) of the users who entered vital signs made at least one free-text comment in vital sign flowsheet entries. The user roles that were most likely to compose comments were registered nurse, technician, and licensed nurse. The most frequently identified topics were the notification of a result to health care providers (0.347), the context of a measurement (0.307), and an inability to obtain a vital sign (0.224). There were 4187 unique medical terms that were extracted from 46,029 (0.220) comments, including many symptom-related terms such as "pain," "upset," "dizziness," "coughing," "anxiety," "distress," and "fever" and drug-related terms such as "tylenol," "anesthesia," "cannula," "oxygen," "motrin," "rituxan," and "labetalol." CONCLUSIONS: Considering that flowsheet comments are generally not displayed or automatically pulled into any clinical notes, our findings suggest that the flowsheet comment functionality can be simplified (eg, via structured response fields instead of a text input dialog) to reduce health care provider effort. Moreover, rich and clinically important medical terms such as medications and symptoms should be explicitly recorded in clinical notes for better visibility.


Asunto(s)
Documentación , Registros Electrónicos de Salud , Centros Médicos Académicos , Humanos , Procesamiento de Lenguaje Natural , Signos Vitales
14.
BMC Med Inform Decis Mak ; 21(1): 353, 2021 12 18.
Artículo en Inglés | MEDLINE | ID: mdl-34922536

RESUMEN

BACKGROUND: Information retrieval (IR) help clinicians answer questions posed to large collections of electronic medical records (EMRs), such as how best to identify a patient's cancer stage. One of the more promising approaches to IR for EMRs is to expand a keyword query with similar terms (e.g., augmenting cancer with mets). However, there is a large range of clinical chart review tasks, such that fixed sets of similar terms is insufficient. Current language models, such as Bidirectional Encoder Representations from Transformers (BERT) embeddings, do not capture the full non-textual context of a task. In this study, we present new methods that provide similar terms dynamically by adjusting with the context of the chart review task. METHODS: We introduce a vector space for medical-context in which each word is represented by a vector that captures the word's usage in different medical contexts (e.g., how frequently cancer is used when ordering a prescription versus describing family history) beyond the context learned from the surrounding text. These vectors are transformed into a vector space for customizing the set of similar terms selected for different chart review tasks. We evaluate the vector space model with multiple chart review tasks, in which supervised machine learning models learn to predict the preferred terms of clinically knowledgeable reviewers. To quantify the usefulness of the predicted similar terms to a baseline of standard word2vec embeddings, we measure (1) the prediction performance of the medical-context vector space model using the area under the receiver operating characteristic curve (AUROC) and (2) the labeling effort required to train the models. RESULTS: The vector space outperformed the baseline word2vec embeddings in all three chart review tasks with an average AUROC of 0.80 versus 0.66, respectively. Additionally, the medical-context vector space significantly reduced the number of labels required to learn and predict the preferred similar terms of reviewers. Specifically, the labeling effort was reduced to 10% of the entire dataset in all three tasks. CONCLUSIONS: The set of preferred similar terms that are relevant to a chart review task can be learned by leveraging the medical context of the task.


Asunto(s)
Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Área Bajo la Curva , Registros Electrónicos de Salud , Humanos , Aprendizaje Automático
15.
Am J Hum Genet ; 100(2): 316-322, 2017 02 02.
Artículo en Inglés | MEDLINE | ID: mdl-28065469

RESUMEN

Emerging scientific endeavors are creating big data repositories of data from millions of individuals. Sharing data in a privacy-respecting manner could lead to important discoveries, but high-profile demonstrations show that links between de-identified genomic data and named persons can sometimes be reestablished. Such re-identification attacks have focused on worst-case scenarios and spurred the adoption of data-sharing practices that unnecessarily impede research. To mitigate concerns, organizations have traditionally relied upon legal deterrents, like data use agreements, and are considering suppressing or adding noise to genomic variants. In this report, we use a game theoretic lens to develop more effective, quantifiable protections for genomic data sharing. This is a fundamentally different approach because it accounts for adversarial behavior and capabilities and tailors protections to anticipated recipients with reasonable resources, not adversaries with unlimited means. We demonstrate this approach via a new public resource with genomic summary data from over 8,000 individuals-the Sequence and Phenotype Integration Exchange (SPHINX)-and show that risks can be balanced against utility more effectively than with traditional approaches. We further show the generalizability of this framework by applying it to other genomic data collection and sharing endeavors. Recognizing that such models are dependent on a variety of parameters, we perform extensive sensitivity analyses to show that our findings are robust to their fluctuations.


Asunto(s)
Bases de Datos Genéticas , Privacidad Genética/legislación & jurisprudencia , Genómica , Difusión de la Información , Modelos Teóricos , Registros Electrónicos de Salud , Humanos , Polimorfismo de Nucleótido Simple
16.
J Biomed Inform ; 100: 103334, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31678588

RESUMEN

OBJECTIVE: Models for predicting preterm birth generally have focused on very preterm (28-32 weeks) and moderate to late preterm (32-37 weeks) settings. However, extreme preterm birth (EPB), before the 28th week of gestational age, accounts for the majority of newborn deaths. We investigated the extent to which deep learning models that consider temporal relations documented in electronic health records (EHRs) can predict EPB. STUDY DESIGN: EHR data were subject to word embedding and a temporal deep learning model, in the form of recurrent neural networks (RNNs) to predict EPB. Due to the low prevalence of EPB, the models were trained on datasets where controls were undersampled to balance the case-control ratio. We then applied an ensemble approach to group the trained models to predict EPB in an evaluation setting with a nature EPB ratio. We evaluated the RNN ensemble models with 10 years of EHR data from 25,689 deliveries at Vanderbilt University Medical Center. We compared their performance with traditional machine learning models (logistical regression, support vector machine, gradient boosting) trained on the datasets with balanced and natural EPB ratio. Risk factors associated with EPB were identified using an adjusted odds ratio. RESULTS: The RNN ensemble models trained on artificially balanced data achieved a higher AUC (0.827 vs. 0.744) and sensitivity (0.965 vs. 0.682) than those RNN models trained on the datasets with naturally imbalanced EPB ratio. In addition, the AUC (0.827) and sensitivity (0.965) of the RNN ensemble models were better than the AUC (0.777) and sensitivity (0.819) of the best baseline models trained on balanced data. Also, risk factors, including twin pregnancy, short cervical length, hypertensive disorder, systemic lupus erythematosus, and hydroxychloroquine sulfate, were found to be associated with EPB at a significant level. CONCLUSION: Temporal deep learning can predict EPB up to 8 weeks earlier than its occurrence. Accurate prediction of EPB may allow healthcare organizations to allocate resources effectively and ensure patients receive appropriate care.


Asunto(s)
Aprendizaje Profundo , Registros Electrónicos de Salud , Recien Nacido Extremadamente Prematuro , Algoritmos , Conjuntos de Datos como Asunto , Humanos , Recién Nacido , Clasificación Internacional de Enfermedades
17.
Int J Clin Pract ; 73(11): e13393, 2019 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-31347754

RESUMEN

BACKGROUND: Hepatorenal syndrome (HRS) is a life-threatening complication of cirrhosis and early detection of evolving HRS may provide opportunities for early intervention. We developed a HRS risk model to assist early recognition of inpatient HRS. METHODS: We analysed a retrospective cohort of patients hospitalised from among 122 medical centres in the US Department of Veterans Affairs between 1 January 2005 and 31 December 2013. We included cirrhotic patients who had Kidney Disease Improving Global Outcomes criteria based acute kidney injury on admission. We developed a logistic regression risk prediction model to detect HRS on admission using 10 variables. We calculated 95% confidence intervals on the model building dataset and, subsequently, calculated performance on a 1000 sample holdout test set. We report model performance with area under the curve (AUC) for discrimination and several calibration measures. RESULTS: The cohort included 19 368 patients comprising 32 047 inpatient admissions. The event rate for hospitalised HRS was 2810/31 047 (9.1%) and 79/1000 (7.9%) in the model building and validation datasets, respectively. The variable selection procedure designed a parsimonious model involving ten predictor variables. Final model performance in the validation dataset had an AUC of 0.87, Brier score of 0.05, slope of 1.10 and intercept of 0.04. CONCLUSIONS: We developed a probabilistic risk model to diagnose HRS within 24 hours of hospital admission using routine clinical variables in the largest ever published HRS cohort. The performance was excellent and this model may help identify high-risk patients for HRS and promote early intervention.


Asunto(s)
Síndrome Hepatorrenal/diagnóstico , Unidades de Cuidados Intensivos , Admisión del Paciente/estadística & datos numéricos , Índice de Severidad de la Enfermedad , Lesión Renal Aguda/diagnóstico , Adulto , Área Bajo la Curva , Estudios de Cohortes , Femenino , Síndrome Hepatorrenal/epidemiología , Hospitalización/estadística & datos numéricos , Humanos , Cirrosis Hepática/diagnóstico , Modelos Logísticos , Masculino , Persona de Mediana Edad , Estudios Retrospectivos
18.
J Biomed Inform ; 77: 1-10, 2018 01.
Artículo en Inglés | MEDLINE | ID: mdl-29174994

RESUMEN

OBJECTIVE: The traditional fee-for-service approach to healthcare can lead to the management of a patient's conditions in a siloed manner, inducing various negative consequences. It has been recognized that a bundled approach to healthcare - one that manages a collection of health conditions together - may enable greater efficacy and cost savings. However, it is not always evident which sets of conditions should be managed in a bundled manner. In this study, we investigate if a data-driven approach can automatically learn potential bundles. METHODS: We designed a framework to infer health condition collections (HCCs) based on the similarity of their clinical workflows, according to electronic medical record (EMR) utilization. We evaluated the framework with data from over 16,500 inpatient stays from Northwestern Memorial Hospital in Chicago, Illinois. The plausibility of the inferred HCCs for bundled care was assessed through an online survey of a panel of five experts, whose responses were analyzed via an analysis of variance (ANOVA) at a 95% confidence level. We further assessed the face validity of the HCCs using evidence in the published literature. RESULTS: The framework inferred four HCCs, indicative of (1) fetal abnormalities, (2) late pregnancies, (3) prostate problems, and (4) chronic diseases, with congestive heart failure featuring prominently. Each HCC was substantiated with evidence in the literature and was deemed plausible for bundled care by the experts at a statistically significant level. CONCLUSIONS: The findings suggest that an automated EMR data-driven framework conducted can provide a basis for discovering bundled care opportunities. Still, translating such findings into actual care management will require further refinement, implementation, and evaluation.


Asunto(s)
Minería de Datos/métodos , Atención a la Salud/organización & administración , Registros Electrónicos de Salud , Paquetes de Atención al Paciente , Comorbilidad , Humanos , Aprendizaje Automático , Informática Médica , Manejo de Atención al Paciente , Fenotipo , Flujo de Trabajo
19.
J Biomed Inform ; 80: 87-95, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29530803

RESUMEN

OBJECTIVE: Hepatorenal Syndrome (HRS) is a devastating form of acute kidney injury (AKI) in advanced liver disease patients with high morbidity and mortality, but phenotyping algorithms have not yet been developed using large electronic health record (EHR) databases. We evaluated and compared multiple phenotyping methods to achieve an accurate algorithm for HRS identification. MATERIALS AND METHODS: A national retrospective cohort of patients with cirrhosis and AKI admitted to 124 Veterans Affairs hospitals was assembled from electronic health record data collected from 2005 to 2013. AKI was defined by the Kidney Disease: Improving Global Outcomes criteria. Five hundred and four hospitalizations were selected for manual chart review and served as the gold standard. Electronic Health Record based predictors were identified using structured and free text clinical data, subjected through NLP from the clinical Text Analysis Knowledge Extraction System. We explored several dimension reduction techniques for the NLP data, including newer high-throughput phenotyping and word embedding methods, and ascertained their effectiveness in identifying the phenotype without structured predictor variables. With the combined structured and NLP variables, we analyzed five phenotyping algorithms: penalized logistic regression, naïve Bayes, support vector machines, random forest, and gradient boosting. Calibration and discrimination metrics were calculated using 100 bootstrap iterations. In the final model, we report odds ratios and 95% confidence intervals. RESULTS: The area under the receiver operating characteristic curve (AUC) for the different models ranged from 0.73 to 0.93; with penalized logistic regression having the best discriminatory performance. Calibration for logistic regression was modest, but gradient boosting and support vector machines were superior. NLP identified 6985 variables; a priori variable selection performed similarly to dimensionality reduction using high-throughput phenotyping and semantic similarity informed clustering (AUC of 0.81 - 0.82). CONCLUSION: This study demonstrated improved phenotyping of a challenging AKI etiology, HRS, over ICD-9 coding. We also compared performance among multiple approaches to EHR-derived phenotyping, and found similar results between methods. Lastly, we showed that automated NLP dimension reduction is viable for acute illness.


Asunto(s)
Algoritmos , Diagnóstico por Computador/métodos , Síndrome Hepatorrenal/diagnóstico , Fenotipo , Lesión Renal Aguda , Anciano , Registros Electrónicos de Salud , Femenino , Síndrome Hepatorrenal/etiología , Síndrome Hepatorrenal/fisiopatología , Humanos , Cirrosis Hepática/complicaciones , Masculino , Persona de Mediana Edad , Procesamiento de Lenguaje Natural , Oportunidad Relativa , Curva ROC , Estudios Retrospectivos , Máquina de Vectores de Soporte
20.
J Biomed Inform ; 66: 42-51, 2017 02.
Artículo en Inglés | MEDLINE | ID: mdl-28007583

RESUMEN

BACKGROUND: The last few years have witnessed an increasing number of clinical research networks (CRNs) focused on building large collections of data from electronic health records (EHRs), claims, and patient-reported outcomes (PROs). Many of these CRNs provide a service for the discovery of research cohorts with various health conditions, which is especially useful for rare diseases. Supporting patient privacy can enhance the scalability and efficiency of such processes; however, current practice mainly relies on policy, such as guidelines defined in the Health Insurance Portability and Accountability Act (HIPAA), which are insufficient for CRNs (e.g., HIPAA does not require encryption of data - which can mitigate insider threats). By combining policy with privacy enhancing technologies we can enhance the trustworthiness of CRNs. The goal of this research is to determine if searchable encryption can instill privacy in CRNs without sacrificing their usability. METHODS: We developed a technique, implemented in working software to enable privacy-preserving cohort discovery (PPCD) services in large distributed CRNs based on elliptic curve cryptography (ECC). This technique also incorporates a block indexing strategy to improve the performance (in terms of computational running time) of PPCD. We evaluated the PPCD service with three real cohort definitions: (1) elderly cervical cancer patients who underwent radical hysterectomy, (2) oropharyngeal and tongue cancer patients who underwent robotic transoral surgery, and (3) female breast cancer patients who underwent mastectomy) with varied query complexity. These definitions were tested in an encrypted database of 7.1 million records derived from the publically available Healthcare Cost and Utilization Project (HCUP) Nationwide Inpatient Sample (NIS). We assessed the performance of the PPCD service in terms of (1) accuracy in cohort discovery, (2) computational running time, and (3) privacy afforded to the underlying records during PPCD. RESULTS: The empirical results indicate that the proposed PPCD can execute cohort discovery queries in a reasonable amount of time, with query runtime in the range of 165-262s for the 3 use cases, with zero compromise in accuracy. We further show that the search performance is practical because it supports a highly parallelized design for secure evaluation over encrypted records. Additionally, our security analysis shows that the proposed construction is resilient to standard adversaries. CONCLUSIONS: PPCD services can be designed for clinical research networks. The security construction presented in this work specifically achieves high privacy guarantees by preventing both threats originating from within and beyond the network.


Asunto(s)
Seguridad Computacional , Registros Electrónicos de Salud , Health Insurance Portability and Accountability Act , Confidencialidad , Femenino , Humanos , Estados Unidos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA