RESUMO
OBJECTIVE: To provide a brief overview of artificial intelligence (AI) application within the field of eating disorders (EDs) and propose focused solutions for research. METHOD: An overview and summary of AI application pertinent to EDs with focus on AI's ability to address issues relating to data sharing and pooling (and associated privacy concerns), data augmentation, as well as bias within datasets is provided. RESULTS: In addition to clinical applications, AI can utilize useful tools to help combat commonly encountered challenges in ED research, including issues relating to low prevalence of specific subpopulations of patients, small overall sample sizes, and bias within datasets. DISCUSSION: There is tremendous potential to embed and utilize various facets of artificial intelligence (AI) to help improve our understanding of EDs and further evaluate and investigate questions that ultimately seek to improve outcomes. Beyond the technology, issues relating to regulation of AI, establishing ethical guidelines for its application, and the trust of providers and patients are all needed for ultimate adoption and acceptance into ED practice. PUBLIC SIGNIFICANCE: Artificial intelligence (AI) offers a promise of significant potential within the realm of eating disorders (EDs) and encompasses a broad set of techniques that offer utility in various facets of ED research and by extension delivery of clinical care. Beyond the technology, issues relating to regulation, establishing ethical guidelines for application, and the trust of providers and patients are needed for the ultimate adoption and acceptance of AI into ED practice.
Assuntos
Inteligência Artificial , Transtornos da Alimentação e da Ingestão de Alimentos , Humanos , Transtornos da Alimentação e da Ingestão de Alimentos/terapia , Pesquisa BiomédicaRESUMO
The number of papers presenting machine learning (ML) models that are being submitted to and published in the Journal of Medical Internet Research and other JMIR Publications journals has steadily increased. Editors and peer reviewers involved in the review process for such manuscripts often go through multiple review cycles to enhance the quality and completeness of reporting. The use of reporting guidelines or checklists can help ensure consistency in the quality of submitted (and published) scientific manuscripts and, for example, avoid instances of missing information. In this Editorial, the editors of JMIR Publications journals discuss the general JMIR Publications policy regarding authors' application of reporting guidelines and specifically focus on the reporting of ML studies in JMIR Publications journals, using the Consolidated Reporting of Machine Learning Studies (CREMLS) guidelines, with an example of how authors and other journals could use the CREMLS checklist to ensure transparency and rigor in reporting.
Assuntos
Aprendizado de Máquina , Humanos , Guias como Assunto , Prognóstico , Lista de ChecagemRESUMO
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
Assuntos
Revelação , Privacidade , Humanos , Coleta de DadosRESUMO
BACKGROUND: The reporting of machine learning (ML) prognostic and diagnostic modeling studies is often inadequate, making it difficult to understand and replicate such studies. To address this issue, multiple consensus and expert reporting guidelines for ML studies have been published. However, these guidelines cover different parts of the analytics lifecycle, and individually, none of them provide a complete set of reporting requirements. OBJECTIVE: We aimed to consolidate the ML reporting guidelines and checklists in the literature to provide reporting items for prognostic and diagnostic ML in in-silico and shadow mode studies. METHODS: We conducted a literature search that identified 192 unique peer-reviewed English articles that provide guidance and checklists for reporting ML studies. The articles were screened by their title and abstract against a set of 9 inclusion and exclusion criteria. Articles that were filtered through had their quality evaluated by 2 raters using a 9-point checklist constructed from guideline development good practices. The average κ was 0.71 across all quality criteria. The resulting 17 high-quality source papers were defined as having a quality score equal to or higher than the median. The reporting items in these 17 articles were consolidated and screened against a set of 6 inclusion and exclusion criteria. The resulting reporting items were sent to an external group of 11 ML experts for review and updated accordingly. The updated checklist was used to assess the reporting in 6 recent modeling papers in JMIR AI. Feedback from the external review and initial validation efforts was used to improve the reporting items. RESULTS: In total, 37 reporting items were identified and grouped into 5 categories based on the stage of the ML project: defining the study details, defining and collecting the data, modeling methodology, model evaluation, and explainability. None of the 17 source articles covered all the reporting items. The study details and data description reporting items were the most common in the source literature, with explainability and methodology guidance (ie, data preparation and model training) having the least coverage. For instance, a median of 75% of the data description reporting items appeared in each of the 17 high-quality source guidelines, but only a median of 33% of the data explainability reporting items appeared. The highest-quality source articles tended to have more items on reporting study details. Other categories of reporting items were not related to the source article quality. We converted the reporting items into a checklist to support more complete reporting. CONCLUSIONS: Our findings supported the need for a set of consolidated reporting items, given that existing high-quality guidelines and checklists do not individually provide complete coverage. The consolidated set of reporting items is expected to improve the quality and reproducibility of ML modeling studies.
Assuntos
Lista de Checagem , Aprendizado de Máquina , Humanos , Prognóstico , Reprodutibilidade dos Testes , ConsensoRESUMO
This article provides a state-of-the-art summary of location privacy issues and geoprivacy-preserving methods in public health interventions and health research involving disaggregate geographic data about individuals. Synthetic data generation (from real data using machine learning) is discussed in detail as a promising privacy-preserving approach. To fully achieve their goals, privacy-preserving methods should form part of a wider comprehensive socio-technical framework for the appropriate disclosure, use and dissemination of data containing personal identifiable information. Select highlights are also presented from a related December 2021 AAG (American Association of Geographers) webinar that explored ethical and other issues surrounding the use of geospatial data to address public health issues during challenging crises, such as the COVID-19 pandemic.
Assuntos
COVID-19 , Privacidade , Confidencialidade , Humanos , Pandemias , Saúde Pública , SARS-CoV-2 , Justiça SocialRESUMO
BACKGROUND: Novartis and the University of Oxford's Big Data Institute (BDI) have established a research alliance with the aim to improve health care and drug development by making it more efficient and targeted. Using a combination of the latest statistical machine learning technology with an innovative IT platform developed to manage large volumes of anonymised data from numerous data sources and types we plan to identify novel patterns with clinical relevance which cannot be detected by humans alone to identify phenotypes and early predictors of patient disease activity and progression. METHOD: The collaboration focuses on highly complex autoimmune diseases and develops a computational framework to assemble a research-ready dataset across numerous modalities. For the Multiple Sclerosis (MS) project, the collaboration has anonymised and integrated phase II to phase IV clinical and imaging trial data from ≈35,000 patients across all clinical phenotypes and collected in more than 2200 centres worldwide. For the "IL-17" project, the collaboration has anonymised and integrated clinical and imaging data from over 30 phase II and III Cosentyx clinical trials including more than 15,000 patients, suffering from four autoimmune disorders (Psoriasis, Axial Spondyloarthritis, Psoriatic arthritis (PsA) and Rheumatoid arthritis (RA)). RESULTS: A fundamental component of successful data analysis and the collaborative development of novel machine learning methods on these rich data sets has been the construction of a research informatics framework that can capture the data at regular intervals where images could be anonymised and integrated with the de-identified clinical data, quality controlled and compiled into a research-ready relational database which would then be available to multi-disciplinary analysts. The collaborative development from a group of software developers, data wranglers, statisticians, clinicians, and domain scientists across both organisations has been key. This framework is innovative, as it facilitates collaborative data management and makes a complicated clinical trial data set from a pharmaceutical company available to academic researchers who become associated with the project. CONCLUSIONS: An informatics framework has been developed to capture clinical trial data into a pipeline of anonymisation, quality control, data exploration, and subsequent integration into a database. Establishing this framework has been integral to the development of analytical tools.
Assuntos
Ciência de Dados , Disseminação de Informação , Bases de Dados Factuais , Desenvolvimento de Medicamentos , Humanos , Projetos de PesquisaRESUMO
The application of big data, radiomics, machine learning, and artificial intelligence (AI) algorithms in radiology requires access to large data sets containing personal health information. Because machine learning projects often require collaboration between different sites or data transfer to a third party, precautions are required to safeguard patient privacy. Safety measures are required to prevent inadvertent access to and transfer of identifiable information. The Canadian Association of Radiologists (CAR) is the national voice of radiology committed to promoting the highest standards in patient-centered imaging, lifelong learning, and research. The CAR has created an AI Ethical and Legal standing committee with the mandate to guide the medical imaging community in terms of best practices in data management, access to health care data, de-identification, and accountability practices. Part 2 of this article will inform CAR members on the practical aspects of medical imaging de-identification, strengths and limitations of de-identification approaches, list of de-identification software and tools available, and perspectives on future directions.
Assuntos
Inteligência Artificial/ética , Anonimização de Dados/ética , Diagnóstico por Imagem/ética , Radiologistas/ética , Algoritmos , Canadá , Humanos , Aprendizado de Máquina , Sociedades MédicasRESUMO
The application of big data, radiomics, machine learning, and artificial intelligence (AI) algorithms in radiology requires access to large data sets containing personal health information. Because machine learning projects often require collaboration between different sites or data transfer to a third party, precautions are required to safeguard patient privacy. Safety measures are required to prevent inadvertent access to and transfer of identifiable information. The Canadian Association of Radiologists (CAR) is the national voice of radiology committed to promoting the highest standards in patient-centered imaging, lifelong learning, and research. The CAR has created an AI Ethical and Legal standing committee with the mandate to guide the medical imaging community in terms of best practices in data management, access to health care data, de-identification, and accountability practices. Part 1 of this article will inform CAR members on principles of de-identification, pseudonymization, encryption, direct and indirect identifiers, k-anonymization, risks of reidentification, implementations, data set release models, and validation of AI algorithms, with a view to developing appropriate standards to safeguard patient information effectively.
Assuntos
Inteligência Artificial/ética , Anonimização de Dados/ética , Diagnóstico por Imagem/ética , Radiologistas/ética , Algoritmos , Canadá , Humanos , Aprendizado de Máquina , Sociedades MédicasRESUMO
BACKGROUND: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE: The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS: A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS: The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS: We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.
Assuntos
COVID-19/epidemiologia , Revelação/normas , Disseminação de Informação/métodos , Humanos , Reprodutibilidade dos Testes , Fatores de Risco , SARS-CoV-2RESUMO
OBJECTIVES: It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases. METHODS: We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method. RESULTS: We demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data. DISCUSSION: Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification. CONCLUSIONS: This framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools.
Assuntos
Confidencialidade , Anonimização de Dados , Registros Eletrônicos de Saúde , Humanos , RiscoRESUMO
OBJECTIVE: Some phase 1 clinical trials offer strong financial incentives for healthy individuals to participate in their studies. There is evidence that some individuals enroll in multiple trials concurrently. This creates safety risks and introduces data quality problems into the trials. Our objective was to construct a privacy preserving protocol to track phase 1 participants to detect concurrent enrollment. DESIGN: A protocol using secure probabilistic querying against a database of trial participants that allows for screening during telephone interviews and on-site enrollment was developed. The match variables consisted of demographic information. MEASUREMENT: The accuracy (sensitivity, precision, and negative predictive value) of the matching and its computational performance in seconds were measured under simulated environments. Accuracy was also compared to non-secure matching methods. RESULTS: The protocol performance scales linearly with the database size. At the largest database size of 20,000 participants, a query takes under 20s on a 64 cores machine. Sensitivity, precision, and negative predictive value of the queries were consistently at or above 0.9, and were very similar to non-secure versions of the protocol. CONCLUSION: The protocol provides a reasonable solution to the concurrent enrollment problems in phase 1 clinical trials, and is able to ensure that personal information about participants is kept secure.
Assuntos
Ensaios Clínicos como Assunto , Confidencialidade , Bases de Dados Factuais , Confiabilidade dos Dados , Humanos , Estatística como AssuntoRESUMO
Rapid release of prepublication data has served the field of genomics well. Attendees at a workshop in Toronto recommend extending the practice to other biological data sets.
Assuntos
Acesso à Informação , Guias como Assunto , Editoração , Pesquisa , Comportamento Cooperativo , Projeto Genoma Humano , Humanos , Ontário , Editoração/ética , Editoração/normas , Pesquisa/normas , Pesquisadores/ética , Pesquisadores/normasRESUMO
BACKGROUND: A linear programming (LP) model was proposed to create de-identified data sets that maximally include spatial detail (e.g., geocodes such as ZIP or postal codes, census blocks, and locations on maps) while complying with the HIPAA Privacy Rule's Expert Determination method, i.e., ensuring that the risk of re-identification is very small. The LP model determines the transition probability from an original location of a patient to a new randomized location. However, it has a limitation for the cases of areas with a small population (e.g., median of 10 people in a ZIP code). METHODS: We extend the previous LP model to accommodate the cases of a smaller population in some locations, while creating de-identified patient spatial data sets which ensure the risk of re-identification is very small. RESULTS: Our LP model was applied to a data set of 11,740 postal codes in the City of Ottawa, Canada. On this data set we demonstrated the limitations of the previous LP model, in that it produces improbable results, and showed how our extensions to deal with small areas allows the de-identification of the whole data set. CONCLUSIONS: The LP model described in this study can be used to de-identify geospatial information for areas with small populations with minimal distortion to postal codes. Our LP model can be extended to include other information, such as age and gender.
Assuntos
Modelos Teóricos , Privacidade , Programação Linear/normas , Análise Espacial , Humanos , Ontário/epidemiologia , Sistemas de Identificação de Pacientes/métodos , Sistemas de Identificação de Pacientes/normasRESUMO
BACKGROUND: The increased use of human biological material for cell-based research and clinical interventions poses risks to the privacy of patients and donors, including the possibility of re-identification of individuals from anonymized cell lines and associated genetic data. These risks will increase as technologies and databases used for re-identification become affordable and more sophisticated. Policies that require ongoing linkage of cell lines to donors' clinical information for research and regulatory purposes, and existing practices that limit research participants' ability to control what is done with their genetic data, amplify the privacy concerns. DISCUSSION: To date, the privacy issues associated with cell-based research and interventions have not received much attention in the academic and policymaking contexts. This paper, arising out of a multi-disciplinary workshop, aims to rectify this by outlining the issues, proposing novel governance strategies and policy recommendations, and identifying areas where further evidence is required to make sound policy decisions. The authors of this paper take the position that existing rules and norms can be reasonably extended to address privacy risks in this context without compromising emerging developments in the research environment, and that exceptions from such rules should be justified using a case-by-case approach. In developing new policies, the broader framework of regulations governing cell-based research and related areas must be taken into account, as well as the views of impacted groups, including scientists, research participants and the general public. SUMMARY: This paper outlines deliberations at a policy development workshop focusing on privacy challenges associated with cell-based research and interventions. The paper provides an overview of these challenges, followed by a discussion of key themes and recommendations that emerged from discussions at the workshop. The paper concludes that privacy risks associated with cell-based research and interventions should be addressed through evidence-based policy reforms that account for both well-established legal and ethical norms and current knowledge about actual or anticipated harms. The authors also call for research studies that identify and address gaps in understanding of privacy risks.
Assuntos
Testes Anônimos , Confidencialidade , Consentimento Livre e Esclarecido , Formulação de Políticas , Sujeitos da Pesquisa , Pesquisa com Células-Tronco , Testes Anônimos/ética , Testes Anônimos/legislação & jurisprudência , Confidencialidade/ética , Confidencialidade/legislação & jurisprudência , Congressos como Assunto , Feminino , Humanos , Consentimento Livre e Esclarecido/legislação & jurisprudência , Masculino , Projetos de Pesquisa , Pesquisa com Células-Tronco/ética , Pesquisa com Células-Tronco/legislação & jurisprudênciaRESUMO
Data bias is a major concern in biomedical research, especially when evaluating large-scale observational datasets. It leads to imprecise predictions and inconsistent estimates in standard regression models. We compare the performance of commonly used bias-mitigating approaches (resampling, algorithmic, and post hoc approaches) against a synthetic data-augmentation method that utilizes sequential boosted decision trees to synthesize under-represented groups. The approach is called synthetic minority augmentation (SMA). Through simulations and analysis of real health datasets on a logistic regression workload, the approaches are evaluated across various bias scenarios (types and severity levels). Performance was assessed based on area under the curve, calibration (Brier score), precision of parameter estimates, confidence interval overlap, and fairness. Overall, SMA produces the closest results to the ground truth in low to medium bias (50% or less missing proportion). In high bias (80% or more missing proportion), the advantage of SMA is not obvious, with no specific method consistently outperforming others.
RESUMO
BACKGROUND: Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. OBJECTIVE: This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. METHODS: We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. RESULTS: The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. CONCLUSIONS: We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.
RESUMO
Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.
Assuntos
Benchmarking , Revelação , Simulação por Computador , Modelos Logísticos , Processos MentaisRESUMO
Patients, families, healthcare providers and funders face multiple comparable treatment options without knowing which provides the best quality of care. As a step towards improving this, the REthinking Clinical Trials (REaCT) pragmatic trials program started in 2014 to break down many of the traditional barriers to performing clinical trials. However, until other innovative methodologies become widely used, the impact of this program will remain limited. These innovations include the incorporation of near equivalence analyses and the incorporation of artificial intelligence (AI) into clinical trial design. Near equivalence analyses allow for the comparison of different treatments (drug and non-drug) using quality of life, toxicity, cost-effectiveness, and pharmacokinetic/pharmacodynamic data. AI offers unique opportunities to maximize the information gleaned from clinical trials, reduces sample size estimates, and can potentially "rescue" poorly accruing trials. On 2 May 2023, the first REaCT international symposium took place to connect clinicians and scientists, set goals and identify future avenues for investigator-led clinical trials. Here, we summarize the topics presented at this meeting to promote sharing and support other similarly motivated groups to learn and share their experiences.
Assuntos
Neoplasias , Qualidade de Vida , Humanos , Inteligência Artificial , Pessoal de Saúde , Neoplasias/terapia , Qualidade da Assistência à Saúde , Ensaios Clínicos como AssuntoRESUMO
BACKGROUND: Participants in medical forums often reveal personal health information about themselves in their online postings. To feel comfortable revealing sensitive personal health information, some participants may hide their identity by posting anonymously. They can do this by using fake identities, nicknames, or pseudonyms that cannot readily be traced back to them. However, individual writing styles have unique features and it may be possible to determine the true identity of an anonymous user through author attribution analysis. Although there has been previous work on the authorship attribution problem, there has been a dearth of research on automated authorship attribution on medical forums. The focus of the paper is to demonstrate that character-based author attribution works better than word-based methods in medical forums. OBJECTIVE: The goal was to build a system that accurately attributes authorship of messages posted on medical forums. The Authorship Attributor system uses text analysis techniques to crawl medical forums and automatically correlate messages written by the same authors. Authorship Attributor processes unstructured texts regardless of the document type, context, and content. METHODS: The messages were labeled by nicknames of the forum participants. We evaluated the system's performance through its accuracy on 6000 messages gathered from 2 medical forums on an in vitro fertilization (IVF) support website. RESULTS: Given 2 lists of candidate authors (30 and 50 candidates, respectively), we obtained an F score accuracy in detecting authors of 75% to 80% on messages containing 100 to 150 words on average, and 97.9% on longer messages containing at least 300 words. CONCLUSIONS: Authorship can be successfully detected in short free-form messages posted on medical forums. This raises a concern about the meaningfulness of anonymous posting on such medical forums. Authorship attribution tools can be used to warn consumers wishing to post anonymously about the likelihood of their identity being determined.
Assuntos
Autoria , Confidencialidade , Internet , HumanosRESUMO
BACKGROUND: Our objective was to develop a model for measuring re-identification risk that more closely mimics the behaviour of an adversary by accounting for repeated attempts at matching and verification of matches, and apply it to evaluate the risk of re-identification for Canada's post-marketing adverse drug event database (ADE).Re-identification is only demonstrably plausible for deaths in ADE. A matching experiment between ADE records and virtual obituaries constructed from Statistics Canada vital statistics was simulated. A new re-identification risk is considered, it assumes that after gathering all the potential matches for a patient record (all records in the obituaries that are potential matches for an ADE record), an adversary tries to verify these potential matches. Two adversary scenarios were considered: (a) a mildly motivated adversary who will stop after one verification attempt, and (b) a highly motivated adversary who will attempt to verify all the potential matches and is only limited by practical or financial considerations. METHODS: The mean percentage of records in ADE that had a high probability of being re-identified was computed. RESULTS: Under scenario (a), the risk of re-identification from disclosing the province, age at death, gender, and exact date of the report is quite high, but the removal of province brings down the risk significantly. By only generalizing the date of reporting to month and year and including all other variables, the risk is always low. All ADE records have a high risk of re-identification under scenario (b), but the plausibility of that scenario is limited because of the financial and practical deterrent even for highly motivated adversaries. CONCLUSIONS: It is possible to disclose Canada's adverse drug event database while ensuring that plausible re-identification risks are acceptably low. Our new re-identification risk model is suitable for such risk assessments.