Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 1.034
1.
Article De | MEDLINE | ID: mdl-38753021

The digital health progress hubs pilot the extensibility of the concepts and solutions of the Medical Informatics Initiative to improve regional healthcare and research. The six funded projects address different diseases, areas in regional healthcare, and methods of cross-institutional data linking and use. Despite the diversity of the scenarios and regional conditions, the technical, regulatory, and organizational challenges and barriers that the progress hubs encounter in the actual implementation of the solutions are often similar. This results in some common approaches to solutions, but also in political demands that go beyond the Health Data Utilization Act, which is considered a welcome improvement by the progress hubs.In this article, we present the digital progress hubs and discuss achievements, challenges, and approaches to solutions that enable the shared use of data from university hospitals and non-academic institutions in the healthcare system and can make a sustainable contribution to improving medical care and research.


Hospitals, University , Hospitals, University/organization & administration , Germany , Humans , Medical Record Linkage/methods , Electronic Health Records/trends , Models, Organizational , National Health Programs/trends , National Health Programs/organization & administration , Medical Informatics/organization & administration , Medical Informatics/trends , Digital Health
2.
Article De | MEDLINE | ID: mdl-38753022

The interoperability Working Group of the Medical Informatics Initiative (MII) is the platform for the coordination of overarching procedures, data structures, and interfaces between the data integration centers (DIC) of the university hospitals and national and international interoperability committees. The goal is the joint content-related and technical design of a distributed infrastructure for the secondary use of healthcare data that can be used via the Research Data Portal for Health. Important general conditions are data privacy and IT security for the use of health data in biomedical research. To this end, suitable methods are used in dedicated task forces to enable procedural, syntactic, and semantic interoperability for data use projects. The MII core dataset was developed as several modules with corresponding information models and implemented using the HL7® FHIR® standard to enable content-related and technical specifications for the interoperable provision of healthcare data through the DIC. International terminologies and consented metadata are used to describe these data in more detail. The overall architecture, including overarching interfaces, implements the methodological and legal requirements for a distributed data use infrastructure, for example, by providing pseudonymized data or by federated analyses. With these results of the Interoperability Working Group, the MII is presenting a future-oriented solution for the exchange and use of healthcare data, the applicability of which goes beyond the purpose of research and can play an essential role in the digital transformation of the healthcare system.


Health Information Interoperability , Humans , Datasets as Topic , Electronic Health Records , Germany , Health Information Interoperability/standards , Medical Informatics , Medical Record Linkage/methods , Systems Integration
3.
Stud Health Technol Inform ; 313: 124-128, 2024 Apr 26.
Article En | MEDLINE | ID: mdl-38682516

BACKGROUND: Electronic health records (EHR) emerged as a digital record of the data that is generated in the healthcare. OBJECTIVES: In this paper the transfer times of EHRs using the Hypertext Transfer Protocol and WebSocket in both local network and wide area network (WAN) are compared. METHODS: A python web application to serve Fast Health Interoperability Resources (FHIR) records is created and the transfer times of the EHRs over both HTTP and WebSocket connection are measured. 45000 test Patient resources in 20, 50, 100 and 200 resources per Bundle transfers are used. RESULTS: WebSocket showed much better transfer times of large amount of data. These were 18 s shorter in the local network and 342 s shorter in WAN for the 20 resource per Bundle transfer. CONCLUSION: RESTful APIs are a convenient way to implement EHR servers; on the other hand, HTTP becomes a bottleneck when transferring large amount of data. WebSocket shows better transfer times and thus its superiority in such situations. The problem can be addressed by developing a new communication protocol or by using network tunneling to handle large data transfer of EHRs.


Electronic Health Records , Humans , Medical Record Linkage/methods , Internet , Health Information Interoperability , Software
4.
Article De | MEDLINE | ID: mdl-38684526

Healthcare data are an important resource in applied medical research. They are available multicentrically. However, it remains a challenge to enable standardized data exchange processes between federal states and their individual laws and regulations. The Medical Informatics Initiative (MII) was founded in 2016 to implement processes that enable cross-clinic access to healthcare data in Germany. Several working groups (WGs) have been set up to coordinate standardized data structures (WG Interoperability), patient information and declarations of consent (WG Consent), and regulations on data exchange (WG Data Sharing). Here we present the most important results of the Data Sharing working group, which include agreed terms of use, legal regulations, and data access processes. They are already being implemented by the established Data Integration Centers (DIZ) and Use and Access Committees (UACs). We describe the services that are necessary to provide researchers with standardized data access. They are implemented with the Research Data Portal for Health, among others. Since the pilot phase, the processes of 385 active researchers have been used on this basis, which, as of April 2024, has resulted in 19 registered projects and 31 submitted research applications.


Electronic Health Records , Information Dissemination , Humans , Biomedical Research , Electronic Health Records/statistics & numerical data , Germany , Health Services Research , Medical Informatics , Medical Record Linkage/methods , Models, Organizational
5.
Stat Methods Med Res ; 33(6): 966-980, 2024 Jun.
Article En | MEDLINE | ID: mdl-38592341

The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.


Algorithms , Latent Class Analysis , Medical Record Linkage , Humans , Medical Record Linkage/methods , Models, Statistical , Cluster Analysis , Computer Simulation
6.
J Korean Med Sci ; 39(14): e127, 2024 Apr 15.
Article En | MEDLINE | ID: mdl-38622936

BACKGROUND: To overcome the limitations of relying on data from a single institution, many researchers have studied data linkage methodologies. Data linkage includes errors owing to legal issues surrounding personal information and technical issues related to data processing. Linkage errors affect selection bias, and external and internal validity. Therefore, quality verification for each connection method with adherence to personal information protection is an important issue. This study evaluated the linkage quality of linked data and analyzed the potential bias resulting from linkage errors. METHODS: This study analyzed claims data submitted to the Health Insurance Review and Assessment Service (HIRA DATA). The linkage errors of the two deterministic linkage methods were evaluated based on the use of the match key. The first deterministic linkage uses a unique identification number, and the second deterministic linkage uses the name, gender, and date of birth as a set of partial identifiers. The linkage error included in this deterministic linkage method was compared with the absolute standardized difference (ASD) of Cohen's according to the baseline characteristics, and the linkage quality was evaluated through the following indicators: linked rate, false match rate, missed match rate, positive predictive value, sensitivity, specificity, and F1-score. RESULTS: For the deterministic linkage method that used the name, gender, and date of birth as a set of partial identifiers, the true match rate was 83.5 and the missed match rate was 16.5. Although there was bias in some characteristics of the data, most of the ASD values were less than 0.1, with no case greater than 0.5. Therefore, it is difficult to determine whether linked data constructed with deterministic linkages have substantial differences. CONCLUSION: This study confirms the possibility of building health and medical data at the national level as the first data linkage quality verification study using big data from the HIRA. Analyzing the quality of linkages is crucial for comprehending linkage errors and generating reliable analytical outcomes. Linkers should increase the reliability of linked data by providing linkage error-related information to researchers. The results of this study will serve as reference data to increase the reliability of multicenter data linkage studies.


Information Storage and Retrieval , Medical Record Linkage , Humans , Reproducibility of Results , Medical Record Linkage/methods , Predictive Value of Tests , Health Services
7.
Int J Popul Data Sci ; 9(1): 2137, 2024.
Article En | MEDLINE | ID: mdl-38425790

Introduction: Recent years have seen an increase in linkages between survey and administrative data. It is important to evaluate the quality of such data linkages to discern the likely reliability of ensuing research. Evaluation of linkage quality and bias can be conducted using different approaches, but many of these are not possible when there is a separation of processes for linkage and analysis to help preserve privacy, as is typically the case in the UK (and elsewhere). Objectives: We aimed to describe a suite of generalisable methods to evaluate linkage quality and population representativeness of linked survey and administrative data which remain tractable when users of the linked data are not party to the linkage process itself. We emphasise issues particular to longitudinal survey data throughout. Methods: Our proposed approaches cover several areas: i) Linkage rates, ii) Selection into response, linkage consent and successful linkage, iii) Linkage quality, and iv) Linked data population representativeness. We illustrate these methods using a recent linkage between the 1958 National Child Development Study (NCDS; a cohort following an initial 17,415 people born in Great Britain in a single week of 1958) and Hospital Episode Statistics (HES) databases (containing important information regarding admissions, accident and emergency attendances and outpatient appointments at NHS hospitals in England). Results: Our illustrative analyses suggest that the linkage quality of the NCDS-HES data is high and that the linked sample maintains an excellent level of population representativeness with respect to the single dimension we assessed. Conclusions: Through this work we hope to encourage providers and users of linked data resources to undertake and publish thorough evaluations. We further hope that providing illustrative analyses using linked NCDS-HES data will improve the quality and transparency of research using this particular linked data resource.


Child Development , Medical Record Linkage , Child , Humans , Reproducibility of Results , Medical Record Linkage/methods , Hospitalization , Hospitals
8.
Int J Med Inform ; 185: 105387, 2024 May.
Article En | MEDLINE | ID: mdl-38428200

BACKGROUND: Cancer registries link a large number of electronic health records reported by medical institutions to already registered records of the matching individual and tumor. Records are automatically linked using deterministic and probabilistic approaches; machine learning is rarely used. Records that cannot be matched automatically with sufficient accuracy are typically processed manually. For application, it is important to know how well record linkage approaches match real-world records and how much manual effort is required to achieve the desired linkage quality. We study the task of linking reported records to the matching registered tumor in cancer registries. METHODS: We compare the tradeoff between linkage quality and manual effort of five machine learning methods (logistic regression, random forest, gradient boosting, neural network, and a stacked method) to a deterministic baseline. The record linkage methods are compared in a two-class setting (no-match/ match) and a three-class setting (no-match/ undecided/ match). A cancer registry collected and linked the dataset consisting of categorical variables matching 145,755 reported records with 33,289 registered tumors. RESULTS: In the two-class setting, the gradient boosting, neural network, and stacked models have higher accuracy and F1 score (accuracy: 0.968-0.978, F1 score: 0.983-0.988) than the deterministic baseline (accuracy: 0.964, F1 score: 0.980) when the same records are manually processed (0.89% of all records). In the three-class setting, these three machine learning methods can automatically process all reported records and still have higher accuracy and F1 score than the deterministic baseline. The linkage quality of the machine learning methods studied, except for the neural network, increase as the number of manually processed records increases. CONCLUSION: Machine learning methods can significantly improve linkage quality and reduce the manual effort required by medical coders to match tumor records in cancer registries compared to a deterministic baseline. Our results help cancer registries estimate how linkage quality increases as more records are manually processed.


Electronic Health Records , Neoplasms , Humans , Medical Record Linkage/methods , Neoplasms/epidemiology , Registries , Databases, Factual
9.
Aust Health Rev ; 48(1): 8-15, 2024 Feb.
Article En | MEDLINE | ID: mdl-38118279

Objective Data linkage is a very powerful research tool in epidemiology, however, establishing this can be a lengthy and intensive process. This paper reports on the complex landscape of conducting data linkage projects in Australia. Methods We reviewed the processes, required documentation, and applications required to conduct multi-jurisdictional data linkage across Australia, in 2023. Results Obtaining the necessary approvals to conduct linkage will likely take nearly 2 years (estimated 730 days, including 605 days from initial submission to obtaining all ethical approvals and an estimated further 125 days for the issuance of unexpected additionally required approvals). Ethical review for linkage projects ranged from 51 to 128 days from submission to ethical approval, and applications consisted of 9-25 documents. Conclusions Major obstacles to conducting multi-jurisdictional data linkage included the complexity of the process, and substantial time and financial costs. The process was characterised by inefficiencies at several levels, reduplication, and a lack of any key accountabilities for timely performance of processes. Data linkage is an invaluable resource for epidemiological research. Further streamlining, establishing accountability, and greater collaboration between jurisdictions is needed to ensure data linkage is both accessible and feasible to researchers.


Heart Defects, Congenital , Medical Record Linkage , Humans , Medical Record Linkage/methods , Registries , Australia/epidemiology , Information Storage and Retrieval , Heart Defects, Congenital/epidemiology
10.
PLoS One ; 18(10): e0291581, 2023.
Article En | MEDLINE | ID: mdl-37862306

Research with administrative records involves the challenge of limited information in any single data source to answer policy-related questions. Record linkage provides researchers with a tool to supplement administrative datasets with other information about the same people when identified in separate sources as matched pairs. Several solutions are available for undertaking record linkage, producing linkage keys for merging data sources for positively matched pairs of records. In the current manuscript, we demonstrate a new application of the Python RecordLinkage package to family-based record linkages with machine learning algorithms for probability scoring, which we call probabilistic record linkage for families (PRLF). First, a simulation of administrative records identifies PRLF accuracy with variations in match and data degradation percentages. Accuracy is largely influenced by degradation (e.g., missing data fields, mismatched values) compared to the percentage of simulated matches. Second, an application of data linkage is presented to compare regression model estimate performance across three record linkage solutions (PRLF, ChoiceMaker, and Link Plus). Our findings indicate that all three solutions, when optimized, provide similar results for researchers. Strengths of our process, such as the use of ensemble methods, to improve match accuracy are discussed. We then identify caveats of record linkage in the context of administrative data.


Algorithms , Medical Record Linkage , Humans , Medical Record Linkage/methods , Computer Simulation , Probability , Information Storage and Retrieval
11.
Stat Med ; 42(27): 4931-4951, 2023 Nov 30.
Article En | MEDLINE | ID: mdl-37652076

In many healthcare and social science applications, information about units is dispersed across multiple data files. Linking records across files is necessary to estimate the associations of interest. Common record linkage algorithms only rely on similarities between linking variables that appear in all the files. Moreover, analysis of linked files often ignores errors that may arise from incorrect or missed links. Bayesian record linking methods allow for natural propagation of linkage error, by jointly sampling the linkage structure and the model parameters. We extend an existing Bayesian record linkage method to integrate associations between variables exclusive to each file being linked. We show analytically, and using simulations, that the proposed method can improve the linking process, and can result in accurate inferences. We apply the method to link Meals on Wheels recipients to Medicare enrollment records.


Medical Record Linkage , Medicare , Aged , Humans , United States , Bayes Theorem , Medical Record Linkage/methods , Algorithms
12.
Gesundheitswesen ; 85(S 02): S154-S161, 2023 Mar.
Article De | MEDLINE | ID: mdl-36940697

BACKGROUND: The aim of the project "Effectiveness of care in oncological centres" (WiZen), funded by the innovation fund of the federal joint committee, is to investigate the effectiveness of certification in oncology. The project uses nationwide data from the statuory health insurance AOK and data from clinical cancer registries from three different federal states from 2006-2017. To combine the strengths of both data sources, these will be linked for eight different cancer entities in compliance with data protection regulations. METHODS: Data linkage was performed using indirect identifiers and validated using the health insurance's patient ID ("Krankenversichertennummer") as a direct identifier and gold standard. This enables quantification of the quality of different linkage variants. Sensitivity and specificity as well as hit accuracy and a score addressing the quality of the linkage were used as evaluation criteria. The distributions of relevant variables resulting from the linkage were validated against the original distributions in the individual datasets. RESULTS: Depending on the combination of indirect identifiers, we found a range of 22,125 to 3,092,401 linkage hits. An almost perfect linkage could be achieved by combining information on cancer type, date of birth, gender and postal code. A total of 74,586 one-to-one linkages were achieved with these characteristics. The median hit quality for the different entities was more than 98%. In addition, both the age and sex distributions and the dates of death, if any, showed a high degree of agreement. DISCUSSION AND CONCLUSION: SHI and cancer registry data can be linked with high internal and external validity at the individual level. This robust linkage enables completely new possibilities for analysis through simultaneous access to variables from both data sets ("the best of both worlds"): Information on the UICC stage that stems from the registries can now be combined, for instance, with comorbidities from the SHI data at the individual level. Due to the use of readily available variables and the high success of the linkage, our procedure constitutes a promising method for future linkage processes in health care research.


Neoplasms , Routinely Collected Health Data , Humans , Germany/epidemiology , Registries , Information Storage and Retrieval , Insurance, Health , Neoplasms/epidemiology , Medical Record Linkage/methods
13.
Community Dent Oral Epidemiol ; 51(1): 75-78, 2023 02.
Article En | MEDLINE | ID: mdl-36749677

OBJECTIVES: Poor oral health, impacting health and wellbeing across the life-course, is a costly and wicked problem. Data (or record) linkage is the linking of different sets of data (often administrative data gathered for non-research purposes) that are matched to an individual and may include records such as medical data, housing information and sociodemographic information. It often uses population-level data or 'big data'. Data linkage provides the opportunity to analyse complex associations from different sources for total populations. The aim of the paper is to explore data linkage, how it is important for oral health research and what promise it holds for the future. METHODS: This is a narrative review of an approach (data linkage) in oral health research. RESULTS: Data linkage may be a powerful method for bringing together various population datasets. It has been used to explore a wide variety of topics with many varied datasets. It has substantial current and potential application in oral health research. CONCLUSIONS: Use of population data linkage is increasing in oral health research where the approach has been very useful in exploring the complexity of oral health. It offers promise for exploring many new areas in the field.


Medical Record Linkage , Oral Health , Humans , Medical Record Linkage/methods , Information Storage and Retrieval
14.
Int J Epidemiol ; 52(1): 214-226, 2023 02 08.
Article En | MEDLINE | ID: mdl-35748342

BACKGROUND: Methods for linking records between two datasets are well established. However, guidance is needed for linking more than two datasets. Using all 'pairwise linkages'-linking each dataset to every other dataset-is the most inclusive, but resource-intensive, approach. The 'spine' approach links each dataset to a designated 'spine dataset', reducing the number of linkages, but potentially reducing linkage quality. METHODS: We compared the pairwise and spine linkage approaches using real-world data on patients undergoing emergency bowel cancer surgery between 31 October 2013 and 30 April 2018. We linked an administrative hospital dataset (Hospital Episode Statistics; HES) capturing patients admitted to hospitals in England, and two clinical datasets comprising patients diagnosed with bowel cancer and patients undergoing emergency bowel surgery. RESULTS: The spine linkage approach, with HES as the spine dataset, created an analysis cohort of 15 826 patients, equating to 98.3% of the 16 100 patients identified using the pairwise linkage approach. There were no systematic differences in patient characteristics between these analysis cohorts. Associations of patient and tumour characteristics with mortality, complications and length of stay were not sensitive to the linkage approach. When eligibility criteria were applied before linkage, spine linkage included 14 509 patients (90.0% compared with pairwise linkage). CONCLUSION: Spine linkage can be used as an efficient alternative to pairwise linkage if case ascertainment in the spine dataset and data quality of linkage variables are high. These aspects should be systematically evaluated in the nominated spine dataset before spine linkage is used to create the analysis cohort.


Colorectal Neoplasms , Electronic Health Records , Humans , Medical Record Linkage/methods , Hospitals , Hospitalization
15.
BMC Res Notes ; 15(1): 337, 2022 Oct 31.
Article En | MEDLINE | ID: mdl-36316778

OBJECTIVE: The aim of this study was to determine whether a secure, privacy-preserving record linkage (PPRL) methodology can be implemented in a scalable manner for use in a large national clinical research network. RESULTS: We established the governance and technical capacity to support the use of PPRL across the National Patient-Centered Clinical Research Network (PCORnet®). As a pilot, four sites used the Datavant software to transform patient personally identifiable information (PII) into de-identified tokens. We queried the sites for patients with a clinical encounter in 2018 or 2019 and matched their tokens to determine whether overlap existed. We described patient overlap among the sites and generated a "deduplicated" table of patient demographic characteristics. Overlapping patients were found in 3 of the 6 site-pairs. Following deduplication, the total patient count was 3,108,515 (0.11% reduction), with the largest reduction in count for patients with an "Other/Missing" value for Sex; from 198 to 163 (17.6% reduction). The PPRL solution successfully links patients across data sources using distributed queries without directly accessing patient PII. The overlap queries and analysis performed in this pilot is being replicated across the full network to provide additional insight into patient linkages among a distributed research network.


Electronic Health Records , Privacy , Humans , Medical Record Linkage/methods , Databases, Factual , Patient-Centered Care
16.
J Am Med Inform Assoc ; 29(12): 2105-2109, 2022 11 14.
Article En | MEDLINE | ID: mdl-36305781

Healthcare systems are hampered by incomplete and fragmented patient health records. Record linkage is widely accepted as a solution to improve the quality and completeness of patient records. However, there does not exist a systematic approach for manually reviewing patient records to create gold standard record linkage data sets. We propose a robust framework for creating and evaluating manually reviewed gold standard data sets for measuring the performance of patient matching algorithms. Our 8-point approach covers data preprocessing, blocking, record adjudication, linkage evaluation, and reviewer characteristics. This framework can help record linkage method developers provide necessary transparency when creating and validating gold standard reference matching data sets. In turn, this transparency will support both the internal and external validity of recording linkage studies and improve the robustness of new record linkage strategies.


Health Records, Personal , Medical Record Linkage , Humans , Medical Record Linkage/methods , Algorithms , Information Storage and Retrieval , Data Collection
17.
Epidemiol Serv Saude ; 31(3): e20211272, 2022.
Article En, Pt | MEDLINE | ID: mdl-36287481

OBJECTIVE: To present a standardized methodology for linking different public health databases. METHODS: This was a methodological review article specifically describing data processing procedures for deterministic linkage between structured databases. It instructs on how to: treat data, select linkage keys, and link databases using two databases simulated in R software. RESULTS: The commands used for the deterministic linkage of the inner_join type were presented. The linkage process resulted in a database with 40,108 pairs using only the "Name" key. Adding the second key, "Name of mother", the resulted dropped to 112 pairs. By adding the third key, "Date of birth", only two pairs were identified. CONCLUSION: Database linkage and its analysis are valid and valuable tools for health services in supporting health surveillance actions.


Information Storage and Retrieval , Medical Record Linkage , Humans , Medical Record Linkage/methods , Brazil , Databases, Factual , Software
18.
Appl Clin Inform ; 13(4): 901-909, 2022 08.
Article En | MEDLINE | ID: mdl-36170880

BACKGROUND: Chronic kidney disease (CKD) is a major global health problem that affects approximately one in 10 adults. Up to 90% of individuals with CKD go undetected until its progression to advanced stages, invariably leading to death in the absence of treatment. The project aims to fill information gaps around the burden of CKD in the Western Australian (WA) population, including incidence, prevalence, rate of progression, and economic cost to the health system. METHODS: Given the sensitivity of the information involved, the project employed a privacy preserving record linkage methodology to link data from four major pathology providers in WA to hospital records, to establish a CKD registry with continuous medical record for individuals with biochemical specification for CKD. This method uses encrypted personal identifying information in a probability-based linkage framework (Bloom filters) to help mitigate risk while maximizing linkage quality. RESULTS: The project developed interoperable technology to create a transparent CKD data catalogue which is linkable to other datasets. This technology has been designed to support the aspirations of the research program to provide linked de-identified pathology, morbidity, and mortality data that can be used to derive insights to enable better CKD patient outcomes. The cohort includes over 1 million individuals with creatinine results over the period 2002 to 2021. CONCLUSION: Using linked data from across the care continuum, researchers are able to evaluate the effectiveness of service delivery and provide evidence for policy and program development. The CKD registry will enable an innovative review of the epidemiology of CKD in WA. Linking pathology records can identify cases of CKD that are missed in the early stages due to disaggregation of results, enabling identification of at-risk populations that represent targets for early intervention and management.


Privacy , Renal Insufficiency, Chronic , Adult , Australia , Creatinine , Humans , Medical Record Linkage/methods , Renal Insufficiency, Chronic/diagnosis , Renal Insufficiency, Chronic/epidemiology , Renal Insufficiency, Chronic/therapy , Semantic Web
19.
J Med Internet Res ; 24(9): e33775, 2022 09 29.
Article En | MEDLINE | ID: mdl-36173664

BACKGROUND: Quality patient care requires comprehensive health care data from a broad set of sources. However, missing data in medical records and matching field selection are 2 real-world challenges in patient-record linkage. OBJECTIVE: In this study, we aimed to evaluate the extent to which incorporating the missing at random (MAR)-assumption in the Fellegi-Sunter model and using data-driven selected fields improve patient-matching accuracy using real-world use cases. METHODS: We adapted the Fellegi-Sunter model to accommodate missing data using the MAR assumption and compared the adaptation to the common strategy of treating missing values as disagreement with matching fields specified by experts or selected by data-driven methods. We used 4 use cases, each containing a random sample of record pairs with match statuses ascertained by manual reviews. Use cases included health information exchange (HIE) record deduplication, linkage of public health registry records to HIE, linkage of Social Security Death Master File records to HIE, and deduplication of newborn screening records, which represent real-world clinical and public health scenarios. Matching performance was evaluated using the sensitivity, specificity, positive predictive value, negative predictive value, and F1-score. RESULTS: Incorporating the MAR assumption in the Fellegi-Sunter model maintained or improved F1-scores, regardless of whether matching fields were expert-specified or selected by data-driven methods. Combining the MAR assumption and data-driven fields optimized the F1-scores in the 4 use cases. CONCLUSIONS: MAR is a reasonable assumption in real-world record linkage applications: it maintains or improves F1-scores regardless of whether matching fields are expert-specified or data-driven. Data-driven selection of fields coupled with MAR achieves the best overall performance, which can be especially useful in privacy-preserving record linkage.


Health Information Exchange , Medical Record Linkage , Algorithms , Humans , Infant, Newborn , Medical Record Linkage/methods , Registries , Research Design
20.
PLoS One ; 17(9): e0267893, 2022.
Article En | MEDLINE | ID: mdl-36137086

Linking several databases containing information on the same person is an essential step of many data workflows. Due to the potential sensitivity of the data, the identity of the persons should be kept private. Privacy-Preserving Record-Linkage (PPRL) techniques have been developed to link persons despite errors in the identifiers used to link the databases without violating their privacy. The basic approach is to use encoded quasi-identifiers instead of plain quasi-identifiers for making the linkage decision. Ideally, the encoded quasi-identifiers should prevent re-identification but still allow for a good linkage quality. While several PPRL techniques have been proposed so far, Bloom filter-based PPRL schemes (BF-PPRL) are among the most popular due to their scalability. However, a recently proposed attack on BF-PPRL based on graph similarities seems to allow individuals' re-identification from encoded quasi-identifiers. Therefore, the graph matching attack is widely considered a serious threat to many PPRL-approaches and leads to the situation that BF-PPRL schemes are rejected as being insecure. In this work, we argue that this view is not fully justified. We show by experiments that the success of graph matching attacks requires a high overlap between encoded and plain records used for the attack. As soon as this condition is not fulfilled, the success rate sharply decreases and renders the attacks hardly effective. This necessary condition does severely limit the applicability of these attacks in practice and also allows for simple but effective countermeasures.


Computer Security , Privacy , Confidentiality , Databases, Factual , Humans , Medical Record Linkage/methods
...