Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 633
Filter
Add more filters

Publication year range
1.
Cell ; 173(2): 400-416.e11, 2018 04 05.
Article in English | MEDLINE | ID: mdl-29625055

ABSTRACT

For a decade, The Cancer Genome Atlas (TCGA) program collected clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 33 different cancer types. TCGA clinical data contain key features representing the democratized nature of the data collection process. To ensure proper use of this large clinical dataset associated with genomic features, we developed a standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR), which includes four major clinical outcome endpoints. In addition to detailing major challenges and statistical limitations encountered during the effort of integrating the acquired clinical data, we present a summary that includes endpoint usage recommendations for each cancer type. These TCGA-CDR findings appear to be consistent with cancer genomics studies independent of the TCGA effort and provide opportunities for investigating cancer biology using clinical correlates at an unprecedented scale.


Subject(s)
Neoplasms/pathology , Databases, Genetic , Genomics , Humans , Kaplan-Meier Estimate , Neoplasms/genetics , Neoplasms/mortality , Proportional Hazards Models
2.
Hum Genomics ; 18(1): 71, 2024 Jun 24.
Article in English | MEDLINE | ID: mdl-38915066

ABSTRACT

OBJECTIVE: To investigate the association between liver enzymes and ovarian cancer (OC), and to validate their potential as biomarkers and their mechanisms in OC. Methods Genome-wide association studies for OC and levels of enzymes such as Alkaline phosphatase (ALP), Aspartate aminotransferase (AST), Alanine aminotransferase, and gamma-glutamyltransferase were analyzed. Univariate and multivariate Mendelian randomization (MR), complemented by the Steiger test, identified enzymes with a potential causal relationship to OC. Single-cell transcriptomics from the GSE130000 dataset pinpointed pivotal cellular clusters, enabling further examination of enzyme-encoding gene expression. Transcription factors (TFs) governing these genes were predicted to construct TF-mRNA networks. Additionally, liver enzyme levels were retrospectively analyzed in healthy individuals and OC patients, alongside the evaluation of correlations with cancer antigen 125 (CA125) and Human Epididymis Protein 4 (HE4). RESULTS: A total of 283 single nucleotide polymorphisms (SNPs) and 209 SNPs related to ALP and AST, respectively. Using the inverse-variance weighted method, univariate MR (UVMR) analysis revealed that ALP (P = 0.050, OR = 0.938) and AST (P = 0.017, OR = 0.906) were inversely associated with OC risk, suggesting their roles as protective factors. Multivariate MR (MVMR) confirmed the causal effect of ALP (P = 0.005, OR = 0.938) on OC without reverse causality. Key cellular clusters including T cells, ovarian cells, endothelial cells, macrophages, cancer-associated fibroblasts (CAFs), and epithelial cells were identified, with epithelial cells showing high expression of genes encoding AST and ALP. Notably, TFs such as TCE4 were implicated in the regulation of GOT2 and ALPL genes. OC patient samples exhibited decreased ALP levels in both blood and tumor tissues, with a negative correlation between ALP and CA125 levels observed. CONCLUSION: This study has established a causal link between AST and ALP with OC, identifying them as protective factors. The increased expression of the genes encoding these enzymes in epithelial cells provides a theoretical basis for developing novel disease markers and targeted therapies for OC.


Subject(s)
Alkaline Phosphatase , Biomarkers, Tumor , Genome-Wide Association Study , Mendelian Randomization Analysis , Ovarian Neoplasms , Polymorphism, Single Nucleotide , Single-Cell Analysis , Humans , Female , Ovarian Neoplasms/genetics , Ovarian Neoplasms/pathology , Polymorphism, Single Nucleotide/genetics , Single-Cell Analysis/methods , Alkaline Phosphatase/genetics , Alkaline Phosphatase/blood , Biomarkers, Tumor/genetics , WAP Four-Disulfide Core Domain Protein 2/genetics , WAP Four-Disulfide Core Domain Protein 2/metabolism , Aspartate Aminotransferases/genetics , Aspartate Aminotransferases/blood , Liver/pathology , Liver/metabolism , Alanine Transaminase/blood , Alanine Transaminase/genetics , gamma-Glutamyltransferase/genetics , gamma-Glutamyltransferase/blood , CA-125 Antigen/genetics , Gene Expression Regulation, Neoplastic/genetics , Transcription Factors/genetics , Transcription Factors/metabolism , Membrane Proteins/genetics , Middle Aged
3.
J Hepatol ; 2024 May 03.
Article in English | MEDLINE | ID: mdl-38703829

ABSTRACT

BACKGROUND & AIMS: Idiosyncratic drug-induced liver injury (DILI) is a complex and unpredictable event caused by drugs, and herbal or dietary supplements. Early identification of human hepatotoxicity at preclinical stages remains a major challenge, in which the selection of validated in vitro systems and test drugs has a significant impact. In this systematic review, we analyzed the compounds used in hepatotoxicity assays and established a list of DILI-positive and -negative control drugs for validation of in vitro models of DILI, supported by literature and clinical evidence and endorsed by an expert committee from the COST Action ProEuroDILI Network (CA17112). METHODS: Following 2020 PRISMA guidelines, original research articles focusing on DILI which used in vitro human models and performed at least one hepatotoxicity assay with positive and negative control compounds, were included. Bias of the studies was assessed by a modified 'Toxicological Data Reliability Assessment Tool'. RESULTS: A total of 51 studies (out of 2,936) met the inclusion criteria, with 30 categorized as reliable without restrictions. Although there was a broad consensus on positive compounds, the selection of negative compounds lacked clarity. 2D monoculture, short exposure times and cytotoxicity endpoints were the most tested, although there was no consensus on drug concentrations. CONCLUSIONS: Extensive analysis highlighted the lack of agreement on control compounds for in vitro DILI assessment. Following comprehensive in vitro and clinical data analysis together with input from the expert committee, an evidence-based consensus-driven list of 10 positive and negative control drugs for validation of in vitro models of DILI is proposed. IMPACT AND IMPLICATIONS: Prediction of human toxicity early in the drug development process remains a major challenge, necessitating the development of more physiologically relevant liver models and careful selection of drug-induced liver injury (DILI)-positive and -negative control drugs to better predict the risk of DILI associated with new drug candidates. Thus, this systematic study has crucial implications for standardizing the validation of new in vitro models of DILI. By establishing a consensus-driven list of positive and negative control drugs, the study provides a scientifically justified framework for enhancing the consistency of preclinical testing, thereby addressing a significant challenge in early hepatotoxicity identification. Practically, these findings can guide researchers in evaluating safety profiles of new drugs, refining in vitro models, and informing regulatory agencies on potential improvements to regulatory guidelines, ensuring a more systematic and efficient approach to drug safety assessment.

4.
Oncologist ; 29(2): 106-116, 2024 Feb 02.
Article in English | MEDLINE | ID: mdl-37878787

ABSTRACT

Rare cancers and other rare nonmalignant tumors comprise 25% of all cancer diagnoses and account for 25% of all cancer deaths. They are difficult to study due to many factors, including infrequent occurrence, lack of a universal infrastructure for data and/or tissue collection, and a paucity of disease models to test potential treatments. For each individual rare cancer, the limited number of diagnosed cases makes it difficult to recruit sufficient patients for clinical studies, and rare cancer research studies are often siloed. As a result, progress has been slow for many of these cancers. While rare cancer research efforts have increased over time, the breadth of the research landscape is not known. A recent literature search revealed a sharp increase in rare tumor, and rare cancer publications began in the early 2000s. To identify rare cancer research efforts being conducted in the US and globally, we conducted an online search of rare tumor/rare cancer research programs and identified 76 programs. To gain a deeper understanding of these programs, we composed and conducted a survey to ask programs for details about their research efforts. Of the 42 programs contacted to complete the survey, 23 programs responded. Survey results show most programs are collecting clinical data, molecular data, and biospecimens, and many are conducting molecular analyses. This landscape analysis demonstrates that multiple rare cancer research efforts are ongoing, and the rare cancer community may benefit from collaboration among stakeholders to accelerate research and improve patient outcomes.


Subject(s)
Neoplasms , Humans , Tissue Banks
5.
J Transl Med ; 22(1): 185, 2024 02 20.
Article in English | MEDLINE | ID: mdl-38378565

ABSTRACT

Clinical data mining of predictive models offers significant advantages for re-evaluating and leveraging large amounts of complex clinical real-world data and experimental comparison data for tasks such as risk stratification, diagnosis, classification, and survival prediction. However, its translational application is still limited. One challenge is that the proposed clinical requirements and data mining are not synchronized. Additionally, the exotic predictions of data mining are difficult to apply directly in local medical institutions. Hence, it is necessary to incisively review the translational application of clinical data mining, providing an analytical workflow for developing and validating prediction models to ensure the scientific validity of analytic workflows in response to clinical questions. This review systematically revisits the purpose, process, and principles of clinical data mining and discusses the key causes contributing to the detachment from practice and the misuse of model verification in developing predictive models for research. Based on this, we propose a niche-targeting framework of four principles: Clinical Contextual, Subgroup-Oriented, Confounder- and False Positive-Controlled (CSCF), to provide guidance for clinical data mining prior to the model's development in clinical settings. Eventually, it is hoped that this review can help guide future research and develop personalized predictive models to achieve the goal of discovering subgroups with varied remedial benefits or risks and ensuring that precision medicine can deliver its full potential.


Subject(s)
Data Mining , Precision Medicine
6.
Article in English | MEDLINE | ID: mdl-38466930

ABSTRACT

OBJECTIVES: To assess whether prodromal symptoms of rheumatoid arthritis (RA), as recorded in the Clinical Practice Research Datalink Aurum (CPRD) database of English primary care records, differ by ethnicity and socioeconomic status. METHODS: A cross-sectional study to determine the coding of common symptoms (≥0.1% in the sample) in the 24 months preceding RA diagnosis in CPRD Aurum, recorded between January 1st 2004 to May 1st 2022. Eligible cases were adults with a code for RA diagnosis. For each symptom, a logistic regression was performed with the symptom as dependent variable, and ethnicity and socioeconomic status as independent variables. Results were adjusted for sex, age, BMI, and smoking status. White ethnicity and the highest socioeconomic quintile were comparators. RESULTS: In total, 70115 cases were eligible for inclusion, of which 66.4% female. Twenty-one symptoms were coded in > 0.1% of cases so were included in the analysis. Patients of South Asian ethnicity had higher frequency of codes for several symptoms, with the largest difference by odds ratio being muscle cramps (OR 1.71, 1.44-2.57) and shoulder pain (1.44, 1.25-1.66). Patients of Black ethnicity had higher prevalence of several codes including unintended weight loss (2.02, 1.25-3.28) and ankle pain (1.51, 1.02-2.23). Low socioeconomic status was associated with morning stiffness (1.74, 1.08-2.80) and falls (1.37, 2.03-1.82). CONCLUSION: There are significant differences in coded symptoms between demographic groups, which must be considered in clinical practice in diverse populations and to avoid algorithmic bias in prediction tools derived from routinely collected healthcare data.

7.
BMC Med Res Methodol ; 24(1): 55, 2024 Mar 01.
Article in English | MEDLINE | ID: mdl-38429658

ABSTRACT

BACKGROUND: Research Electronic Data CAPture (REDCap) is a web application for creating and managing online surveys and databases. Clinical data management is an essential process before performing any statistical analysis to ensure the quality and reliability of study information. Processing REDCap data in R can be complex and often benefits from automation. While there are several R packages available for specific tasks, none offer an expansive approach to data management. RESULTS: The REDCapDM is an R package for accessing and managing REDCap data. It imports data from REDCap to R using either an API connection or the files in R format exported directly from REDCap. It has several functions for data processing and transformation, and it helps to generate and manage queries to clarify or resolve discrepancies found in the data. CONCLUSION: The REDCapDM package is a valuable tool for data scientists and clinical data managers who use REDCap and R. It assists in tasks such as importing, processing, and quality-checking data from their research studies.


Subject(s)
Data Management , Software , Humans , Reproducibility of Results , Surveys and Questionnaires , Records
8.
J Am Acad Dermatol ; 90(4): 693-701, 2024 Apr.
Article in English | MEDLINE | ID: mdl-37343834

ABSTRACT

Throughout the 21st century, national and local governments, private health sectors, health insurance companies, healthcare professionals, labor unions, and consumers have been striving to develop an effective approach to evaluate, report, and improve the quality of healthcare. As medicine improves and health systems grow to meet patient needs, the performance measurement system of care effectiveness must also evolve. Continual efforts should be undertaken to effectively measure quality of care to create a more informed public, improve health outcomes, and reduce healthcare costs. As such, recent policy reform has necessitated that performance systems be implemented in healthcare, with the "performance measure" being the foundation of the system in which all of healthcare must be actively engaged in to ensure optimal care for patients. The development of performance measures can be highly complex, particularly when creating specialty-specific performance measures. To help dermatologists understand the process of creating dermatology-specific performance measures to engage in creating or implementing performance measures at the local or national levels, this article in the two-part continuing medical education series reviews the types, components, and process of developing, reviewing, and implementing performance measures.


Subject(s)
Dermatology , Humans , Delivery of Health Care , Insurance, Health
9.
J Pediatr Gastroenterol Nutr ; 78(5): 1126-1134, 2024 May.
Article in English | MEDLINE | ID: mdl-38482890

ABSTRACT

OBJECTIVES: Vedolizumab (VDZ) and ustekinumab (UST) are second-line treatments in pediatric patients with ulcerative colitis (UC) refractory to antitumor necrosis factor (anti-TNF) therapy. Pediatric studies comparing the effectiveness of these medications are lacking. Using a registry from ImproveCareNow (ICN), a global research network in pediatric inflammatory bowel disease, we compared the effectiveness of UST and VDZ in anti-TNF refractory UC. METHODS: We performed a propensity-score weighted regression analysis to compare corticosteroid-free clinical remission (CFCR) at 6 months from starting second-line therapy. Sensitivity analyses tested the robustness of our findings to different ways of handling missing outcome data. Secondary analyses evaluated alternative proxies of response and infection risk. RESULTS: Our cohort included 262 patients on VDZ and 74 patients on UST. At baseline, the two groups differed on their mean pediatric UC activity index (PUCAI) (p = 0.03) but were otherwise similar. At Month 6, 28.3% of patients on VDZ and 25.8% of those on UST achieved CFCR (p = 0.76). Our primary model showed no difference in CFCR (odds ratio: 0.81; 95% confidence interval [CI]: 0.41-1.59) (p = 0.54). The time to biologic discontinuation was similar in both groups (hazard ratio: 1.26; 95% CI: 0.76-2.08) (p = 0.36), with the reference group being VDZ, and we found no differences in clinical response, growth parameters, hospitalizations, surgeries, infections, or malignancy risk. Sensitivity analyses supported these findings of similar effectiveness. CONCLUSIONS: UST and VDZ are similarly effective for inducing clinical remission in anti-TNF refractory UC in pediatric patients. Providers should consider safety, tolerability, cost, and comorbidities when deciding between these therapies.


Subject(s)
Antibodies, Monoclonal, Humanized , Colitis, Ulcerative , Gastrointestinal Agents , Ustekinumab , Humans , Colitis, Ulcerative/drug therapy , Ustekinumab/therapeutic use , Female , Male , Child , Antibodies, Monoclonal, Humanized/therapeutic use , Adolescent , Gastrointestinal Agents/therapeutic use , Treatment Outcome , Tumor Necrosis Factor-alpha/antagonists & inhibitors , Remission Induction/methods , Propensity Score , Registries
10.
J Biomed Inform ; 157: 104711, 2024 Aug 30.
Article in English | MEDLINE | ID: mdl-39182632

ABSTRACT

OBJECTIVE: This study aimed to develop a novel approach using routinely collected electronic health records (EHRs) data to improve the prediction of a rare event. We illustrated this using an example of improving early prediction of an autism diagnosis, given its low prevalence, by leveraging correlations between autism and other neurodevelopmental conditions (NDCs). METHODS: To achieve this, we introduced a conditional multi-label model by merging conditional learning and multi-label methodologies. The conditional learning approach breaks a hard task into more manageable pieces in each stage, and the multi-label approach utilizes information from related neurodevelopmental conditions to learn predictive latent features. The study involved forecasting autism diagnosis by age 5.5 years, utilizing data from the first 18 months of life, and the analysis of feature importance correlations to explore the alignment within the feature space across different conditions. RESULTS: Upon analysis of health records from 18,156 children, we are able to generate a model that predicts a future autism diagnosis with moderate performance (AUROC=0.76). The proposed conditional multi-label method significantly improves predictive performance with an AUROC of 0.80 (p < 0.001). Further examination shows that both the conditional and multi-label approach alone provided marginal lift to the model performance compared to a one-stage one-label approach. We also demonstrated the generalizability and applicability of this method using simulated data with high correlation between feature vectors for different labels. CONCLUSION: Our findings underscore the effectiveness of the developed conditional multi-label model for early prediction of an autism diagnosis. The study introduces a versatile strategy applicable to prediction tasks involving limited target populations but sharing underlying features or etiology among related groups.

11.
BMC Med Imaging ; 24(1): 67, 2024 Mar 20.
Article in English | MEDLINE | ID: mdl-38504179

ABSTRACT

BACKGROUND: Clinical data warehouses provide access to massive amounts of medical images, but these images are often heterogeneous. They can for instance include images acquired both with or without the injection of a gadolinium-based contrast agent. Harmonizing such data sets is thus fundamental to guarantee unbiased results, for example when performing differential diagnosis. Furthermore, classical neuroimaging software tools for feature extraction are typically applied only to images without gadolinium. The objective of this work is to evaluate how image translation can be useful to exploit a highly heterogeneous data set containing both contrast-enhanced and non-contrast-enhanced images from a clinical data warehouse. METHODS: We propose and compare different 3D U-Net and conditional GAN models to convert contrast-enhanced T1-weighted (T1ce) into non-contrast-enhanced (T1nce) brain MRI. These models were trained using 230 image pairs and tested on 77 image pairs from the clinical data warehouse of the Greater Paris area. RESULTS: Validation using standard image similarity measures demonstrated that the similarity between real and synthetic T1nce images was higher than between real T1nce and T1ce images for all the models compared. The best performing models were further validated on a segmentation task. We showed that tissue volumes extracted from synthetic T1nce images were closer to those of real T1nce images than volumes extracted from T1ce images. CONCLUSION: We showed that deep learning models initially developed with research quality data could synthesize T1nce from T1ce images of clinical quality and that reliable features could be extracted from the synthetic images, thus demonstrating the ability of such methods to help exploit a data set coming from a clinical data warehouse.


Subject(s)
Data Warehousing , Gadolinium , Humans , Brain/diagnostic imaging , Magnetic Resonance Imaging/methods , Neuroimaging/methods , Image Processing, Computer-Assisted/methods
12.
J Med Internet Res ; 26: e54580, 2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38551633

ABSTRACT

BACKGROUND: The study of disease progression relies on clinical data, including text data, and extracting valuable features from text data has been a research hot spot. With the rise of large language models (LLMs), semantic-based extraction pipelines are gaining acceptance in clinical research. However, the security and feature hallucination issues of LLMs require further attention. OBJECTIVE: This study aimed to introduce a novel modular LLM pipeline, which could semantically extract features from textual patient admission records. METHODS: The pipeline was designed to process a systematic succession of concept extraction, aggregation, question generation, corpus extraction, and question-and-answer scale extraction, which was tested via 2 low-parameter LLMs: Qwen-14B-Chat (QWEN) and Baichuan2-13B-Chat (BAICHUAN). A data set of 25,709 pregnancy cases from the People's Hospital of Guangxi Zhuang Autonomous Region, China, was used for evaluation with the help of a local expert's annotation. The pipeline was evaluated with the metrics of accuracy and precision, null ratio, and time consumption. Additionally, we evaluated its performance via a quantified version of Qwen-14B-Chat on a consumer-grade GPU. RESULTS: The pipeline demonstrates a high level of precision in feature extraction, as evidenced by the accuracy and precision results of Qwen-14B-Chat (95.52% and 92.93%, respectively) and Baichuan2-13B-Chat (95.86% and 90.08%, respectively). Furthermore, the pipeline exhibited low null ratios and variable time consumption. The INT4-quantified version of QWEN delivered an enhanced performance with 97.28% accuracy and a 0% null ratio. CONCLUSIONS: The pipeline exhibited consistent performance across different LLMs and efficiently extracted clinical features from textual data. It also showed reliable performance on consumer-grade hardware. This approach offers a viable and effective solution for mining clinical research data from textual records.


Subject(s)
Data Mining , Electronic Health Records , Humans , Data Mining/methods , Natural Language Processing , China , Language
13.
J Med Internet Res ; 26: e55779, 2024 Apr 09.
Article in English | MEDLINE | ID: mdl-38593431

ABSTRACT

Practitioners of digital health are familiar with disjointed data environments that often inhibit effective communication among different elements of the ecosystem. This fragmentation leads in turn to issues such as inconsistencies in services versus payments, wastage, and notably, care delivered being less than best-practice. Despite the long-standing recognition of interoperable data as a potential solution, efforts in achieving interoperability have been disjointed and inconsistent, resulting in numerous incompatible standards, despite the widespread agreement that fewer standards would enhance interoperability. This paper introduces a framework for understanding health care data needs, discussing the challenges and opportunities of open data standards in the field. It emphasizes the necessity of acknowledging diverse data standards, each catering to specific viewpoints and needs, while proposing a categorization of health care data into three domains, each with its distinct characteristics and challenges, along with outlining overarching design requirements applicable to all domains and specific requirements unique to each domain.


Subject(s)
Delivery of Health Care , Humans
14.
J Med Internet Res ; 26: e50049, 2024 06 10.
Article in English | MEDLINE | ID: mdl-38857066

ABSTRACT

BACKGROUND: It is necessary to harmonize and standardize data variables used in case report forms (CRFs) of clinical studies to facilitate the merging and sharing of the collected patient data across several clinical studies. This is particularly true for clinical studies that focus on infectious diseases. Public health may be highly dependent on the findings of such studies. Hence, there is an elevated urgency to generate meaningful, reliable insights, ideally based on a high sample number and quality data. The implementation of core data elements and the incorporation of interoperability standards can facilitate the creation of harmonized clinical data sets. OBJECTIVE: This study's objective was to compare, harmonize, and standardize variables focused on diagnostic tests used as part of CRFs in 6 international clinical studies of infectious diseases in order to, ultimately, then make available the panstudy common data elements (CDEs) for ongoing and future studies to foster interoperability and comparability of collected data across trials. METHODS: We reviewed and compared the metadata that comprised the CRFs used for data collection in and across all 6 infectious disease studies under consideration in order to identify CDEs. We examined the availability of international semantic standard codes within the Systemized Nomenclature of Medicine - Clinical Terms, the National Cancer Institute Thesaurus, and the Logical Observation Identifiers Names and Codes system for the unambiguous representation of diagnostic testing information that makes up the CDEs. We then proposed 2 data models that incorporate semantic and syntactic standards for the identified CDEs. RESULTS: Of 216 variables that were considered in the scope of the analysis, we identified 11 CDEs to describe diagnostic tests (in particular, serology and sequencing) for infectious diseases: viral lineage/clade; test date, type, performer, and manufacturer; target gene; quantitative and qualitative results; and specimen identifier, type, and collection date. CONCLUSIONS: The identification of CDEs for infectious diseases is the first step in facilitating the exchange and possible merging of a subset of data across clinical studies (and with that, large research projects) for possible shared analysis to increase the power of findings. The path to harmonization and standardization of clinical study data in the interest of interoperability can be paved in 2 ways. First, a map to standard terminologies ensures that each data element's (variable's) definition is unambiguous and that it has a single, unique interpretation across studies. Second, the exchange of these data is assisted by "wrapping" them in a standard exchange format, such as Fast Health care Interoperability Resources or the Clinical Data Interchange Standards Consortium's Clinical Data Acquisition Standards Harmonization Model.


Subject(s)
Communicable Diseases , Semantics , Humans , Communicable Diseases/diagnosis , Common Data Elements
15.
BMC Med Inform Decis Mak ; 24(1): 58, 2024 Feb 26.
Article in English | MEDLINE | ID: mdl-38408983

ABSTRACT

BACKGROUND: To gain insight into the real-life care of patients in the healthcare system, data from hospital information systems and insurance systems are required. Consequently, linking clinical data with claims data is necessary. To ensure their syntactic and semantic interoperability, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) from the Observational Health Data Sciences and Informatics (OHDSI) community was chosen. However, there is no detailed guide that would allow researchers to follow a generic process for data harmonization, i.e. the transformation of local source data into the standardized OMOP CDM format. Thus, the aim of this paper is to conceptualize a generic data harmonization process for OMOP CDM. METHODS: For this purpose, we conducted a literature review focusing on publications that address the harmonization of clinical or claims data in OMOP CDM. Subsequently, the process steps used and their chronological order as well as applied OHDSI tools were extracted for each included publication. The results were then compared to derive a generic sequence of the process steps. RESULTS: From 23 publications included, a generic data harmonization process for OMOP CDM was conceptualized, consisting of nine process steps: dataset specification, data profiling, vocabulary identification, coverage analysis of vocabularies, semantic mapping, structural mapping, extract-transform-load-process, qualitative and quantitative data quality analysis. Furthermore, we identified seven OHDSI tools which supported five of the process steps. CONCLUSIONS: The generic data harmonization process can be used as a step-by-step guide to assist other researchers in harmonizing source data in OMOP CDM.


Subject(s)
Medical Informatics , Vocabulary , Humans , Databases, Factual , Data Science , Semantics , Electronic Health Records
16.
BMC Med Inform Decis Mak ; 24(1): 147, 2024 May 30.
Article in English | MEDLINE | ID: mdl-38816848

ABSTRACT

BACKGROUND: Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset's utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset's utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. METHODS: Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. RESULTS: All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. CONCLUSIONS: As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data's intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.


Subject(s)
Confidentiality , Data Anonymization , Humans , Confidentiality/standards , Emergency Service, Hospital , Length of Stay , Republic of Korea , Male
17.
J Allergy Clin Immunol ; 151(1): 272-279, 2023 01.
Article in English | MEDLINE | ID: mdl-36243223

ABSTRACT

BACKGROUND: Identification of patients with underlying inborn errors of immunity and inherent susceptibility to infection remains challenging. The ensuing protracted diagnostic odyssey for such patients often results in greater morbidity and suboptimal outcomes, underscoring a need to develop systematic methods for improving diagnostic rates. OBJECTIVE: The principal aim of this study is to build and validate a generalizable analytical pipeline for population-wide detection of infection susceptibility and risk of primary immunodeficiency. METHODS: This prospective, longitudinal cohort study coupled weighted rules with a machine learning classifier for risk stratification. Claims data were analyzed from a diverse population (n = 427,110) iteratively over 30 months. Cohort outcomes were enumerated for new diagnoses, hospitalizations, and acute care visits. This study followed TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) standards. RESULTS: Cohort members initially identified as high risk were proportionally more likely to receive a diagnosis of primary immunodeficiency compared to those at low-medium risk or those without claims of interest respectively (9% vs 1.5% vs 0.2%; P < .001, chi-square test). Subsequent machine learning stratification enabled an annualized individual snapshot of complexity for triaging referrals. This study's top-performing machine learning model for visit-level prediction used a single dense layer neural network architecture (area under the receiver-operator characteristic curve = 0.98; F1 score = 0.98). CONCLUSIONS: A 2-step analytical pipeline can facilitate identification of individuals with primary immunodeficiency and accurately quantify clinical risk.


Subject(s)
Artificial Intelligence , Machine Learning , Humans , Prospective Studies , Longitudinal Studies , Prognosis
18.
Int J Mol Sci ; 25(11)2024 May 30.
Article in English | MEDLINE | ID: mdl-38892228

ABSTRACT

Primary sclerosing cholangitis (PSC) is a rare, progressive disease, characterized by inflammation and fibrosis of the bile ducts, lacking reliable prognostic biomarkers for disease activity. Machine learning applied to broad proteomic profiling of sera allowed for the discovery of markers of disease presence, severity, and cirrhosis and the exploration of the involvement of CCL24, a chemokine with fibro-inflammatory activity. Sera from 30 healthy controls and 45 PSC patients were profiled with proximity extension assay, quantifying the expression of 2870 proteins, and used to train an elastic net model. Proteins that contributed most to the model were tested for correlation to enhanced liver fibrosis (ELF) score and used to perform pathway analysis. Statistical modeling for the presence of cirrhosis was performed with principal component analysis (PCA), and receiver operating characteristics (ROC) curves were used to assess the useability of potential biomarkers. The model successfully predicted the presence of PSC, where the top-ranked proteins were associated with cell adhesion, immune response, and inflammation, and each had an area under receiver operator characteristic (AUROC) curve greater than 0.9 for disease presence and greater than 0.8 for ELF score. Pathway analysis showed enrichment for functions associated with PSC, overlapping with pathways enriched in patients with high levels of CCL24. Patients with cirrhosis showed higher levels of CCL24. This data-driven approach to characterize PSC and its severity highlights potential serum protein biomarkers and the importance of CCL24 in the disease, implying its therapeutic potential in PSC.


Subject(s)
Biomarkers , Chemokine CCL24 , Cholangitis, Sclerosing , Liver Cirrhosis , Machine Learning , Adult , Female , Humans , Male , Middle Aged , Biomarkers/blood , Case-Control Studies , Chemokine CCL24/metabolism , Chemokine CCL24/blood , Cholangitis, Sclerosing/blood , Cholangitis, Sclerosing/metabolism , Disease Progression , Liver Cirrhosis/blood , Liver Cirrhosis/metabolism , Liver Cirrhosis/pathology , Proteomics/methods , ROC Curve
19.
J Stroke Cerebrovasc Dis ; 33(9): 107848, 2024 Sep.
Article in English | MEDLINE | ID: mdl-38964525

ABSTRACT

OBJECTIVES: Cerebral Venous Thrombosis (CVT) poses diagnostic challenges due to the variability in disease course and symptoms. The prognosis of CVT relies on early diagnosis. Our study focuses on developing a machine learning-based screening algorithm using clinical data from a large neurology referral center in southern Iran. METHODS: The Iran Cerebral Venous Thrombosis Registry (ICVTR code: 9001013381) provided data on 382 CVT cases from Namazi Hospital. The control group comprised of adult headache patients without CVT as confirmed by neuroimaging and was retrospectively selected from those admitted to the same hospital. We collected 60 clinical and demographic features for model development and validation. Our modeling pipeline involved imputing missing values and evaluating four machine learning algorithms: generalized linear model, random forest, support vector machine, and extreme gradient boosting. RESULTS: A total of 314 CVT cases and 575 controls were included. The highest AUROC was reached when imputation was used to estimate missing values for all the variables, combined with the support vector machine model (AUROC = 0.910, Recall = 0.73, Precision = 0.88). The best recall was achieved also by the support vector machine model when only variables with less than 50 % missing rate were included (AUROC = 0.887, Recall = 0.77, Precision = 0.86). The random forest model yielded the best precision by using variables with less than 50 % missing rate (AUROC = 0.882, Recall = 0.61, Precision = 0.94). CONCLUSION: The application of machine learning techniques using clinical data showed promising results in accurately diagnosing CVT within our study population. This approach offers a valuable complementary assistive tool or an alternative to resource-intensive imaging methods.


Subject(s)
Intracranial Thrombosis , Predictive Value of Tests , Registries , Support Vector Machine , Venous Thrombosis , Humans , Female , Male , Iran/epidemiology , Adult , Retrospective Studies , Middle Aged , Intracranial Thrombosis/diagnostic imaging , Intracranial Thrombosis/diagnosis , Intracranial Thrombosis/therapy , Venous Thrombosis/diagnostic imaging , Venous Thrombosis/diagnosis , Reproducibility of Results , Diagnosis, Computer-Assisted , Machine Learning , Aged
20.
Aten Primaria ; 56(5): 102848, 2024 May.
Article in Spanish | MEDLINE | ID: mdl-38228052

ABSTRACT

INTRODUCTION: Technological advances continue to transform society, including the health sector. The decentralized and verifiable nature of blockchain technology presents great potential for addressing current challenges in healthcare data management. DISCUSSION: This article reports on how the generalized adoption of blockchain faces important challenges and barriers that must be addressed, such as the lack of regulation, technical complexity, safeguarding privacy, and economic and technological costs. Collaboration between medical professionals, technologists and legislators is essential to establish a solid regulatory framework and adequate training. CONCLUSION: Blockchain technology has the potential to revolutionize data management in the healthcare sector, improving the quality of medical care, empowering users, and promoting the secure sharing of data, but an important cultural change is needed, along with more evidence, to reveal its advantages in front of the existing technological alternative.


Subject(s)
Blockchain , Computer Security , Computer Security/standards , Humans , Data Management
SELECTION OF CITATIONS
SEARCH DETAIL