Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
País de afiliación
Intervalo de año de publicación
1.
Am J Hum Genet ; 109(9): 1591-1604, 2022 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-35998640

RESUMEN

Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data such as electronic health records (EHRs) have not been fully exploited to derive rare disease annotations. Here, we present open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives. The rare disease prevalence derived by OARD is highly correlated with those annotated in the original rare disease knowledgebase. By performing association analysis, we identified more than 1 million novel disease-phenotype association pairs that were previously missed by human annotation, and >60% were confirmed true associations via manual review of a list of sampled pairs. Compared to the manual curated annotation, OARD is 100% data driven and its pipeline can be shared across different institutions. By supporting privacy-preserving sharing of aggregated summary statistics, such as term frequencies and disease-phenotype associations, it fills an important gap to facilitate data-driven research in the rare disease community.


Asunto(s)
Procesamiento de Lenguaje Natural , Enfermedades Raras , Registros Electrónicos de Salud , Humanos , Fenotipo , Enfermedades Raras/genética
2.
J Biomed Inform ; 155: 104659, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38777085

RESUMEN

OBJECTIVE: This study aims to promote interoperability in precision medicine and translational research by aligning the Observational Medical Outcomes Partnership (OMOP) and Phenopackets data models. Phenopackets is an expert knowledge-driven schema designed to facilitate the storage and exchange of multimodal patient data, and support downstream analysis. The first goal of this paper is to explore model alignment by characterizing the common data models using a newly developed data transformation process and evaluation method. Second, using OMOP normalized clinical data, we evaluate the mapping of real-world patient data to Phenopackets. We evaluate the suitability of Phenopackets as a patient data representation for real-world clinical cases. METHODS: We identified mappings between OMOP and Phenopackets and applied them to a real patient dataset to assess the transformation's success. We analyzed gaps between the models and identified key considerations for transforming data between them. Further, to improve ambiguous alignment, we incorporated Unified Medical Language System (UMLS) semantic type-based filtering to direct individual concepts to their most appropriate domain and conducted a domain-expert evaluation of the mapping's clinical utility. RESULTS: The OMOP to Phenopacket transformation pipeline was executed for 1,000 Alzheimer's disease patients and successfully mapped all required entities. However, due to missing values in OMOP for required Phenopacket attributes, 10.2 % of records were lost. The use of UMLS-semantic type filtering for ambiguous alignment of individual concepts resulted in 96 % agreement with clinical thinking, increased from 68 % when mapping exclusively by domain correspondence. CONCLUSION: This study presents a pipeline to transform data from OMOP to Phenopackets. We identified considerations for the transformation to ensure data quality, handling restrictions for successful Phenopacket validation and discrepant data formats. We identified unmappable Phenopacket attributes that focus on specialty use cases, such as genomics or oncology, which OMOP does not currently support. We introduce UMLS semantic type filtering to resolve ambiguous alignment to Phenopacket entities to be most appropriate for real-world interpretation. We provide a systematic approach to align OMOP and Phenopackets schemas. Our work facilitates future use of Phenopackets in clinical applications by addressing key barriers to interoperability when deriving a Phenopacket from real-world patient data.


Asunto(s)
Unified Medical Language System , Humanos , Semántica , Registros Electrónicos de Salud , Medicina de Precisión/métodos , Investigación Biomédica Traslacional , Informática Médica/métodos , Procesamiento de Lenguaje Natural , Enfermedad de Alzheimer
3.
J Biomed Inform ; 154: 104649, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38697494

RESUMEN

OBJECTIVE: Automated identification of eligible patients is a bottleneck of clinical research. We propose Criteria2Query (C2Q) 3.0, a system that leverages GPT-4 for the semi-automatic transformation of clinical trial eligibility criteria text into executable clinical database queries. MATERIALS AND METHODS: C2Q 3.0 integrated three GPT-4 prompts for concept extraction, SQL query generation, and reasoning. Each prompt was designed and evaluated separately. The concept extraction prompt was benchmarked against manual annotations from 20 clinical trials by two evaluators, who later also measured SQL generation accuracy and identified errors in GPT-generated SQL queries from 5 clinical trials. The reasoning prompt was assessed by three evaluators on four metrics: readability, correctness, coherence, and usefulness, using corrected SQL queries and an open-ended feedback questionnaire. RESULTS: Out of 518 concepts from 20 clinical trials, GPT-4 achieved an F1-score of 0.891 in concept extraction. For SQL generation, 29 errors spanning seven categories were detected, with logic errors being the most common (n = 10; 34.48 %). Reasoning evaluations yielded a high coherence rating, with the mean score being 4.70 but relatively lower readability, with a mean of 3.95. Mean scores of correctness and usefulness were identified as 3.97 and 4.37, respectively. CONCLUSION: GPT-4 significantly improves the accuracy of extracting clinical trial eligibility criteria concepts in C2Q 3.0. Continued research is warranted to ensure the reliability of large language models.


Asunto(s)
Ensayos Clínicos como Asunto , Humanos , Procesamiento de Lenguaje Natural , Programas Informáticos , Selección de Paciente
4.
J Biomed Inform ; 142: 104375, 2023 06.
Artículo en Inglés | MEDLINE | ID: mdl-37141977

RESUMEN

OBJECTIVE: Feasible, safe, and inclusive eligibility criteria are crucial to successful clinical research recruitment. Existing expert-centered methods for eligibility criteria selection may not be representative of real-world populations. This paper presents a novel model called OPTEC (OPTimal Eligibility Criteria) based on the Multiple Attribute Decision Making method boosted by an efficient greedy algorithm. METHODS: It systematically identifies the optimal criteria combination for a given medical condition with the optimal tradeoff among feasibility, patient safety, and cohort diversity. The model offers flexibility in attribute configurations and generalizability to various clinical domains. The model was evaluated on two clinical domains (i.e., Alzheimer's disease and Neoplasm of pancreas) using two datasets (i.e., MIMIC-III dataset and NewYork-Presbyterian/Columbia University Irving Medical Center (NYP/CUIMC) database). RESULTS: We simulated the process of automatically optimizing eligibility criteria according to user-specified prioritization preferences and generated recommendations based on the top-ranked criteria combination accordingly (top 0.41-2.75%) with OPTEC. Harnessing the power of the model, we designed an interactive criteria recommendation system and conducted a case study with an experienced clinical researcher using the think-aloud protocol. CONCLUSIONS: The results demonstrated that OPTEC could be used to recommend feasible eligibility criteria combinations, and to provide actionable recommendations for clinical study designers to construct a feasible, safe, and diverse cohort definition during early study design.


Asunto(s)
Algoritmos , Proyectos de Investigación , Humanos , Selección de Paciente , Determinación de la Elegibilidad , Investigadores
5.
J Biomed Inform ; 127: 104032, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-35189334

RESUMEN

OBJECTIVE: To present an approach on using electronic health record (EHR) data that assesses how different eligibility criteria, either individually or in combination, can impact patient count and safety (exemplified by all-cause hospitalization risk) and further assist with criteria selection for prospective clinical trials. MATERIALS AND METHODS: Trials in three disease domains - relapsed/refractory (r/r) lymphoma/leukemia; hepatitis C virus (HCV); stages 3 and 4 chronic kidney disease (CKD) - were analyzed as case studies for this approach. For each disease domain, criteria were identified and all criteria combinations were used to create EHR cohorts. Per combination, two values were derived: (1) number of eligible patients meeting the selected criteria; (2) hospitalization risk, measured as the hazard ratio between those that qualified and those that did not. From these values, k-means clustering was applied to derive which criteria combinations maximized patient counts but minimized hospitalization risk. RESULTS: Criteria combinations that reduced hospitalization risk without substantial reductions on patient counts were as follows: for r/r lymphoma/leukemia (23 trials; 9 criteria; 623 patients), applying no infection and adequate absolute neutrophil count while forgoing no prior malignancy; for HCV (15; 7; 751), applying no human immunodeficiency virus and no hepatocellular carcinoma while forgoing no decompensated liver disease/cirrhosis; for CKD (10; 9; 23893), applying no congestive heart failure. CONCLUSIONS: Within each disease domain, the more drastic effects were generally driven by a few criteria. Similar criteria across different disease domains introduce different changes. Although results are contingent on the trial sample and the EHR data used, this approach demonstrates how EHR data can inform the impact on safety and available patients when exploring different criteria combinations for designing clinical trials.


Asunto(s)
Registros Electrónicos de Salud , Infecciones por VIH , Determinación de la Elegibilidad , Humanos , Selección de Paciente , Estudios Prospectivos
6.
Nucleic Acids Res ; 47(W1): W566-W570, 2019 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-31106327

RESUMEN

We present Doc2Hpo, an interactive web application that enables interactive and efficient phenotype concept curation from clinical text with automated concept normalization using the Human Phenotype Ontology (HPO). Users can edit the HPO concepts automatically extracted by Doc2Hpo in real time, and export the extracted HPO concepts into gene prioritization tools. Our evaluation showed that Doc2Hpo significantly reduced manual effort while achieving high accuracy in HPO concept curation. Doc2Hpo is freely available at https://impact2.dbmi.columbia.edu/doc2hpo/. The source code is available at https://github.com/stormliucong/doc2hpo for local installation for protected health data.


Asunto(s)
Ontologías Biológicas , Curaduría de Datos , Fenotipo , Programas Informáticos , Genes , Humanos , Internet , Interfaz Usuario-Computador
7.
J Med Internet Res ; 23(9): e31122, 2021 09 30.
Artículo en Inglés | MEDLINE | ID: mdl-34543225

RESUMEN

BACKGROUND: COVID-19 has threatened the health of tens of millions of people all over the world. Massive research efforts have been made in response to the COVID-19 pandemic. Utilization of clinical data can accelerate these research efforts to combat the pandemic since important characteristics of the patients are often found by examining the clinical data. Publicly accessible clinical data on COVID-19, however, remain limited despite the immediate need. OBJECTIVE: To provide shareable clinical data to catalyze COVID-19 research, we present Columbia Open Health Data for COVID-19 Research (COHD-COVID), a publicly accessible database providing clinical concept prevalence, clinical concept co-occurrence, and clinical symptom prevalence for hospitalized patients with COVID-19. COHD-COVID also provides data on hospitalized patients with influenza and general hospitalized patients as comparator cohorts. METHODS: The data used in COHD-COVID were obtained from NewYork-Presbyterian/Columbia University Irving Medical Center's electronic health records database. Condition, drug, and procedure concepts were obtained from the visits of identified patients from the cohorts. Rare concepts were excluded, and the true concept counts were perturbed using Poisson randomization to protect patient privacy. Concept prevalence, concept prevalence ratio, concept co-occurrence, and symptom prevalence were calculated using the obtained concepts. RESULTS: Concept prevalence and concept prevalence ratio analyses showed the clinical characteristics of the COVID-19 cohorts, confirming the well-known characteristics of COVID-19 (eg, acute lower respiratory tract infection and cough). The concepts related to the well-known characteristics of COVID-19 recorded high prevalence and high prevalence ratio in the COVID-19 cohort compared to the hospitalized influenza cohort and general hospitalized cohort. Concept co-occurrence analyses showed potential associations between specific concepts. In case of acute lower respiratory tract infection in the COVID-19 cohort, a high co-occurrence ratio was obtained with COVID-19-related concepts and commonly used drugs (eg, disease due to coronavirus and acetaminophen). Symptom prevalence analysis indicated symptom-level characteristics of the cohorts and confirmed that well-known symptoms of COVID-19 (eg, fever, cough, and dyspnea) showed higher prevalence than the hospitalized influenza cohort and the general hospitalized cohort. CONCLUSIONS: We present COHD-COVID, a publicly accessible database providing useful clinical data for hospitalized patients with COVID-19, hospitalized patients with influenza, and general hospitalized patients. We expect COHD-COVID to provide researchers and clinicians quantitative measures of COVID-19-related clinical features to better understand and combat the pandemic.


Asunto(s)
COVID-19 , Gripe Humana , Bases de Datos Factuales , Humanos , Gripe Humana/epidemiología , Pandemias , SARS-CoV-2
8.
J Biomed Inform ; 100: 103325, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31676459

RESUMEN

This special communication describes activities, products, and lessons learned from a recent hackathon that was funded by the National Center for Advancing Translational Sciences via the Biomedical Data Translator program ('Translator'). Specifically, Translator team members self-organized and worked together to conceptualize and execute, over a five-day period, a multi-institutional clinical research study that aimed to examine, using open clinical data sources, relationships between sex, obesity, diabetes, and exposure to airborne fine particulate matter among patients with severe asthma. The goal was to develop a proof of concept that this new model of collaboration and data sharing could effectively produce meaningful scientific results and generate new scientific hypotheses. Three Translator Clinical Knowledge Sources, each of which provides open access (via Application Programming Interfaces) to data derived from the electronic health record systems of major academic institutions, served as the source of study data. Jupyter Python notebooks, shared in GitHub repositories, were used to call the knowledge sources and analyze and integrate the results. The results replicated established or suspected relationships between sex, obesity, diabetes, exposure to airborne fine particulate matter, and severe asthma. In addition, the results demonstrated specific differences across the three Translator Clinical Knowledge Sources, suggesting cohort- and/or environment-specific factors related to the services themselves or the catchment area from which each service derives patient data. Collectively, this special communication demonstrates the power and utility of intense, team-oriented hackathons and offers general technical, organizational, and scientific lessons learned.


Asunto(s)
Asma/fisiopatología , Diabetes Mellitus/fisiopatología , Exposición a Riesgos Ambientales , Almacenamiento y Recuperación de la Información , Obesidad/fisiopatología , Material Particulado/toxicidad , Factores Sexuales , Asma/complicaciones , Femenino , Humanos , Masculino , Obesidad/complicaciones , Índice de Severidad de la Enfermedad
9.
J Biomed Inform ; 100: 103318, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31655273

RESUMEN

BACKGROUND: Manually curating standardized phenotypic concepts such as Human Phenotype Ontology (HPO) terms from narrative text in electronic health records (EHRs) is time consuming and error prone. Natural language processing (NLP) techniques can facilitate automated phenotype extraction and thus improve the efficiency of curating clinical phenotypes from clinical texts. While individual NLP systems can perform well for a single cohort, an ensemble-based method might shed light on increasing the portability of NLP pipelines across different cohorts. METHODS: We compared four NLP systems, MetaMapLite, MedLEE, ClinPhen and cTAKES, and four ensemble techniques, including intersection, union, majority-voting and machine learning, for extracting generic phenotypic concepts. We addressed two important research questions regarding automated phenotype recognition. First, we evaluated the performance of different approaches in identifying generic phenotypic concepts. Second, we compared the performance of different methods to identify patient-specific phenotypic concepts. To better quantify the effects caused by concept granularity differences on performance, we developed a novel evaluation metric that considered concept hierarchies and frequencies. Each of the approaches was evaluated on a gold standard set of clinical documents annotated by clinical experts. One dataset containing 1,609 concepts derived from 50 clinical notes from two different institutions was used in both evaluations, and an additional dataset of 608 concepts derived from 50 case report abstracts obtained from PubMed was used for evaluation of identifying generic phenotypic concepts only. RESULTS: For generic phenotypic concept recognition, the top three performers in the NYP/CUIMC dataset are union ensemble (F1, 0.634), training-based ensemble (F1, 0.632), and majority vote-based ensemble (F1, 0.622). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.642), cTAKES (F1, 0.615), and MedLEE (F1, 0.559). In the PubMed dataset, the top three are majority vote-based ensemble (F1, 0.719), training-based (F1, 0.696) and MetaMapLite (F1, 0.694). For identifying patient specific phenotypes, the top three performers in the NYP/CUIMC dataset are majority vote-based ensemble (F1, 0.610), MedLEE (F1, 0.609), and training-based ensemble (F1, 0.585). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.604), cTAKES (F1, 0.531) and MedLEE (F1, 0.527). CONCLUSIONS: Our study demonstrates that ensembles of natural language processing can improve both generic phenotypic concept recognition and patient specific phenotypic concept identification over individual systems. Among the individual NLP systems, each individual system performed best when they were applied in the dataset that they were primary designed for. However, combining multiple NLP systems to create an ensemble can generally improve the performance. Specifically, the ensemble can increase the results reproducibility across different cohorts and tasks, and thus provide a more portable phenotyping solution compared to individual NLP systems.


Asunto(s)
Procesamiento de Lenguaje Natural , Fenotipo , Conjuntos de Datos como Asunto , Registros Electrónicos de Salud , Humanos , Reproducibilidad de los Resultados
10.
J Biomed Inform ; 99: 103293, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31542521

RESUMEN

BACKGROUND: Implementation of phenotype algorithms requires phenotype engineers to interpret human-readable algorithms and translate the description (text and flowcharts) into computable phenotypes - a process that can be labor intensive and error prone. To address the critical need for reducing the implementation efforts, it is important to develop portable algorithms. METHODS: We conducted a retrospective analysis of phenotype algorithms developed in the Electronic Medical Records and Genomics (eMERGE) network and identified common customization tasks required for implementation. A novel scoring system was developed to quantify portability from three aspects: Knowledge conversion, clause Interpretation, and Programming (KIP). Tasks were grouped into twenty representative categories. Experienced phenotype engineers were asked to estimate the average time spent on each category and evaluate time saving enabled by a common data model (CDM), specifically the Observational Medical Outcomes Partnership (OMOP) model, for each category. RESULTS: A total of 485 distinct clauses (phenotype criteria) were identified from 55 phenotype algorithms, corresponding to 1153 customization tasks. In addition to 25 non-phenotype-specific tasks, 46 tasks are related to interpretation, 613 tasks are related to knowledge conversion, and 469 tasks are related to programming. A score between 0 and 2 (0 for easy, 1 for moderate, and 2 for difficult portability) is assigned for each aspect, yielding a total KIP score range of 0 to 6. The average clause-wise KIP score to reflect portability is 1.37 ±â€¯1.38. Specifically, the average knowledge (K) score is 0.64 ±â€¯0.66, interpretation (I) score is 0.33 ±â€¯0.55, and programming (P) score is 0.40 ±â€¯0.64. 5% of the categories can be completed within one hour (median). 70% of the categories take from days to months to complete. The OMOP model can assist with vocabulary mapping tasks. CONCLUSION: This study presents firsthand knowledge of the substantial implementation efforts in phenotyping and introduces a novel metric (KIP) to measure portability of phenotype algorithms for quantifying such efforts across the eMERGE Network. Phenotype developers are encouraged to analyze and optimize the portability in regards to knowledge, interpretation and programming. CDMs can be used to improve the portability for some 'knowledge-oriented' tasks.


Asunto(s)
Registros Electrónicos de Salud/clasificación , Informática Médica/métodos , Algoritmos , Genómica , Humanos , Fenotipo , Estudios Retrospectivos
11.
Radiology ; 286(3): 1062-1071, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-29072980

RESUMEN

Purpose To assess the performance of computer-aided diagnosis (CAD) systems and to determine the dominant ultrasonographic (US) features when classifying benign versus malignant focal liver lesions (FLLs) by using contrast material-enhanced US cine clips. Materials and Methods One hundred six US data sets in all subjects enrolled by three centers from a multicenter trial that included 54 malignant, 51 benign, and one indeterminate FLL were retrospectively analyzed. The 105 benign or malignant lesions were confirmed at histologic examination, contrast-enhanced computed tomography (CT), dynamic contrast-enhanced magnetic resonance (MR) imaging, and/or 6 or more months of clinical follow-up. Data sets included 3-minute cine clips that were automatically corrected for in-plane motion and automatically filtered out frames acquired off plane. B-mode and contrast-specific features were automatically extracted on a pixel-by-pixel basis and analyzed by using an artificial neural network (ANN) and a support vector machine (SVM). Areas under the receiver operating characteristic curve (AUCs) for CAD were compared with those for one experienced and one inexperienced blinded reader. A third observer graded cine quality to assess its effects on CAD performance. Results CAD, the inexperienced observer, and the experienced observer were able to analyze 95, 100, and 102 cine clips, respectively. The AUCs for the SVM, ANN, and experienced and inexperienced observers were 0.883 (95% confidence interval [CI]: 0.793, 0.940), 0.829 (95% CI: 0.724, 0.901), 0.843 (95% CI: 0.756, 0.903), and 0.702 (95% CI: 0.586, 0.782), respectively; only the difference between SVM and the inexperienced observer was statistically significant. Accuracy improved from 71.3% (67 of 94; 95% CI: 60.6%, 79.8%) to 87.7% (57 of 65; 95% CI: 78.5%, 93.8%) and from 80.9% (76 of 94; 95% CI: 72.3%, 88.3%) to 90.3% (65 of 72; 95% CI: 80.6%, 95.8%) when CAD was in agreement with the inexperienced reader and when it was in agreement with the experienced reader, respectively. B-mode heterogeneity and contrast material washout were the most discriminating features selected by CAD for all iterations. CAD selected time-based time-intensity curve (TIC) features 99.0% (207 of 209) of the time to classify FLLs, versus 1.0% (two of 209) of the time for intensity-based features. None of the 15 video-quality criteria had a statistically significant effect on CAD accuracy-all P values were greater than the Holm-Sidak α-level correction for multiple comparisons. Conclusion CAD systems classified benign and malignant FLLs with an accuracy similar to that of an expert reader. CAD improved the accuracy of both readers. Time-based features of TIC were more discriminating than intensity-based features. © RSNA, 2017 Online supplemental material is available for this article.


Asunto(s)
Medios de Contraste/uso terapéutico , Interpretación de Imagen Asistida por Computador/métodos , Neoplasias Hepáticas/diagnóstico por imagen , Ultrasonografía/métodos , Humanos , Curva ROC , Estudios Retrospectivos
12.
J Plast Reconstr Aesthet Surg ; 88: 330-339, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38061257

RESUMEN

BACKGROUND: Autologous breast reconstruction is composed of diverse techniques and results in a variety of outcome trajectories. We propose employing an unsupervised machine learning method to characterize such heterogeneous patterns in large-scale datasets. METHODS: A retrospective cohort study of autologous breast reconstruction patients was conducted through the National Surgical Quality Improvement Program database. Patient characteristics, intraoperative variables, and occurrences of acute postoperative complications were collected. The cohort was classified into patient subgroups via the K-means clustering algorithm, a similarity-based unsupervised learning approach. The characteristics of each cluster were compared for differences from the complementary sample (p < 2 ×10-4) and validated with a test set. RESULTS: A total of 14,274 female patients were included in the final study cohort. Clustering identified seven optimal subgroups, ordered by increasing rate of postoperative complication. Cluster 1 (2027 patients) featured breast reconstruction with free flaps (50%) and latissimus dorsi flaps (40%). In addition to its low rate of complications (14%, p < 2 ×10-4), its patient population was younger and with lower comorbidities when compared with the whole cohort. In the other extreme, cluster 7 (1112 patients) almost exclusively featured breast reconstruction with free flaps (94%) and possessed the highest rates of unplanned reoperations, readmissions, and dehiscence (p < 2 ×10-4). The reoperation profile of cluster 3 was also significantly different from the general cohort and featured lower proportions of vascular repair procedures (p < 8 ×10-4). CONCLUSIONS: This study presents a novel, generalizable application of an unsupervised learning model to organize patient subgroups with associations between comorbidities, modality of breast reconstruction, and postoperative outcomes.


Asunto(s)
Neoplasias de la Mama , Colgajos Tisulares Libres , Mamoplastia , Humanos , Femenino , Aprendizaje Automático no Supervisado , Estudios Retrospectivos , Mamoplastia/métodos , Complicaciones Posoperatorias/etiología , Colgajos Tisulares Libres/cirugía , Neoplasias de la Mama/complicaciones
13.
AMIA Jt Summits Transl Sci Proc ; 2023: 388-397, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37350869

RESUMEN

This reproducibility study presents an algorithm to weigh in race distribution data of clinical research study samples when training biomedical embeddings. We extracted 12,864 PubMed abstracts published between January 1st, 2000 and January 1st, 2022 and weighed them based on the race distribution data extracted from their corresponding clinical trials registered on ClinicalTrials.gov. We trained Word2vec and BERT embeddings and evaluated their performance on predicting length of hospital stay (LHS) and intensive care unit (ICU) readmission using MIMIC-IV electronic health record data. We observed that models trained using race-sensitive embeddings do not consistently outperform the neutral embeddings ones when used for LHS prediction (with similar Mean Absolute Error 1.975 vs. 2.008) or ICU readmission prediction (with similar accuracy 74.61% vs. 75.17% and the same AUC 0.775), respectively. We conclude that demographic sensitive embeddings do not necessarily significantly improve the accuracy of health predictive models as previously reported in the literature.

14.
J Am Med Inform Assoc ; 30(2): 256-272, 2023 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-36255273

RESUMEN

OBJECTIVE: To identify and characterize clinical subgroups of hospitalized Coronavirus Disease 2019 (COVID-19) patients. MATERIALS AND METHODS: Electronic health records of hospitalized COVID-19 patients at NewYork-Presbyterian/Columbia University Irving Medical Center were temporally sequenced and transformed into patient vector representations using Paragraph Vector models. K-means clustering was performed to identify subgroups. RESULTS: A diverse cohort of 11 313 patients with COVID-19 and hospitalizations between March 2, 2020 and December 1, 2021 were identified; median [IQR] age: 61.2 [40.3-74.3]; 51.5% female. Twenty subgroups of hospitalized COVID-19 patients, labeled by increasing severity, were characterized by their demographics, conditions, outcomes, and severity (mild-moderate/severe/critical). Subgroup temporal patterns were characterized by the durations in each subgroup, transitions between subgroups, and the complete paths throughout the course of hospitalization. DISCUSSION: Several subgroups had mild-moderate severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections but were hospitalized for underlying conditions (pregnancy, cardiovascular disease [CVD], etc.). Subgroup 7 included solid organ transplant recipients who mostly developed mild-moderate or severe disease. Subgroup 9 had a history of type-2 diabetes, kidney and CVD, and suffered the highest rates of heart failure (45.2%) and end-stage renal disease (80.6%). Subgroup 13 was the oldest (median: 82.7 years) and had mixed severity but high mortality (33.3%). Subgroup 17 had critical disease and the highest mortality (64.6%), with age (median: 68.1 years) being the only notable risk factor. Subgroups 18-20 had critical disease with high complication rates and long hospitalizations (median: 40+ days). All subgroups are detailed in the full text. A chord diagram depicts the most common transitions, and paths with the highest prevalence, longest hospitalizations, lowest and highest mortalities are presented. Understanding these subgroups and their pathways may aid clinicians in their decisions for better management and earlier intervention for patients.


Asunto(s)
COVID-19 , Enfermedades Cardiovasculares , Humanos , Femenino , Persona de Mediana Edad , Anciano , Masculino , SARS-CoV-2 , Registros Electrónicos de Salud , Hospitalización
15.
Res Sq ; 2023 Nov 22.
Artículo en Inglés | MEDLINE | ID: mdl-38045411

RESUMEN

Rare disease patients often endure prolonged diagnostic odysseys and may still remain undiagnosed for years. Selecting the appropriate genetic tests is crucial to lead to timely diagnosis. Phenotypic features offer great potential for aiding genomic diagnosis in rare disease cases. We see great promise in effective integration of phenotypic information into genetic test selection workflow. In this study, we present a phenotype-driven molecular genetic test recommendation (Phen2Test) for pediatric rare disease diagnosis. Phen2Test was constructed using frequency matrix of phecodes and demographic data from the EHR before ordering genetic tests, with the objective to streamline the selection of molecular genetic tests (whole-exome / whole-genome sequencing, or gene panels) for clinicians with minimum genetic training expertise. We developed and evaluated binary classifiers based on 1,005 individuals referred to genetic counselors for potential genetic evaluation. In the evaluation using the gold standard cohort, the model achieved strong performance with an AUROC of 0.82 and an AUPRC of 0.92. Furthermore, we tested the model on another silver standard cohort (n=6,458), achieving an overall AUROC of 0.72 and an AUPRC of 0.671. Phen2Test was adjusted to align with current clinical guidelines, showing superior performance with more recent data, demonstrating its potential for use within a learning healthcare system as a genomic medicine intervention that adapts to guideline updates. This study showcases the practical utility of phenotypic features in recommending molecular genetic tests with performance comparable to clinical geneticists. Phen2Test could assist clinicians with limited genetic training and knowledge to order appropriate genetic tests.

16.
J Am Med Inform Assoc ; 30(6): 1022-1031, 2023 05 19.
Artículo en Inglés | MEDLINE | ID: mdl-36921288

RESUMEN

OBJECTIVE: To develop a computable representation for medical evidence and to contribute a gold standard dataset of annotated randomized controlled trial (RCT) abstracts, along with a natural language processing (NLP) pipeline for transforming free-text RCT evidence in PubMed into the structured representation. MATERIALS AND METHODS: Our representation, EvidenceMap, consists of 3 levels of abstraction: Medical Evidence Entity, Proposition and Map, to represent the hierarchical structure of medical evidence composition. Randomly selected RCT abstracts were annotated following EvidenceMap based on the consensus of 2 independent annotators to train an NLP pipeline. Via a user study, we measured how the EvidenceMap improved evidence comprehension and analyzed its representative capacity by comparing the evidence annotation with EvidenceMap representation and without following any specific guidelines. RESULTS: Two corpora including 229 disease-agnostic and 80 COVID-19 RCT abstracts were annotated, yielding 12 725 entities and 1602 propositions. EvidenceMap saves users 51.9% of the time compared to reading raw-text abstracts. Most evidence elements identified during the freeform annotation were successfully represented by EvidenceMap, and users gave the enrollment, study design, and study Results sections mean 5-scale Likert ratings of 4.85, 4.70, and 4.20, respectively. The end-to-end evaluations of the pipeline show that the evidence proposition formulation achieves F1 scores of 0.84 and 0.86 in the adjusted random index score. CONCLUSIONS: EvidenceMap extends the participant, intervention, comparator, and outcome framework into 3 levels of abstraction for transforming free-text evidence from the clinical literature into a computable structure. It can be used as an interoperable format for better evidence retrieval and synthesis and an interpretable representation to efficiently comprehend RCT findings.


Asunto(s)
COVID-19 , Comprensión , Humanos , Procesamiento de Lenguaje Natural , PubMed
17.
AMIA Jt Summits Transl Sci Proc ; 2023: 281-290, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37350899

RESUMEN

Participant recruitment continues to be a challenge to the success of randomized controlled trials, resulting in increased costs, extended trial timelines and delayed treatment availability. Literature provides evidence that study design features (e.g., trial phase, study site involvement) and trial sponsor are significantly associated with recruitment success. Principal investigators oversee the conduct of clinical trials, including recruitment. Through a cross-sectional survey and a thematic analysis of free-text responses, we assessed the perceptions of sixteen principal investigators regarding success factors for participant recruitment. Study site involvement and funding source do not necessarily make recruitment easier or more challenging from the perspective of the principal investigators. The most commonly used recruitment strategies are also the most effort inefficient (e.g., in-person recruitment, reviewing the electronic medical records for prescreening). Finally, we recommended actionable steps, such as improving staff support and leveraging informatics-driven approaches, to allow clinical researchers to enhance participant recruitment.

18.
J Clin Transl Sci ; 7(1): e199, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37830010

RESUMEN

Background: Randomized clinical trials (RCT) are the foundation for medical advances, but participant recruitment remains a persistent barrier to their success. This retrospective data analysis aims to (1) identify clinical trial features associated with successful participant recruitment measured by accrual percentage and (2) compare the characteristics of the RCTs by assessing the most and least successful recruitment, which are indicated by varying thresholds of accrual percentage such as ≥ 90% vs ≤ 10%, ≥ 80% vs ≤ 20%, and ≥ 70% vs ≤ 30%. Methods: Data from the internal research registry at Columbia University Irving Medical Center and Aggregated Analysis of ClinicalTrials.gov were collected for 393 randomized interventional treatment studies closed to further enrollment. We compared two regularized linear regression and six tree-based machine learning models for accrual percentage (i.e., reported accrual to date divided by the target accrual) prediction. The outperforming model and Tree SHapley Additive exPlanations were used for feature importance analysis for participant recruitment. The identified features were compared between the two subgroups. Results: CatBoost regressor outperformed the others. Key features positively associated with recruitment success, as measured by accrual percentage, include government funding and compensation. Meanwhile, cancer research and non-conventional recruitment methods (e.g., websites) are negatively associated with recruitment success. Statistically significant subgroup differences (corrected p-value < .05) were found in 15 of the top 30 most important features. Conclusion: This multi-source retrospective study highlighted key features influencing RCT participant recruitment, offering actionable steps for improvement, including flexible recruitment infrastructure and appropriate participant compensation.

19.
J Clin Transl Sci ; 7(1): e214, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37900350

RESUMEN

Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium, we have developed a knowledge graph-based question-answering system designed to augment human reasoning and accelerate translational scientific discovery: the Translator system. We have applied the Translator system to answer biomedical questions in the context of a broad array of diseases and syndromes, including Fanconi anemia, primary ciliary dyskinesia, multiple sclerosis, and others. A variety of collaborative approaches have been used to research and develop the Translator system. One recent approach involved the establishment of a monthly "Question-of-the-Month (QotM) Challenge" series. Herein, we describe the structure of the QotM Challenge; the six challenges that have been conducted to date on drug-induced liver injury, cannabidiol toxicity, coronavirus infection, diabetes, psoriatic arthritis, and ATP1A3-related phenotypes; the scientific insights that have been gleaned during the challenges; and the technical issues that were identified over the course of the challenges and that can now be addressed to foster further development of the prototype Translator system. We close with a discussion on Large Language Models such as ChatGPT and highlight differences between those models and the Translator system.

20.
Stud Health Technol Inform ; 290: 1054-1055, 2022 Jun 06.
Artículo en Inglés | MEDLINE | ID: mdl-35673202

RESUMEN

Bidirectional recurrent neural networks (RNN) improved performance of various natural language processing tasks and recently have been used for diagnosis prediction. Advantages of general bidirectional RNN, however, are not readily applied to diagnosis prediction task. In this study, we present a simple way to efficiently apply bidirectional RNN for diagnosis prediction without using any additional networks or parameters.


Asunto(s)
Procesamiento de Lenguaje Natural , Redes Neurales de la Computación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA