Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 105
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
J Biomed Inform ; 155: 104656, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38782170

RESUMEN

OBJECTIVE: Healthcare continues to grapple with the persistent issue of treatment disparities, sparking concerns regarding the equitable allocation of treatments in clinical practice. While various fairness metrics have emerged to assess fairness in decision-making processes, a growing focus has been on causality-based fairness concepts due to their capacity to mitigate confounding effects and reason about bias. However, the application of causal fairness notions in evaluating the fairness of clinical decision-making with electronic health record (EHR) data remains an understudied domain. This study aims to address the methodological gap in assessing causal fairness of treatment allocation with electronic health records data. In addition, we investigate the impact of social determinants of health on the assessment of causal fairness of treatment allocation. METHODS: We propose a causal fairness algorithm to assess fairness in clinical decision-making. Our algorithm accounts for the heterogeneity of patient populations and identifies potential unfairness in treatment allocation by conditioning on patients who have the same likelihood to benefit from the treatment. We apply this framework to a patient cohort with coronary artery disease derived from an EHR database to evaluate the fairness of treatment decisions. RESULTS: Our analysis reveals notable disparities in coronary artery bypass grafting (CABG) allocation among different patient groups. Women were found to be 4.4%-7.7% less likely to receive CABG than men in two out of four treatment response strata. Similarly, Black or African American patients were 5.4%-8.7% less likely to receive CABG than others in three out of four response strata. These results were similar when social determinants of health (insurance and area deprivation index) were dropped from the algorithm. These findings highlight the presence of disparities in treatment allocation among similar patients, suggesting potential unfairness in the clinical decision-making process. CONCLUSION: This study introduces a novel approach for assessing the fairness of treatment allocation in healthcare. By incorporating responses to treatment into fairness framework, our method explores the potential of quantifying fairness from a causal perspective using EHR data. Our research advances the methodological development of fairness assessment in healthcare and highlight the importance of causality in determining treatment fairness.


Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Humanos , Masculino , Femenino , Toma de Decisiones Clínicas , Enfermedad de la Arteria Coronaria/terapia , Disparidades en Atención de Salud , Persona de Mediana Edad , Determinantes Sociales de la Salud , Causalidad
2.
J Biomed Inform ; 142: 104343, 2023 06.
Artículo en Inglés | MEDLINE | ID: mdl-36935011

RESUMEN

Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Informatics (OHDSI) consortium was established to develop methods and tools to promote the use of textual data and NLP in real-world observational studies. In this paper, we describe a framework for representing and utilizing textual data in real-world evidence generation, including representations of information from clinical text in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), the workflow and tools that were developed to extract, transform and load (ETL) data from clinical notes into tables in OMOP CDM, as well as current applications and specific use cases of the proposed OHDSI NLP solution at large consortia and individual institutions with English textual data. Challenges faced and lessons learned during the process are also discussed to provide valuable insights for researchers who are planning to implement NLP solutions in real-world studies.


Asunto(s)
Ciencia de los Datos , Informática Médica , Humanos , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Narración
3.
J Emerg Med ; 64(1): 83-92, 2023 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-36450614

RESUMEN

BACKGROUND: Work Relative Value Units (wRVUs) are a component of many compensation models, and a proxy for the effort required to care for a patient. Accurate prediction of wRVUs generated per patient at triage could facilitate real-time load balancing between physicians and provide many practical operational and clinical benefits. OBJECTIVE: We examined whether deep-learning approaches could predict the wRVUs generated by a patient's visit using data commonly available at triage. METHODS: Adult patients presenting to an urban, academic emergency department from July 1, 2016-March 1, 2020 were included. Deidentified triage information included structured data (age, sex, vital signs, Emergency Severity Index score, language, race, standardized chief complaint) and unstructured data (free-text chief complaint) with wRVUs as outcome. Five models were examined: average wRVUs per chief complaint, linear regression, neural network and gradient-boosted tree on structured data, and neural network on unstructured textual data. Models were evaluated using mean absolute error. RESULTS: We analyzed 204,064 visits between July 1, 2016 and March 1, 2020. The median wRVUs were 3.80 (interquartile range 2.56-4.21), with significant effects of age, gender, and race. Models demonstrated lower error as complexity increased. Predictions using averages from chief complaints alone demonstrated a mean error of 2.17 predicted wRVUs per visit (95% confidence interval [CI] 2.07-2.27), the linear regression model: 1.00 wRVUs (95% CI 0.97-1.04), gradient-boosted tree: 0.85 wRVUs (95% CI 0.84-0.86), neural network with structured data: 0.86 wRVUs (95% CI 0.85-0.87), and neural network with unstructured data: 0.78 wRVUs (95% CI 0.76-0.80). CONCLUSIONS: Chief complaints are a poor predictor of the effort needed to evaluate a patient; however, deep-learning techniques show promise. These algorithms have the potential to provide many practical applications, including balancing workloads and compensation between emergency physicians, quantify crowding and mobilizing resources, and reducing bias in the triage process.


Asunto(s)
Servicio de Urgencia en Hospital , Carga de Trabajo , Adulto , Humanos , Triaje/métodos , Algoritmos , Aprendizaje Automático
4.
J Gerontol Nurs ; 49(4): 6-11, 2023 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-36989473

RESUMEN

The current study examined the frequency and predictors of older adults' engagement with symptom reporting in COVIDWATCHER, a mobile health (mHealth) citizen science application. Citizen science is a type of participatory research that leverages information provided by community members. There were 1,028 COVIDWATCHER participants who engaged with symptom reporting between April 2020 and January 2021. Approximately 13.5% (n = 139) were adults aged ≥65 years. We used a Wilcoxon test to compare the mean frequency of engagement with symptom reporting by older adults (i.e., aged ≥65 years) to younger adults (i.e., aged ≤64 years) and multivariable linear regression to explore the predictors of engagement with symptom reporting. There was a significant difference in engagement with symptom reporting between adults aged ≥65 years compared to those aged ≤64 years (p < 0.001). In our final model, age (ß = 26.0; 95% confidence interval [14.8, 34.2]) was a significant predictor for engagement with symptom reporting. These results help further our understanding of older adult engagement with mHealth-enabled citizen science for symptom reporting. [Journal of Gerontological Nursing, 49(4), 6-11.].


Asunto(s)
COVID-19 , Ciencia Ciudadana , Telemedicina , Humanos , Anciano , COVID-19/epidemiología
5.
Med Educ ; 54(12): 1159-1170, 2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-32776345

RESUMEN

OBJECTIVES: Observed Structured Clinical Exams (OSCEs) allow assessment of, and provide feedback to, medical students. Clinical examiners and standardised patients (SP) typically complete itemised checklists and global scoring scales, which have known shortcomings. In this study, we applied machine learning (ML) to label some communication skills and interview content information in OSCE transcripts and to compare several ML methodologies by performance and transferability. METHODS: One-hundred and twenty-one transcripts of two OSCE scenarios were manually annotated per utterance across 19 communication skills and content areas. Utterances were converted to two types of numeric sentence vector representations and were paired with three types of ML algorithms. First, ML models (MLMs) were evaluated using a five K-fold cross-validation technique on all transcripts in one scenario to generate precision and recall, and their harmonic mean, F1 scores. Second, ML models were trained on all 101 transcripts from scenario 1 and tested for transferability on 20 scenario 2 transcripts. RESULTS: Performance testing in the K-fold cross-validation demonstrated relatively high mean F1 scores: median 0.87 and range 0.53-0.98 across all 19 labels. Transferability testing demonstrated success: F1 median 0.76 and range 0.46-0.97. The combination of a bi-directional long short-term memory neural network (biLSTM) algorithm with GenSen numeric sentence vector representations was associated with greater F1 scores across both performance and transferability (P < .005). CONCLUSIONS: We report the first application of ML in the context of student-SP OSCEs. We demonstrated that several MLMs automatically labelled OSCE transcripts for a range of interview content and some clinical communications skills. Some MLMs achieved greater performance and transferability. Optimised MLMs could provide automated and accurate assessment of OSCEs with potential to track student progress and identify areas for further practice.


Asunto(s)
Evaluación Educacional , Estudiantes de Medicina , Competencia Clínica , Comunicación , Humanos , Aprendizaje Automático
6.
J Clin Monit Comput ; 33(1): 95-105, 2019 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-29556884

RESUMEN

To develop and validate a prediction model for delayed cerebral ischemia (DCI) after subarachnoid hemorrhage (SAH) using a temporal unsupervised feature engineering approach, demonstrating improved precision over standard features. 488 consecutive SAH admissions from 2006 to 2014 to a tertiary care hospital were included. Models were trained on 80%, while 20% were set aside for validation testing. Baseline information and standard grading scales were evaluated: age, sex, Hunt Hess grade, modified Fisher Scale (mFS), and Glasgow Coma Scale (GCS). An unsupervised approach applying random kernels was used to extract features from physiological time series (systolic and diastolic blood pressure, heart rate, respiratory rate, and oxygen saturation). Classifiers (Partial Least Squares, linear and kernel Support Vector Machines) were trained on feature subsets of the derivation dataset. Models were applied to the validation dataset. The performances of the best classifiers on the validation dataset are reported by feature subset. Standard grading scale (mFS): AUC 0.58. Combined demographics and grading scales: AUC 0.60. Random kernel derived physiologic features: AUC 0.74. Combined baseline and physiologic features with redundant feature reduction: AUC 0.77. Current DCI prediction tools rely on admission imaging and are advantageously simple to employ. However, using an agnostic and computationally inexpensive learning approach for high-frequency physiologic time series data, we demonstrated that our models achieve higher classification accuracy.


Asunto(s)
Isquemia Encefálica/diagnóstico por imagen , Diagnóstico por Computador/métodos , Hemorragia Subaracnoidea/diagnóstico por imagen , Anciano , Área Bajo la Curva , Cuidados Críticos , Reacciones Falso Positivas , Femenino , Escala de Coma de Glasgow , Humanos , Análisis de los Mínimos Cuadrados , Masculino , Persona de Mediana Edad , Admisión del Paciente , Valor Predictivo de las Pruebas , Reproducibilidad de los Resultados , Factores de Riesgo , Índice de Severidad de la Enfermedad , Máquina de Vectores de Soporte , Centros de Atención Terciaria , Factores de Tiempo
7.
J Biomed Inform ; 85: 21-29, 2018 09.
Artículo en Inglés | MEDLINE | ID: mdl-30036675

RESUMEN

Most laboratory results are valid for only a certain time period (laboratory tests shelf-life), after which they are outdated and the test needs to be re-administered. Currently, laboratory test shelf-lives are not centrally available anywhere but the implicit knowledge of doctors. In this work we propose an automated method to learn laboratory test-specific shelf-life by identifying prevalent laboratory test order patterns in electronic health records. The resulting shelf-lives performed well in the evaluation of internal validity, clinical interpretability, and external validity.


Asunto(s)
Técnicas de Laboratorio Clínico/estadística & datos numéricos , Glucemia/análisis , Técnicas de Laboratorio Clínico/normas , Biología Computacional , Registros Electrónicos de Salud/estadística & datos numéricos , Humanos , Estudios Longitudinales , Modelos Estadísticos , Fenotipo , Reproducibilidad de los Resultados , Factores de Tiempo
8.
J Biomed Inform ; 69: 1-9, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28323113

RESUMEN

Identifying topics of discussions in online health communities (OHC) is critical to various information extraction applications, but can be difficult because topics of OHC content are usually heterogeneous and domain-dependent. In this paper, we provide a multi-class schema, an annotated dataset, and supervised classifiers based on convolutional neural network (CNN) and other models for the task of classifying discussion topics. We apply the CNN classifier to the most popular breast cancer online community, and carry out cross-sectional and longitudinal analyses to show topic distributions and topic dynamics throughout members' participation. Our experimental results suggest that CNN outperforms other classifiers in the task of topic classification and identify several patterns and trajectories. For example, although members discuss mainly disease-related topics, their interest may change through time and vary with their disease severities.


Asunto(s)
Neoplasias de la Mama , Internet , Redes Neurales de la Computación , Estudios Transversales , Femenino , Humanos , Participación del Paciente
9.
J Biomed Inform ; 60: 334-41, 2016 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-26923634

RESUMEN

Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting toward this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5103 gold-standard speculation annotations on 2000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language.


Asunto(s)
Minería de Datos/métodos , Registros Electrónicos de Salud/instrumentación , Procesamiento de Lenguaje Natural , China , Sistemas de Computación , Humanos , Lenguaje , Informática Médica/métodos , Reproducibilidad de los Resultados , Flujo de Trabajo
10.
J Biomed Inform ; 58: 156-165, 2015 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-26464024

RESUMEN

We present the Unsupervised Phenome Model (UPhenome), a probabilistic graphical model for large-scale discovery of computational models of disease, or phenotypes. We tackle this challenge through the joint modeling of a large set of diseases and a large set of clinical observations. The observations are drawn directly from heterogeneous patient record data (notes, laboratory tests, medications, and diagnosis codes), and the diseases are modeled in an unsupervised fashion. We apply UPhenome to two qualitatively different mixtures of patients and diseases: records of extremely sick patients in the intensive care unit with constant monitoring, and records of outpatients regularly followed by care providers over multiple years. We demonstrate that the UPhenome model can learn from these different care settings, without any additional adaptation. Our experiments show that (i) the learned phenotypes combine the heterogeneous data types more coherently than baseline LDA-based phenotypes; (ii) they each represent single diseases rather than a mix of diseases more often than the baseline ones; and (iii) when applied to unseen patient records, they are correlated with the patients' ground-truth disorders. Code for training, inference, and quantitative evaluation is made available to the research community.


Asunto(s)
Registros Electrónicos de Salud , Aprendizaje , Probabilidad , Humanos , Fenotipo
11.
J Biomed Inform ; 51: 24-34, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-24727481

RESUMEN

Electronic health record (EHR) data show promise for deriving new ways of modeling human disease states. Although EHR researchers often use numerical values of laboratory tests as features in disease models, a great deal of information is contained in the context within which a laboratory test is taken. For example, the same numerical value of a creatinine test has different interpretation for a chronic kidney disease patient and a patient with acute kidney injury. We study whether EHR research studies are subject to biased results and interpretations if laboratory measurements taken in different contexts are not explicitly separated. We show that the context of a laboratory test measurement can often be captured by the way the test is measured through time. We perform three tasks to study the properties of these temporal measurement patterns. In the first task, we confirm that laboratory test measurement patterns provide additional information to the stand-alone numerical value. The second task identifies three measurement pattern motifs across a set of 70 laboratory tests performed for over 14,000 patients. Of these, one motif exhibits properties that can lead to biased research results. In the third task, we demonstrate the potential for biased results on a specific example. We conduct an association study of lipase test values to acute pancreatitis. We observe a diluted signal when using only a lipase value threshold, whereas the full association is recovered when properly accounting for lipase measurements in different contexts (leveraging the lipase measurement patterns to separate the contexts). Aggregating EHR data without separating distinct laboratory test measurement patterns can intermix patients with different diseases, leading to the confounding of signals in large-scale EHR analyses. This paper presents a methodology for leveraging measurement frequency to identify and reduce laboratory test biases.


Asunto(s)
Artefactos , Sistemas de Información en Laboratorio Clínico/estadística & datos numéricos , Interpretación Estadística de Datos , Minería de Datos/métodos , Registros Electrónicos de Salud/clasificación , Registros Electrónicos de Salud/estadística & datos numéricos , Reconocimiento de Normas Patrones Automatizadas/métodos , Sistemas de Información en Laboratorio Clínico/clasificación , Factores de Confusión Epidemiológicos , New York
12.
Sci Adv ; 10(4): eadf9033, 2024 Jan 26.
Artículo en Inglés | MEDLINE | ID: mdl-38266089

RESUMEN

Without comprehensive examination of available literature on health disparities and minority health (HDMH), the field is left vulnerable to disproportionately focus on specific populations or conditions, curtailing our ability to fully advance health equity. Using scalable open-source methods, we conducted a computational scoping review of more than 200,000 articles to investigate major populations, conditions, and themes as well as notable gaps. We also compared trends in studied conditions to their relative prevalence using insurance claims (42 million Americans). HDMH publications represent 1% of articles in Medical Literature Analysis and Retrieval System Online (MEDLINE). Most studies are observational in nature, although randomized trial reporting has increased fivefold in the past 20 years. Half of HDMH articles concentrate on only three disease groups (cancer, mental health, and endocrine/metabolic disorders), while hearing, vision, and skin-related conditions are among the least well represented despite substantial prevalence. To support further investigation, we present HDMH Monitor, an interactive dashboard and repository generated from the HDMH bibliome.


Asunto(s)
Audición , Salud de las Minorías , Humanos , Salud Mental , Inequidades en Salud
13.
medRxiv ; 2024 Apr 12.
Artículo en Inglés | MEDLINE | ID: mdl-38645158

RESUMEN

Sexually transmitted infections (STIs) continue to pose a substantial public health challenge in the United States (US). Surveillance, a cornerstone of disease control and prevention, can be strengthened to promote more timely, efficient, and equitable practices by incorporating health information exchange (HIE) and other large-scale health data sources into reporting. New York City patient-level electronic health record data between January 1, 2018 and June 30, 2023 were obtained from Healthix, the largest US public HIE. Healthix data were linked to neighborhood-level information from the American Community Survey. In this casecontrol study, chlamydia, gonorrhea, and HIV-positive cases were compared to controls to estimate the odds of receiving a specific laboratory test or positive result using generalized estimating equations with logit function and robust standard errors. Among 1,519,121 tests performed for chlamydia, 1,574,772 for gonorrhea, and 1,200,560 for HIV, 2%, 0.6% and 0.3% were positive for chlamydia, gonorrhea, and HIV, respectively. Chlamydia and gonorrhea co-occurred in 1,854 cases (7% of chlamydia and 21% of gonorrhea total cases). Testing behavior was often incongruent with geographic and sociodemographic patterns of positive cases. For example, people living in areas with the highest levels of poverty were less likely to test for gonorrhea but almost twice as likely to test positive compared to those in low poverty areas. Regional HIE enabled review of testing and cases using granular and complementary data not typically available given existing reporting practices. Enhanced surveillance spotlights potential incongruencies between testing patterns and STI risk in certain populations, signaling potential under- and over-testing. These and future insights derived from HIE data may be used to continuously inform public health practice and drive further improvements in provision and evaluation of services and programs.

14.
BMJ Qual Saf ; 33(2): 132-135, 2024 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-38071526

RESUMEN

Studying near-miss errors is essential to preventing errors from reaching patients. When an error is committed, it may be intercepted (near-miss) or it will reach the patient; estimates of the proportion that reach the patient vary widely. To better understand this relationship, we conducted a retrospective cohort study using two objective measures to identify wrong-patient imaging order errors involving radiation, estimating the proportion of errors that are intercepted and those that reach the patient. This study was conducted at a large integrated healthcare system using data from 1 January to 31 December 2019. The study used two outcome measures of wrong-patient orders: (1) wrong-patient orders that led to misadministration of radiation reported to the New York Patient Occurrence Reporting and Tracking System (NYPORTS) (misadministration events); and (2) wrong-patient orders identified by the Wrong-Patient Retract-and-Reorder (RAR) measure, a measure identifying orders placed for a patient, retracted and rapidly reordered by the same clinician on a different patient (near-miss events). All imaging orders that involved radiation were extracted retrospectively from the healthcare system data warehouse. Among 293 039 total eligible orders, 151 were wrong-patient orders (3 misadministration events, 148 near-miss events), for an overall rate of 51.5 per 100 000 imaging orders involving radiation placed on the wrong patient. Of all wrong-patient imaging order errors, 2% reached the patient, translating to 50 near-miss events for every 1 error that reached the patient. This proportion provides a more accurate and reliable estimate and reinforces the utility of systematic measure of near-miss errors as an outcome for preventative interventions.


Asunto(s)
Prestación Integrada de Atención de Salud , Humanos , Estudios Retrospectivos , New York
15.
medRxiv ; 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38712122

RESUMEN

Background: Endometriosis affects 10% of reproductive-age women, and yet, it goes undiagnosed for 3.6 years on average after symptoms onset. Despite large GWAS meta-analyses (N > 750,000), only a few dozen causal loci have been identified. We hypothesized that the challenges in identifying causal genes for endometriosis stem from heterogeneity across clinical and biological factors underlying endometriosis diagnosis. Methods: We extracted known endometriosis risk factors, symptoms, and concomitant conditions from the Penn Medicine Biobank (PMBB) and performed unsupervised spectral clustering on 4,078 women with endometriosis. The 5 clusters were characterized by utilizing additional electronic health record (EHR) variables, such as endometriosis-related comorbidities and confirmed surgical phenotypes. From four EHR-linked genetic datasets, PMBB, eMERGE, AOU, and UKBB, we extracted lead variants and tag variants 39 known endometriosis loci for association testing. We meta-analyzed ancestry-stratified case/control tests for each locus and cluster in addition to a positive control (Total N endometriosis cases = 10,108). Results: We have designated the five subtype clusters as pain comorbidities, uterine disorders, pregnancy complications, cardiometabolic comorbidities, and EHR-asymptomatic based on enriched features from each group. One locus, RNLS , surpassed the genome-wide significant threshold in the positive control. Thirteen more loci reached a Bonferroni threshold of 1.3 x 10 -3 (0.05 / 39) in the positive control. The cluster-stratified tests yielded more significant associations than the positive control for anywhere from 5 to 15 loci depending on the cluster. Bonferroni significant loci were identified for four out of five clusters, including WNT4 and GREB1 for the uterine disorders cluster, RNLS for the cardiometabolic cluster, FSHB for the pregnancy complications cluster, and SYNE1 and CDKN2B-AS1 for the EHR-asymptomatic cluster. This study enhances our understanding of the clinical presentation patterns of endometriosis subtypes, showcasing the innovative approach employed to investigate this complex disease.

16.
BMC Bioinformatics ; 14: 10, 2013 Jan 16.
Artículo en Inglés | MEDLINE | ID: mdl-23323800

RESUMEN

BACKGROUND: The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? RESULTS: We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results. CONCLUSIONS: Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.


Asunto(s)
Minería de Datos/métodos , Registros Electrónicos de Salud , Algoritmos , Registros de Salud Personal , Humanos , Semántica
17.
J Biomed Inform ; 46(6): 1088-98, 2013 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-23954592

RESUMEN

Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rules, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work.


Asunto(s)
Investigación Biomédica , Procesamiento de Lenguaje Natural , Vocabulario Controlado
18.
Math Biosci ; 358: 108979, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36792027

RESUMEN

A normally functioning menstrual cycle requires significant crosstalk between hormones originating in ovarian and brain tissues. Reproductive hormone dysregulation may cause abnormal function and sometimes infertility. The inherent complexity in this endocrine system is a challenge to identifying mechanisms of cycle disruption, particularly given the large number of unknown parameters in existing mathematical models. We develop a new endocrine model to limit model complexity and use simulated distributions of unknown parameters for model analysis. By employing a comprehensive model evaluation, we identify a collection of mechanisms that differentiate normal and abnormal phenotypes. We also discover an intermediate phenotype-displaying relatively normal hormone levels and cycle dynamics-that is grouped statistically with the irregular phenotype. Results provide insight into how clinical symptoms associated with ovulatory disruption may not be detected through hormone measurements alone.


Asunto(s)
Síndrome del Ovario Poliquístico , Humanos , Femenino , Síndrome del Ovario Poliquístico/diagnóstico , Hormonas , Ciclo Menstrual/fisiología
19.
medRxiv ; 2023 Oct 17.
Artículo en Inglés | MEDLINE | ID: mdl-37905137

RESUMEN

Without comprehensive examination of available literature on health disparities and minority health (HDMH), the field is left vulnerable to disproportionately focus on specific populations or conditions, curtailing our ability to fully advance health equity. Using scalable open-source methods, we conducted a computational scoping review of more than 200,000 articles to investigate major populations, conditions, and themes in the literature as well as notable gaps. We also compared trends in studied conditions to their relative prevalence in the general population using insurance claims (42MM Americans). HDMH publications represent 1% of articles in MEDLINE. Most studies are observational in nature, though randomized trial reporting has increased five-fold in the last twenty years. Half of all HDMH articles concentrate on only three disease groups (cancer, mental health, endocrine/metabolic disorders), while hearing, vision, and skin-related conditions are among the least well represented despite substantial prevalence. To support further investigation, we also present HDMH Monitor, an interactive dashboard and repository generated from the HDMH bibliome.

20.
Proc Conf Assoc Comput Linguist Meet ; 2023: 2680-2697, 2023 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38770277

RESUMEN

Two-step approaches, in which summary candidates are generated-then-reranked to return a single summary, can improve ROUGE scores over the standard single-step approach. Yet, standard decoding methods (i.e., beam search, nucleus sampling, and diverse beam search) produce candidates with redundant, and often low quality, content. In this paper, we design a novel method to generate candidates for re-ranking that addresses these issues. We ground each candidate abstract on its own unique content plan and generate distinct plan-guided abstracts using a model's top beam. More concretely, a standard language model (a BART LM) auto-regressively generates elemental discourse unit (EDU) content plans with an extractive copy mechanism. The top K beams from the content plan generator are then used to guide a separate LM, which produces a single abstractive candidate for each distinct plan. We apply an existing re-ranker (BRIO) to abstractive candidates generated from our method, as well as baseline decoding methods. We show large relevance improvements over previously published methods on widely used single document news article corpora, with ROUGE-2 F1 gains of 0.88, 2.01, and 0.38 on CNN / Dailymail, NYT, and Xsum, respectively. A human evaluation on CNN / DM validates these results. Similarly, on 1k samples from CNN / DM, we show that prompting GPT-3 to follow EDU plans outperforms sampling-based methods by 1.05 ROUGE-2 F1 points. Code to generate and realize plans is available at https://github.com/griff4692/edu-sum.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA