RESUMEN
Social factors impact morbidity and mortality among patients. Documenting social needs in the clinical notes is currently widely done by family physicians. The unstructured format of information on social factors in electronic health records limits the ability of providers to address these issues. A proposed solution is using natural language processing to identify social needs from the electronic health record. This could support physicians in capturing structured social needs information that is consistent and reproducible without increasing documentation burden.
Asunto(s)
Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Humanos , Documentación , Médicos de FamiliaRESUMEN
Background: A better understanding of neighborhood-level factors' contribution is needed in order to increase the precision of cancer control interventions that target geographic determinants of cancer health disparities. This study characterized the distribution of neighborhood deprivation in a racially diverse cohort of prostate cancer survivors. Methods: A retrospective cohort of 253 prostate cancer patients who were treated with radical prostatectomy from 2011 to 2019 was established at the Medical University of South Carolina. Individual-level data on clinical variables (e.g., stage, grade) and race were abstracted. Social Deprivation Index (SDI) and Healthcare Professional Shortage (HPS) status was obtained from the Robert Graham Center and assigned to participants based on their residential census tract. Data were analyzed with descriptive statistics and multivariable logistic regression. Results: The cohort of 253 men consisted of 168 white, 81 African American, 1 Hispanic and 3 multiracial men. Approximately 49% of 249 men lived in areas with high SDI (e.g., SDI score of 48 to 98). The mean for SDI was 44.5 (+27.4), and the range was 97 (1−98) for all study participants. African American men had a significantly greater likelihood of living in a socially deprived neighborhood compared to white men (OR = 3.7, 95% C.I. 2.1−6.7, p < 0.01), while men who lived in areas with higher HPS shortage status were significantly more likely to live in a neighborhood that had high SDI compared to men who lived in areas with lower HPS shortages (OR = 4.7, 95% C.I. = 2.1−10.7, p < 0.01). African Americans had a higher likelihood of developing biochemical reoccurrence (OR = 3.7, 95% C.I. = 1.7−8.0) compared with white men. There were no significant association between SDI and clinical characteristics of prostate cancer. Conclusions: This study demonstrates that SDI varies considerably by race among men with prostate cancer treated with radical prostatectomy. Using SDI to understand the social environment could be -particularly useful as part of precision medicine and precision public health approaches and could be used by cancer centers, public health providers, and other health care specialists to inform operational decisions about how to target health promotion and disease prevention efforts in catchment areas and patient populations.
RESUMEN
BACKGROUND: Identifying patients at risk of hereditary cancer based on their family health history is a highly nuanced task. Frequently, patients at risk are not referred for genetic counseling as providers lack the time and training to collect and assess their family health history. Consequently, patients at risk do not receive genetic counseling and testing that they need to determine the preventive steps they should take to mitigate their risk. OBJECTIVE: This study aims to automate clinical practice guideline recommendations for hereditary cancer risk based on patient family health history. METHODS: We combined chatbots, web application programming interfaces, clinical practice guidelines, and ontologies into a web service-oriented system that can automate family health history collection and assessment. We used Owlready2 and Protégé to develop a lightweight, patient-centric clinical practice guideline domain ontology using hereditary cancer criteria from the American College of Medical Genetics and Genomics and the National Cancer Comprehensive Network. RESULTS: The domain ontology has 758 classes, 20 object properties, 23 datatype properties, and 42 individuals and encompasses 44 cancers, 144 genes, and 113 clinical practice guideline criteria. So far, it has been used to assess >5000 family health history cases. We created 192 test cases to ensure concordance with clinical practice guidelines. The average test case completes in 4.5 (SD 1.9) seconds, the longest in 19.6 seconds, and the shortest in 2.9 seconds. CONCLUSIONS: Web service-enabled, chatbot-oriented family health history collection and ontology-driven clinical practice guideline criteria risk assessment is a simple and effective method for automating hereditary cancer risk screening.
RESUMEN
Glycomics researchers have identified the need for integrated database systems for collecting glycomics information in a consistent format. The goal is to create a resource for knowledge discovery and dissemination to wider research communities. This has the potential and has exhibited initial success, to extend the research community to include biologists, clinicians, chemists, and computer scientists. This chapter discusses the technology and approach needed to create integrated data resources and informatics ecosystems to empower the broader community to leverage extant glycomics data. The focus is on glycosaminoglycan (GAGs) and proteoglycan research, but the approach can be generalized. The methods described span the development of glycomics standards from CarbBank to Glyco Connection Tables. Integrated data sets provide a foundation for novel methods of analysis such as machine learning and deep learning for knowledge discovery. The implications of predictive analysis are examined in relation to disease biomarker to expand the target audience of GAG and proteoglycan research.
Asunto(s)
Ecosistema , Glicómica , Informática , Polisacáridos , ProteoglicanosRESUMEN
INTRODUCTION: Primary care providers (PCPs) and oncologists lack time and training to appropriately identify patients at increased risk for hereditary cancer using family health history (FHx) and clinical practice guideline (CPG) criteria. We built a tool, "ItRunsInMyFamily" (ItRuns) that automates FHx collection and risk assessment using CPGs. The purpose of this study was to evaluate ItRuns by measuring the level of concordance in referral patterns for genetic counseling/testing (GC/GT) between the CPGs as applied by the tool and genetic counselors (GCs), in comparison to oncologists and PCPs. The extent to which non-GCs are discordant with CPGs is a gap that health information technology, such as ItRuns, can help close to facilitate the identification of individuals at risk for hereditary cancer. METHODS: We curated 18 FHx cases and surveyed GCs and non-GCs (oncologists and PCPs) to assess concordance with ItRuns CPG criteria for referring patients for GC/GT. Percent agreement was used to describe concordance, and logistic regression to compare providers and the tool's concordance with CPG criteria. RESULTS: GCs had the best overall concordance with the CPGs used in ItRuns at 82.2%, followed by oncologists with 66.0% and PCPs with 60.6%. GCs were significantly more likely to concur with CPGs (OR = 4.04, 95% CI = 3.35-4.89) than non-GCs. All providers had higher concordance with CPGs for FHx cases that met the criteria for genetic counseling/testing than for cases that did not. DISCUSSION/CONCLUSION: The risk assessment provided by ItRuns was highly concordant with that of GC's, particularly for at-risk individuals. The use of such technology-based tools improves efficiency and can lead to greater numbers of at-risk individuals accessing genetic counseling, testing, and mutation-based interventions to improve health.
RESUMEN
BACKGROUND: Suicide is an important public health concern in the United States and around the world. There has been significant work examining machine learning approaches to identify and predict intentional self-harm and suicide using existing data sets. With recent advances in computing, deep learning applications in health care are gaining momentum. OBJECTIVE: This study aimed to leverage the information in clinical notes using deep neural networks (DNNs) to (1) improve the identification of patients treated for intentional self-harm and (2) predict future self-harm events. METHODS: We extracted clinical text notes from electronic health records (EHRs) of 835 patients with International Classification of Diseases (ICD) codes for intentional self-harm and 1670 matched controls who never had any intentional self-harm ICD codes. The data were divided into training and holdout test sets. We tested a number of algorithms on clinical notes associated with the intentional self-harm codes using the training set, including several traditional bag-of-words-based models and 2 DNN models: a convolutional neural network (CNN) and a long short-term memory model. We also evaluated the predictive performance of the DNNs on a subset of patients who had clinical notes 1 to 6 months before the first intentional self-harm event. Finally, we evaluated the impact of a pretrained model using Word2vec (W2V) on performance. RESULTS: The area under the receiver operating characteristic curve (AUC) for the CNN on the phenotyping task, that is, the detection of intentional self-harm in clinical notes concurrent with the events was 0.999, with an F1 score of 0.985. In the predictive task, the CNN achieved the highest performance with an AUC of 0.882 and an F1 score of 0.769. Although pretraining with W2V shortened the DNN training time, it did not improve performance. CONCLUSIONS: The strong performance on the first task, namely, phenotyping based on clinical notes, suggests that such models could be used effectively for surveillance of intentional self-harm in clinical text in an EHR. The modest performance on the predictive task notwithstanding, the results using DNN models on clinical text alone are competitive with other reports in the literature using risk factors from structured EHR data.
RESUMEN
Precision medicine informatics is a field of research that incorporates learning systems that generate new knowledge to improve individualized treatments using integrated data sets and models. Given the ever-increasing volumes of data that are relevant to patient care, artificial intelligence (AI) pipelines need to be a central component of such research to speed discovery. Applying AI methodology to complex multidisciplinary information retrieval can support efforts to discover bridging concepts within collaborating communities. This dovetails with precision medicine research, given the information rich multi-omic data that are used in precision medicine analysis pipelines. In this perspective article we define a prototype AI pipeline to facilitate discovering research connections between bioinformatics and clinical researchers. We propose building knowledge representations that are iteratively improved through AI and human-informed learning feedback loops supported through crowdsourcing. To illustrate this, we will explore the specific use case of nonalcoholic fatty liver disease, a growing health care problem. We will examine AI pipeline construction and utilization in relation to bench-to-bedside bridging concepts with interconnecting knowledge representations applicable to bioinformatics researchers and clinicians.
RESUMEN
BACKGROUND: Machine learning has been used extensively in clinical text classification tasks. Deep learning approaches using word embeddings have been recently gaining momentum in biomedical applications. In an effort to automate the identification of altered mental status (AMS) in emergency department provider notes for the purpose of decision support, we compare the performance of classic bag-of-words-based machine learning classifiers and novel deep learning approaches. METHODS: We used a case-control study design to extract an adequate number of clinical notes with AMS and non-AMS based on ICD codes. The notes were parsed to extract the history of present illness, which was used as the clinical text for the classifiers. The notes were manually labeled by clinicians. As a baseline for comparison, we tested several traditional bag-of-words based classifiers. We then tested several deep learning models using a convolutional neural network architecture with three different types of word embeddings, a pre-trained word2vec model and two models without pre-training but with different word embedding dimensions. RESULTS: We evaluated the models on 1130 labeled notes from the emergency department. The deep learning models had the best overall performance with an area under the ROC curve of 98.5% and an accuracy of 94.5%. Pre-training word embeddings on the unlabeled corpus reduced training iterations and had performance that was statistically no different than the other deep learning models. CONCLUSION: This supervised deep learning approach performs exceedingly well for the detection of AMS symptoms in clinical text in our environment. Further work is needed for the generalizability of these findings, including evaluation of these models in other types of clinical notes and other environments. The results seem promising for the ultimate use of these types of classifiers in combination with other information derived from the electronic health records as input for clinical decision support.
Asunto(s)
Sistemas de Apoyo a Decisiones Clínicas , Aprendizaje Profundo , Servicio de Urgencia en Hospital , Trastornos Mentales/diagnóstico , Adulto , Estudios de Casos y Controles , Registros Electrónicos de Salud , Femenino , Humanos , Clasificación Internacional de Enfermedades , Masculino , Redes Neurales de la Computación , Sensibilidad y EspecificidadRESUMEN
With the rapid growth of health-related data including genomic, proteomic, imaging and clinical, the arduous task of data integration can be overwhelmed by the complexity of the environment including data size and diversity. This report examines the role of data integration strategies for big data predictive analytics in precision medicine research. Infrastructure-as-code methodologies will be discussed as a means of integrating and managing data. This includes a discussion on how and when these strategies can be used to lower barriers and address issues of consistency and interoperability within medical research environments. The goal is to support translational research and enable healthcare organizations to integrate and utilize infrastructure to accelerate the adoption of precision medicine.
Asunto(s)
Predicción/métodos , Medicina de Precisión/métodos , Investigación Biomédica/métodos , Interpretación Estadística de Datos , Bases de Datos Factuales , Registros Electrónicos de Salud , Genómica/métodos , Humanos , Proteómica/métodos , Investigación Biomédica Traslacional/métodosRESUMEN
The integration of phenotypes and genotypes is at an unprecedented level and offers new opportunities to establish deep phenotypes. There are a number of challenges to overcome, specifically, accelerated growth of data, data silos, incompleteness, inaccuracies, and heterogeneity within and across data sources. This perspective report discusses artificial intelligence (AI) approaches that hold promise in addressing these challenges by automating computable phenotypes and integrating them with genotypes. Collaborations between biomedical and AI researchers will be highlighted in order to describe initial successes with an eye toward the future.
RESUMEN
Entity-attribute-value (EAV) tables are widely used to store data in electronic medical records and clinical study data management systems. Before they can be used by various analytical (e.g., data mining and machine learning) programs, EAV-modeled data usually must be transformed into conventional relational table format through pivot operations. This time-consuming and resource-intensive process is often performed repeatedly on a regular basis, e.g., to provide a daily refresh of the content in a clinical data warehouse. Thus, it would be beneficial to make pivot operations as efficient as possible. In this paper, we present three techniques for improving the efficiency of pivot operations: 1) filtering out EAV tuples related to unneeded clinical parameters early on; 2) supporting pivoting across multiple EAV tables; and 3) conducting multi-query optimization. We demonstrate the effectiveness of our techniques through implementation. We show that our optimized execution method of pivoting using these techniques significantly outperforms the current basic execution method of pivoting. Our techniques can be used to build a data extraction tool to simplify the specification of and improve the efficiency of extracting data from the EAV tables in electronic medical records and clinical study data management systems.
Asunto(s)
Minería de Datos/métodos , Sistemas de Administración de Bases de Datos , Registros Electrónicos de Salud , Modelos Teóricos , Estudios Clínicos como Asunto , HumanosRESUMEN
OBJECTIVES: To examine the feasibility of deploying a virtual web service for sharing data within a research network, and to evaluate the impact on data consistency and quality. MATERIAL AND METHODS: Virtual machines (VMs) encapsulated an open-source, semantically and syntactically interoperable secure web service infrastructure along with a shadow database. The VMs were deployed to 8 Collaborative Pediatric Critical Care Research Network Clinical Centers. RESULTS: Virtual web services could be deployed in hours. The interoperability of the web services reduced format misalignment from 56% to 1% and demonstrated that 99% of the data consistently transferred using the data dictionary and 1% needed human curation. CONCLUSIONS: Use of virtualized open-source secure web service technology could enable direct electronic abstraction of data from hospital databases for research purposes.
Asunto(s)
Acceso a la Información , Redes de Comunicación de Computadores , Cuidados Críticos , Difusión de la Información/métodos , Internet , Pediatría/organización & administración , Sistemas de Computación , Bases de Datos como Asunto , Estudios de Factibilidad , Humanos , Programas InformáticosRESUMEN
Glycomics researchers have identified the need for integrated database systems for collecting glycomics information in a consistent format. The goal is to create a resource for knowledge discovery and dissemination to wider research communities. This has the potential to extend the research community to include biologists, clinicians, chemists, and computer scientists. This chapter discusses the technology and approach needed to create integrated data resources to empower the broader community to leverage extant glycomics data. The focus is on glycosaminoglycan (GAGs) and proteoglycan research, but the approach can be generalized. The methods described span the development of glycomics standards from CarbBank to Glyco Connection Tables. The existence of integrated data sets provides a foundation for novel methods of analysis such as machine learning for knowledge discovery. The implications of predictive analysis are examined in relation to disease biomarker to expand the target audience of GAG and proteoglycan research.
Asunto(s)
Biología Computacional/métodos , Glicosaminoglicanos/química , Proteoglicanos/química , Inteligencia Artificial , Modelos Moleculares , Integración de SistemasRESUMEN
Glioblastoma multiforme (GBM), a highly aggressive form of brain cancer, results in a median survival of 12-15 months. For decades, researchers have explored the effects of clinical and molecular factors on this disease and have identified several candidate prognostic markers. In this study, we evaluated the use of multivariate classification models for differentiating between subsets of patients who survive a relatively long or short time. Data for this study came from The Cancer Genome Atlas (TCGA), a public repository containing clinical, treatment, histological and biomolecular variables for hundreds of patients. We applied variable-selection and classification algorithms in a cross-validated design and observed that predictive performance of the resulting models varied substantially across the algorithms and categories of data. The best-performing models were based on age, treatments and global DNA methylation. In this paper, we summarise our findings, discuss lessons learned in analysing TCGA data and offer recommendations for performing such analyses.
Asunto(s)
Neoplasias Encefálicas/mortalidad , Glioblastoma/mortalidad , Algoritmos , Neoplasias Encefálicas/diagnóstico , Neoplasias Encefálicas/genética , Metilación de ADN , Glioblastoma/diagnóstico , Glioblastoma/genética , Humanos , Estimación de Kaplan-Meier , Modelos Moleculares , Pronóstico , Tasa de SupervivenciaRESUMEN
BACKGROUND: With the advent of whole-genome analysis for profiling tumor tissue, a pressing need has emerged for principled methods of organizing the large amounts of resulting genomic information. We propose the concept of multiplicity measures on cancer and gene networks to organize the information in a clinically meaningful manner. Multiplicity applied in this context extends Fearon and Vogelstein's multi-hit genetic model of colorectal carcinoma across multiple cancers. METHODS: Using the Catalogue of Somatic Mutations in Cancer (COSMIC), we construct networks of interacting cancers and genes. Multiplicity is calculated by evaluating the number of cancers and genes linked by the measurement of a somatic mutation. The Kamada-Kawai algorithm is used to find a two-dimensional minimum energy solution with multiplicity as an input similarity measure. Cancers and genes are positioned in two dimensions according to this similarity. A third dimension is added to the network by assigning a maximal multiplicity to each cancer or gene. Hierarchical clustering within this three-dimensional network is used to identify similar clusters in somatic mutation patterns across cancer types. RESULTS: The clustering of genes in a three-dimensional network reveals a similarity in acquired mutations across different cancer types. Surprisingly, the clusters separate known causal mutations. The multiplicity clustering technique identifies a set of causal genes with an area under the ROC curve of 0.84 versus 0.57 when clustering on gene mutation rate alone. The cluster multiplicity value and number of causal genes are positively correlated via Spearman's Rank Order correlation (rs(8) = 0.894, Spearman's t = 17.48, p < 0.05). A clustering analysis of cancer types segregates different types of cancer. All blood tumors cluster together, and the cluster multiplicity values differ significantly (Kruskal-Wallis, H = 16.98, df = 2, p < 0.05). CONCLUSION: We demonstrate the principle of multiplicity for organizing somatic mutations and cancers in clinically relevant clusters. These clusters of cancers and mutations provide representations that identify segregations of cancer and genes driving cancer progression.
Asunto(s)
Algoritmos , Mutación , Neoplasias/genética , Análisis por Conglomerados , HumanosRESUMEN
A polygenic model for predicting susceptibility to late-onset, sporadic forms of breast-cancer based on an individual's SNP profile was developed. The model was validated using a publicly available data set with genome-wide SNP markers for cases and controls. Preliminary results show that this method performs better than expected by chance.
Asunto(s)
Neoplasias de la Mama/diagnóstico , Neoplasias de la Mama/genética , Mapeo Cromosómico/métodos , Predisposición Genética a la Enfermedad/epidemiología , Predisposición Genética a la Enfermedad/genética , Modelos Genéticos , Polimorfismo de Nucleótido Simple/genética , Neoplasias de la Mama/epidemiología , Simulación por Computador , Diagnóstico por Computador/métodos , Femenino , Pruebas Genéticas/métodos , Humanos , Modelos de Riesgos Proporcionales , Medición de Riesgo/métodos , Factores de Riesgo , UtahRESUMEN
The advancement of cancer diagnosis, prognosis, and treatment would be hastened via a robust method to identify patterns that indicate a tumor's state. Prior research has established that sporadic, colorectal-cancer pathogenesis involves a series of genetic mutations that allow benign polyps to develop and eventually progress to malignant tumors in distinguishable patterns. Using a publicly available database of somatic mutations for many cancer types, we identified somatic-mutation signatures. Our results for colorectal cancer are consistent with what extant biological models as described in the literature. This approach is potentially useful for identifying previously undiscovered patterns and generating hypotheses related to biological pathways. Such signatures could prove valuable for eventual translation into clinical practice.
Asunto(s)
Biomarcadores de Tumor/genética , Neoplasias Colorrectales/diagnóstico , Neoplasias Colorrectales/genética , Marcadores Genéticos/genética , Mutación/genética , Proteínas de Neoplasias/genética , Bases de Datos Factuales , HumanosRESUMEN
We use Backward Chaining Rule Induction (BCRI), a novel data mining method for hypothesizing causative mechanisms, to mine lung cancer gene expression array data for mechanisms that could impact survival. Initially, a supervised learning system is used to generate a prediction model in the form of "IF