RESUMEN
INTRODUCTION: The German Medical Text Project (GeMTeX) is one of the largest infrastructure efforts targeting German-language clinical documents. We here introduce the architecture of the de-identification pipeline of GeMTeX. METHODS: This pipeline comprises the export of raw clinical documents from the local hospital information system, the import into the annotation platform INCEpTION, fully automatic pre-tagging with protected health information (PHI) items by the Averbis Health Discovery pipeline, a manual curation step of these pre-annotated data, and, finally, the automatic replacement of PHI items with type-conformant substitutes. This design was implemented in a pilot study involving six annotators and two curators each at the Data Integration Centers of the University Hospitals Leipzig and Erlangen. RESULTS: As a proof of concept, the publicly available Graz Synthetic Text Clinical Corpus (GRASSCO) was enhanced with PHI annotations in an annotation campaign for which reasonable inter-annotator agreement values of Krippendorff's α ≈ 0.97 can be reported. CONCLUSION: These curated 1.4 K PHI annotations are released as open-source data constituting the first publicly available German clinical language text corpus with PHI metadata.
Asunto(s)
Registros Electrónicos de Salud , Proyectos Piloto , Alemania , Procesamiento de Lenguaje Natural , Confidencialidad , Humanos , Seguridad ComputacionalRESUMEN
The extraction of medication information from unstructured clinical documents has been a major application of clinical NLP in the past decade as evidenced by the conduct of two shared tasks under the I2B2 and N2C2 umbrella. We here propose a new methodological approach which has already shown a tremendous potential for increasing system performance for general NLP tasks, but has so far not been applied to medication extraction from EHR data, namely deep learning based on transformer models. We ran experiments on established clinical data sets for English (exploiting I2B2 and N2C2 corpora) and German (based on the 3000PA corpus, a German reference data set). Our results reveal that transformer models are on a par with current state-of-the-art results for English, but yield new ones for German data. We further address the influence of context on the overall performance of transformer-based medication relation extraction.
Asunto(s)
Análisis de Datos , Preparaciones Farmacéuticas , Aprendizaje ProfundoRESUMEN
We here report on one of the outcomes of a large-scale German research program, the Medical Informatics Initiative (MII), aiming at the development of a solid data and software infrastructure for German-language clinical natural language processing. Within this framework, we have developed 3000PA, a national clinical reference corpus composed of patient records from three clinical university sites and annotated with a multitude of semantic annotation layers (including medical named entities, semantic and temporal relations between entities, as well as certainty and negation information related to entities and relations). This non-sharable corpus has been complemented by three sharable ones (JSYNCC, GGPONC, and GRASCCO). Overall, 3000PA, JSYNCC and GRASCCO feature about 2.1 million metadata points.
Asunto(s)
Lenguaje , Informática Médica , Humanos , Semántica , Metadatos , Procesamiento de Lenguaje NaturalRESUMEN
We present GePI, a novel Web server for large-scale text mining of molecular interactions from the scientific biomedical literature. GePI leverages natural language processing techniques to identify genes and related entities, interactions between those entities and biomolecular events involving them. GePI supports rapid retrieval of interactions based on powerful search options to contextualize queries targeting (lists of) genes of interest. Contextualization is enabled by full-text filters constraining the search for interactions to either sentences or paragraphs, with or without pre-defined gene lists. Our knowledge graph is updated several times a week ensuring the most recent information to be available at all times. The result page provides an overview of the outcome of a search, with accompanying interaction statistics and visualizations. A table (downloadable in Excel format) gives direct access to the retrieved interaction pairs, together with information about the molecular entities, the factual certainty of the interactions (as verbatim expressed by the authors), and a text snippet from the original document that verbalizes each interaction. In summary, our Web application offers free, easy-to-use, and up-to-date monitoring of gene and protein interaction information, in company with flexible query formulation and filtering options. GePI is available at https://gepi.coling.uni-jena.de/.
Asunto(s)
Minería de Datos , Programas Informáticos , Minería de Datos/métodosRESUMEN
Phosphorylation-dependent signal transduction plays an important role in regulating the functions and fate of skeletal muscle cells. Central players in the phospho-signaling network are the protein kinases AKT, S6K, and RSK as part of the PI3K-AKT-mTOR-S6K and RAF-MEK-ERK-RSK pathways. However, despite their functional importance, knowledge about their specific targets is incomplete because these kinases share the same basophilic substrate motif RxRxxp[ST]. To address this, we performed a multifaceted quantitative phosphoproteomics study of skeletal myotubes following kinase inhibition. Our data corroborate a cross talk between AKT and RAF, a negative feedback loop of RSK on ERK, and a putative connection between RSK and PI3K signaling. Altogether, we report a kinase target landscape containing 49 so far unknown target sites. AKT, S6K, and RSK phosphorylate numerous proteins involved in muscle development, integrity, and functions, and signaling converges on factors that are central for the skeletal muscle cytoskeleton. Whereas AKT controls insulin signaling and impinges on GTPase signaling, nuclear signaling is characteristic for RSK. Our data further support a role of RSK in glucose metabolism. Shared targets have functions in RNA maturation, stability, and translation, which suggests that these basophilic kinases establish an intricate signaling network to orchestrate and regulate processes involved in translation.
Asunto(s)
Fosfatidilinositol 3-Quinasas , Proteínas Proto-Oncogénicas c-akt , Fibras Musculares Esqueléticas/metabolismo , Fosfatidilinositol 3-Quinasas/metabolismo , Fosforilación , Proteínas Proto-Oncogénicas c-akt/genética , Proteínas Proto-Oncogénicas c-akt/metabolismo , Transducción de Señal/fisiología , Proteínas Quinasas S6 Ribosómicas 90-kDa , Proteínas Quinasas S6 Ribosómicas 70-kDaRESUMEN
BACKGROUND: Childhood asthma is a result of a complex interaction of genetic and environmental components causing epigenetic and immune dysregulation, airway inflammation and impaired lung function. Although different microarray based EWAS studies have been conducted, the impact of epigenetic regulation in asthma development is still widely unknown. We have therefore applied unbiased whole genome bisulfite sequencing (WGBS) to characterize global DNA-methylation profiles of asthmatic children compared to healthy controls. METHODS: Peripheral blood samples of 40 asthmatic and 42 control children aged 5-15 years from three birth cohorts were sequenced together with paired cord blood samples. Identified differentially methylated regions (DMRs) were categorized in genotype-associated, cell-type-dependent, or prenatally primed. Network analysis and subsequent natural language processing of DMR-associated genes was complemented by targeted analysis of functional translation of epigenetic regulation on the transcriptional and protein level. RESULTS: In total, 158 DMRs were identified in asthmatic children compared to controls of which 37% were related to the eosinophil content. A global hypomethylation was identified affecting predominantly enhancer regions and regulating key immune genes such as IL4, IL5RA, and EPX. These DMRs were confirmed in n = 267 samples and could be linked to aberrant gene expression. Out of the 158 DMRs identified in the established phenotype, 56 were perturbed already at birth and linked, at least in part, to prenatal influences such as tobacco smoke exposure or phthalate exposure. CONCLUSION: This is the first epigenetic study based on whole genome sequencing to identify marked dysregulation of enhancer regions as a hallmark of childhood asthma.
Asunto(s)
Asma , Epigénesis Genética , Femenino , Embarazo , Humanos , Metilación de ADN , Asma/genética , ADNRESUMEN
We describe the creation of GRASCCO, a novel German-language corpus composed of some 60 clinical documents with more than.43,000 tokens. GRASCCO is a synthetic corpus resulting from a series of alienation steps to obfuscate privacy-sensitive information contained in real clinical documents, the true origin of all GRASCCO texts. Therefore, it is publicly shareable without any legal restrictions We also explore whether this corpus still represents common clinical language use by comparison with a real (non-shareable) clinical corpus we developed as a contribution to the Medical Informatics Initiative in Germany (MII) within the SMITH consortium. We find evidence that such a claim can indeed be made.
Asunto(s)
Lenguaje , Procesamiento de Lenguaje Natural , AlemaniaRESUMEN
We describe the adaptation of a non-clinical pseudonymization system, originally developed for a German email corpus, for clinical use. This tool replaces previously identified Protected Health Information (PHI) items as carriers of privacy-sensitive information (original names for people, organizations, places, etc.) with semantic type-conformant, yet, fictitious surrogates. We evaluate the generated substitutes for grammatical correctness, semantic and medical plausibility and find particularly low numbers of error instances (less than 1%) on all of these dimensions.
Asunto(s)
Confidencialidad , Privacidad , HumanosRESUMEN
Automated identification of advanced chronic kidney disease (CKD ≥ III) and of no known kidney disease (NKD) can support both clinicians and researchers. We hypothesized that identification of CKD and NKD can be improved, by combining information from different electronic health record (EHR) resources, comprising laboratory values, discharge summaries and ICD-10 billing codes, compared to using each component alone. We included EHRs from 785 elderly multimorbid patients, hospitalized between 2010 and 2015, that were divided into a training and a test (n = 156) dataset. We used both the area under the receiver operating characteristic (AUROC) and under the precision-recall curve (AUCPR) with a 95% confidence interval for evaluation of different classification models. In the test dataset, the combination of EHR components as a simple classifier identified CKD ≥ III (AUROC 0.96[0.93-0.98]) and NKD (AUROC 0.94[0.91-0.97]) better than laboratory values (AUROC CKD 0.85[0.79-0.90], NKD 0.91[0.87-0.94]), discharge summaries (AUROC CKD 0.87[0.82-0.92], NKD 0.84[0.79-0.89]) or ICD-10 billing codes (AUROC CKD 0.85[0.80-0.91], NKD 0.77[0.72-0.83]) alone. Logistic regression and machine learning models improved recognition of CKD ≥ III compared to the simple classifier if only laboratory values were used (AUROC 0.96[0.92-0.99] vs. 0.86[0.81-0.91], p < 0.05) and improved recognition of NKD if information from previous hospital stays was used (AUROC 0.99[0.98-1.00] vs. 0.95[0.92-0.97]], p < 0.05). Depending on the availability of data, correct automated identification of CKD ≥ III and NKD from EHRs can be improved by generating classification models based on the combination of different EHR components.
RESUMEN
Aryl hydrocarbon receptor (AHR) activation by tryptophan (Trp) catabolites enhances tumor malignancy and suppresses anti-tumor immunity. The context specificity of AHR target genes has so far impeded systematic investigation of AHR activity and its upstream enzymes across human cancers. A pan-tissue AHR signature, derived by natural language processing, revealed that across 32 tumor entities, interleukin-4-induced-1 (IL4I1) associates more frequently with AHR activity than IDO1 or TDO2, hitherto recognized as the main Trp-catabolic enzymes. IL4I1 activates the AHR through the generation of indole metabolites and kynurenic acid. It associates with reduced survival in glioma patients, promotes cancer cell motility, and suppresses adaptive immunity, thereby enhancing the progression of chronic lymphocytic leukemia (CLL) in mice. Immune checkpoint blockade (ICB) induces IDO1 and IL4I1. As IDO1 inhibitors do not block IL4I1, IL4I1 may explain the failure of clinical studies combining ICB with IDO1 inhibition. Taken together, IL4I1 blockade opens new avenues for cancer therapy.
Asunto(s)
L-Aminoácido Oxidasa/metabolismo , Receptores de Hidrocarburo de Aril/metabolismo , Adulto , Anciano , Animales , Línea Celular , Línea Celular Tumoral , Progresión de la Enfermedad , Femenino , Glioma/inmunología , Glioma/metabolismo , Glioma/terapia , Células HEK293 , Humanos , Inhibidores de Puntos de Control Inmunológico/farmacología , Indolamina-Pirrol 2,3,-Dioxigenasa/metabolismo , Leucemia Linfocítica Crónica de Células B/inmunología , Leucemia Linfocítica Crónica de Células B/metabolismo , Leucemia Linfocítica Crónica de Células B/terapia , Masculino , Ratones , Ratones Endogámicos C57BL , Persona de Mediana Edad , RatasRESUMEN
OBJECTIVES: We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basic IE tasks, named entity recognition and relation extraction, for two selected semantic classes-diseases and drugs (or medications)-and relations between them. METHODS: For the time period from 2017 to early 2020, we searched for relevant publications from three major scientific communities: medicine and medical informatics, natural language processing, as well as neural networks and artificial intelligence. RESULTS: In the past decade, the field of Natural Language Processing (NLP) has undergone a profound methodological shift from symbolic to distributed representations based on the paradigm of Deep Learning (DL). Meanwhile, this trend is, although with some delay, also reflected in the medical NLP community. In the reporting period, overwhelming experimental evidence has been gathered, as illustrated in this survey for medical IE, that DL-based approaches outperform non-DL ones by often large margins. Still, small-sized and access-limited corpora create intrinsic problems for data-greedy DL as do special linguistic phenomena of medical sublanguages that have to be overcome by adaptive learning strategies. CONCLUSIONS: The paradigm shift from (feature-engineered) ML to DNNs changes the fundamental methodological rules of the game for medical NLP. This change is by no means restricted to medical IE but should also deeply influence other areas of medical informatics, either NLP- or non-NLP-based.
Asunto(s)
Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Redes Neurales de la Computación , Conjuntos de Datos como Asunto , Aprendizaje Profundo , Enfermedad , Interacciones Farmacológicas , Humanos , Informática Médica , Preparaciones FarmacéuticasRESUMEN
We here describe the evolution of annotation guidelines for major clinical named entities, namely Diagnosis, Findings and Symptoms, on a corpus of approximately 1,000 German discharge letters. Due to their intrinsic opaqueness and complexity, clinical annotation tasks require continuous guideline tuning, beginning from the initial definition of crucial entities and the subsequent iterative evolution of guidelines based on empirical evidence. We describe rationales for adaptation, with focus on several metrical criteria and task-centered clinical constraints.
Asunto(s)
Curaduría de Datos , Alta del Paciente , HumanosRESUMEN
The PI3K/Akt pathway promotes skeletal muscle growth and myogenic differentiation. Although its importance in skeletal muscle biology is well documented, many of its substrates remain to be identified. We here studied PI3K/Akt signaling in contracting skeletal muscle cells by quantitative phosphoproteomics. We identified the extended basophilic phosphosite motif RxRxxp[S/T]xxp[S/T] in various proteins including filamin-C (FLNc). Importantly, this extended motif, located in a unique insert in Ig-like domain 20 of FLNc, is doubly phosphorylated. The protein kinases responsible for this dual-site phosphorylation are Akt and PKCα. Proximity proteomics and interaction analysis identified filamin A-interacting protein 1 (FILIP1) as direct FLNc binding partner. FILIP1 binding induces filamin degradation, thereby negatively regulating its function. Here, dual-site phosphorylation of FLNc not only reduces FILIP1 binding, providing a mechanism to shield FLNc from FILIP1-mediated degradation, but also enables fast dynamics of FLNc necessary for its function as signaling adaptor in cross-striated muscle cells.
Asunto(s)
Proteínas Portadoras/metabolismo , Proteínas del Citoesqueleto/metabolismo , Filaminas/metabolismo , Fibras Musculares Esqueléticas/metabolismo , Fosfoproteínas/metabolismo , Proteoma/metabolismo , Secuencias de Aminoácidos , Células HEK293 , Humanos , Desarrollo de Músculos , Fibras Musculares Esqueléticas/citología , Fosfatidilinositol 3-Quinasas/metabolismo , Fosforilación , Unión Proteica , Proteolisis , Proteoma/análisis , Proteínas Proto-Oncogénicas c-akt/metabolismo , Transducción de SeñalRESUMEN
We devised annotation guidelines for the de-identification of German clinical documents and assembled a corpus of 1,106 discharge summaries and transfer letters with 44K annotated protected health information (PHI) items. After three iteration rounds, our annotation team finally reached an inter-annotator agreement of 0.96 on the instance level and 0.97 on the token level of annotation (averaged pair-wise F1 score). To establish a baseline for automatic de-identification on our corpus, we trained a recurrent neural network (RNN) and achieved F1 scores greater than 0.9 on most major PHI categories.
Asunto(s)
Anonimización de la Información , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Redes Neurales de la ComputaciónRESUMEN
All cells and organisms exhibit stress-coping mechanisms to ensure survival. Cytoplasmic protein-RNA assemblies termed stress granules are increasingly recognized to promote cellular survival under stress. Thus, they might represent tumor vulnerabilities that are currently poorly explored. The translation-inhibitory eIF2α kinases are established as main drivers of stress granule assembly. Using a systems approach, we identify the translation enhancers PI3K and MAPK/p38 as pro-stress-granule-kinases. They act through the metabolic master regulator mammalian target of rapamycin complex 1 (mTORC1) to promote stress granule assembly. When highly active, PI3K is the main driver of stress granules; however, the impact of p38 becomes apparent as PI3K activity declines. PI3K and p38 thus act in a hierarchical manner to drive mTORC1 activity and stress granule assembly. Of note, this signaling hierarchy is also present in human breast cancer tissue. Importantly, only the recognition of the PI3K-p38 hierarchy under stress enabled the discovery of p38's role in stress granule formation. In summary, we assign a new pro-survival function to the key oncogenic kinases PI3K and p38, as they hierarchically promote stress granule formation.
Asunto(s)
Gránulos Citoplasmáticos/metabolismo , Diana Mecanicista del Complejo 1 de la Rapamicina/metabolismo , Fosfatidilinositol 3-Quinasas/metabolismo , Estrés Fisiológico/fisiología , Proteínas Quinasas p38 Activadas por Mitógenos/metabolismo , Arsenitos/farmacología , Supervivencia Celular/efectos de los fármacos , Simulación por Computador , Técnicas de Silenciamiento del Gen , Células HEK293 , Células HeLa , Humanos , Células MCF-7 , Fosforilación/efectos de los fármacos , Proteínas Proto-Oncogénicas c-akt/genética , Proteínas Proto-Oncogénicas c-akt/metabolismo , Transducción de Señal/efectos de los fármacos , TransfecciónRESUMEN
INTRODUCTION: This article is part of the Focus Theme of Methods of Information in Medicine on the German Medical Informatics Initiative. "Smart Medical Information Technology for Healthcare (SMITH)" is one of four consortia funded by the German Medical Informatics Initiative (MI-I) to create an alliance of universities, university hospitals, research institutions and IT companies. SMITH's goals are to establish Data Integration Centers (DICs) at each SMITH partner hospital and to implement use cases which demonstrate the usefulness of the approach. OBJECTIVES: To give insight into architectural design issues underlying SMITH data integration and to introduce the use cases to be implemented. GOVERNANCE AND POLICIES: SMITH implements a federated approach as well for its governance structure as for its information system architecture. SMITH has designed a generic concept for its data integration centers. They share identical services and functionalities to take best advantage of the interoperability architectures and of the data use and access process planned. The DICs provide access to the local hospitals' Electronic Medical Records (EMR). This is based on data trustee and privacy management services. DIC staff will curate and amend EMR data in the Health Data Storage. METHODOLOGY AND ARCHITECTURAL FRAMEWORK: To share medical and research data, SMITH's information system is based on communication and storage standards. We use the Reference Model of the Open Archival Information System and will consistently implement profiles of Integrating the Health Care Enterprise (IHE) and Health Level Seven (HL7) standards. Standard terminologies will be applied. The SMITH Market Place will be used for devising agreements on data access and distribution. 3LGM2 for enterprise architecture modeling supports a consistent development process.The DIC reference architecture determines the services, applications and the standardsbased communication links needed for efficiently supporting the ingesting, data nourishing, trustee, privacy management and data transfer tasks of the SMITH DICs. The reference architecture is adopted at the local sites. Data sharing services and the market place enable interoperability. USE CASES: The methodological use case "Phenotype Pipeline" (PheP) constructs algorithms for annotations and analyses of patient-related phenotypes according to classification rules or statistical models based on structured data. Unstructured textual data will be subject to natural language processing to permit integration into the phenotyping algorithms. The clinical use case "Algorithmic Surveillance of ICU Patients" (ASIC) focusses on patients in Intensive Care Units (ICU) with the acute respiratory distress syndrome (ARDS). A model-based decision-support system will give advice for mechanical ventilation. The clinical use case HELP develops a "hospital-wide electronic medical record-based computerized decision support system to improve outcomes of patients with blood-stream infections" (HELP). ASIC and HELP use the PheP. The clinical benefit of the use cases ASIC and HELP will be demonstrated in a change of care clinical trial based on a step wedge design. DISCUSSION: SMITH's strength is the modular, reusable IT architecture based on interoperability standards, the integration of the hospitals' information management departments and the public-private partnership. The project aims at sustainability beyond the first 4-year funding period.
Asunto(s)
Atención a la Salud , Tecnología de la Información , Algoritmos , Gestión Clínica , Comunicación , Sistemas de Apoyo a Decisiones Clínicas , Registros Electrónicos de Salud , Almacenamiento y Recuperación de la Información , Unidades de Cuidados Intensivos , Modelos Teóricos , Fenotipo , PolíticasRESUMEN
We introduce 3000PA, a clinical document corpus composed of 3,000 EPRs from three different clinical sites, which will serve as the backbone of a national reference language resource for German clinical NLP. We outline its design principles, results from a medication annotation campaign and the evaluation of a first medication information extraction prototype using a subset of 3000PA.
Asunto(s)
Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Humanos , LenguajeRESUMEN
We present the outcome of an annotation effort targeting the content-sensitive segmentation of German clinical reports into sections. We recruited an annotation team of up to eight medical students to annotate a clinical text corpus on a sentence-by-sentence basis in four pre-annotation iterations and one final main annotation step. The annotation scheme we came up with adheres to categories developed for clinical documents in the HL7-CDA (Clinical Document Architecture) standard for section headings. Once the scheme became stable, we ran the main annotation campaign on the complete set of roughly 1,000 clinical documents. Due to its reliance on the CDA standard, the annotation scheme allows the integration of legacy and newly produced clinical documents within a common pipeline. We then made direct use of the annotations by training a baseline classifier to automatically identify sections in clinical reports.
Asunto(s)
Lenguaje , Resumen del Alta del Paciente/clasificación , Curaduría de Datos , Alemania , HumanosRESUMEN
With the increasing availability of complete full texts (journal articles), rather than their surrogates (titles, abstracts), as resources for text analytics, entirely new opportunities arise for information extraction and text mining from scholarly publications. Yet, we gathered evidence that a range of problems are encountered for full-text processing when biomedical text analytics simply reuse existing NLP pipelines which were developed on the basis of abstracts (rather than full texts). We conducted experiments with four different relation extraction engines all of which were top performers in previous BioNLP Event Extraction Challenges. We found that abstract-trained engines loose up to 6.6% F-score points when run on full-text data. Hence, the reuse of existing abstract-based NLP software in a full-text scenario is considered harmful because of heavy performance losses. Given the current lack of annotated full-text resources to train on, our study quantifies the price paid for this short cut.
Asunto(s)
Minería de Datos , Almacenamiento y Recuperación de la Información , PubMed , Procesamiento de Lenguaje Natural , Programas InformáticosRESUMEN
Amino acids (aa) are not only building blocks for proteins, but also signalling molecules, with the mammalian target of rapamycin complex 1 (mTORC1) acting as a key mediator. However, little is known about whether aa, independently of mTORC1, activate other kinases of the mTOR signalling network. To delineate aa-stimulated mTOR network dynamics, we here combine a computational-experimental approach with text mining-enhanced quantitative proteomics. We report that AMP-activated protein kinase (AMPK), phosphatidylinositide 3-kinase (PI3K) and mTOR complex 2 (mTORC2) are acutely activated by aa-readdition in an mTORC1-independent manner. AMPK activation by aa is mediated by Ca2+/calmodulin-dependent protein kinase kinase ß (CaMKKß). In response, AMPK impinges on the autophagy regulators Unc-51-like kinase-1 (ULK1) and c-Jun. AMPK is widely recognized as an mTORC1 antagonist that is activated by starvation. We find that aa acutely activate AMPK concurrently with mTOR. We show that AMPK under aa sufficiency acts to sustain autophagy. This may be required to maintain protein homoeostasis and deliver metabolite intermediates for biosynthetic processes.