RESUMO
OBJECTIVES: We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basic IE tasks, named entity recognition and relation extraction, for two selected semantic classes-diseases and drugs (or medications)-and relations between them. METHODS: For the time period from 2017 to early 2020, we searched for relevant publications from three major scientific communities: medicine and medical informatics, natural language processing, as well as neural networks and artificial intelligence. RESULTS: In the past decade, the field of Natural Language Processing (NLP) has undergone a profound methodological shift from symbolic to distributed representations based on the paradigm of Deep Learning (DL). Meanwhile, this trend is, although with some delay, also reflected in the medical NLP community. In the reporting period, overwhelming experimental evidence has been gathered, as illustrated in this survey for medical IE, that DL-based approaches outperform non-DL ones by often large margins. Still, small-sized and access-limited corpora create intrinsic problems for data-greedy DL as do special linguistic phenomena of medical sublanguages that have to be overcome by adaptive learning strategies. CONCLUSIONS: The paradigm shift from (feature-engineered) ML to DNNs changes the fundamental methodological rules of the game for medical NLP. This change is by no means restricted to medical IE but should also deeply influence other areas of medical informatics, either NLP- or non-NLP-based.
Assuntos
Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Redes Neurais de Computação , Conjuntos de Dados como Assunto , Aprendizado Profundo , Doença , Interações Medicamentosas , Humanos , Informática Médica , Preparações FarmacêuticasRESUMO
Word embeddings have become the predominant representation scheme on a token-level for various clinical natural language processing (NLP) tasks. More recently, character-level neural language models, exploiting recurrent neural networks, have again received attention, because they achieved similar performance against various NLP benchmarks. We investigated to what extent character-based language models can be applied to the clinical domain and whether they are able to capture reasonable lexical semantics using this maximally fine-grained representation scheme. We trained a long short-term memory network on an excerpt from a table of de-identified 50-character long problem list entries in German, each of which assigned to an ICD-10 code. We modelled the task as a time series of one-hot encoded single character inputs. After the training phase we accessed the top 10 most similar character-induced word embeddings related to a clinical concept via a nearest neighbour search and evaluated the expected interconnected semantics. Results showed that traceable semantics were captured on a syntactic level above single characters, addressing the idiosyncratic nature of clinical language. The results support recent work on general language modelling that raised the question whether token-based representation schemes are still necessary for specific NLP tasks.
Assuntos
Idioma , Processamento de Linguagem Natural , Análise por Conglomerados , Redes Neurais de Computação , Assistência ao Paciente , SemânticaRESUMO
Acronyms frequently occur in clinical text, which makes their identification, disambiguation and resolution an important task in clinical natural language processing. This paper contributes to acronym resolution in Spanish through the creation of a set of sense inventories organized by clinical specialty containing acronyms, their expansions, and corpus-driven features. The new acronym resource is composed of 51 clinical specialties with 3,603 acronyms in total, from which we identified 228 language independent acronyms and 391 language dependent expansions. We further analyzed the sense inventory across specialties and present novel insights of acronym usage in biomedical Spanish texts.
Assuntos
Abreviaturas como Assunto , Processamento de Linguagem Natural , PubMed , Inteligência Artificial , Humanos , IdiomaRESUMO
OBJECTIVE: Automated clinical phenotyping is challenging because word-based features quickly turn it into a high-dimensional problem, in which the small, privacy-restricted, training datasets might lead to overfitting. Pretrained embeddings might solve this issue by reusing input representation schemes trained on a larger dataset. We sought to evaluate shallow and deep learning text classifiers and the impact of pretrained embeddings in a small clinical dataset. MATERIALS AND METHODS: We participated in the 2018 National NLP Clinical Challenges (n2c2) Shared Task on cohort selection and received an annotated dataset with medical narratives of 202 patients for multilabel binary text classification. We set our baseline to a majority classifier, to which we compared a rule-based classifier and orthogonal machine learning strategies: support vector machines, logistic regression, and long short-term memory neural networks. We evaluated logistic regression and long short-term memory using both self-trained and pretrained BioWordVec word embeddings as input representation schemes. RESULTS: Rule-based classifier showed the highest overall micro F1 score (0.9100), with which we finished first in the challenge. Shallow machine learning strategies showed lower overall micro F1 scores, but still higher than deep learning strategies and the baseline. We could not show a difference in classification efficiency between self-trained and pretrained embeddings. DISCUSSION: Clinical context, negation, and value-based criteria hindered shallow machine learning approaches, while deep learning strategies could not capture the term diversity due to the small training dataset. CONCLUSION: Shallow methods for clinical phenotyping can still outperform deep learning methods in small imbalanced data, even when supported by pretrained embeddings.
Assuntos
Ensaios Clínicos como Assunto/métodos , Mineração de Dados/métodos , Aprendizado de Máquina , Processamento de Linguagem Natural , Seleção de Pacientes , Classificação , Aprendizado Profundo , Humanos , Modelos Logísticos , Redes Neurais de ComputaçãoRESUMO
Pathology reports are a main source of information regarding cancer diagnosis and are commonly written following semi-structured templates that include tumour localisation and behaviour. In this work, we evaluated the efficiency of support vector machines (SVMs) to classify pathology reports written in Portuguese into the International Classification of Diseases for Oncology (ICD-O), a biaxial classification of cancer topography and morphology. A partnership program with the Brazilian hospital A.C. Camargo Cancer Center provided anonymised pathology reports and structured data from 94,980 patients used for training and validation. We employed SVMs with tf-idf weighting scheme in a bag-of-words approach and report F1 score of 0.82 for 18 sites and 0.73 for 49 morphology classes. With the largest dataset ever used in such a task, our work provides reliable estimates for the classification of pathology reports in Portuguese and agrees with a few similar studies published in the same kind of data in other languages.
Assuntos
Classificação Internacional de Doenças/organização & administração , Neoplasias/patologia , Máquina de Vetores de Suporte , Brasil , Humanos , Neoplasias/diagnóstico , Sistema de RegistrosRESUMO
Clinical narratives are typically produced under time pressure, which incites the use of abbreviations and acronyms. To expand such short forms in a correct way eases text comprehension and further semantic processing. We propose a completely unsupervised and data-driven algorithm for the resolution of non-lexicalised and potentially ambiguous abbreviations. Based on the lookup of word bigrams and unigrams extracted from a corpus of 30,000 pseudonymised cardiology reports in German, our method achieved an F
Assuntos
Algoritmos , Narração , Processamento de Linguagem Natural , Humanos , Pressão , SemânticaRESUMO
Clinical trials are studies designed to assess whether a new intervention is better than the current alternatives. However, most of them fail to recruit participants on schedule. It is hard to use Electronic Health Record (EHR) data to find eligible patients, therefore studies rely on manual assessment, which is time consuming, inefficient and requires specialized training. In this work we describe the design and development of an information retrieval system with the objective of finding eligible patients for cancer trials. The Recruit system has been in use at A. C. Camargo Cancer Center since August/2014 and contains data from more than 500,000 patients and 9 databases. It uses ontologies to integrate data from several sources and represent medical knowledge, which helps enhance results. One can search both in structured data and inside free text reports. The preliminary quality assessments shows excellent recall rates. Recruit proved to be an useful tool for researchers and its modular design could be applied to other clinical conditions and hospitals.
Assuntos
Ontologias Biológicas , Ensaios Clínicos como Assunto/métodos , Mineração de Dados/métodos , Registros Eletrônicos de Saúde/classificação , Neoplasias/classificação , Seleção de Pacientes , Brasil , Ensaios Clínicos como Assunto/organização & administração , Definição da Elegibilidade/métodos , Definição da Elegibilidade/organização & administração , Processamento de Linguagem NaturalRESUMO
This work develops an automated classifier of pathology reports which infers the topography and the morphology classes of a tumor using codes from the International Classification of Diseases for Oncology (ICD-O). Data from 94,980 patients of the A.C. Camargo Cancer Center was used for training and validation of Naive Bayes classifiers, evaluated by the F1-score. Measures greater than 74% in the topographic group and 61% in the morphologic group are reported. Our work provides a successful baseline for future research for the classification of medical documents written in Portuguese and in other domains.
Assuntos
Mineração de Dados/métodos , Diagnóstico por Computador/métodos , Processamento de Linguagem Natural , Neoplasias/diagnóstico , Neoplasias/patologia , Patologia/classificação , Brasil/epidemiologia , Sistemas de Apoio a Decisões Clínicas , Classificação Internacional de Doenças , Neoplasias/epidemiologia , Reconhecimento Automatizado de Padrão/métodos , Prevalência , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
Part of speech taggers need a considerable amount of data to train their models. Such data is not readily available for medical texts in Portuguese. We evaluated the accuracy of a morphological tagger against a gold standard when trained with corpora of different sizes and domains. Accuracy was the highest with a medical corpus during the complete training process, achieving 91.5%. Training on a newswire corpus achieved 75.3% only. Furthermore, an active learning technique has been adapted to the POS tagging task. The algorithm uses a POS tagger committee to isolate the sentences with the highest disagreement indexes for manual correction. However, the method was not able to reduce training and tagging times when compared to a random selection strategy. We encourage that future works employ some effort in order to annotate a small amount of random data in the domain of study, which should be enough for higher accuracy rates.