Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
BMC Med Inform Decis Mak ; 24(1): 162, 2024 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-38915012

RESUMO

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.


Assuntos
Processamento de Linguagem Natural , Humanos , Privacidade , Suécia , Anônimos e Pseudônimos , Segurança Computacional/normas , Confidencialidade/normas , Registros Eletrônicos de Saúde/normas
2.
Artif Intell Med ; 146: 102695, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-38042595

RESUMO

Clinical prediction models tend only to incorporate structured healthcare data, ignoring information recorded in other data modalities, including free-text clinical notes. Here, we demonstrate how multimodal models that effectively leverage both structured and unstructured data can be developed for predicting COVID-19 outcomes. The models are trained end-to-end using a technique we refer to as multimodal fine-tuning, whereby a pre-trained language model is updated based on both structured and unstructured data. The multimodal models are trained and evaluated using a multicenter cohort of COVID-19 patients encompassing all encounters at the emergency department of six hospitals. Experimental results show that multimodal models, leveraging the notion of multimodal fine-tuning and trained to predict (i) 30-day mortality, (ii) safe discharge and (iii) readmission, outperform unimodal models trained using only structured or unstructured healthcare data on all three outcomes. Sensitivity analyses are performed to better understand how well the multimodal models perform on different patient groups, while an ablation study is conducted to investigate the impact of different types of clinical notes on model performance. We argue that multimodal models that make effective use of routinely collected healthcare data to predict COVID-19 outcomes may facilitate patient management and contribute to the effective use of limited healthcare resources.


Assuntos
COVID-19 , Humanos , COVID-19/epidemiologia , Serviço Hospitalar de Emergência , Hospitais , Idioma , Alta do Paciente , Processamento de Linguagem Natural
3.
Antimicrob Resist Infect Control ; 12(1): 117, 2023 10 26.
Artigo em Inglês | MEDLINE | ID: mdl-37884948

RESUMO

BACKGROUND: In patients who underwent colorectal surgery, an existing semi-automated surveillance algorithm based on structured data achieves high sensitivity in detecting deep surgical site infections (SSI), however, generates a significant number of false positives. The inclusion of unstructured, clinical narratives to the algorithm may decrease the number of patients requiring manual chart review. The aim of this study was to investigate the performance of this semi-automated surveillance algorithm augmented with a natural language processing (NLP) component to improve positive predictive value (PPV) and thus workload reduction (WR). METHODS: Retrospective, observational cohort study in patients who underwent colorectal surgery from January 1, 2015, through September 30, 2020. NLP was used to detect keyword counts in clinical notes. Several NLP-algorithms were developed with different count input types and classifiers, and added as component to the original semi-automated algorithm. Traditional manual surveillance was compared with the NLP-augmented surveillance algorithms and sensitivity, specificity, PPV and WR were calculated. RESULTS: From the NLP-augmented models, the decision tree models with discretized counts or binary counts had the best performance (sensitivity 95.1% (95%CI 83.5-99.4%), WR 60.9%) and improved PPV and WR by only 2.6% and 3.6%, respectively, compared to the original algorithm. CONCLUSIONS: The addition of an NLP component to the existing algorithm had modest effect on WR (decrease of 1.4-12.5%), at the cost of sensitivity. For future implementation it will be a trade-off between optimal case-finding techniques versus practical considerations such as acceptability and availability of resources.


Assuntos
Cirurgia Colorretal , Infecção da Ferida Cirúrgica , Humanos , Estudos Retrospectivos , Infecção da Ferida Cirúrgica/diagnóstico , Infecção da Ferida Cirúrgica/prevenção & controle , Cirurgia Colorretal/efeitos adversos , Estudos de Coortes , Valor Preditivo dos Testes
4.
Sci Rep ; 13(1): 11760, 2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37474597

RESUMO

Sepsis is a leading cause of mortality and early identification improves survival. With increasing digitalization of health care data automated sepsis prediction models hold promise to aid in prompt recognition. Most previous studies have focused on the intensive care unit (ICU) setting. Yet only a small proportion of sepsis develops in the ICU and there is an apparent clinical benefit to identify patients earlier in the disease trajectory. In this cohort of 82,852 hospital admissions and 8038 sepsis episodes classified according to the Sepsis-3 criteria, we demonstrate that a machine learned score can predict sepsis onset within 48 h using sparse routine electronic health record data outside the ICU. Our score was based on a causal probabilistic network model-SepsisFinder-which has similarities with clinical reasoning. A prediction was generated hourly on all admissions, providing a new variable was registered. Compared to the National Early Warning Score (NEWS2), which is an established method to identify sepsis, the SepsisFinder triggered earlier and had a higher area under receiver operating characteristic curve (AUROC) (0.950 vs. 0.872), as well as area under precision-recall curve (APR) (0.189 vs. 0.149). A machine learning comparator based on a gradient-boosting decision tree model had similar AUROC (0.949) and higher APR (0.239) than SepsisFinder but triggered later than both NEWS2 and SepsisFinder. The precision of SepsisFinder increased if screening was restricted to the earlier admission period and in episodes with bloodstream infection. Furthermore, the SepsisFinder signaled median 5.5 h prior to antibiotic administration. Identifying a high-risk population with this method could be used to tailor clinical interventions and improve patient care.


Assuntos
Registros Eletrônicos de Saúde , Sepse , Humanos , Estudos Retrospectivos , Sepse/diagnóstico , Sepse/epidemiologia , Algoritmos , Hospitalização , Curva ROC , Unidades de Terapia Intensiva , Mortalidade Hospitalar
5.
Artigo em Inglês | MEDLINE | ID: mdl-36310782

RESUMO

We developed and validated a set of fully automated surveillance algorithms for healthcare-onset CDI using electronic health records. In a validation data set of 750 manually annotated admissions, the algorithm based on International Classification of Disease, Tenth Revision (ICD-10) code A04.7 had insufficient sensitivity. Algorithms based on microbiological test results with or without addition of symptoms performed well.

6.
Crit Care Med ; 50(3): e272-e283, 2022 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-34406170

RESUMO

OBJECTIVES: Sequential Organ Failure Assessment score is the basis of the Sepsis-3 criteria and requires arterial blood gas analysis to assess respiratory function. Peripheral oxygen saturation is a noninvasive alternative but is not included in neither Sequential Organ Failure Assessment score nor Sepsis-3. We aimed to assess the association between worst peripheral oxygen saturation during onset of suspected infection and mortality. DESIGN: Cohort study of hospital admissions from a main cohort and emergency department visits from four external validation cohorts between year 2011 and 2018. Data were collected from electronic health records and prospectively by study investigators. SETTING: Eight academic and community hospitals in Sweden and Canada. PATIENTS: Adult patients with suspected infection episodes. INTERVENTIONS: None. MEASUREMENTS AND MAIN RESULTS: The main cohort included 19,396 episodes (median age, 67.0 [53.0-77.0]; 9,007 [46.4%] women; 1,044 [5.4%] died). The validation cohorts included 10,586 episodes (range of median age, 61.0-76.0; women 42.1-50.2%; mortality 2.3-13.3%). Peripheral oxygen saturation levels 96-95% were not significantly associated with increased mortality in the main or pooled validation cohorts. At peripheral oxygen saturation 94%, the adjusted odds ratio of death was 1.56 (95% CI, 1.10-2.23) in the main cohort and 1.36 (95% CI, 1.00-1.85) in the pooled validation cohorts and increased gradually below this level. Respiratory assessment using peripheral oxygen saturation 94-91% and less than 91% to generate 1 and 2 Sequential Organ Failure Assessment points, respectively, improved the discrimination of the Sequential Organ Failure Assessment score from area under the receiver operating characteristics 0.75 (95% CI, 0.74-0.77) to 0.78 (95% CI, 0.77-0.80; p < 0.001). Peripheral oxygen saturation/Fio2 ratio had slightly better predictive performance compared with peripheral oxygen saturation alone, but the clinical impact was minor. CONCLUSIONS: These findings provide evidence for assessing respiratory function with peripheral oxygen saturation in the Sequential Organ Failure Assessment score and the Sepsis-3 criteria. Our data support using peripheral oxygen saturation thresholds 94% and 90% to get 1 and 2 Sequential Organ Failure Assessment respiratory points, respectively. This has important implications primarily for emergency practice, rapid response teams, surveillance, research, and resource-limited settings.


Assuntos
Unidades de Terapia Intensiva , Escores de Disfunção Orgânica , Consumo de Oxigênio/fisiologia , Saturação de Oxigênio/fisiologia , Sepse/sangue , Sepse/mortalidade , Idoso , Estudos de Coortes , Feminino , Mortalidade Hospitalar , Humanos , Masculino , Pessoa de Meia-Idade , Oxigênio/sangue , Estudos Retrospectivos , Síndrome de Resposta Inflamatória Sistêmica
7.
BMJ Qual Saf ; 29(9): 735-745, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-32029574

RESUMO

BACKGROUND: Surveillance of sepsis incidence is important for directing resources and evaluating quality-of-care interventions. The aim was to develop and validate a fully-automated Sepsis-3 based surveillance system in non-intensive care wards using electronic health record (EHR) data, and demonstrate utility by determining the burden of hospital-onset sepsis and variations between wards. METHODS: A rule-based algorithm was developed using EHR data from a cohort of all adult patients admitted at an academic centre between July 2012 and December 2013. Time in intensive care units was censored. To validate algorithm performance, a stratified random sample of 1000 hospital admissions (674 with and 326 without suspected infection) was classified according to the Sepsis-3 clinical criteria (suspected infection defined as having any culture taken and at least two doses of antimicrobials administered, and an increase in Sequential Organ Failure Assessment (SOFA) score by >2 points) and the likelihood of infection by physician medical record review. RESULTS: In total 82 653 hospital admissions were included. The Sepsis-3 clinical criteria determined by physician review were met in 343 of 1000 episodes. Among them, 313 (91%) had possible, probable or definite infection. Based on this reference, the algorithm achieved sensitivity 0.887 (95% CI: 0.799 to 0.964), specificity 0.985 (95% CI: 0.978 to 0.991), positive predictive value 0.881 (95% CI: 0.833 to 0.926) and negative predictive value 0.986 (95% CI: 0.973 to 0.996). When applied to the total cohort taking into account the sampling proportions of those with and without suspected infection, the algorithm identified 8599 (10.4%) sepsis episodes. The burden of hospital-onset sepsis (>48 hour after admission) and related in-hospital mortality varied between wards. CONCLUSIONS: A fully-automated Sepsis-3 based surveillance algorithm using EHR data performed well compared with physician medical record review in non-intensive care wards, and exposed variations in hospital-onset sepsis incidence between wards.


Assuntos
Médicos , Sepse , Adulto , Registros Eletrônicos de Saúde , Feminino , Infecções por HIV , Mortalidade Hospitalar , Hospitais Gerais , Humanos , Unidades de Terapia Intensiva , Estudos Retrospectivos
8.
Stud Health Technol Inform ; 235: 201-205, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28423783

RESUMO

Automated diagnosis coding can be provided efficiently by learning predictive models from historical data; however, discriminating between thousands of codes while allowing a variable number of codes to be assigned is extremely difficult. Here, we explore various text representations and classification models for assigning ICD-9 codes to discharge summaries in MIMIC-III. It is shown that the relative effectiveness of the investigated representations depends on the frequency of the diagnosis code under consideration and that the best performance is obtained by combining models built using different representations.


Assuntos
Classificação Internacional de Doenças , Sumários de Alta do Paciente Hospitalar , Codificação Clínica , Registros Eletrônicos de Saúde , Humanos , Aprendizado de Máquina
9.
Stud Health Technol Inform ; 235: 216-220, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28423786

RESUMO

Obscuring protected health information (PHI) in the clinical text of health records facilitates the secondary use of healthcare data in a privacy-preserving manner. Although automatic de-identification of clinical text using machine learning holds much promise, little is known about the relative prevalence of PHI in different types of clinical text and whether there is a need for domain adaptation when learning predictive models from one particular domain and applying it to another. In this study, we address these questions by training a predictive model and using it to estimate the prevalence of PHI in clinical text written (1) in different clinical specialties, (2) in different types of notes (i.e., under different headings), and (3) by persons in different professional roles. It is demonstrated that the overall PHI density is 1.57%; however, substantial differences exist across domains.


Assuntos
Confidencialidade , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Aprendizado de Máquina , Prontuários Médicos , Prevalência , Suécia
10.
Stud Health Technol Inform ; 235: 241-245, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28423790

RESUMO

The mapping of unstructured clinical text to an ontology facilitates meaningful secondary use of health records but is non-trivial due to lexical variation and the abundance of misspellings in hurriedly produced notes. Here, we apply several spelling correction methods to Swedish medical text and evaluate their impact on SNOMED CT mapping; first in a controlled evaluation using medical literature text with induced errors, followed by a partial evaluation on clinical notes. It is shown that the best-performing method is context-sensitive, taking into account trigram frequencies and utilizing a corpus-based dictionary.


Assuntos
Registros Eletrônicos de Saúde/organização & administração , Processamento de Linguagem Natural , Systematized Nomenclature of Medicine , Algoritmos , Uso Significativo , Garantia da Qualidade dos Cuidados de Saúde/métodos , Suécia , Vocabulário Controlado
11.
Stud Health Technol Inform ; 245: 393-397, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29295123

RESUMO

To enable secondary use of healthcare data in a privacy-preserving manner, there is a need for methods capable of automatically identifying protected health information (PHI) in clinical text. To that end, learning predictive models from labeled examples has emerged as a promising alternative to rule-based systems. However, little is known about differences with respect to PHI prevalence in different types of clinical notes and how potential domain differences may affect the performance of predictive models trained on one particular type of note and applied to another. In this study, we analyze the performance of a predictive model trained on an existing PHI corpus of Swedish clinical notes and applied to a variety of clinical notes: written (i) in different clinical specialties, (ii) under different headings, and (iii) by persons in different professions. The results indicate that domain adaption is needed for effective detection of PHI in heterogeneous clinical notes.


Assuntos
Registros Eletrônicos de Saúde , Privacidade , Humanos , Processamento de Linguagem Natural , Prevalência , Suécia
12.
J Biomed Semantics ; 7(1): 58, 2016 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-27671202

RESUMO

BACKGROUND: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs. METHODS: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies. RESULTS: Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding. CONCLUSIONS: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.

13.
BMC Med Inform Decis Mak ; 16 Suppl 2: 69, 2016 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-27459846

RESUMO

BACKGROUND: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events - modeled in an ensemble of semantic spaces - for the purpose of predictive modeling. METHODS: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events - diagnosis codes, drug codes, measurements, and words in clinical notes - are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces - corresponding to the considered data types - of a given context window size. RESULTS: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. CONCLUSIONS: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy - significantly outperforming the considered alternatives - involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.


Assuntos
Árvores de Decisões , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Aprendizado de Máquina , Modelos Teóricos , Farmacovigilância , Humanos , Semântica
14.
BMC Med Inform Decis Mak ; 16 Suppl 2: 71, 2016 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-27459993

RESUMO

BACKGROUND: Longitudinal data sources, such as electronic health records (EHRs), are very valuable for monitoring adverse drug events (ADEs). However, ADEs are heavily under-reported in EHRs. Using machine learning algorithms to automatically detect patients that should have had ADEs reported in their health records is an efficient and effective solution. One of the challenges to that end is how to take into account the temporality of clinical events, which are time stamped in EHRs, and providing these as features for machine learning algorithms to exploit. Previous research on this topic suggests that representing EHR data as a bag of temporally weighted clinical events is promising; however, the weights were in that case pre-assigned according to their time stamps, which is limited and potentially less accurate. This study therefore focuses on how to learn weights that effectively take into account the temporality and importance of clinical events for ADE detection. METHODS: Variable importance obtained from the random forest learning algorithm is used for extracting temporal weights. Two strategies are proposed for applying the learned weights: weighted aggregation and weighted sampling. The first strategy aggregates the weighted clinical events from different time windows to form new features; the second strategy retains the original features but samples them by using their weights as probabilities when building each tree in the forest. The predictive performance of random forest models using the learned weights with the two strategies is compared to using pre-assigned weights. In addition, to assess the sensitivity of the weight-learning procedure, weights from different granularity levels are evaluated and compared. RESULTS: In the weighted sampling strategy, using learned weights significantly improves the predictive performance, in comparison to using pre-assigned weights; however, there is no significant difference between them in the weighted aggregation strategy. Moreover, the granularity of the weight learning procedure has a significant impact on the former, but not on the latter. CONCLUSIONS: Learning temporal weights is significantly beneficial in terms of predictive performance with the weighted sampling strategy. Moreover, weighted aggregation generally diminishes the impact of temporal weighting of the clinical events, irrespective of whether the weights are pre-assigned or learned.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Aprendizado de Máquina , Farmacovigilância , Humanos
15.
Int J Data Min Bioinform ; 13(4): 395-411, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26547986

RESUMO

The scarcity of large labelled datasets comprising clinical text that can be exploited within the paradigm of supervised machine learning creates barriers for the secondary use of data from electronic health records. It is therefore important to develop capabilities to leverage the large amounts of unlabelled data that, indeed, tend to be readily available. One technique utilises distributional semantics to create word representations in a wholly unsupervised manner and uses existing training data to learn prototypical representations of predefined semantic categories. Features describing whether a given word belongs to a certain category are then provided to the learning algorithm. It has been shown that using multiple distributional semantic models, each employing a different word order strategy, can lead to enhanced predictive performance. Here, another hyperparameter is also varied--the size of the context window--and an experimental investigation shows that this leads to further performance gains.


Assuntos
Mineração de Dados/métodos , Registros Eletrônicos de Saúde/organização & administração , Processamento de Linguagem Natural , Semântica , Terminologia como Assunto , Vocabulário Controlado , Sistemas de Gerenciamento de Base de Dados/organização & administração , Aprendizado de Máquina , Reconhecimento Automatizado de Padrão/métodos
16.
BMC Med Inform Decis Mak ; 15 Suppl 4: S1, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26606038

RESUMO

BACKGROUND: The digitization of healthcare data, resulting from the increasingly widespread adoption of electronic health records, has greatly facilitated its analysis by computational methods and thereby enabled large-scale secondary use thereof. This can be exploited to support public health activities such as pharmacovigilance, wherein the safety of drugs is monitored to inform regulatory decisions about sustained use. To that end, electronic health records have emerged as a potentially valuable data source, providing access to longitudinal observations of patient treatment and drug use. A nascent line of research concerns predictive modeling of healthcare data for the automatic detection of adverse drug events, which presents its own set of challenges: it is not yet clear how to represent the heterogeneous data types in a manner conducive to learning high-performing machine learning models. METHODS: Datasets from an electronic health record database are used for learning predictive models with the purpose of detecting adverse drug events. The use and representation of two data types, as well as their combination, are studied: clinical codes, describing prescribed drugs and assigned diagnoses, and measurements. Feature selection is conducted on the various types of data to reduce dimensionality and sparsity, while allowing for an in-depth feature analysis of the usefulness of each data type and representation. RESULTS: Within each data type, combining multiple representations yields better predictive performance compared to using any single representation. The use of clinical codes for adverse drug event detection significantly outperforms the use of measurements; however, there is no significant difference over datasets between using only clinical codes and their combination with measurements. For certain adverse drug events, the combination does, however, outperform using only clinical codes. Feature selection leads to increased predictive performance for both data types, in isolation and combined. CONCLUSIONS: We have demonstrated how machine learning can be applied to electronic health records for the purpose of detecting adverse drug events and proposed solutions to some of the challenges this presents, including how to represent the various data types. Overall, clinical codes are more useful than measurements and, in specific cases, it is beneficial to combine the two.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/diagnóstico , Registros Eletrônicos de Saúde , Aprendizado de Máquina , Farmacovigilância , Algoritmos , Simulação por Computador , Bases de Dados Factuais , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/etiologia , Previsões , Humanos , Segurança do Paciente
17.
J Biomed Inform ; 57: 333-49, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26291578

RESUMO

For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the voluntary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics - i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words - and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Semântica , Curadoria de Dados , Mineração de Dados , Humanos
19.
AMIA Annu Symp Proc ; 2015: 1371-80, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26958278

RESUMO

Using longitudinal data in electronic health records (EHRs) for post-marketing adverse drug event (ADE) detection allows for monitoring patients throughout their medical history. Machine learning methods have been shown to be efficient and effective in screening health records and detecting ADEs. How best to exploit historical data, as encoded by clinical events in EHRs is, however, not very well understood. In this study, three strategies for handling temporality of clinical events are proposed and evaluated using an EHR database from Stockholm, Sweden. The random forest learning algorithm is applied to predict fourteen ADEs using clinical events collected from different lengths of patient history. The results show that, in general, including longer patient history leads to improved predictive performance, and that assigning weights to events according to time distance from the ADE yields the biggest improvement.


Assuntos
Algoritmos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Registros Eletrônicos de Saúde , Aprendizado de Máquina , Bases de Dados Factuais , Humanos , Vigilância de Produtos Comercializados
20.
J Biomed Semantics ; 5(1): 6, 2014 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-24499679

RESUMO

BACKGROUND: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. RESULTS: A combination of two distributional models - Random Indexing and Random Permutation - employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora - a corpus of clinical text and a corpus of medical journal articles - further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. CONCLUSIONS: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models - with different model parameters - and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...