Pesquisa | Portal Regional da BVS

A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records.

Catelli, Rosario; Gargiulo, Francesco; Casola, Valentina; De Pietro, Giuseppe; Fujita, Hamido; Esposito, Massimo.

IEEE Access ; 9: 19097-19110, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34786303

RESUMO

In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach.

Crosslingual named entity recognition for clinical de-identification applied to a COVID-19 Italian data set.

Catelli, Rosario; Gargiulo, Francesco; Casola, Valentina; De Pietro, Giuseppe; Fujita, Hamido; Esposito, Massimo.

Appl Soft Comput ; 97: 106779, 2020 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-33052197

RESUMO

The COrona VIrus Disease 19 (COVID-19) pandemic required the work of all global experts to tackle it. Despite the abundance of new studies, privacy laws prevent their dissemination for medical investigations: through clinical de-identification, the Protected Health Information (PHI) contained therein can be anonymized so that medical records can be shared and published. The automation of clinical de-identification through deep learning techniques has proven to be less effective for languages other than English due to the scarcity of data sets. Hence a new Italian de-identification data set has been created from the COVID-19 clinical records made available by the Italian Society of Radiology (SIRM). Therefore, two multi-lingual deep learning systems have been developed for this low-resource language scenario: the objective is to investigate their ability to transfer knowledge between different languages while maintaining the necessary features to correctly perform the Named Entity Recognition task for de-identification. The systems were trained using four different strategies, using both the English Informatics for Integrating Biology & the Bedside (i2b2) 2014 and the new Italian SIRM COVID-19 data sets, then evaluated on the latter. These approaches have demonstrated the effectiveness of cross-lingual transfer learning to de-identify medical records written in a low resource language such as Italian, using one with high resources such as English.

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA