A new approach for cleansing geographical dataset using Levenshtein distance, prior knowledge and contextual information.

Ugon, Adrien; Nicolas, Thomas; Richard, Marion; Guerin, Patrick; Chansard, Pascal; Demoor, Christophe; Toubiana, Laurent

Ugon, Adrien; Nicolas, Thomas; Richard, Marion; Guerin, Patrick; Chansard, Pascal; Demoor, Christophe; Toubiana, Laurent.

Afiliação

Ugon A; INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430, Villetaneuse, France.
Nicolas T; INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430, Villetaneuse, France.
Richard M; INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430, Villetaneuse, France.
Guerin P; IRSAN, 37 rue des Mathurins, F-75008, Paris, France.
Chansard P; IRSAN, 37 rue des Mathurins, F-75008, Paris, France.
Demoor C; IRSAN, 37 rue des Mathurins, F-75008, Paris, France.
Toubiana L; INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430, Villetaneuse, France.

Stud Health Technol Inform ; 210: 227-9, 2015.

Article em En | MEDLINE | ID: mdl-25991137

ABSTRACT

ABSTRACT

Epidemiological studies are necessary to take public health decisions. Their relevance depends on the quality of data. Doctors in continuous care collect a big amount of data that can be used for epidemiological purpose, but spatial data may be dirty; based on city names, the localization is imprecise, even more if it is misspelled. The only way to identify a city without ambiguity is to use its identifier, which can be retrieved by cleansing geographical textual data. In France, cities are organized in administrative zones called departments and some city names are shared by several cities in several departments. The clear identification of the department and the city name allows to deduce the city unique identifier and to make some spatial analysis such as epidemiological studies. In this paper, we propose a method to cleanse such data, using several steps. After having standardized the text to cleanse, we use the Levenshtein distance to generate a first set of propositions. Finally, the propositions are filtered, by removing the less likely candidates, so that it remains only one, which becomes the chosen city. Tested on a dataset of 9818 entries, we obtained 89.1% of concordance, whereas the standard Levenshtein distance obtained 70.5%. This demonstrates that our method has better results.

Assuntos

Confiabilidade dos Dados; Bases de Dados Factuais; Aprendizado de Máquina; Processamento de Linguagem Natural; Reconhecimento Automatizado de Padrão/métodos; Topografia Médica/métodos; Algoritmos; França; Armazenamento e Recuperação da Informação/métodos; Semântica

Buscar no Google

Imprimir

XML

PubMed Links

Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Reconhecimento Automatizado de Padrão / Bases de Dados Factuais / Topografia Médica / Confiabilidade dos Dados / Aprendizado de Máquina País como assunto: Europa Idioma: En Ano de publicação: 2015 Tipo de documento: Article

Buscar no Google

Imprimir

XML

PubMed Links