An unsupervised machine learning approach to segmentation of clinician-entered free text.

Wrenn, Jesse O; Stetson, Peter D; Johnson, Stephen B

Wrenn, Jesse O; Stetson, Peter D; Johnson, Stephen B.

Afiliação

Wrenn JO; Department of Biomedical Informatics, Columbia University, New York, NY, USA.

AMIA Annu Symp Proc ; : 811-5, 2007 Oct 11.

Article em En | MEDLINE | ID: mdl-18693949

RESUMO

Natural language processing, an important tool in biomedicine, fails without successful segmentation of words and sentences. Tokenization is a form of segmentation that identifies boundaries separating semantic units, for example words, dates, numbers and symbols, within a text. We sought to construct a highly generalizeable tokenization algorithm with no prior knowledge of characters or their function, based solely on the inherent statistical properties of token and sentence boundaries. Tokenizing clinician-entered free text, we achieved precision and recall of 92% and 93%, respectively compared to a whitespace token boundary detection algorithm. We classified over 80% of punctuation characters correctly, based on manual disambiguation with high inter-rater agreement (kappa=0.916). Our algorithm effectively discovered properties of whitespace and punctuation in the corpus without prior knowledge of either. Given the dynamic nature of biomedical language, and the variety of distinct sublanguages used, the effectiveness and generalizability of our novel tokenization algorithm make it a valuable tool.

Assuntos

Algoritmos; Prontuários Médicos; Processamento de Linguagem Natural; Área Sob a Curva; Inteligência Artificial; Curva ROC

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / Processamento de Linguagem Natural / Prontuários Médicos Tipo de estudo: Evaluation_studies / Guideline / Prognostic_studies Idioma: En Ano de publicação: 2007 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google