RESUMO
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning.
RESUMO
BACKGROUND: Primary care databases are a major source of data for epidemiological and health services research. However, most studies are based on coded information, ignoring information stored in free text. Using the early presentation of rheumatoid arthritis (RA) as an exemplar, our objective was to estimate the extent of data hidden within free text, using a keyword search. METHODS: We examined the electronic health records (EHRs) of 6,387 patients from the UK, aged 30 years and older, with a first coded diagnosis of RA between 2005 and 2008. We listed indicators for RA which were present in coded format and ran keyword searches for similar information held in free text. The frequency of indicator code groups and keywords from one year before to 14 days after RA diagnosis were compared, and temporal relationships examined. RESULTS: One or more keyword for RA was found in the free text in 29% of patients prior to the RA diagnostic code. Keywords for inflammatory arthritis diagnoses were present for 14% of patients whereas only 11% had a diagnostic code. Codes for synovitis were found in 3% of patients, but keywords were identified in an additional 17%. In 13% of patients there was evidence of a positive rheumatoid factor test in text only, uncoded. No gender differences were found. Keywords generally occurred close in time to the coded diagnosis of rheumatoid arthritis. They were often found under codes indicating letters and communications. CONCLUSIONS: Potential cases may be missed or wrongly dated when coded data alone are used to identify patients with RA, as diagnostic suspicions are frequently confined to text. The use of EHRs to create disease registers or assess quality of care will be misleading if free text information is not taken into account. Methods to facilitate the automated processing of text need to be developed and implemented.
Assuntos
Artrite Reumatoide/epidemiologia , Bases de Dados Factuais , Registros Eletrônicos de Saúde , Atenção Primária à Saúde , Idoso , Codificação Clínica , Feminino , Humanos , Incidência , Masculino , Pessoa de Meia-Idade , Prevalência , Reino Unido/epidemiologiaRESUMO
Electronic health records are increasingly used for research. The definition of cases or endpoints often relies on the use of coded diagnostic data, using a pre-selected group of codes. Validation of these cases, as 'true' cases of the disease, is crucial. There are, however, ambiguities in what is meant by validation in the context of electronic records. Validation usually implies comparison of a definition against a gold standard of diagnosis and the ability to identify false negatives ('true' cases which were not detected) as well as false positives (detected cases which did not have the condition). We argue that two separate concepts of validation are often conflated in existing studies. Firstly, whether the GP thought the patient was suffering from a particular condition (which we term confirmation or internal validation) and secondly, whether the patient really had the condition (external validation). Few studies have the ability to detect false negatives who have not received a diagnostic code. Natural language processing is likely to open up the use of free text within the electronic record which will facilitate both the validation of the coded diagnosis and searching for false negatives.
Assuntos
Bases de Dados Factuais/normas , Registros Eletrônicos de Saúde/normas , Controle de Formulários e Registros , Processamento de Linguagem Natural , Doença/classificação , Registros Eletrônicos de Saúde/organização & administração , Estudos de Validação como AssuntoRESUMO
OBJECTIVES: Much research with electronic health records (EHRs) uses coded or structured data only; important information captured in the free text remains unused. One dimension of EHR data quality assessment is 'currency' or timeliness, that is, data are representative of the patient state at the time of measurement. We explored the use of free text in UK general practice patient records to evaluate delays in recording of rheumatoid arthritis (RA) diagnosis. We also aimed to locate and quantify disease and diagnostic information recorded only in text. SETTING: UK general practice patient records from the Clinical Practice Research Datalink. PARTICIPANTS: 294 individuals with incident diagnosis of RA between 2005 and 2008; 204 women and 85 men, median age 63â years. PRIMARY AND SECONDARY OUTCOME MEASURES: Assessment of (1) quantity and timing of text entries for disease-modifying antirheumatic drugs (DMARDs) as a proxy for the RA disease code, and (2) quantity, location and timing of free text information relating to RA onset and diagnosis. RESULTS: Inflammatory markers, pain and DMARDs were the most common categories of disease information in text prior to RA diagnostic code; 10-37% of patients had such information only in text. Read codes associated with RA-related text included correspondence, general consultation and arthritis codes. 64 patients (22%) had DMARD text entries >14â days prior to RA code; these patients had more and earlier referrals to rheumatology, tests, swelling, pain and DMARD prescriptions, suggestive of an earlier implicit diagnosis than was recorded by the diagnostic code. CONCLUSIONS: RA-related symptoms, tests, referrals and prescriptions were recorded in free text with 22% of patients showing strong evidence of delay in coding of diagnosis. Researchers using EHRs may need to mitigate for delayed codes by incorporating text into their case-ascertainment strategies. Natural language processing techniques have the capability to do this at scale.