RESUMO
Information retrieval algorithms based on natural language processing (NLP) of the free text of medical records have been used to find documents of interest from databases. Homelessness is a high priority non-medical diagnosis that is noted in electronic medical records of Veterans in Veterans Affairs (VA) facilities. Using a human-reviewed reference standard corpus of clinical documents of Veterans with evidence of homelessness and those without, an open-source NLP tool (Automated Retrieval Console v2.0, ARC) was trained to classify documents. The best performing model based on document level work-flow performed well on a test set (Precision 94%, Recall 97%, F-Measure 96). Processing of a naïve set of 10,000 randomly selected documents from the VA using this best performing model yielded 463 documents flagged as positive, indicating a 4.7% prevalence of homelessness. Human review noted a precision of 70% for these flags resulting in an adjusted prevalence of homelessness of 3.3% which matches current VA estimates. Further refinements are underway to improve the performance. We demonstrate an effective and rapid lifecycle of using an off-the-shelf NLP tool for screening targets of interest from medical records.