Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 21
Filter
1.
Stud Health Technol Inform ; 245: 501-505, 2017.
Article in English | MEDLINE | ID: mdl-29295145

ABSTRACT

Concept mapping is important in natural language processing (NLP) for bioinformatics. The UMLS Metathesaurus provides a rich synonym thesaurus and is a popular resource for concept mapping. Query expansion using synonyms for subterm substitutions is an effective technique to increase recall for UMLS concept mapping. Synonyms used to substitute subterms are called element synonyms. The completeness and quality of both element synonyms and the UMLS synonym thesaurus is the key to success in such applications. The Lexical Systems Group (LSG) has developed a new system for element synonym acquisition based on new enhanced requirements and design for better performance. The results show: 1) A 36.71 times growth of synonyms in the Lexicon (lexSynonym) in the 2017 release; 2) Improvements of concept mapping for recall and F1 with similar precision using the lexSynonym.2017 as element synonyms due to the broader coverage and better quality.


Subject(s)
Natural Language Processing , Unified Medical Language System , Semantics , Vocabulary, Controlled
2.
AMIA Annu Symp Proc ; 2015: 707-16, 2015.
Article in English | MEDLINE | ID: mdl-26958206

ABSTRACT

The Privacy Rule of Health Insurance Portability and Accountability Act (HIPAA) requires that clinical documents be stripped of personally identifying information before they can be released to researchers and others. We have been manually annotating clinical text since 2008 in order to test and evaluate an algorithmic clinical text de-identification tool, NLM Scrubber, which we have been developing in parallel. Although HIPAA provides some guidance about what must be de-identified, translating those guidelines into practice is not as straightforward, especially when one deals with free text. As a result we have changed our manual annotation labels and methods six times. This paper explains why we have made those annotation choices, which have been evolved throughout seven years of practice on this field. The aim of this paper is to start a community discussion towards developing standards for clinical text annotation with the end goal of studying and comparing clinical text de-identification systems more accurately.


Subject(s)
Confidentiality , Data Anonymization , Electronic Health Records , Health Insurance Portability and Accountability Act , Algorithms , Confidentiality/legislation & jurisprudence , Data Anonymization/standards , Humans , Personally Identifiable Information , Privacy/legislation & jurisprudence , United States
3.
J Am Med Inform Assoc ; 21(3): 423-31, 2014.
Article in English | MEDLINE | ID: mdl-24026308

ABSTRACT

OBJECTIVE: To understand the factors that influence success in scrubbing personal names from narrative text. MATERIALS AND METHODS: We developed a scrubber, the NLM Name Scrubber (NLM-NS), to redact personal names from narrative clinical reports, hand tagged words in a set of gold standard narrative reports as personal names or not, and measured the scrubbing success of NLM-NS and that of four other scrubbing/name recognition tools (MIST, MITdeid, LingPipe, and ANNIE/GATE) against the gold standard reports. We ran three comparisons which used increasingly larger name lists. RESULTS: The test reports contained more than 1 million words, of which 2388 were patient and 20,160 were provider name tokens. NLM-NS failed to scrub only 2 of the 2388 instances of patient name tokens. Its sensitivity was 0.999 on both patient and provider name tokens and missed fewer instances of patient name tokens in all comparisons with other scrubbers. MIST produced the best all token specificity and F-measure for name instances in our most relevant study (study 2), with values of 0.997 and 0.938, respectively. In that same comparison, NLM-NS was second best, with values of 0.986 and 0.748, respectively, and MITdeid was a close third, with values of 0.985 and 0.796 respectively. With the addition of the Clinical Center name list to their native name lists, Ling Pipe, MITdeid, MIST, and ANNIE/GATE all improved substantially. MITdeid and Ling Pipe gained the most--reaching patient name sensitivity of 0.995 (F-measure=0.705) and 0.989 (F-measure=0.386), respectively. DISCUSSION: The privacy risk due to two name tokens missed by NLM-NS was statistically negligible, since neither individual could be distinguished among more than 150,000 people listed in the US Social Security Registry. CONCLUSIONS: The nature and size of name lists have substantial influences on scrubbing success. The use of very large name lists with frequency statistics accounts for much of NLM-NS scrubbing success.


Subject(s)
Confidentiality , Electronic Health Records , Names , Natural Language Processing , Humans , National Library of Medicine (U.S.) , United States
4.
AMIA Annu Symp Proc ; 2014: 353-8, 2014.
Article in English | MEDLINE | ID: mdl-25954338

ABSTRACT

We created a Gold Standard corpus comprised over 20,000 records of annotated narrative clinical reports for use in the training and evaluation of NLM Scrubber, a de-identification software system for medical records. Our experience with designing the corpus demonstrated the conceptual complexity of the task.


Subject(s)
Confidentiality , Electronic Health Records , Software , Health Insurance Portability and Accountability Act , Humans , United States
5.
AMIA Annu Symp Proc ; 2014: 767-76, 2014.
Article in English | MEDLINE | ID: mdl-25954383

ABSTRACT

INTRODUCTION: The Privacy Rule of Health Insurance Portability and Accountability Act requires that clinical documents be stripped of personally identifying information before they can be released to researchers and others. We have been developing a software application, NLM Scrubber, to de-identify narrative clinical reports. METHODS: We compared NLM Scrubber with MIT's and MITRE's de-identification systems on 3,093 clinical reports about 1,636 patients. The performance of each system was analyzed on address, date, and alphanumeric identifier recognition separately. Their overall performance on de-identification and on conservation of the remaining clinical text was analyzed as well. RESULTS: NLM Scrubber's sensitivity on de-identifying these identifiers was 99%. It's specificity on conserving the text with no personal identifiers was 99% as well. CONCLUSION: The current version of the system recognizes and redacts patient names, alphanumeric identifiers, addresses and dates. We plan to make the system available prior to the AMIA Annual Symposium in 2014.


Subject(s)
Confidentiality , Electronic Health Records , Software , Computer Security , Health Insurance Portability and Accountability Act , United States
6.
AMIA Annu Symp Proc ; : 1030, 2008 Nov 06.
Article in English | MEDLINE | ID: mdl-18998786

ABSTRACT

Journal Descriptor Indexing (JDI) is a vector-based text classification system developed at NLM (National Library of Medicine), originally in Lisp and now as a Java tool. Consequently, a testing suite was developed to verify training set data and results of the JDI tool. A methodology was developed and implemented to compare two sets of JD vectors, resulting in a single index (from 0 - 1) measuring their similarity. This methodology is fast, effective, and accurate.


Subject(s)
Artificial Intelligence , Information Storage and Retrieval/methods , Natural Language Processing , Pattern Recognition, Automated/methods , Terminology as Topic , Vocabulary, Controlled , Algorithms , United States
7.
AMIA Annu Symp Proc ; : 1031, 2008 Nov 06.
Article in English | MEDLINE | ID: mdl-18998787

ABSTRACT

Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the worlds writing systems. It is widely used in multilingual NLP (natural language processing) projects. On the other hand, there are some NLP projects still only dealing with ASCII characters. This paper describes methods of utilizing lexical tools to convert Unicode characters (UTF-8) to ASCII (7-bit) characters.


Subject(s)
Artificial Intelligence , Information Storage and Retrieval/methods , Natural Language Processing , Pattern Recognition, Automated/methods , Terminology as Topic , Vocabulary, Controlled , Algorithms , United States
8.
AMIA Annu Symp Proc ; : 956, 2008 Nov 06.
Article in English | MEDLINE | ID: mdl-18999120

ABSTRACT

Certain texts, such as clinical reports and clinical trial records, are written by professionals for professionals while being increasingly accessed by lay people. To improve the comprehensibility of such documents to the lay audience, we conducted a pilot study to analyze terms used primarily by health professionals, and explore ways to make them more comprehensible to lay people.


Subject(s)
Comprehension , Dictionaries, Medical as Topic , Information Dissemination/methods , Physician-Patient Relations , Terminology as Topic , Boston , Consumer Health Information , Public Opinion
9.
J Am Med Inform Assoc ; 15(4): 496-505, 2008.
Article in English | MEDLINE | ID: mdl-18436906

ABSTRACT

OBJECTIVE: This study has two objectives: first, to identify and characterize consumer health terms not found in the Unified Medical Language System (UMLS) Metathesaurus (2007 AB); second, to describe the procedure for creating new concepts in the process of building a consumer health vocabulary. How do the unmapped consumer health concepts relate to the existing UMLS concepts? What is the place of these new concepts in professional medical discourse? DESIGN: The consumer health terms were extracted from two large corpora derived in the process of Open Access Collaboratory Consumer Health Vocabulary (OAC CHV) building. Terms that could not be mapped to existing UMLS concepts via machine and manual methods prompted creation of new concepts, which were then ascribed semantic types, related to existing UMLS concepts, and coded according to specified criteria. RESULTS: This approach identified 64 unmapped concepts, 17 of which were labeled as uniquely "lay" and not feasible for inclusion in professional health terminologies. The remaining terms constituted potential candidates for inclusion in professional vocabularies, or could be constructed by post-coordinating existing UMLS terms. The relationship between new and existing concepts differed depending on the corpora from which they were extracted. CONCLUSION: Non-mapping concepts constitute a small proportion of consumer health terms, but a proportion that is likely to affect the process of consumer health vocabulary building. We have identified a novel approach for identifying such concepts.


Subject(s)
Consumer Health Information/classification , Unified Medical Language System , Vocabulary , Humans , Terminology as Topic
10.
J Am Med Inform Assoc ; 15(4): 484-95, 2008.
Article in English | MEDLINE | ID: mdl-18436912

ABSTRACT

OBJECTIVE: Despite the proliferation of consumer health sites, lay individuals often experience difficulty finding health information online. The present study attempts to understand users' information seeking difficulties by drawing on a hypothesis testing explanatory framework. It also addresses the role of user competencies and their interaction with internet resources. DESIGN: Twenty participants were interviewed about their understanding of a hypothetical scenario about a family member suffering from stable angina and then searched MedlinePlus consumer health information portal for information on the problem presented in the scenario. Participants' understanding of heart disease was analyzed via semantic analysis. Thematic coding was used to describe information seeking trajectories in terms of three key strategies: verification of the primary hypothesis, narrowing search within the general hypothesis area and bottom-up search. RESULTS: Compared to an expert model, participants' understanding of heart disease involved different key concepts, which were also differently grouped and defined. This understanding provided the framework for search-guiding hypotheses and results interpretation. Incorrect or imprecise domain knowledge led individuals to search for information on irrelevant sites, often seeking out data to confirm their incorrect initial hypotheses. Online search skills enhanced search efficiency, but did not eliminate these difficulties. CONCLUSIONS: Regardless of their web experience and general search skills, lay individuals may experience difficulty with health information searches. These difficulties may be related to formulating and evaluating hypotheses that are rooted in their domain knowledge. Informatics can provide support at the levels of health information portals, individual websites, and consumer education tools.


Subject(s)
Consumer Health Information , Information Storage and Retrieval/methods , MedlinePlus , Angina Pectoris , Humans , Information Theory , Internet , User-Computer Interface
11.
Stud Health Technol Inform ; 129(Pt 1): 545-9, 2007.
Article in English | MEDLINE | ID: mdl-17911776

ABSTRACT

The National Library of Medicine has developed a tool to identify medical concepts from the Unified Medical Language System in free text. This tool - MetaMap (and its java version MMTx) has been used extensively for biomedical text mining applications. We have developed a module for MetaMap which has a high performance in terms of processing speed. We evaluated our module independently against MetaMap for the task of identifying UMLS concepts in free text clinical radiology reports. A set of 1000 sentences from neuro-radiology reports were collected and processed using our technique and the MMTx Program. An evaluation showed that our technique was able to identify 91% of the concepts found by MMTx in 14% of the time taken by MMTx. An error analysis showed that the missing concepts were largely those which were not direct lexical matches but inferential matches of multiple concepts. Our method also identified multi-phrase concepts which MMTx failed to identify. We suggest that this module be implemented as an option in MMTx for real-time text mining applications where single concepts found in the UMLS need to be identified.


Subject(s)
Information Storage and Retrieval/methods , Natural Language Processing , Unified Medical Language System , Humans , Medical Records Systems, Computerized , Neurology , Radiology Department, Hospital , Radiology Information Systems
12.
J Med Internet Res ; 9(1): e4, 2007 Feb 28.
Article in English | MEDLINE | ID: mdl-17478413

ABSTRACT

BACKGROUND: The development of consumer health information applications such as health education websites has motivated the research on consumer health vocabulary (CHV). Term identification is a critical task in vocabulary development. Because of the heterogeneity and ambiguity of consumer expressions, term identification for CHV is more challenging than for professional health vocabularies. OBJECTIVE: For the development of a CHV, we explored several term identification methods, including collaborative human review and automated term recognition methods. METHODS: A set of criteria was established to ensure consistency in the collaborative review, which analyzed 1893 strings. Using the results from the human review, we tested two automated methods-C-value formula and a logistic regression model. RESULTS: The study identified 753 consumer terms and found the logistic regression model to be highly effective for CHV term identification (area under the receiver operating characteristic curve = 95.5%). CONCLUSIONS: The collaborative human review and logistic regression methods were effective for identifying terms for CHV development.


Subject(s)
Health Education/methods , Vocabulary, Controlled , Automation/methods , Cooperative Behavior , Humans , Logistic Models , Models, Theoretical , ROC Curve
13.
AMIA Annu Symp Proc ; : 721-5, 2007 Oct 11.
Article in English | MEDLINE | ID: mdl-18693931

ABSTRACT

The UMLS Knowledge Source Server (UMLSKS), developed at the National Library of Medicine (NLM), makes the knowledge sources of the Unified Medical Language System (UMLS) available to the research community over the Internet. In 2003, the UMLSKS was redesigned utilizing state-of-the-art technologies available at that time. That design offered a significant improvement over the prior version but presented a set of technology-dependent issues that limited its functionality and usability. Four areas of desired improvement were identified: software interfaces, web interface content, system maintenance/deployment, and user authentication. By employing next generation web technologies, newer authentication paradigms and further refinements in modular design methods, these areas could be addressed and corrected to meet the ever increasing needs of UMLSKS developers. In this paper we detail the issues present with the existing system and describe the new system's design using new technologies considered entrants in the Web 2.0 development era.


Subject(s)
Software Design , Unified Medical Language System , Internet , Knowledge Bases , Software/trends
14.
AMIA Annu Symp Proc ; : 941, 2007 Oct 11.
Article in English | MEDLINE | ID: mdl-18694041

ABSTRACT

We are developing a freely available Spanish medical syntactic lexicon, initially populated with medical terms from a bilingual list, and then from corpus based term discovery. The lexical records are a simplification of the SPECIALST English lexicon. Lexical variant generation and normalization tools will be provided along with the lexicon.


Subject(s)
Terminology as Topic , Language
15.
AMIA Annu Symp Proc ; : 200-3, 2006.
Article in English | MEDLINE | ID: mdl-17238331

ABSTRACT

The Lexical Systems Group at the National Library of Medicine (NLM) has developed a Part-of-Speech (POS) tagger to be freely distributed with the SPECIALIST NLP Tools. dTagger is specifically designed for use with the SPECIALIST lexicon but it can be used with an arbitrary tag set. It is capable of single or multi-word chunking. It is trainable with previously annotated text and in development is a version that is tunable with untagged text. The tagger allows users to add local lexicon content. It can report likelihoods for each sentence tagged. New words seen while tagging (the unknowns) are handled by shape identification including heuristics based on suffix statistics gleaned during the training. The performance of the supervised training is noted to be 95% on a modified version of the MedPost hand annotated Medline abstracts. Eight percent of the terms within this corpus were multi-word entities.


Subject(s)
Abstracting and Indexing , Linguistics , Natural Language Processing , Vocabulary, Controlled , Algorithms , MEDLINE , Markov Chains , Software
16.
AMIA Annu Symp Proc ; : 960, 2006.
Article in English | MEDLINE | ID: mdl-17238579

ABSTRACT

A JDI (Journal Descriptor Indexing) tool has been developed at NLM that automatically categorizes biomedical text as input, returning a ranked list, with scores between 0-1, of either JDs (Journal Descriptors, corresponding to biomedical disciplines) or STs (UMLS Semantic Types). Possible applications include WSD (Word Sense Disambiguation) and retrieval according to discipline. The Lexical Systems Group plans to distribute an open source JAVA version of this tool.


Subject(s)
Abstracting and Indexing/methods , Natural Language Processing , Medical Subject Headings , Periodicals as Topic , Semantics , Unified Medical Language System
17.
AMIA Annu Symp Proc ; : 1155, 2006.
Article in English | MEDLINE | ID: mdl-17238774

ABSTRACT

The Consumer Health Vocabulary Initiative (http://consumerhealthvocab.org/) is a multi-disciplinary effort to facilitate the research and development of consumer health vocabularies (CHVs). We are currently investigating different types of lexical forms used in lay expressions (i.e. words and phrases): consumer-friendly display forms, consumer forms that have different semantics in professional and lay contexts and consumer forms not covered by professional health terminologies. The next step will address lay and professional conceptual differences in second-generation CHVs.


Subject(s)
Medical Informatics Applications , Vocabulary , Delivery of Health Care , Humans
18.
AMIA Annu Symp Proc ; : 859-63, 2005.
Article in English | MEDLINE | ID: mdl-16779162

ABSTRACT

We have developed a systematic methodology using corpus-based text analysis followed by human review to assign "consumer-friendly display (CFD) names" to medical concepts from the National Library of Medicine (NLM) Unified Medical Language System (UMLS) Metathesaurus. Using NLM MedlinePlus queries as a corpus of consumer expressions and a collaborative Web-based tool to facilitate review, we analyzed 425 frequently occurring concepts. As a preliminary test of our method, we evaluated 34 ana-lyzed concepts and their CFD names, using a questionnaire modeled on standard reading assessments. The initial results that consumers (n=10) are more likely to understand and recognize CFD names than alternate labels suggest that the approach is useful in the development of consumer health vocabularies for displaying understandable health information.


Subject(s)
Patients , Unified Medical Language System , Vocabulary , Community Participation , Humans
19.
AMIA Annu Symp Proc ; : 564-8, 2003.
Article in English | MEDLINE | ID: mdl-14728236

ABSTRACT

The U.S. National Institutes of Health, through its National Library of Medicine, developed ClinicalTrials.gov to provide the public with easy access to information on clinical trials on a wide range of conditions or diseases. Only English language information retrieval is currently supported. Given the growing number of Spanish speakers in the U.S. and their increasing use of the Web, we anticipate a significant increase in Spanish-speaking users. This study compares the effectiveness of two common cross-language information retrieval methods using machine translation, query translation versus document translation, using a subset of genuine user queries from ClinicalTrials.gov. Preliminary results conducted with the ClinicalTrials.gov search engine show that in our environment, query translation is statistically significantly better than document translation. We discuss possible reasons for this result and we conclude with suggestions for future work.


Subject(s)
Clinical Trials as Topic , Databases as Topic , Information Storage and Retrieval/methods , Translating , Humans , Language , National Library of Medicine (U.S.) , Software , Spain , Unified Medical Language System , United States
20.
AMIA Annu Symp Proc ; : 798, 2003.
Article in English | MEDLINE | ID: mdl-14728303

ABSTRACT

A variety of resources developed for use with the Unified Medical Language System are presented. These resources include the UMLS Knowledge Source Server, the SPECIALIST lexicon, a set of lexical tools that work with the SPECIALIST lexicon, and a variety of other NLP document processing tools. These tools manage lexical variation, tokenize and parse text strings, suggest spelling variants, and provide text-to-concept mapping capabilities. The UMLS Knowledge Source Server is available under a license agreement. The other tools are freely downloadable.


Subject(s)
Natural Language Processing , Unified Medical Language System , Abstracting and Indexing
SELECTION OF CITATIONS
SEARCH DETAIL