Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 83
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
J Biomed Inform ; 157: 104707, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39142598

RESUMO

OBJECTIVE: Traditional knowledge-based and machine learning diagnostic decision support systems have benefited from integrating the medical domain knowledge encoded in the Unified Medical Language System (UMLS). The emergence of Large Language Models (LLMs) to supplant traditional systems poses questions of the quality and extent of the medical knowledge in the models' internal knowledge representations and the need for external knowledge sources. The objective of this study is three-fold: to probe the diagnosis-related medical knowledge of popular LLMs, to examine the benefit of providing the UMLS knowledge to LLMs (grounding the diagnosis predictions), and to evaluate the correlations between human judgments and the UMLS-based metrics for generations by LLMs. METHODS: We evaluated diagnoses generated by LLMs from consumer health questions and daily care notes in the electronic health records using the ConsumerQA and Problem Summarization datasets. Probing LLMs for the UMLS knowledge was performed by prompting the LLM to complete the diagnosis-related UMLS knowledge paths. Grounding the predictions was examined in an approach that integrated the UMLS graph paths and clinical notes in prompting the LLMs. The results were compared to prompting without the UMLS paths. The final experiments examined the alignment of different evaluation metrics, UMLS-based and non-UMLS, with human expert evaluation. RESULTS: In probing the UMLS knowledge, GPT-3.5 significantly outperformed Llama2 and a simple baseline yielding an F1 score of 10.9% in completing one-hop UMLS paths for a given concept. Grounding diagnosis predictions with the UMLS paths improved the results for both models on both tasks, with the highest improvement (4%) in SapBERT score. There was a weak correlation between the widely used evaluation metrics (ROUGE and SapBERT) and human judgments. CONCLUSION: We found that while popular LLMs contain some medical knowledge in their internal representations, augmentation with the UMLS knowledge provides performance gains around diagnosis generation. The UMLS needs to be tailored for the task to improve the LLMs predictions. Finding evaluation metrics that are aligned with human judgments better than the traditional ROUGE and BERT-based scores remains an open research question.


Assuntos
Registros Eletrônicos de Saúde , Unified Medical Language System , Humanos , Aprendizado de Máquina , Processamento de Linguagem Natural , Sistemas de Apoio a Decisões Clínicas , Diagnóstico por Computador/métodos
2.
BMC Med Inform Decis Mak ; 23(Suppl 4): 299, 2024 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-38326827

RESUMO

BACKGROUND: In this era of big data, data harmonization is an important step to ensure reproducible, scalable, and collaborative research. Thus, terminology mapping is a necessary step to harmonize heterogeneous data. Take the Medical Dictionary for Regulatory Activities (MedDRA) and International Classification of Diseases (ICD) for example, the mapping between them is essential for drug safety and pharmacovigilance research. Our main objective is to provide a quantitative and qualitative analysis of the mapping status between MedDRA and ICD. We focus on evaluating the current mapping status between MedDRA and ICD through the Unified Medical Language System (UMLS) and Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). We summarized the current mapping statistics and evaluated the quality of the current MedDRA-ICD mapping; for unmapped terms, we used our self-developed algorithm to rank the best possible mapping candidates for additional mapping coverage. RESULTS: The identified MedDRA-ICD mapped pairs cover 27.23% of the overall MedDRA preferred terms (PT). The systematic quality analysis demonstrated that, among the mapped pairs provided by UMLS, only 51.44% are considered an exact match. For the 2400 sampled unmapped terms, 56 of the 2400 MedDRA Preferred Terms (PT) could have exact match terms from ICD. CONCLUSION: Some of the mapped pairs between MedDRA and ICD are not exact matches due to differences in granularity and focus. For 72% of the unmapped PT terms, the identified exact match pairs illustrate the possibility of identifying additional mapped pairs. Referring to its own mapping standard, some of the unmapped terms should qualify for the expansion of MedDRA to ICD mapping in UMLS.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Classificação Internacional de Doenças , Humanos , Unified Medical Language System , Farmacovigilância , Algoritmos
3.
J Med Internet Res ; 24(11): e40361, 2022 11 25.
Artigo em Inglês | MEDLINE | ID: mdl-36427233

RESUMO

BACKGROUND: Electronic medical records (EMRs) of patients with lung cancer (LC) capture a variety of health factors. Understanding the distribution of these factors will help identify key factors for risk prediction in preventive screening for LC. OBJECTIVE: We aimed to generate an integrated biomedical graph from EMR data and Unified Medical Language System (UMLS) ontology for LC, and to generate an LC health factor distribution from a hospital EMR of approximately 1 million patients. METHODS: The data were collected from 2 sets of 1397 patients with and those without LC. A patient-centered health factor graph was plotted with 108,000 standardized data, and a graph database was generated to integrate the graphs of patient health factors and the UMLS ontology. With the patient graph, we calculated the connection delta ratio (CDR) for each of the health factors to measure the relative strength of the factor's relationship to LC. RESULTS: The patient graph had 93,000 relations between the 2794 patient nodes and 650 factor nodes. An LC graph with 187 related biomedical concepts and 188 horizontal biomedical relations was plotted and linked to the patient graph. Searching the integrated biomedical graph with any number or category of health factors resulted in graphical representations of relationships between patients and factors, while searches using any patient presented the patient's health factors from the EMR and the LC knowledge graph (KG) from the UMLS in the same graph. Sorting the health factors by CDR in descending order generated a distribution of health factors for LC. The top 70 CDR-ranked factors of disease, symptom, medical history, observation, and laboratory test categories were verified to be concordant with those found in the literature. CONCLUSIONS: By collecting standardized data of thousands of patients with and those without LC from the EMR, it was possible to generate a hospital-wide patient-centered health factor graph for graph search and presentation. The patient graph could be integrated with the UMLS KG for LC and thus enable hospitals to bring continuously updated international standard biomedical KGs from the UMLS for clinical use in hospitals. CDR analysis of the graph of patients with LC generated a CDR-sorted distribution of health factors, in which the top CDR-ranked health factors were concordant with the literature. The resulting distribution of LC health factors can be used to help personalize risk evaluation and preventive screening recommendations.


Assuntos
Registros Eletrônicos de Saúde , Neoplasias Pulmonares , Humanos , Estudos Retrospectivos , Unified Medical Language System , Neoplasias Pulmonares/epidemiologia , Hospitais
4.
Inf Serv Use ; 42(1): 95-106, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35600122

RESUMO

Donald A.B. Lindberg M.D. arrived at the U.S. National Library of Medicine in 1984 and quickly launched the Unified Medical Language System (UMLS) research and development project to help computers understand biomedical meaning and to enable retrieval and integration of information from disparate electronic sources, e.g., patient records, biomedical literature, knowledge bases. This chapter focuses on how Lindberg's thinking, preferred ways of working, and decision-making guided UMLS goals and development and on what made the UMLS markedly "new and different" and ahead of its time.

5.
J Surg Res ; 268: 552-561, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34464893

RESUMO

BACKGROUND: The Unified Medical Language System (UMLS) maps relationships between and within >100 biomedical vocabularies, including Current Procedural Terminology (CPT) codes, creating a powerful knowledge resource which can accelerate clinical research. METHODS: We used synonymy and concepts relating hierarchical structure of CPT codes within the UMLS, (1) guiding surgical experts in expanding the Operative Stress Score (OSS) from 565 originally rated CPT codes to additional, 1,853 related procedures; (2) establishing validity of the association between the added OSS ratings and 30-day outcomes in VASQIP (2015-2018). RESULTS: The UMLS Metathesaurus and Semantic Network was converted into an interactive graph database (https://github.com/dbmi-pitt/UMLS-Graph) delineating ontology relatedness. From this UMLS-graph, the CPT hierarchy was queried obtaining all paths from each code to the hierarchical apex. Of 1,853 added ratings, 43% and 76% were siblings and cousins of original OSS CPT codes. Of 857,577 VASQIP cases (mean age, 64±11years; 91% male; 75% white), 786,122 (92%) and 71,455 (8%) were rated in the original and added OSS. Compared to original, added OSS cases included more females (14% versus 9%) and frail patients (25% versus 19%) undergoing high stress procedures (11% versus 8%; all P <.001). Postoperative mortality consistently increased with OSS. Very low stress procedures had <0.5% (original, 0.4% [95%CI, 0.4%-0.5%] versus added, 0.9% [95%CI, 0.6%-1.2%]) and very high 3.8% (original, 3.5% [95%CI, 3.0%-4.0%] versus added, 5.8% [95%CI, 4.6-7.3%]) mortality rates. CONCLUSIONS: The synonymy and concepts relating biomedical data within the UMLS can be abstracted and efficiently used to expand the utility of existing clinical research tools.


Assuntos
Indexação e Redação de Resumos , Unified Medical Language System , Idoso , Bases de Dados Factuais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade
6.
BMC Bioinformatics ; 20(Suppl 10): 251, 2019 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-31138105

RESUMO

BACKGROUND: The quantity of documents being published requires researchers to specialize to a narrower field, meaning that inferable connections between publications (particularly from different domains) can be missed. This has given rise to automatic literature based discovery (LBD). However, unless heavily filtered, LBD generates more potential new knowledge than can be manually verified and another form of selection is required before the results can be passed onto a user. Since a large proportion of the automatically generated hidden knowledge is valid but generally known, we investigate the hypothesis that non trivial, interesting, hidden knowledge can be treated as an anomaly and identified using anomaly detection approaches. RESULTS: Two experiments are conducted: (1) to avoid errors arising from incorrect extraction of relations, the hypothesis is validated using manually annotated relations appearing in a thesaurus, and (2) automatically extracted relations are used to investigate the hypothesis on publication abstracts. These allow an investigation of a potential upper bound and the detection of limitations yielded by automatic relation extraction. CONCLUSION: We apply one-class SVM and isolation forest anomaly detection algorithms to a set of hidden connections to rank connections by identifying outlying (interesting) ones and show that the approach increases the F1 measure by a factor of 10 while greatly reducing the quantity of hidden knowledge to manually verify. We also demonstrate the statistical significance of this result.


Assuntos
Conhecimento , Algoritmos , Automação , Humanos , Descoberta do Conhecimento , Publicações , Semântica
7.
BMC Bioinformatics ; 20(1): 653, 2019 Dec 11.
Artigo em Inglês | MEDLINE | ID: mdl-31829175

RESUMO

BACKGROUND: Computational compound repositioning has the potential for identifying new uses for existing drugs, and new algorithms and data source aggregation strategies provide ever-improving results via in silico metrics. However, even with these advances, the number of compounds successfully repositioned via computational screening remains low. New strategies for algorithm evaluation that more accurately reflect the repositioning potential of a compound could provide a better target for future optimizations. RESULTS: Using a text-mined database, we applied a previously described network-based computational repositioning algorithm, yielding strong results via cross-validation, averaging 0.95 AUROC on test-set indications. However, to better approximate a real-world scenario, we built a time-resolved evaluation framework. At various time points, we built networks corresponding to prior knowledge for use as a training set, and then predicted on a test set comprised of indications that were subsequently described. This framework showed a marked reduction in performance, peaking in performance metrics with the 1985 network at an AUROC of .797. Examining performance reductions due to removal of specific types of relationships highlighted the importance of drug-drug and disease-disease similarity metrics. Using data from future timepoints, we demonstrate that further acquisition of these kinds of data may help improve computational results. CONCLUSIONS: Evaluating a repositioning algorithm using indications unknown to input network better tunes its ability to find emerging drug indications, rather than finding those which have been randomly withheld. Focusing efforts on improving algorithmic performance in a time-resolved paradigm may further improve computational repositioning predictions.


Assuntos
Biologia Computacional/métodos , Mineração de Dados , Reposicionamento de Medicamentos , Bases de Conhecimento , Algoritmos , Doença , Humanos , Aprendizado de Máquina , Reprodutibilidade dos Testes , Fatores de Tempo
8.
Eur Arch Psychiatry Clin Neurosci ; 268(1): 99-104, 2018 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-28032254

RESUMO

Disorganized speech is related to functional abnormalities in schizophrenia. To test the association between formal thought disorders (FTDs) and white matter microstructure, we applied a behavioral rating and diffusion tensor imaging in 61 patients with schizophrenia spectrum disorders. The Bern Psychopathology Scale was used to rate the dimension of language abnormalities ranging from negative FTDs, basically unaltered speech, to positive FTDs. Tract-based spatial statistics indicated increased fractional anisotropy in left hemispheric pathways of the language system in patients with negative FTDs. Thus, altered white matter properties in relevant fiber tracts may represent vulnerability to specific formal thought disorders.


Assuntos
Esquizofrenia/complicações , Distúrbios da Fala/etiologia , Estatística como Assunto , Substância Branca/patologia , Adulto , Análise de Variância , Anisotropia , Imagem de Tensor de Difusão , Feminino , Lateralidade Funcional , Humanos , Processamento de Imagem Assistida por Computador , Masculino , Pessoa de Meia-Idade , Escalas de Graduação Psiquiátrica , Distúrbios da Fala/diagnóstico por imagem , Substância Branca/diagnóstico por imagem
9.
Anaesthesia ; 73(12): 1531-1534, 2018 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-30412290

RESUMO

We examined the prevalence of novel acronyms in the titles of anaesthetic and related studies and the response of anaesthetists to them. We separately analysed trainee-led research projects in the UK supported by the Research and Audit Federation of Trainees (RAFT), and a 10-year cohort of papers identified using the PubMed literature search tool. We also conducted a survey of 20 anaesthetists within our institution regarding the utility and impact of titles containing acronyms, and their recall of the associated topics. Finally, we developed a scoring system for acronym accuracy and complexity, the ORigin of AcroNym letterinG Used Term AppropriateNess (ORANGUTAN) score, and measured the progression of acronym usage over the 10-year period studied. Our results show that while acronyms themselves are sometimes considered memorable, they do not aid recall of topics and are, in general, not considered helpful. There has been an increase in the prevalence of acronymic titles over 10 years, and in the complexity of acronyms used, suggesting that there is currently a selective pressure favouring the use of acronyms even if they are of limited benefit.


Assuntos
Abreviaturas como Assunto , Anestesiologia/métodos , Anestésicos/farmacologia , Anestesiologia/educação , Estudos de Coortes , Humanos , Internato e Residência/métodos , Rememoração Mental , Reprodutibilidade dos Testes
10.
J Med Internet Res ; 20(10): e274, 2018 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-30322834

RESUMO

BACKGROUND: Medical coding is essential for standardized communication and integration of clinical data. The Unified Medical Language System by the National Library of Medicine is the largest clinical terminology system for medical coders and Natural Language Processing tools. However, the abundance of ambiguous codes leads to low rates of uniform coding among different coders. OBJECTIVE: The objective of our study was to measure uniform coding among different medical experts in terms of interrater reliability and analyze the effect on interrater reliability using an expert- and Web-based code suggestion system. METHODS: We conducted a quasi-experimental study in which 6 medical experts coded 602 medical items from structured quality assurance forms or free-text eligibility criteria of 20 different clinical trials. The medical item content was selected on the basis of mortality-leading diseases according to World Health Organization data. The intervention comprised using a semiautomatic code suggestion tool that is linked to a European information infrastructure providing a large medical text corpus of >300,000 medical form items with expert-assigned semantic codes. Krippendorff alpha (Kalpha) with bootstrap analysis was used for the interrater reliability analysis, and coding times were measured before and after the intervention. RESULTS: The intervention improved interrater reliability in structured quality assurance form items (from Kalpha=0.50, 95% CI 0.43-0.57 to Kalpha=0.62 95% CI 0.55-0.69) and free-text eligibility criteria (from Kalpha=0.19, 95% CI 0.14-0.24 to Kalpha=0.43, 95% CI 0.37-0.50) while preserving or slightly reducing the mean coding time per item for all 6 coders. Regardless of the intervention, precoordination and structured items were associated with significantly high interrater reliability, but the proportion of items that were precoordinated significantly increased after intervention (eligibility criteria: OR 4.92, 95% CI 2.78-8.72; quality assurance: OR 1.96, 95% CI 1.19-3.25). CONCLUSIONS: The Web-based code suggestion mechanism improved interrater reliability toward moderate or even substantial intercoder agreement. Precoordination and the use of structured versus free-text data elements are key drivers of higher interrater reliability.


Assuntos
Codificação Clínica/métodos , Ensaios Clínicos Controlados não Aleatórios como Assunto/métodos , Unified Medical Language System/normas , Humanos , Internet , Processamento de Linguagem Natural , Reprodutibilidade dos Testes
11.
J Biomed Inform ; 68: 96-103, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-28286031

RESUMO

BACKGROUND AND OBJECTIVE: Passage retrieval, the identification of top-ranked passages that may contain the answer for a given biomedical question, is a crucial component for any biomedical question answering (QA) system. Passage retrieval in open-domain QA is a longstanding challenge widely studied over the last decades. However, it still requires further efforts in biomedical QA. In this paper, we present a new biomedical passage retrieval method based on Stanford CoreNLP sentence/passage length, probabilistic information retrieval (IR) model and UMLS concepts. METHODS: In the proposed method, we first use our document retrieval system based on PubMed search engine and UMLS similarity to retrieve relevant documents to a given biomedical question. We then take the abstracts from the retrieved documents and use Stanford CoreNLP for sentence splitter to make a set of sentences, i.e., candidate passages. Using stemmed words and UMLS concepts as features for the BM25 model, we finally compute the similarity scores between the biomedical question and each of the candidate passages and keep the N top-ranked ones. RESULTS: Experimental evaluations performed on large standard datasets, provided by the BioASQ challenge, show that the proposed method achieves good performances compared with the current state-of-the-art methods. The proposed method significantly outperforms the current state-of-the-art methods by an average of 6.84% in terms of mean average precision (MAP). CONCLUSION: We have proposed an efficient passage retrieval method which can be used to retrieve relevant passages in biomedical QA systems with high mean average precision.


Assuntos
Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , PubMed , Unified Medical Language System , Modelos Estatísticos
12.
J Biomed Inform ; 67: 1-10, 2017 03.
Artigo em Inglês | MEDLINE | ID: mdl-28131722

RESUMO

OBJECTIVE: The utility of biomedical information retrieval environments can be severely limited when users lack expertise in constructing effective search queries. To address this issue, we developed a computer-based query recommendation algorithm that suggests semantically interchangeable terms based on an initial user-entered query. In this study, we assessed the value of this approach, which has broad applicability in biomedical information retrieval, by demonstrating its application as part of a search engine that facilitates retrieval of information from electronic health records (EHRs). MATERIALS AND METHODS: The query recommendation algorithm utilizes MetaMap to identify medical concepts from search queries and indexed EHR documents. Synonym variants from UMLS are used to expand the concepts along with a synonym set curated from historical EHR search logs. The empirical study involved 33 clinicians and staff who evaluated the system through a set of simulated EHR search tasks. User acceptance was assessed using the widely used technology acceptance model. RESULTS: The search engine's performance was rated consistently higher with the query recommendation feature turned on vs. off. The relevance of computer-recommended search terms was also rated high, and in most cases the participants had not thought of these terms on their own. The questions on perceived usefulness and perceived ease of use received overwhelmingly positive responses. A vast majority of the participants wanted the query recommendation feature to be available to assist in their day-to-day EHR search tasks. DISCUSSION AND CONCLUSION: Challenges persist for users to construct effective search queries when retrieving information from biomedical documents including those from EHRs. This study demonstrates that semantically-based query recommendation is a viable solution to addressing this challenge.


Assuntos
Algoritmos , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Ferramenta de Busca , Humanos , Armazenamento e Recuperação da Informação , Semântica
13.
BMC Med Inform Decis Mak ; 17(1): 155, 2017 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-29191207

RESUMO

BACKGROUND: The medical subdomain of a clinical note, such as cardiology or neurology, is useful content-derived metadata for developing machine learning downstream applications. To classify the medical subdomain of a note accurately, we have constructed a machine learning-based natural language processing (NLP) pipeline and developed medical subdomain classifiers based on the content of the note. METHODS: We constructed the pipeline using the clinical NLP system, clinical Text Analysis and Knowledge Extraction System (cTAKES), the Unified Medical Language System (UMLS) Metathesaurus, Semantic Network, and learning algorithms to extract features from two datasets - clinical notes from Integrating Data for Analysis, Anonymization, and Sharing (iDASH) data repository (n = 431) and Massachusetts General Hospital (MGH) (n = 91,237), and built medical subdomain classifiers with different combinations of data representation methods and supervised learning algorithms. We evaluated the performance of classifiers and their portability across the two datasets. RESULTS: The convolutional recurrent neural network with neural word embeddings trained-medical subdomain classifier yielded the best performance measurement on iDASH and MGH datasets with area under receiver operating characteristic curve (AUC) of 0.975 and 0.991, and F1 scores of 0.845 and 0.870, respectively. Considering better clinical interpretability, linear support vector machine-trained medical subdomain classifier using hybrid bag-of-words and clinically relevant UMLS concepts as the feature representation, with term frequency-inverse document frequency (tf-idf)-weighting, outperformed other shallow learning classifiers on iDASH and MGH datasets with AUC of 0.957 and 0.964, and F1 scores of 0.932 and 0.934 respectively. We trained classifiers on one dataset, applied to the other dataset and yielded the threshold of F1 score of 0.7 in classifiers for half of the medical subdomains we studied. CONCLUSION: Our study shows that a supervised learning-based NLP approach is useful to develop medical subdomain classifiers. The deep learning algorithm with distributed word representation yields better performance yet shallow learning algorithms with the word and concept representation achieves comparable performance with better clinical interpretability. Portable classifiers may also be used across datasets from different institutions.


Assuntos
Tomada de Decisão Clínica , Aprendizado de Máquina , Prontuários Médicos , Processamento de Linguagem Natural , Unified Medical Language System , Humanos
14.
Neuroimage ; 139: 74-93, 2016 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-27261158

RESUMO

The majority of functional neuroimaging investigations aim to characterize an average human brain. However, another important goal of cognitive neuroscience is to understand the ways in which individuals differ from one another and the significance of these differences. This latter goal is given special weight by the recent reconceptualization of neurological disorders where sharp boundaries are no longer drawn either between health and neuropsychiatric and neurodevelopmental disorders, or among different disorders (e.g., Insel et al., 2010). Consequently, even the variability in the healthy population can inform our understanding of brain disorders. However, because the use of functional neural markers is still in its infancy, no consensus presently exists about which measures (e.g., effect size?, extent of activation?, degree of lateralization?) are the best ones to use. We here attempt to address this question with respect to one large-scale neural system: the set of brain regions in the frontal and temporal cortices that jointly support high-level linguistic processing (e.g., Binder et al., 1997; Fedorenko, Hsieh, Nieto-Castanon, Whitfield-Gabrieli, & Kanwisher, 2010). In particular, using data from 150 individuals all of whom had performed a language "localizer" task contrasting sentences and nonword sequences (Fedorenko et al., 2010), we: a) characterize the distributions of the values for four key neural measures of language activity (region effect sizes, region volumes, lateralization based on effect sizes, and lateralization based on volumes); b) test the reliability of these measures in a subset of 32 individuals who were scanned across two sessions; c) evaluate the relationship among the different regions of the language system; and d) evaluate the relationship among the different neural measures. Based on our results, we provide some recommendations for future studies of brain-behavior and brain-genes relationships. Although some of our conclusions are specific to the language system, others (e.g., the fact that effect-size-based measures tend to be more reliable than volume-based measures) are likely to generalize to the rest of the brain.


Assuntos
Mapeamento Encefálico/métodos , Encéfalo/anatomia & histologia , Encéfalo/fisiologia , Lateralidade Funcional/fisiologia , Interpretação de Imagem Assistida por Computador/métodos , Idioma , Rede Nervosa/fisiologia , Adolescente , Adulto , Biomarcadores , Cognição/fisiologia , Compreensão/fisiologia , Feminino , Humanos , Imageamento por Ressonância Magnética/métodos , Masculino , Pessoa de Meia-Idade , Rede Nervosa/anatomia & histologia , Tamanho do Órgão/fisiologia , Valores de Referência , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Adulto Jovem
15.
J Biomed Inform ; 64: 320-332, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-27815227

RESUMO

Ambiguity in the biomedical domain represents a major issue when performing Natural Language Processing tasks over the huge amount of available information in the field. For this reason, Word Sense Disambiguation is critical for achieving accurate systems able to tackle complex tasks such as information extraction, summarization or document classification. In this work we explore whether multilinguality can help to solve the problem of ambiguity, and the conditions required for a system to improve the results obtained by monolingual approaches. Also, we analyze the best ways to generate those useful multilingual resources, and study different languages and sources of knowledge. The proposed system, based on co-occurrence graphs containing biomedical concepts and textual information, is evaluated on a test dataset frequently used in biomedicine. We can conclude that multilingual resources are able to provide a clear improvement of more than 7% compared to monolingual approaches, for graphs built from a small number of documents. Also, empirical results show that automatically translated resources are a useful source of information for this particular task.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Algoritmos , Humanos , Bases de Conhecimento , Unified Medical Language System
16.
J Biomed Inform ; 58 Suppl: S103-S110, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26375493

RESUMO

The second track of the 2014 i2b2 challenge asked participants to automatically identify risk factors for heart disease among diabetic patients using natural language processing techniques for clinical notes. This paper describes a rule-based system developed using a combination of regular expressions, concepts from the Unified Medical Language System (UMLS), and freely-available resources from the community. With a performance (F1=90.7) that is significantly higher than the median (F1=87.20) and close to the top performing system (F1=92.8), it was the best rule-based system of all the submissions in the challenge. We also used this system to evaluate the utility of different terminologies in the UMLS towards the challenge task. Of the 155 terminologies in the UMLS, 129 (76.78%) have no representation in the corpus. The Consumer Health Vocabulary had very good coverage of relevant concepts and was the most useful terminology for the challenge task. While segmenting notes into sections and lists has a significant impact on the performance, identifying negations and experiencer of the medical event results in negligible gain.


Assuntos
Mineração de Dados/métodos , Complicações do Diabetes/epidemiologia , Registros Eletrônicos de Saúde/organização & administração , Narração , Processamento de Linguagem Natural , Unified Medical Language System/organização & administração , Idoso , Estudos de Coortes , Comorbidade , Segurança Computacional , Confidencialidade , Doença da Artéria Coronariana/diagnóstico , Doença da Artéria Coronariana/epidemiologia , Complicações do Diabetes/diagnóstico , Feminino , Humanos , Incidência , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Ohio/epidemiologia , Reconhecimento Automatizado de Padrão/métodos , Medição de Risco/métodos , Terminologia como Assunto , Vocabulário Controlado
17.
J Biomed Inform ; 47: 18-27, 2014 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-24036003

RESUMO

OBJECTIVE: In medical information retrieval research, semantic resources have been mostly used by expanding the original query terms or estimating the concept importance weight. However, implicit term-dependency information contained in semantic concept terms has been overlooked or at least underused in most previous studies. In this study, we incorporate a semantic concept-based term-dependence feature into a formal retrieval model to improve its ranking performance. DESIGN: Standardized medical concept terms used by medical professionals were assumed to have implicit dependency within the same concept. We hypothesized that, by elaborately revising the ranking algorithms to favor documents that preserve those implicit dependencies, the ranking performance could be improved. The implicit dependence features are harvested from the original query using MetaMap. These semantic concept-based dependence features were incorporated into a semantic concept-enriched dependence model (SCDM). We designed four different variants of the model, with each variant having distinct characteristics in the feature formulation method. MEASUREMENTS: We performed leave-one-out cross validations on both a clinical document corpus (TREC Medical records track) and a medical literature corpus (OHSUMED), which are representative test collections in medical information retrieval research. RESULTS: Our semantic concept-enriched dependence model consistently outperformed other state-of-the-art retrieval methods. Analysis shows that the performance gain has occurred independently of the concept's explicit importance in the query. CONCLUSION: By capturing implicit knowledge with regard to the query term relationships and incorporating them into a ranking model, we could build a more robust and effective retrieval model, independent of the concept importance.


Assuntos
Informática Médica/organização & administração , Semântica , Algoritmos , Humanos , Armazenamento e Recuperação da Informação , Internet , Informática Médica/métodos , Modelos Estatísticos , Software , Unified Medical Language System
18.
J Biomed Inform ; 52: 319-28, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25066773

RESUMO

OBJECTIVE: Automatic summarization of biomedical literature usually relies on domain knowledge from external sources to build rich semantic representations of the documents to be summarized. In this paper, we investigate the impact of the knowledge source used on the quality of the summaries that are generated. MATERIALS AND METHODS: We present a method for representing a set of documents relevant to a given biological entity or topic as a semantic graph of domain concepts and relations. Different graphs are created by using different combinations of ontologies and vocabularies within the UMLS (including GO, SNOMED-CT, HUGO and all available vocabularies in the UMLS) to retrieve domain concepts, and different types of relationships (co-occurrence and semantic relations from the UMLS Metathesaurus and Semantic Network) are used to link the concepts in the graph. The different graphs are next used as input to a summarization system that produces summaries composed of the most relevant sentences from the original documents. RESULTS AND CONCLUSIONS: Our experiments demonstrate that the choice of the knowledge source used to model the text has a significant impact on the quality of the automatic summaries. In particular, we find that, when summarizing gene-related literature, using GO, SNOMED-CT and HUGO to extract domain concepts results in significantly better summaries than using all available vocabularies in the UMLS. This finding suggests that successful biomedical summarization requires the selection of the appropriate knowledge source, whose coverage, specificity and relations must be in accordance to the type of the documents to summarize.


Assuntos
Pesquisa Biomédica , Mineração de Dados/normas , Bases de Dados Factuais , Armazenamento e Recuperação da Informação/normas , Semântica , Vocabulário Controlado , Análise por Conglomerados
19.
J Biomed Inform ; 48: 54-65, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24316051

RESUMO

Rapid, automated determination of the mapping of free text phrases to pre-defined concepts could assist in the annotation of clinical notes and increase the speed of natural language processing systems. The aim of this study was to design and evaluate a token-order-specific naïve Bayes-based machine learning system (RapTAT) to predict associations between phrases and concepts. Performance was assessed using a reference standard generated from 2860 VA discharge summaries containing 567,520 phrases that had been mapped to 12,056 distinct Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT) concepts by the MCVS natural language processing system. It was also assessed on the manually annotated, 2010 i2b2 challenge data. Performance was established with regard to precision, recall, and F-measure for each of the concepts within the VA documents using bootstrapping. Within that corpus, concepts identified by MCVS were broadly distributed throughout SNOMED CT, and the token-order-specific language model achieved better performance based on precision, recall, and F-measure (0.95±0.15, 0.96±0.16, and 0.95±0.16, respectively; mean±SD) than the bag-of-words based, naïve Bayes model (0.64±0.45, 0.61±0.46, and 0.60±0.45, respectively) that has previously been used for concept mapping. Precision, recall, and F-measure on the i2b2 test set were 92.9%, 85.9%, and 89.2% respectively, using the token-order-specific model. RapTAT required just 7.2ms to map all phrases within a single discharge summary, and mapping rate did not decrease as the number of processed documents increased. The high performance attained by the tool in terms of both accuracy and speed was encouraging, and the mapping rate should be sufficient to support near-real-time, interactive annotation of medical narratives. These results demonstrate the feasibility of rapidly and accurately mapping phrases to a wide range of medical concepts based on a token-order-specific naïve Bayes model and machine learning.


Assuntos
Inteligência Artificial , Processamento de Linguagem Natural , Algoritmos , Automação , Teorema de Bayes , Bases de Dados Factuais , Registros Eletrônicos de Saúde , Hospitais de Veteranos , Humanos , Modelos Estatísticos , Reprodutibilidade dos Testes , Software , Systematized Nomenclature of Medicine , Tennessee , Terminologia como Assunto , Unified Medical Language System , Vocabulário Controlado
20.
J Am Med Inform Assoc ; 31(2): 426-434, 2024 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-37952122

RESUMO

OBJECTIVE: To construct an exhaustive Complementary and Integrative Health (CIH) Lexicon (CIHLex) to help better represent the often underrepresented physical and psychological CIH approaches in standard terminologies, and to also apply state-of-the-art natural language processing (NLP) techniques to help recognize them in the biomedical literature. MATERIALS AND METHODS: We constructed the CIHLex by integrating various resources, compiling and integrating data from biomedical literature and relevant sources of knowledge. The Lexicon encompasses 724 unique concepts with 885 corresponding unique terms. We matched these concepts to the Unified Medical Language System (UMLS), and we developed and utilized BERT models comparing their efficiency in CIH named entity recognition to well-established models including MetaMap and CLAMP, as well as the large language model GPT3.5-turbo. RESULTS: Of the 724 unique concepts in CIHLex, 27.2% could be matched to at least one term in the UMLS. About 74.9% of the mapped UMLS Concept Unique Identifiers were categorized as "Therapeutic or Preventive Procedure." Among the models applied to CIH named entity recognition, BLUEBERT delivered the highest macro-average F1-score of 0.91, surpassing other models. CONCLUSION: Our CIHLex significantly augments representation of CIH approaches in biomedical literature. Demonstrating the utility of advanced NLP models, BERT notably excelled in CIH entity recognition. These results highlight promising strategies for enhancing standardization and recognition of CIH terminology in biomedical contexts.


Assuntos
Algoritmos , Unified Medical Language System , Processamento de Linguagem Natural , Idioma
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA