Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 72
Filtrar
1.
Diagnostics (Basel) ; 14(11)2024 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-38893730

RESUMO

In recent years, Convolutional Neural Network (CNN) models have demonstrated notable advancements in various domains such as image classification and Natural Language Processing (NLP). Despite their success in image classification tasks, their potential impact on medical image retrieval, particularly in text-based medical image retrieval (TBMIR) tasks, has not yet been fully realized. This could be attributed to the complexity of the ranking process, as there is ambiguity in treating TBMIR as an image retrieval task rather than a traditional information retrieval or NLP task. To address this gap, our paper proposes a novel approach to re-ranking medical images using a Deep Matching Model (DMM) and Medical-Dependent Features (MDF). These features incorporate categorical attributes such as medical terminologies and imaging modalities. Specifically, our DMM aims to generate effective representations for query and image metadata using a personalized CNN, facilitating matching between these representations. By using MDF, a semantic similarity matrix based on Unified Medical Language System (UMLS) meta-thesaurus, and a set of personalized filters taking into account some ranking features, our deep matching model can effectively consider the TBMIR task as an image retrieval task, as previously mentioned. To evaluate our approach, we performed experiments on the medical ImageCLEF datasets from 2009 to 2012. The experimental results show that the proposed model significantly enhances image retrieval performance compared to the baseline and state-of-the-art approaches.

2.
BMC Med Inform Decis Mak ; 23(Suppl 4): 299, 2024 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-38326827

RESUMO

BACKGROUND: In this era of big data, data harmonization is an important step to ensure reproducible, scalable, and collaborative research. Thus, terminology mapping is a necessary step to harmonize heterogeneous data. Take the Medical Dictionary for Regulatory Activities (MedDRA) and International Classification of Diseases (ICD) for example, the mapping between them is essential for drug safety and pharmacovigilance research. Our main objective is to provide a quantitative and qualitative analysis of the mapping status between MedDRA and ICD. We focus on evaluating the current mapping status between MedDRA and ICD through the Unified Medical Language System (UMLS) and Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). We summarized the current mapping statistics and evaluated the quality of the current MedDRA-ICD mapping; for unmapped terms, we used our self-developed algorithm to rank the best possible mapping candidates for additional mapping coverage. RESULTS: The identified MedDRA-ICD mapped pairs cover 27.23% of the overall MedDRA preferred terms (PT). The systematic quality analysis demonstrated that, among the mapped pairs provided by UMLS, only 51.44% are considered an exact match. For the 2400 sampled unmapped terms, 56 of the 2400 MedDRA Preferred Terms (PT) could have exact match terms from ICD. CONCLUSION: Some of the mapped pairs between MedDRA and ICD are not exact matches due to differences in granularity and focus. For 72% of the unmapped PT terms, the identified exact match pairs illustrate the possibility of identifying additional mapped pairs. Referring to its own mapping standard, some of the unmapped terms should qualify for the expansion of MedDRA to ICD mapping in UMLS.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Classificação Internacional de Doenças , Humanos , Unified Medical Language System , Farmacovigilância , Algoritmos
3.
Artif Intell Med ; 148: 102758, 2024 02.
Artigo em Inglês | MEDLINE | ID: mdl-38325934

RESUMO

The development of intelligent systems that use social media data for decision-making processes in numerous domains such as politics, business, marketing, and finance, has been made possible by the popularity of social media platforms. However, the utilization of textual data from social media in the healthcare management industry is still somewhat limited when it is compared to other industries. Investigating how current machine learning and natural language processing technologies can be used in the healthcare industry to gauge public sentiment is an important study. Earlier works on healthcare sentiment analysis have utilized traditional word embedding models trained on the general and medical corpus. However, integration of medical knowledge to pre-trained word embedding models has not been considered yet. Word embedding models trained on the general corpus led to the problem of lacking medical knowledge and the models trained on the small size of the medical corpus have limitations in capturing semantic and syntactic properties. This research proposes a new word embedding model named Word Embedding Integrated with Medical Knowledge Vector (WE-iMKVec). The proposed model integrates sentiment lexicons and medical knowledgebases into the pre-trained word embedding to enrich the properties of word embedding. A new medical-aware sentiment polarity score is proposed for the utilization in learning neural-network sentiment and these vectors incorporate with the original pre-trained word vectors. The resulting vectors are enriched with lexicon vectors and the medical knowledge vectors: Adverse Drug Reaction (ADR) vector and Unified Medical Language System (UMLS) vector are used to build the proposed WE-iMKVec model. WE-iMKVec is validated on the five different social media healthcare review datasets and the empirical results showed its superiority over traditional word embedding models in medical sentiment analysis. The highest improvement can be found in the patients.info medical condition dataset where the proposed model outperforms three conventional word2vec models (Google-News, PubMed-PMC, and Drug Reviews) by 12.7 %, 31.4 %, and 25.4 % respectively in terms of F1 score.


Assuntos
Aprendizado Profundo , Análise de Sentimentos , Humanos , Redes Neurais de Computação , Aprendizado de Máquina , Processamento de Linguagem Natural
4.
Heliyon ; 10(1): e22766, 2024 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-38163107

RESUMO

A transient ischemic attack (TIA) affects millions of people worldwide. Although TIA risk factors have been identified individually, a systemic quantitative analysis of all health factors relevant to TIA using electronic medical records (EMR) remains lacking. This study employed a data-driven approach, leveraging hospital EMR data to create a TIA patient health factor graph. This graph consisted of 737 TIA and 737 control patient nodes, 740 health factor nodes, and over 33,000 relations between patients and factors. For all health factors in the graph, the connection delta ratios (CDRs) were determined and ranked, generating a quantitative distribution of TIA health factors. A literature review confirmed 56 risk factors in the distribution and unveiled a potential new risk factor "rhinosinusitis" for future validation. Moreover, the patient graph was visualized together with the TIA knowledge graph in the Unified Medical Language System. This integration enables clinicians to access and visualize patient data and international standard knowledge within a unified graph. In conclusion, graph CDR analysis can effectively quantify the distribution of TIA risk factors. The resulting TIA risk factor distribution might be instrumental in developing new risk prediction machine learning models for screening and early detection of TIA.

5.
Heliyon ; 9(6): e16818, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37332929

RESUMO

Embeddings are fundamental resources often reused for building intelligent systems in the biomedical context. As a result, evaluating the quality of previously trained embeddings and ensuring they cover the desired information is critical for the success of applications. This paper proposes a new evaluation methodology to test the coverage of embeddings against a targetted domain of interest. It defines measures to assess the terminology, similarity, and analogy coverage, which are core aspects of the embeddings. Then, it discusses the experimentation carried out on existing biomedical embeddings in the specific context of pulmonary diseases. The proposed methodology and measures are general and may be applied to any application domain.

6.
Stud Health Technol Inform ; 305: 97-101, 2023 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-37386967

RESUMO

Currently, there is very little research aimed at developing medical knowledge extraction tools for major West Slavic languages (Czech, Polish, and Slovak). This project lays the groundwork for a general medical knowledge extraction pipeline, introducing the resource vocabularies available for the respective languages (UMLS resources, ICD-10 translations and national drug databases). It demonstrates the utility of this approach on a case study using a large proprietary corpus of Czech oncology records consisting of more than 40 million words written about more than 4,000 patients. After correlating MedDRA terms found in patients' records with drugs prescribed to them, significant non-obvious associations were found between selected medical conditions being mentioned and the probability of certain drugs being prescribed over the course of the patient's treatment, in some cases increasing the probability of prescriptions by over 250%. This direction of research, producing large amounts of annotated data, is a prerequisite for training deep learning models and predictive systems.


Assuntos
Bases de Dados de Produtos Farmacêuticos , Idioma , Humanos , Classificação Internacional de Doenças , Conhecimento , Oncologia
7.
Stud Health Technol Inform ; 305: 186-189, 2023 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-37386992

RESUMO

Clinical search engines development is actual task for medical informatics. The main issue in this area is to implement high-quality unstructured texts processing. Ontological interdisciplinary metathesaurus UMLS can be used to solve this problem. Currently, there is no unified method to relevant information aggregation from UMLS. In this research, we have presented the UMLS as graph model and performed the spot check of UMLS structure to identify basic problems. Then we created and integrated new graph metric in two created by us program modules for relevant knowledge aggregation from UMLS.


Assuntos
Informática Médica , Unified Medical Language System , Estudos Interdisciplinares , Conhecimento , Ferramenta de Busca
8.
Stud Health Technol Inform ; 302: 823-824, 2023 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-37203506

RESUMO

This paper describes a first attempt to map UMLS concepts to pictographs as a resource for translation systems for the medical domain. An evaluation of pictographs from two freely available sets shows that for many concepts no pictograph could be found and that word-based lookup is inadequate for this task.


Assuntos
Unified Medical Language System
9.
Artif Intell Med ; 140: 102551, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-37210157

RESUMO

Text-Based Medical Image Retrieval (TBMIR) has been known to be successful in retrieving medical images with textual descriptions. Usually, these descriptions are very brief and cannot express the whole visual content of the image in words, hence negatively affect the retrieval performance. One of the solutions offered in the literature is to form a Bayesian Network thesaurus taking advantage of some medical terms extracted from the image datasets. Despite the interestingness of this solution, it is not efficient as it is highly related to the co-occurrence measure, the layer arrangement and the arc directions. A significant drawback of the co-occurrence measure is the generation of a lot of uninteresting co-occurring terms. Several studies applied the association rules mining and its measures to discover the correlation between the terms. In this paper, we propose a new efficient association Rule Based Bayesian Network (R2BN) model for TBMIR using updated medically-dependent features (MDF) based on Unified Medical Language System (UMLS). The MDF are a set of medical terms that refers to the imaging modalities, the image color, the searched object dimension, etc. The proposed model presents the association rules mined from MDF in the form of Bayesian Network model. Then, it exploits the association rule measures (support, confidence, and lift) to prune the Bayesian Network model for efficient computation. The proposed R2BN model is combined with a literature probabilistic model to predict the relevance of an image to a given query. Experiments are carried out with ImageCLEF medical retrieval task collections from 2009 to 2013. Results show that our proposed model enhances significantly the image retrieval accuracy compared to the state-of-the-art retrieval models.


Assuntos
Armazenamento e Recuperação da Informação , Modelos Estatísticos , Teorema de Bayes , Unified Medical Language System
10.
Heliyon ; 9(3): e14636, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37020943

RESUMO

Background and objectives: Medical notes are narratives that describe the health of the patient in free text format. These notes can be more informative than structured data such as the history of medications or disease conditions. They are routinely collected and can be used to evaluate the patient's risk for developing chronic diseases such as dementia. This study investigates different methodologies for transforming routine care notes into dementia risk classifiers and evaluates the generalizability of these classifiers to new patients and new health care institutions. Methods: The notes collected over the relevant history of the patient are lengthy. In this study, TF-ICF is used to select keywords with the highest discriminative ability between at risk dementia patients and healthy controls. The medical notes are then summarized in the form of occurrences of the selected keywords. Two different encodings of the summary are compared. The first encoding consists of the average of the vector embedding of each keyword occurrence as produced by the BERT or Clinical BERT pre-trained language models. The second encoding aggregates the keywords according to UMLS concepts and uses each concept as an exposure variable. For both encodings, misspellings of the selected keywords are also considered in an effort to improve the predictive performance of the classifiers. A neural network is developed over the first encoding and a gradient boosted trees model is applied to the second encoding. Patients from a single health care institution are used to develop all the classifiers which are then evaluated on held-out patients from the same health care institution as well as test patients from two other health care institutions. Results: The results indicate that it is possible to identify patients at risk for dementia one year ahead of the onset of the disease using medical notes with an AUC of 75% when a gradient boosted trees model is used in conjunction with exposure variables derived from UMLS concepts. However, this performance is not maintained with an embedded feature space and when the classifier is applied to patients from other health care institutions. Moreover, an analysis of the top predictors of the gradient boosted trees model indicates that different features inform the classification depending on whether or not spelling variants of the keywords are included. Conclusion: The present study demonstrates that medical notes can enable risk prediction models for complex chronic diseases such as dementia. However, additional research efforts are needed to improve the generalizability of these models. These efforts should take into consideration the length and localization of the medical notes; the availability of sufficient training data for each disease condition; and the variabilities resulting from different feature engineering techniques.

11.
J Med Internet Res ; 24(11): e40361, 2022 11 25.
Artigo em Inglês | MEDLINE | ID: mdl-36427233

RESUMO

BACKGROUND: Electronic medical records (EMRs) of patients with lung cancer (LC) capture a variety of health factors. Understanding the distribution of these factors will help identify key factors for risk prediction in preventive screening for LC. OBJECTIVE: We aimed to generate an integrated biomedical graph from EMR data and Unified Medical Language System (UMLS) ontology for LC, and to generate an LC health factor distribution from a hospital EMR of approximately 1 million patients. METHODS: The data were collected from 2 sets of 1397 patients with and those without LC. A patient-centered health factor graph was plotted with 108,000 standardized data, and a graph database was generated to integrate the graphs of patient health factors and the UMLS ontology. With the patient graph, we calculated the connection delta ratio (CDR) for each of the health factors to measure the relative strength of the factor's relationship to LC. RESULTS: The patient graph had 93,000 relations between the 2794 patient nodes and 650 factor nodes. An LC graph with 187 related biomedical concepts and 188 horizontal biomedical relations was plotted and linked to the patient graph. Searching the integrated biomedical graph with any number or category of health factors resulted in graphical representations of relationships between patients and factors, while searches using any patient presented the patient's health factors from the EMR and the LC knowledge graph (KG) from the UMLS in the same graph. Sorting the health factors by CDR in descending order generated a distribution of health factors for LC. The top 70 CDR-ranked factors of disease, symptom, medical history, observation, and laboratory test categories were verified to be concordant with those found in the literature. CONCLUSIONS: By collecting standardized data of thousands of patients with and those without LC from the EMR, it was possible to generate a hospital-wide patient-centered health factor graph for graph search and presentation. The patient graph could be integrated with the UMLS KG for LC and thus enable hospitals to bring continuously updated international standard biomedical KGs from the UMLS for clinical use in hospitals. CDR analysis of the graph of patients with LC generated a CDR-sorted distribution of health factors, in which the top CDR-ranked health factors were concordant with the literature. The resulting distribution of LC health factors can be used to help personalize risk evaluation and preventive screening recommendations.


Assuntos
Registros Eletrônicos de Saúde , Neoplasias Pulmonares , Humanos , Estudos Retrospectivos , Unified Medical Language System , Neoplasias Pulmonares/epidemiologia , Hospitais
12.
Proc Int World Wide Web Conf ; 2022: 1037-1046, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-36108322

RESUMO

The Unified Medical Language System (UMLS) Metathesaurus construction process mainly relies on lexical algorithms and manual expert curation for integrating over 200 biomedical vocabularies. A lexical-based learning model (LexLM) was developed to predict synonymy among Metathesaurus terms and largely outperforms a rule-based approach (RBA) that approximates the current construction process. However, the LexLM has the potential for being improved further because it only uses lexical information from the source vocabularies, while the RBA also takes advantage of contextual information. We investigate the role of multiple types of contextual information available to the UMLS editors, namely source synonymy (SS), source semantic group (SG), and source hierarchical relations (HR), for the UMLS vocabulary alignment (UVA) problem. In this paper, we develop multiple variants of context-enriched learning models (ConLMs) by adding to the LexLM the types of contextual information listed above. We represent these context types in context-enriched knowledge graphs (ConKGs) with four variants ConSS, ConSG, ConHR, and ConAll. We train these ConKG embeddings using seven KG embedding techniques. We create the ConLMs by concatenating the ConKG embedding vectors with the word embedding vectors from the LexLM. We evaluate the performance of the ConLMs using the UVA generalization test datasets with hundreds of millions of pairs. Our extensive experiments show a significant performance improvement from the ConLMs over the LexLM, namely +5.0% in precision (93.75%), +0.69% in recall (93.23%), +2.88% in F1 (93.49%) for the best ConLM. Our experiments also show that the ConAll variant including the three context types takes more time, but does not always perform better than other variants with a single context type. Finally, our experiments show that the pairs of terms with high lexical similarity benefit most from adding contextual information, namely +6.56% in precision (94.97%), +2.13% in recall (93.23%), +4.35% in F1 (94.09%) for the best ConLM. The pairs with lower degrees of lexical similarity also show performance improvement with +0.85% in F1 (96%) for low similarity and +1.31% in F1 (96.34%) for no similarity. These results demonstrate the importance of using contextual information in the UVA problem.

13.
J Biomed Inform ; 131: 104118, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35690349

RESUMO

OBJECTIVE: To propose a new vector-based relatedness metric that derives word vectors from the intrinsic structure of biomedical ontologies, without consulting external resources such as large-scale biomedical corpora. MATERIALS AND METHODS: SNOMED CT on the mapping layer of UMLS was used as a testbed ontology. Vectors were created for every concept at the end of all semantic relations-attribute-value relations and descendants as well as is_a relation-of the defining concept. The cosine similarity between the averages of those vectors with respect to each defining concept was computed to produce a final semantic relatedness. RESULTS: Two benchmark sets that include a total of 62 biomedical term pairs were used for evaluation. Spearman's rank coefficient of the current method was 0.655, 0.744, and 0.742 with the relatedness rated by physicians, coders, and medical experts, respectively. The proposed method was comparable to a word-embedding method and outperformed path-based, information content-based, and another multiple relation-based relatedness metrics. DISCUSSION: The current study demonstrated that the addition of attribute relations to the is_a hierarchy of SNOMED CT better conforms to the human sense of relatedness than models based on taxonomic relations. The current approach also showed that it is robust to the design inconsistency of ontologies. CONCLUSION: Unlike the previous vector-based approach, the current study exploited the intrinsic semantic structure of an ontology, precluding the need for external textual resources to obtain context information of defining terms. Future research is recommended to prove the validity of the current method with other biomedical ontologies.


Assuntos
Ontologias Biológicas , Systematized Nomenclature of Medicine , Humanos , Processamento de Linguagem Natural , Semântica , Unified Medical Language System
14.
BMC Med Res Methodol ; 22(1): 141, 2022 05 14.
Artigo em Inglês | MEDLINE | ID: mdl-35568796

RESUMO

BACKGROUND: Screening for eligible patients continues to pose a great challenge for many clinical trials. This has led to a rapidly growing interest in standardizing computable representations of eligibility criteria (EC) in order to develop tools that leverage data from electronic health record (EHR) systems. Although laboratory procedures (LP) represent a common entity of EC that is readily available and retrievable from EHR systems, there is a lack of interoperable data models for this entity of EC. A public, specialized data model that utilizes international, widely-adopted terminology for LP, e.g. Logical Observation Identifiers Names and Codes (LOINC®), is much needed to support automated screening tools. OBJECTIVE: The aim of this study is to establish a core dataset for LP most frequently requested to recruit patients for clinical trials using LOINC terminology. Employing such a core dataset could enhance the interface between study feasibility platforms and EHR systems and significantly improve automatic patient recruitment. METHODS: We used a semi-automated approach to analyze 10,516 screening forms from the Medical Data Models (MDM) portal's data repository that are pre-annotated with Unified Medical Language System (UMLS). An automated semantic analysis based on concept frequency is followed by an extensive manual expert review performed by physicians to analyze complex recruitment-relevant concepts not amenable to automatic approach. RESULTS: Based on analysis of 138,225 EC from 10,516 screening forms, 55 laboratory procedures represented 77.87% of all UMLS laboratory concept occurrences identified in the selected EC forms. We identified 26,413 unique UMLS concepts from 118 UMLS semantic types and covered the vast majority of Medical Subject Headings (MeSH) disease domains. CONCLUSIONS: Only a small set of common LP covers the majority of laboratory concepts in screening EC forms which supports the feasibility of establishing a focused core dataset for LP. We present ELaPro, a novel, LOINC-mapped, core dataset for the most frequent 55 LP requested in screening for clinical trials. ELaPro is available in multiple machine-readable data formats like CSV, ODM and HL7 FHIR. The extensive manual curation of this large number of free-text EC as well as the combining of UMLS and LOINC terminologies distinguishes this specialized dataset from previous relevant datasets in the literature.


Assuntos
Logical Observation Identifiers Names and Codes , Medical Subject Headings , Humanos , Semântica
15.
BMC Med Inform Decis Mak ; 22(1): 114, 2022 04 29.
Artigo em Inglês | MEDLINE | ID: mdl-35488252

RESUMO

BACKGROUND: Health providers create Electronic Health Records (EHRs) to describe the conditions and procedures used to treat their patients. Medical notes entered by medical staff in the form of free text are a particularly insightful component of EHRs. There is a great interest in applying machine learning tools on medical notes in numerous medical informatics applications. Learning vector representations, or embeddings, of terms in the notes, is an important pre-processing step in such applications. However, learning good embeddings is challenging because medical notes are rich in specialized terminology, and the number of available EHRs in practical applications is often very small. METHODS: In this paper, we propose a novel algorithm to learn embeddings of medical terms from a limited set of medical notes. The algorithm, called definition2vec, exploits external information in the form of medical term definitions. It is an extension of a skip-gram algorithm that incorporates textual definitions of medical terms provided by the Unified Medical Language System (UMLS) Metathesaurus. RESULTS: To evaluate the proposed approach, we used a publicly available Medical Information Mart for Intensive Care (MIMIC-III) EHR data set. We performed quantitative and qualitative experiments to measure the usefulness of the learned embeddings. The experimental results show that definition2vec keeps the semantically similar medical terms together in the embedding vector space even when they are rare or unobserved in the corpus. We also demonstrate that learned vector embeddings are helpful in downstream medical informatics applications. CONCLUSION: This paper shows that medical term definitions can be helpful when learning embeddings of rare or previously unseen medical terms from a small corpus of specialized documents such as medical notes.


Assuntos
Registros Eletrônicos de Saúde , Unified Medical Language System , Algoritmos , Humanos , Aprendizado de Máquina
16.
Front Artif Intell ; 5: 1051724, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36714202

RESUMO

Objective: The adoption of electronic health records (EHRs) has produced enormous amounts of data, creating research opportunities in clinical data sciences. Several concept recognition systems have been developed to facilitate clinical information extraction from these data. While studies exist that compare the performance of many concept recognition systems, they are typically developed internally and may be biased due to different internal implementations, parameters used, and limited number of systems included in the evaluations. The goal of this research is to evaluate the performance of existing systems to retrieve relevant clinical concepts from EHRs. Methods: We investigated six concept recognition systems, including CLAMP, cTAKES, MetaMap, NCBO Annotator, QuickUMLS, and ScispaCy. Clinical concepts extracted included procedures, disorders, medications, and anatomical location. The system performance was evaluated on two datasets: the 2010 i2b2 and the MIMIC-III. Additionally, we assessed the performance of these systems in five challenging situations, including negation, severity, abbreviation, ambiguity, and misspelling. Results: For clinical concept extraction, CLAMP achieved the best performance on exact and inexact matching, with an F-score of 0.70 and 0.94, respectively, on i2b2; and 0.39 and 0.50, respectively, on MIMIC-III. Across the five challenging situations, ScispaCy excelled in extracting abbreviation information (F-score: 0.86) followed by NCBO Annotator (F-score: 0.79). CLAMP outperformed in extracting severity terms (F-score 0.73) followed by NCBO Annotator (F-score: 0.68). CLAMP outperformed other systems in extracting negated concepts (F-score 0.63). Conclusions: Several concept recognition systems exist to extract clinical information from unstructured data. This study provides an external evaluation by end-users of six commonly used systems across different extraction tasks. Our findings suggest that CLAMP provides the most comprehensive set of annotations for clinical concept extraction tasks and associated challenges. Comparing standard extraction tasks across systems provides guidance to other clinical researchers when selecting a concept recognition system relevant to their clinical information extraction task.

17.
Artigo em Inglês | MEDLINE | ID: mdl-36776766

RESUMO

Biomedical ontologies provide formalized information and knowledge in the biomedical domain. Over the years, biomedical ontologies have played an important role in facilitating biomedical research and applications. Common quality issues of biomedical ontologies include inconsistent naming of concepts, redundant concepts, redundant relations, incomplete/incorrect concept definitions, and incomplete/incorrect class hierarchies. In this work, we focus on addressing the incompleteness of the class hierarchy in SNOMED CT. We develop a substring replacement approach, leveraging concepts' lexical features and existing IS-A relations to identify potential missing IS-A relations in SNOMED CT. To evaluate the effectiveness of our approach, we performed both automated and manual validation. For the automated evaluation, we leverage relations from external terminologies in the Unified Medical Language System (UMLS) to validate the identified missing IS-A relations. For the manual validation, a randomly selected 100 samples from the results are reviewed by a domain expert. Applying our approach to the March 2022 release of SNOMED CT US Edition, we identified 3,228 potential missing IS-A relations, among which 63 were validated through the UMLS. The evaluation by the domain expert revealed that 89 out of 100 (a precision of 89%) missing IS-A relations are valid cases, showing the effectiveness of this substring replacement approach to facilitate the quality assurance of IS-A relations in SNOMED CT.

18.
Proc Int World Wide Web Conf ; 2021: 2672-2683, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34514472

RESUMO

With 214 source vocabularies, the construction and maintenance process of the UMLS (Unified Medical Language System) Metathesaurus terminology integration system is costly, time-consuming, and error-prone as it primarily relies on (1) lexical and semantic processing for suggesting groupings of synonymous terms, and (2) the expertise of UMLS editors for curating these synonymy predictions. This paper aims to improve the UMLS Metathesaurus construction process by developing a novel supervised learning approach for improving the task of suggesting synonymous pairs that can scale to the size and diversity of the UMLS source vocabularies. We evaluate this deep learning (DL) approach against a rule-based approach (RBA) that approximates the current UMLS Metathesaurus construction process. The key to the generalizability of our approach is the use of various degrees of lexical similarity in negative pairs during the training process. Our initial experiments demonstrate the strong performance across multiple datasets of our DL approach in terms of recall (91-92%), precision (88-99%), and F1 score (89-95%). Our DL approach largely outperforms the RBA method in recall (+23%), precision (+2.4%), and F1 score (+14.1%). This novel approach has great potential for improving the UMLS Metathesaurus construction process by providing better synonymy suggestions to the UMLS editors.

19.
Comput Struct Biotechnol J ; 19: 4559-4573, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34471499

RESUMO

Drug repurposing has become a widely used strategy to accelerate the process of finding treatments. While classical de novo drug development involves high costs, risks, and time-consuming paths, drug repurposing allows to reuse already-existing and approved drugs for new indications. Numerous research has been carried out in this field, both in vitro and in silico. Computational drug repurposing methods make use of modern heterogeneous biomedical data to identify and prioritize new indications for old drugs. In the current paper, we present a new complete methodology to evaluate new potentially repurposable drugs based on disease-gene and disease-phenotype associations, identifying significant differences between repurposing and non-repurposing data. We have collected a set of known successful drug repurposing case studies from the literature and we have analysed their dissimilarities with other biomedical data not necessarily participating in repurposing processes. The information used has been obtained from the DISNET platform. We have performed three analyses (at the genetical, phenotypical, and categorization levels), to conclude that there is a statistically significant difference between actual repurposing-related information and non-repurposing data. The insights obtained could be relevant when suggesting new potential drug repurposing hypotheses.

20.
Front Big Data ; 4: 701966, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34458724

RESUMO

The entire scientific and academic community has been mobilized to gain a better understanding of the COVID-19 disease and its impact on humanity. Most research related to COVID-19 needs to analyze large amounts of data in very little time. This urgency has made Big Data Analysis, and related questions around the privacy and security of the data, an extremely important part of research in the COVID-19 era. The White House OSTP has, for example, released a large dataset of papers related to COVID research from which the research community can extract knowledge and information. We show an example system with a machine learning-based knowledge extractor which draws out key medical information from COVID-19 related academic research papers. We represent this knowledge in a Knowledge Graph that uses the Unified Medical Language System (UMLS). However, publicly available studies rely on dataset that might have sensitive data. Extracting information from academic papers can potentially leak sensitive data, and protecting the security and privacy of this data is equally important. In this paper, we address the key challenges around the privacy and security of such information extraction and analysis systems. Policy regulations like HIPAA have updated the guidelines to access data, specifically, data related to COVID-19, securely. In the US, healthcare providers must also comply with the Office of Civil Rights (OCR) rules to protect data integrity in matters like plasma donation, media access to health care data, telehealth communications, etc. Privacy policies are typically short and unstructured HTML or PDF documents. We have created a framework to extract relevant knowledge from the health centers' policy documents and also represent these as a knowledge graph. Our framework helps to understand the extent to which individual provider policies comply with regulations and define access control policies that enforce the regulation rules on data in the knowledge graph extracted from COVID-related papers. Along with being compliant, privacy policies must also be transparent and easily understood by the clients. We analyze the relative readability of healthcare privacy policies and discuss the impact. In this paper, we develop a framework for access control decisions that uses policy compliance information to securely retrieve COVID data. We show how policy compliance information can be used to restrict access to COVID-19 data and information extracted from research papers.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA