Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 144
Filtrar
1.
Brief Bioinform ; 17(5): 819-30, 2016 09.
Artigo em Inglês | MEDLINE | ID: mdl-26420780

RESUMO

Phenotypes have gained increased notoriety in the clinical and biological domain owing to their application in numerous areas such as the discovery of disease genes and drug targets, phylogenetics and pharmacogenomics. Phenotypes, defined as observable characteristics of organisms, can be seen as one of the bridges that lead to a translation of experimental findings into clinical applications and thereby support 'bench to bedside' efforts. However, to build this translational bridge, a common and universal understanding of phenotypes is required that goes beyond domain-specific definitions. To achieve this ambitious goal, a digital revolution is ongoing that enables the encoding of data in computer-readable formats and the data storage in specialized repositories, ready for integration, enabling translational research. While phenome research is an ongoing endeavor, the true potential hidden in the currently available data still needs to be unlocked, offering exciting opportunities for the forthcoming years. Here, we provide insights into the state-of-the-art in digital phenotyping, by means of representing, acquiring and analyzing phenotype data. In addition, we provide visions of this field for future research work that could enable better applications of phenotype data.


Assuntos
Fenótipo , Humanos , Armazenamento e Recuperação da Informação , Projetos de Pesquisa , Pesquisa Translacional Biomédica
2.
J Biomed Inform ; 78: 177-184, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-29274386

RESUMO

OBJECTIVE: We introduce a structural-lexical approach for auditing SNOMED CT using a combination of non-lattice subgraphs of the underlying hierarchical relations and enriched lexical attributes of fully specified concept names. Our goal is to develop a scalable and effective approach that automatically identifies missing hierarchical IS-A relations. METHODS: Our approach involves 3 stages. In stage 1, all non-lattice subgraphs of SNOMED CT's IS-A hierarchical relations are extracted. In stage 2, lexical attributes of fully-specified concept names in such non-lattice subgraphs are extracted. For each concept in a non-lattice subgraph, we enrich its set of attributes with attributes from its ancestor concepts within the non-lattice subgraph. In stage 3, subset inclusion relations between the lexical attribute sets of each pair of concepts in each non-lattice subgraph are compared to existing IS-A relations in SNOMED CT. For concept pairs within each non-lattice subgraph, if a subset relation is identified but an IS-A relation is not present in SNOMED CT IS-A transitive closure, then a missing IS-A relation is reported. The September 2017 release of SNOMED CT (US edition) was used in this investigation. RESULTS: A total of 14,380 non-lattice subgraphs were extracted, from which we suggested a total of 41,357 missing IS-A relations. For evaluation purposes, 200 non-lattice subgraphs were randomly selected from 996 smaller subgraphs (of size 4, 5, or 6) within the "Clinical Finding" and "Procedure" sub-hierarchies. Two domain experts confirmed 185 (among 223) suggested missing IS-A relations, a precision of 82.96%. CONCLUSIONS: Our results demonstrate that analyzing the lexical features of concepts in non-lattice subgraphs is an effective approach for auditing SNOMED CT.


Assuntos
Ontologias Biológicas , Mineração de Dados/métodos , Garantia da Qualidade dos Cuidados de Saúde/normas , Systematized Nomenclature of Medicine , Algoritmos , Registros Eletrônicos de Saúde , Humanos , Auditoria Médica , Semântica
3.
J Biomed Inform ; 76: 41-49, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-29081385

RESUMO

OBJECTIVE: Improving mechanisms to detect adverse drug reactions (ADRs) is key to strengthening post-marketing drug safety surveillance. Signal detection is presently unimodal, relying on a single information source. Multimodal signal detection is based on jointly analyzing multiple information sources. Building on, and expanding the work done in prior studies, the aim of the article is to further research on multimodal signal detection, explore its potential benefits, and propose methods for its construction and evaluation. MATERIAL AND METHODS: Four data sources are investigated; FDA's adverse event reporting system, insurance claims, the MEDLINE citation database, and the logs of major Web search engines. Published methods are used to generate and combine signals from each data source. Two distinct reference benchmarks corresponding to well-established and recently labeled ADRs respectively are used to evaluate the performance of multimodal signal detection in terms of area under the ROC curve (AUC) and lead-time-to-detection, with the latter relative to labeling revision dates. RESULTS: Limited to our reference benchmarks, multimodal signal detection provides AUC improvements ranging from 0.04 to 0.09 based on a widely used evaluation benchmark, and a comparative added lead-time of 7-22 months relative to labeling revision dates from a time-indexed benchmark. CONCLUSIONS: The results support the notion that utilizing and jointly analyzing multiple data sources may lead to improved signal detection. Given certain data and benchmark limitations, the early stage of development, and the complexity of ADRs, it is currently not possible to make definitive statements about the ultimate utility of the concept. Continued development of multimodal signal detection requires a deeper understanding the data sources used, additional benchmarks, and further research on methods to generate and synthesize signals.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Bases de Dados Factuais , Humanos , Estados Unidos , United States Food and Drug Administration
4.
J Biomed Inform ; 57: 425-35, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26342964

RESUMO

BACKGROUND: Traditional approaches to pharmacovigilance center on the signal detection from spontaneous reports, e.g., the U.S. Food and Drug Administration (FDA) adverse event reporting system (FAERS). In order to enrich the scientific evidence and enhance the detection of emerging adverse drug events that can lead to unintended harmful outcomes, pharmacovigilance activities need to evolve to encompass novel complementary data streams, for example the biomedical literature available through MEDLINE. OBJECTIVES: (1) To review how the characteristics of MEDLINE indexing influence the identification of adverse drug events (ADEs); (2) to leverage this knowledge to inform the design of a system for extracting ADEs from MEDLINE indexing; and (3) to assess the specific contribution of some characteristics of MEDLINE indexing to the performance of this system. METHODS: We analyze the characteristics of MEDLINE indexing. We integrate three specific characteristics into the design of a system for extracting ADEs from MEDLINE indexing. We experimentally assess the specific contribution of these characteristics over a baseline system based on co-occurrence between drug descriptors qualified by adverse effects and disease descriptors qualified by chemically induced. RESULTS: Our system extracted 405,300 ADEs from 366,120 MEDLINE articles. The baseline system accounts for 297,093 ADEs (73%). 85,318 ADEs (21%) can be extracted only after integrating specific pre-coordinated MeSH descriptors and additional qualifiers. 22,889 ADEs (6%) can be extracted only after considering indirect links between the drug of interest and the descriptor that bears the ADE context. CONCLUSIONS: In this paper, we demonstrate significant improvement over a baseline approach to identifying ADEs from MEDLINE indexing, which mitigates some of the inherent limitations of MEDLINE indexing for pharmacovigilance. ADEs extracted from MEDLINE indexing are complementary to, not a replacement for, other sources.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , MEDLINE , Medical Subject Headings , Farmacovigilância , Sistemas de Notificação de Reações Adversas a Medicamentos , Mineração de Dados , Humanos , Armazenamento e Recuperação da Informação , Estados Unidos , United States Food and Drug Administration
5.
J Biomed Inform ; 54: 141-57, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25661592

RESUMO

BACKGROUND: Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting scientific literature. Prior approaches to LBD include use of: (1) domain expertise and structured background knowledge to manually filter and explore the literature, (2) distributional statistics and graph-theoretic measures to rank interesting connections, and (3) heuristics to help eliminate spurious connections. However, manual approaches to LBD are not scalable and purely distributional approaches may not be sufficient to obtain insights into the meaning of poorly understood associations. While several graph-based approaches have the potential to elucidate associations, their effectiveness has not been fully demonstrated. A considerable degree of a priori knowledge, heuristics, and manual filtering is still required. OBJECTIVES: In this paper we implement and evaluate a context-driven, automatic subgraph creation method that captures multifaceted complex associations between biomedical concepts to facilitate LBD. Given a pair of concepts, our method automatically generates a ranked list of subgraphs, which provide informative and potentially unknown associations between such concepts. METHODS: To generate subgraphs, the set of all MEDLINE articles that contain either of the two specified concepts (A, C) are first collected. Then binary relationships or assertions, which are automatically extracted from the MEDLINE articles, called semantic predications, are used to create a labeled directed predications graph. In this predications graph, a path is represented as a sequence of semantic predications. The hierarchical agglomerative clustering (HAC) algorithm is then applied to cluster paths that are bounded by the two concepts (A, C). HAC relies on implicit semantics captured through Medical Subject Heading (MeSH) descriptors, and explicit semantics from the MeSH hierarchy, for clustering. Paths that exceed a threshold of semantic relatedness are clustered into subgraphs based on their shared context. Finally, the automatically generated clusters are provided as a ranked list of subgraphs. RESULTS: The subgraphs generated using this approach facilitated the rediscovery of 8 out of 9 existing scientific discoveries. In particular, they directly (or indirectly) led to the recovery of several intermediates (or B-concepts) between A- and C-terms, while also providing insights into the meaning of the associations. Such meaning is derived from predicates between the concepts, as well as the provenance of the semantic predications in MEDLINE. Additionally, by generating subgraphs on different thematic dimensions (such as Cellular Activity, Pharmaceutical Treatment and Tissue Function), the approach may enable a broader understanding of the nature of complex associations between concepts. Finally, in a statistical evaluation to determine the interestingness of the subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE on average. CONCLUSION: These results suggest that leveraging the implicit and explicit semantics provided by manually assigned MeSH descriptors is an effective representation for capturing the underlying context of complex associations, along multiple thematic dimensions in LBD situations.


Assuntos
Análise por Conglomerados , Mineração de Dados/métodos , Descoberta do Conhecimento/métodos , Algoritmos , Bases de Dados Factuais , Humanos , Medical Subject Headings , Modelos Teóricos , Semântica
6.
J Biomed Inform ; 46(2): 238-51, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-23026233

RESUMO

OBJECTIVES: This paper presents a methodology for recovering and decomposing Swanson's Raynaud Syndrome-Fish Oil hypothesis semi-automatically. The methodology leverages the semantics of assertions extracted from biomedical literature (called semantic predications) along with structured background knowledge and graph-based algorithms to semi-automatically capture the informative associations originally discovered manually by Swanson. Demonstrating that Swanson's manually intensive techniques can be undertaken semi-automatically, paves the way for fully automatic semantics-based hypothesis generation from scientific literature. METHODS: Semantic predications obtained from biomedical literature allow the construction of labeled directed graphs which contain various associations among concepts from the literature. By aggregating such associations into informative subgraphs, some of the relevant details originally articulated by Swanson have been uncovered. However, by leveraging background knowledge to bridge important knowledge gaps in the literature, a methodology for semi-automatically capturing the detailed associations originally explicated in natural language by Swanson, has been developed. RESULTS: Our methodology not only recovered the three associations commonly recognized as Swanson's hypothesis, but also decomposed them into an additional 16 detailed associations, formulated as chains of semantic predications. Altogether, 14 out of the 19 associations that can be attributed to Swanson were retrieved using our approach. To the best of our knowledge, such an in-depth recovery and decomposition of Swanson's hypothesis has never been attempted. CONCLUSION: In this work therefore, we presented a methodology to semi-automatically recover and decompose Swanson's RS-DFO hypothesis using semantic representations and graph algorithms. Our methodology provides new insights into potential prerequisites for semantics-driven Literature-Based Discovery (LBD). Based on our observations, three critical aspects of LBD include: (1) the need for more expressive representations beyond Swanson's ABC model; (2) an ability to accurately extract semantic information from text; and (3) the semantic integration of scientific literature and structured background knowledge.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Descoberta do Conhecimento/métodos , Modelos Teóricos , Semântica , Viscosidade Sanguínea , Biologia Computacional/tendências , Mineração de Dados/tendências , Humanos , Agregação Plaquetária , Doença de Raynaud
7.
J Am Med Inform Assoc ; 30(12): 1887-1894, 2023 11 17.
Artigo em Inglês | MEDLINE | ID: mdl-37528056

RESUMO

OBJECTIVE: Use heuristic, deep learning (DL), and hybrid AI methods to predict semantic group (SG) assignments for new UMLS Metathesaurus atoms, with target accuracy ≥95%. MATERIALS AND METHODS: We used train-test datasets from successive 2020AA-2022AB UMLS Metathesaurus releases. Our heuristic "waterfall" approach employed a sequence of 7 different SG prediction methods. Atoms not qualifying for a method were passed on to the next method. The DL approach generated BioWordVec and SapBERT embeddings for atom names, BioWordVec embeddings for source vocabulary names, and BioWordVec embeddings for atom names of the second-to-top nodes of an atom's source hierarchy. We fed a concatenation of the 4 embeddings into a fully connected multilayer neural network with an output layer of 15 nodes (one for each SG). For both approaches, we developed methods to estimate the probability that their predicted SG for an atom would be correct. Based on these estimations, we developed 2 hybrid SG prediction methods combining the strengths of heuristic and DL methods. RESULTS: The heuristic waterfall approach accurately predicted 94.3% of SGs for 1 563 692 new unseen atoms. The DL accuracy on the same dataset was also 94.3%. The hybrid approaches achieved an average accuracy of 96.5%. CONCLUSION: Our study demonstrated that AI methods can predict SG assignments for new UMLS atoms with sufficient accuracy to be potentially useful as an intermediate step in the time-consuming task of assigning new atoms to UMLS concepts. We showed that for SG prediction, combining heuristic methods and DL methods can produce better results than either alone.


Assuntos
Aprendizado Profundo , Heurística , Semântica , Unified Medical Language System , Redes Neurais de Computação
8.
J Am Med Inform Assoc ; 30(10): 1614-1621, 2023 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-37407272

RESUMO

OBJECTIVE: The aim of this study was to derive and evaluate a practical strategy of replacing ICD-10-CM codes by ICD-11 for morbidity coding in the United States, without the creation of a Clinical Modification. MATERIALS AND METHODS: A stepwise strategy is described, using first the ICD-11 stem codes from the Mortality and Morbidity Statistics (MMS) linearization, followed by exposing Foundation entities, then adding postcoordination (with existing codes and adding new stem codes if necessary), with creating new stem codes as the last resort. The strategy was evaluated by recoding 2 samples of ICD-10-CM codes comprised of frequently used codes and all codes from the digestive diseases chapter. RESULTS: Among the 1725 ICD-10-CM codes examined, the cumulative coverage at the stem code, Foundation, and postcoordination levels are 35.2%, 46.5% and 89.4% respectively. 7.1% of codes require new extension codes and 3.5% require new stem codes. Among the new extension codes, severity scale values and anatomy are the most common categories. 5.5% of codes are not one-to-one matches (1 ICD-10-CM code matched to 1 ICD-11 stem code or Foundation entity) which could be potentially challenging. CONCLUSION: Existing ICD-11 content can achieve full representation of almost 90% of ICD-10-CM codes, provided that postcoordination can be used and the coding guidelines and hierarchical structures of ICD-10-CM and ICD-11 can be harmonized. The various options examined in this study should be carefully considered before embarking on the traditional approach of a full-fledged ICD-11-CM.


Assuntos
Codificação Clínica , Classificação Internacional de Doenças , Estados Unidos , Morbidade
9.
J Am Med Inform Assoc ; 30(3): 475-484, 2023 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-36539234

RESUMO

OBJECTIVE: SNOMED CT is the largest clinical terminology worldwide. Quality assurance of SNOMED CT is of utmost importance to ensure that it provides accurate domain knowledge to various SNOMED CT-based applications. In this work, we introduce a deep learning-based approach to uncover missing is-a relations in SNOMED CT. MATERIALS AND METHODS: Our focus is to identify missing is-a relations between concept-pairs exhibiting a containment pattern (ie, the set of words of one concept being a proper subset of that of the other concept). We use hierarchically related containment concept-pairs as positive instances and hierarchically unrelated containment concept-pairs as negative instances to train a model predicting whether an is-a relation exists between 2 concepts with containment pattern. The model is a binary classifier leveraging concept name features, hierarchical features, enriched lexical attribute features, and logical definition features. We introduce a cross-validation inspired approach to identify missing is-a relations among all hierarchically unrelated containment concept-pairs. RESULTS: We trained and applied our model on the Clinical finding subhierarchy of SNOMED CT (September 2019 US edition). Our model (based on the validation sets) achieved a precision of 0.8164, recall of 0.8397, and F1 score of 0.8279. Applying the model to predict actual missing is-a relations, we obtained a total of 1661 potential candidates. Domain experts performed evaluation on randomly selected 230 samples and verified that 192 (83.48%) are valid. CONCLUSIONS: The results showed that our deep learning approach is effective in uncovering missing is-a relations between containment concept-pairs in SNOMED CT.


Assuntos
Aprendizado Profundo , Systematized Nomenclature of Medicine
10.
Bioinformatics ; 27(3): 408-15, 2011 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-21138947

RESUMO

MOTIVATION: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations. RESULTS: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder--a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases. DISCUSSION: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles. AVAILABILITY: Freely available at: http://bioinf.umbc.edu/EMU/ftp.


Assuntos
Algoritmos , Biologia Computacional/métodos , Bases de Dados Factuais , Armazenamento e Recuperação da Informação/métodos , Mutação Puntual/genética , Publicações , Humanos , Neoplasias/genética , Reprodutibilidade dos Testes , Software
11.
J Biomed Inform ; 45(5): 835-41, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-22683993

RESUMO

OBJECTIVES: To explore the notion of mutation-centric pharmacogenomic relation extraction and to evaluate our approach against reference pharmacogenomic relations. METHODS: From a corpus of MEDLINE abstracts relevant to genetic variation, we identify co-occurrences between drug mentions extracted using MetaMap and RxNorm, and genetic variants extracted by EMU. The recall of our approach is evaluated against reference relations curated manually in PharmGKB. We also reviewed a random sample of 180 relations in order to evaluate its precision. RESULTS: One crucial aspect of our strategy is the use of biological knowledge for identifying specific genetic variants in text, not simply gene mentions. On the 104 reference abstracts from PharmGKB, the recall of our mutation-centric approach is 33-46%. Applied to 282,000 abstracts from MEDLINE, our approach identifies pharmacogenomic relations in 4534 abstracts, with a precision of 65%. CONCLUSIONS: Compared to a relation-centric approach, our mutation-centric approach shows similar recall, but slightly lower precision. We show that both approaches have limited overlap in their results, but are complementary and can be used in combination. Rather than a solution for the automatic curation of pharmacogenomic knowledge, we see these high-throughput approaches as tools to assist biocurators in the identification of pharmacogenomic relations of interest from the published literature. This investigation also identified three challenging aspects of the extraction of pharmacogenomic relations, namely processing full-text articles, sequence validation of DNA variants and resolution of genetic variants to reference databases, such as dbSNP.


Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas , Mutação , Farmacogenética/métodos , Humanos , Bases de Conhecimento , MEDLINE
12.
Stud Health Technol Inform ; 290: 116-119, 2022 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-35672982

RESUMO

BACKGROUND: Terminology integration at the scale of the UMLS Metathesaurus (i.e., over 200 source vocabularies) remains challenging despite recent advances in ontology alignment techniques based on neural networks. OBJECTIVES: To improve the performance of the neural network architecture we developed for predicting synonymy between terms in the UMLS Metathesaurus, specifically through the addition of an attention layer. METHODS: We modify our original Siamese neural network architecture with Long-Short Term Memory (LSTM) and create two variants by (1) adding an attention layer on top of the existing LSTM, and (2) replacing the existing LSTM layer by an attention layer. RESULTS: Adding an attention layer to the LSTM layer resulted in increasing precision to 92.38% (+3.63%) and F1 score to 91,74% (+1.13%), with limited impact on recall at 91.12% (-1.42%). CONCLUSIONS: Although limited, this increase in precision substantially reduces the false positive rate and minimizes the need for manual curation.


Assuntos
Redes Neurais de Computação , Unified Medical Language System , Atenção
13.
Stud Health Technol Inform ; 290: 96-100, 2022 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-35672978

RESUMO

BACKGROUND: ICD-11 will be used to report mortality statistics by WHO member countries starting in 2022. In the US, ICD-10-CM will likely continue to be used for morbidity coding for a long period of time. A map between ICD-10-CM and ICD-11 will therefore be useful for interoperability purpose between datasets coded with ICD-10-CM and ICD-11. OBJECTIVES: The objective of this study is to explore novel approaches to automatically derive a map between ICD-10-CM and ICD-11 through the sequential use of existing maps. METHODS AND RESULTS: Sequential mapping through ICD-10 yielded better coverage and accuracy compared to mapping through SNOMED CT. CONCLUSIONS: Sequential mapping is useful in automatically creating a draft map from ICD-10-CM to ICD-11 and would reduce manual curation efforts in creating the final map. The various approaches offer different trade-offs among coverage, recall and precision.


Assuntos
Classificação Internacional de Doenças , Systematized Nomenclature of Medicine
14.
Proc Int World Wide Web Conf ; 2022: 1037-1046, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-36108322

RESUMO

The Unified Medical Language System (UMLS) Metathesaurus construction process mainly relies on lexical algorithms and manual expert curation for integrating over 200 biomedical vocabularies. A lexical-based learning model (LexLM) was developed to predict synonymy among Metathesaurus terms and largely outperforms a rule-based approach (RBA) that approximates the current construction process. However, the LexLM has the potential for being improved further because it only uses lexical information from the source vocabularies, while the RBA also takes advantage of contextual information. We investigate the role of multiple types of contextual information available to the UMLS editors, namely source synonymy (SS), source semantic group (SG), and source hierarchical relations (HR), for the UMLS vocabulary alignment (UVA) problem. In this paper, we develop multiple variants of context-enriched learning models (ConLMs) by adding to the LexLM the types of contextual information listed above. We represent these context types in context-enriched knowledge graphs (ConKGs) with four variants ConSS, ConSG, ConHR, and ConAll. We train these ConKG embeddings using seven KG embedding techniques. We create the ConLMs by concatenating the ConKG embedding vectors with the word embedding vectors from the LexLM. We evaluate the performance of the ConLMs using the UVA generalization test datasets with hundreds of millions of pairs. Our extensive experiments show a significant performance improvement from the ConLMs over the LexLM, namely +5.0% in precision (93.75%), +0.69% in recall (93.23%), +2.88% in F1 (93.49%) for the best ConLM. Our experiments also show that the ConAll variant including the three context types takes more time, but does not always perform better than other variants with a single context type. Finally, our experiments show that the pairs of terms with high lexical similarity benefit most from adding contextual information, namely +6.56% in precision (94.97%), +2.13% in recall (93.23%), +4.35% in F1 (94.09%) for the best ConLM. The pairs with lower degrees of lexical similarity also show performance improvement with +0.85% in F1 (96%) for low similarity and +1.31% in F1 (96.34%) for no similarity. These results demonstrate the importance of using contextual information in the UVA problem.

15.
Artigo em Inglês | MEDLINE | ID: mdl-36093038

RESUMO

Recent work uses a Siamese Network, initialized with BioWordVec embeddings (distributed word embeddings), for predicting synonymy among biomedical terms to automate a part of the UMLS (Unified Medical Language System) Metathesaurus construction process. We evaluate the use of contextualized word embeddings extracted from nine different biomedical BERT-based models for synonymy prediction in the UMLS by replacing BioWordVec embeddings with embeddings extracted from each biomedical BERT model using different feature extraction methods. Surprisingly, we find that Siamese Networks initialized with BioWordVec embeddings still outperform the Siamese Networks initialized with embedding extracted from biomedical BERT model.

16.
BMC Bioinformatics ; 12: 461, 2011 Nov 29.
Artigo em Inglês | MEDLINE | ID: mdl-22126369

RESUMO

BACKGROUND: A critical aspect of the NIH Translational Research roadmap, which seeks to accelerate the delivery of "bench-side" discoveries to patient's "bedside," is the management of the provenance metadata that keeps track of the origin and history of data resources as they traverse the path from the bench to the bedside and back. A comprehensive provenance framework is essential for researchers to verify the quality of data, reproduce scientific results published in peer-reviewed literature, validate scientific process, and associate trust value with data and results. Traditional approaches to provenance management have focused on only partial sections of the translational research life cycle and they do not incorporate "domain semantics", which is essential to support domain-specific querying and analysis by scientists. RESULTS: We identify a common set of challenges in managing provenance information across the pre-publication and post-publication phases of data in the translational research lifecycle. We define the semantic provenance framework (SPF), underpinned by the Provenir upper-level provenance ontology, to address these challenges in the four stages of provenance metadata:(a) Provenance collection - during data generation(b) Provenance representation - to support interoperability, reasoning, and incorporate domain semantics(c) Provenance storage and propagation - to allow efficient storage and seamless propagation of provenance as the data is transferred across applications(d) Provenance query - to support queries with increasing complexity over large data size and also support knowledge discovery applicationsWe apply the SPF to two exemplar translational research projects, namely the Semantic Problem Solving Environment for Trypanosoma cruzi (T.cruzi SPSE) and the Biomedical Knowledge Repository (BKR) project, to demonstrate its effectiveness. CONCLUSIONS: The SPF provides a unified framework to effectively manage provenance of translational research data during pre and post-publication phases. This framework is underpinned by an upper-level provenance ontology called Provenir that is extended to create domain-specific provenance ontologies to facilitate provenance interoperability, seamless propagation of provenance, automated querying, and analysis.


Assuntos
Bases de Dados Factuais , Armazenamento e Recuperação da Informação , Pesquisa Translacional Biomédica , Doença de Chagas/parasitologia , Humanos , Bases de Conhecimento , Publicações Periódicas como Assunto , Semântica , Trypanosoma cruzi/genética
17.
Proc Int World Wide Web Conf ; 2021: 2672-2683, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34514472

RESUMO

With 214 source vocabularies, the construction and maintenance process of the UMLS (Unified Medical Language System) Metathesaurus terminology integration system is costly, time-consuming, and error-prone as it primarily relies on (1) lexical and semantic processing for suggesting groupings of synonymous terms, and (2) the expertise of UMLS editors for curating these synonymy predictions. This paper aims to improve the UMLS Metathesaurus construction process by developing a novel supervised learning approach for improving the task of suggesting synonymous pairs that can scale to the size and diversity of the UMLS source vocabularies. We evaluate this deep learning (DL) approach against a rule-based approach (RBA) that approximates the current UMLS Metathesaurus construction process. The key to the generalizability of our approach is the use of various degrees of lexical similarity in negative pairs during the training process. Our initial experiments demonstrate the strong performance across multiple datasets of our DL approach in terms of recall (91-92%), precision (88-99%), and F1 score (89-95%). Our DL approach largely outperforms the RBA method in recall (+23%), precision (+2.4%), and F1 score (+14.1%). This novel approach has great potential for improving the UMLS Metathesaurus construction process by providing better synonymy suggestions to the UMLS editors.

18.
J Am Med Inform Assoc ; 28(11): 2404-2411, 2021 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-34383897

RESUMO

OBJECTIVE: The study sought to assess the feasibility of replacing the International Classification of Diseases-Tenth Revision-Clinical Modification (ICD-10-CM) with the International Classification of Diseases-11th Revision (ICD-11) for morbidity coding based on content analysis. MATERIALS AND METHODS: The most frequently used ICD-10-CM codes from each chapter covering 60% of patients were identified from Medicare claims and hospital data. Each ICD-10-CM code was recoded in the ICD-11, using postcoordination (combination of codes) if necessary. Recoding was performed by 2 terminologists independently. Failure analysis was done for cases where full representation was not achieved even with postcoordination. After recoding, the coding guidance (inclusions, exclusions, and index) of the ICD-10-CM and ICD-11 codes were reviewed for conflict. RESULTS: Overall, 23.5% of 943 codes could be fully represented by the ICD-11 without postcoordination. Postcoordination is the potential game changer. It supports the full representation of 8.6% of 943 codes. Moreover, with the addition of only 9 extension codes, postcoordination supports the full representation of 35.2% of 943 codes. Coding guidance review identified potential conflicts in 10% of codes, but mostly not affecting recoding. The majority of the conflicts resulted from differences in granularity and default coding assumptions between the ICD-11 and ICD-10-CM. CONCLUSIONS: With some minor enhancements to postcoordination, the ICD-11 can fully represent almost 60% of the most frequently used ICD-10-CM codes. Even without postcoordination, 23.5% full representation is comparable to the 24.3% of ICD-9-CM codes with exact match in the ICD-10-CM, so migrating from the ICD-10-CM to the ICD-11 is not necessarily more disruptive than from the International Classification of Diseases-Ninth Revision-Clinical Modification to the ICD-10-CM. Therefore, the ICD-11 (without a CM) should be considered as a candidate to replace the ICD-10-CM for morbidity coding.


Assuntos
Classificação Internacional de Doenças , Medicare , Idoso , Codificação Clínica , Estudos de Viabilidade , Humanos , Morbidade , Estados Unidos
19.
Bioinformatics ; 25(12): i69-76, 2009 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-19478019

RESUMO

MOTIVATION: For many years, the Unified Medical Language System (UMLS) semantic network (SN) has been used as an upper-level semantic framework for the categorization of terms from terminological resources in biomedicine. BioTop has recently been developed as an upper-level ontology for the biomedical domain. In contrast to the SN, it is founded upon strict ontological principles, using OWL DL as a formal representation language, which has become standard in the semantic Web. In order to make logic-based reasoning available for the resources annotated or categorized with the SN, a mapping ontology was developed aligning the SN with BioTop. METHODS: The theoretical foundations and the practical realization of the alignment are being described, with a focus on the design decisions taken, the problems encountered and the adaptations of BioTop that became necessary. For evaluation purposes, UMLS concept pairs obtained from MEDLINE abstracts by a named entity recognition system were tested for possible semantic relationships. Furthermore, all semantic-type combinations that occur in the UMLS Metathesaurus were checked for satisfiability. RESULTS: The effort-intensive alignment process required major design changes and enhancements of BioTop and brought up several design errors that could be fixed. A comparison between a human curator and the ontology yielded only a low agreement. Ontology reasoning was also used to successfully identify 133 inconsistent semantic-type combinations. AVAILABILITY: BioTop, the OWL DL representation of the UMLS SN, and the mapping ontology are available at http://www.purl.org/biotop/.


Assuntos
Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Unified Medical Language System/normas , Bases de Dados Factuais , Reconhecimento Automatizado de Padrão , Semântica , Vocabulário Controlado
20.
Nucleic Acids Res ; 36(8): 2777-86, 2008 May.
Artigo em Inglês | MEDLINE | ID: mdl-18367472

RESUMO

A number of previous studies have predicted transcription factor binding sites (TFBSs) by exploiting the position of genomic landmarks like the transcriptional start site (TSS). The studies' methods are generally too computationally intensive for genome-scale investigation, so the full potential of 'positional regulomics' to discover TFBSs and determine their function remains unknown. Because databases often annotate the genomic landmarks in DNA sequences, the methodical exploitation of positional regulomics has become increasingly urgent. Accordingly, we examined a set of 7914 human putative promoter regions (PPRs) with a known TSS. Our methods identified 1226 eight-letter DNA words with significant positional preferences with respect to the TSS, of which only 608 of the 1226 words matched known TFBSs. Many groups of genes whose PPRs contained a common word displayed similar expression profiles and related biological functions, however. Most interestingly, our results included 78 words, each of which clustered significantly in two or three different positions relative to the TSS. Often, the gene groups corresponding to different positional clusters of the same word corresponded to diverse functions, e.g. activation or repression in different tissues. Thus, different clusters of the same word likely reflect the phenomenon of 'positional regulation', i.e. a word's regulatory function can vary with its position relative to a genomic landmark, a conclusion inaccessible to methods based purely on sequence. Further integrative analysis of words co-occurring in PPRs also yielded 24 different groups of genes, likely identifying cis-regulatory modules de novo. Whereas comparative genomics requires precise sequence alignments, positional regulomics exploits genomic landmarks to provide a 'poor man's alignment'. By exploiting the phenomenon of positional regulation, it uses position to differentiate the biological functions of subsets of TFBSs sharing a common sequence motif.


Assuntos
Regulação da Expressão Gênica , Regiões Promotoras Genéticas , Fatores de Transcrição/metabolismo , Sítio de Iniciação de Transcrição , Sítios de Ligação , Análise por Conglomerados , Biologia Computacional , Perfilação da Expressão Gênica , Genômica , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA