Pesquisa | BVS Doenças Infecciosas e Parasitárias

1.

Opportunities and challenges for ChatGPT and large language models in biomedicine and health.

Tian, Shubo; Jin, Qiao; Yeganova, Lana; Lai, Po-Ting; Zhu, Qingqing; Chen, Xiuying; Yang, Yifan; Chen, Qingyu; Kim, Won; Comeau, Donald C; Islamaj, Rezarta; Kapoor, Aadit; Gao, Xin; Lu, Zhiyong.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38168838

RESUMO

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.

Assuntos

Armazenamento e Recuperação da Informação , Idioma , Humanos , Privacidade , Pesquisadores

2.

Evolving use of ancestry, ethnicity, and race in genetics research-A survey spanning seven decades.

Byeon, Yen Ji Julia; Islamaj, Rezarta; Yeganova, Lana; Wilbur, W John; Lu, Zhiyong; Brody, Lawrence C; Bonham, Vence L.

Am J Hum Genet ; 108(12): 2215-2223, 2021 12 02.

Artigo em Inglês | MEDLINE | ID: mdl-34861173

RESUMO

To inform continuous and rigorous reflection about the description of human populations in genomics research, this study investigates the historical and contemporary use of the terms "ancestry," "ethnicity," "race," and other population labels in The American Journal of Human Genetics from 1949 to 2018. We characterize these terms' frequency of use and assess their odds of co-occurrence with a set of social and genetic topical terms. Throughout The Journal's 70-year history, "ancestry" and "ethnicity" have increased in use, appearing in 33% and 26% of articles in 2009-2018, while the use of "race" has decreased, occurring in 4% of articles in 2009-2018. Although its overall use has declined, the odds of "race" appearing in the presence of "ethnicity" has increased relative to the odds of occurring in its absence. Forms of population descriptors "Caucasian" and "Negro" have largely disappeared from The Journal (<1% of articles in 2009-2018). Conversely, the continental labels "African," "Asian," and "European" have increased in use and appear in 18%, 14%, and 42% of articles from 2009-2018, respectively. Decreasing uses of the terms "race," "Caucasian," and "Negro" are indicative of a transition away from the field's history of explicitly biological race science; at the same time, the increasing use of "ancestry," "ethnicity," and continental labels should serve to motivate ongoing reflection as the terminology used to describe genetic variation continues to evolve.

Assuntos

Pesquisa em Genética , Genética Humana/tendências , Etnicidade , Pesquisa em Genética/história , História do Século XX , História do Século XXI , Genética Humana/história , Humanos , Editoração/história , Grupos Raciais

3.

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

Jin, Qiao; Kim, Won; Chen, Qingyu; Comeau, Donald C; Yeganova, Lana; Wilbur, W John; Lu, Zhiyong.

Bioinformatics ; 39(11)2023 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-37930897

RESUMO

MOTIVATION: Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. RESULTS: To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks. AVAILABILITY AND IMPLEMENTATION: The MedCPT code and model are available at https://github.com/ncbi/MedCPT.

Assuntos

Armazenamento e Recuperação da Informação , Semântica , Idioma , Processamento de Linguagem Natural , PubMed , Literatura de Revisão como Assunto

4.

Towards a unified search: Improving PubMed retrieval with full text.

Kim, Won; Yeganova, Lana; Comeau, Donald C; Wilbur, W John; Lu, Zhiyong.

J Biomed Inform ; 134: 104211, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-36152950

RESUMO

OBJECTIVE: A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance. MATERIALS AND METHODS: For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness. RESULTS AND CONCLUSIONS: Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.

Assuntos

Gerenciamento de Dados , Armazenamento e Recuperação da Informação , PubMed

5.

Discovering themes in biomedical literature using a projection-based algorithm.

Yeganova, Lana; Kim, Sun; Balasanov, Grigory; Wilbur, W John.

BMC Bioinformatics ; 19(1): 269, 2018 07 16.

Artigo em Inglês | MEDLINE | ID: mdl-30012087

RESUMO

BACKGROUND: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously. RESULTS: We present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. In this work, a novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed â documents examining the subject of Single Nucleotide Polymorphism, evaluate the results and show the effectiveness and scalability of the proposed method. CONCLUSIONS: This study presents a contribution on theoretical and algorithmic levels, as well as demonstrates the feasibility of the method for large scale applications. The evaluation of our system on benchmark datasets demonstrates that our method compares favorably with the current state-of-the-art methods in computing clusters of documents with coherent topic terms.

Assuntos

Algoritmos , Publicações , Análise por Conglomerados , Bases de Dados Genéticas , Humanos , Polimorfismo de Nucleotídeo Único/genética

6.

Meshable: searching PubMed abstracts by utilizing MeSH and MeSH-derived topical terms.

Kim, Sun; Yeganova, Lana; Wilbur, W John.

Bioinformatics ; 32(19): 3044-6, 2016 10 01.

Artigo em Inglês | MEDLINE | ID: mdl-27288493

RESUMO

UNLABELLED: Medical Subject Headings (MeSH(®)) is a controlled vocabulary for indexing and searching biomedical literature. MeSH terms and subheadings are organized in a hierarchical structure and are used to indicate the topics of an article. Biologists can use either MeSH terms as queries or the MeSH interface provided in PubMed(®) for searching PubMed abstracts. However, these are rarely used, and there is no convenient way to link standardized MeSH terms to user queries. Here, we introduce a web interface which allows users to enter queries to find MeSH terms closely related to the queries. Our method relies on co-occurrence of text words and MeSH terms to find keywords that are related to each MeSH term. A query is then matched with the keywords for MeSH terms, and candidate MeSH terms are ranked based on their relatedness to the query. The experimental results show that our method achieves the best performance among several term extraction approaches in terms of topic coherence. Moreover, the interface can be effectively used to find full names of abbreviations and to disambiguate user queries. AVAILABILITY AND IMPLEMENTATION: https://www.ncbi.nlm.nih.gov/IRET/MESHABLE/ CONTACT: sun.kim@nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

MEDLINE , Medical Subject Headings , PubMed , Ferramenta de Busca , Indexação e Redação de Resumos

7.

Retro: concept-based clustering of biomedical topical sets.

Yeganova, Lana; Kim, Won; Kim, Sun; Wilbur, W John.

Bioinformatics ; 30(22): 3240-8, 2014 Nov 15.

Artigo em Inglês | MEDLINE | ID: mdl-25075115

RESUMO

MOTIVATION: Clustering methods can be useful for automatically grouping documents into meaningful clusters, improving human comprehension of a document collection. Although there are clustering algorithms that can achieve the goal for relatively large document collections, they do not always work well for small and homogenous datasets. METHODS: In this article, we present Retro-a novel clustering algorithm that extracts meaningful clusters along with concise and descriptive titles from small and homogenous document collections. Unlike common clustering approaches, our algorithm predicts cluster titles before clustering. It relies on the hypergeometric distribution model to discover key phrases, and generates candidate clusters by assigning documents to these phrases. Further, the statistical significance of candidate clusters is tested using supervised learning methods, and a multiple testing correction technique is used to control the overall quality of clustering. RESULTS: We test our system on five disease datasets from OMIM(®) and evaluate the results based on MeSH(®) term assignments. We further compare our method with several baseline and state-of-the-art methods, including K-means, expectation maximization, latent Dirichlet allocation-based clustering, Lingo, OPTIMSRC and adapted GK-means. The experimental results on the 20-Newsgroup and ODP-239 collections demonstrate that our method is successful at extracting significant clusters and is superior to existing methods in terms of quality of clusters. Finally, we apply our system to a collection of 6248 topical sets from the HomoloGene(®) database, a resource in PubMed(®). Empirical evaluation confirms the method is useful for small homogenous datasets in producing meaningful clusters with descriptive titles. AVAILABILITY AND IMPLEMENTATION: A web-based demonstration of the algorithm applied to a collection of sets from the HomoloGene database is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/CLUSTERING_HOMOLOGENE/index.html. CONTACT: lana.yeganova@nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Mineração de Dados/métodos , Análise por Conglomerados , Bases de Dados Genéticas , Medical Subject Headings

8.

Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach.

Kim, Sun; Liu, Haibin; Yeganova, Lana; Wilbur, W John.

J Biomed Inform ; 55: 23-30, 2015 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-25796456

RESUMO

Identifying unknown drug interactions is of great benefit in the early detection of adverse drug reactions. Despite existence of several resources for drug-drug interaction (DDI) information, the wealth of such information is buried in a body of unstructured medical text which is growing exponentially. This calls for developing text mining techniques for identifying DDIs. The state-of-the-art DDI extraction methods use Support Vector Machines (SVMs) with non-linear composite kernels to explore diverse contexts in literature. While computationally less expensive, linear kernel-based systems have not achieved a comparable performance in DDI extraction tasks. In this work, we propose an efficient and scalable system using a linear kernel to identify DDI information. The proposed approach consists of two steps: identifying DDIs and assigning one of four different DDI types to the predicted drug pairs. We demonstrate that when equipped with a rich set of lexical and syntactic features, a linear SVM classifier is able to achieve a competitive performance in detecting DDIs. In addition, the one-against-one strategy proves vital for addressing an imbalance issue in DDI type classification. Applied to the DDIExtraction 2013 corpus, our system achieves an F1 score of 0.670, as compared to 0.651 and 0.609 reported by the top two participating teams in the DDIExtraction 2013 challenge, both based on non-linear kernel methods.

Assuntos

Mineração de Dados/métodos , Interações Medicamentosas , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Máquina de Vetores de Suporte , Vocabulário Controlado , Reconhecimento Automatizado de Padrão/métodos , Semântica

9.

Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health.

Tian, Shubo; Jin, Qiao; Yeganova, Lana; Lai, Po-Ting; Zhu, Qingqing; Chen, Xiuying; Yang, Yifan; Chen, Qingyu; Kim, Won; Comeau, Donald C; Islamaj, Rezarta; Kapoor, Aadit; Gao, Xin; Lu, Zhiyong.

ArXiv ; 2023 Oct 17.

Artigo em Inglês | MEDLINE | ID: mdl-37904734

RESUMO

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction, and medical education, and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.

10.

Identifying well-formed biomedical phrases in MEDLINE® text.

Kim, Won; Yeganova, Lana; Comeau, Donald C; Wilbur, W John.

J Biomed Inform ; 45(6): 1035-41, 2012 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-22683889

RESUMO

In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality.

Assuntos

MEDLINE , Vocabulário Controlado , Algoritmos , Humanos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Software , Estados Unidos

11.

Machine learning with naturally labeled data for identifying abbreviation definitions.

Yeganova, Lana; Comeau, Donald C; Wilbur, W John.

BMC Bioinformatics ; 12 Suppl 3: S6, 2011 Jun 09.

Artigo em Inglês | MEDLINE | ID: mdl-21658293

RESUMO

BACKGROUND: The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data. METHODS: In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. RESULTS: We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.

Assuntos

Abreviaturas como Assunto , Algoritmos , Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodos , Armazenamento e Recuperação da Informação

12.

PDC - a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed.

Islamaj, Rezarta; Yeganova, Lana; Kim, Won; Xie, Natalie; Wilbur, W John; Lu, Zhiyong.

AMIA Jt Summits Transl Sci Proc ; 2020: 259-268, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32477645

RESUMO

The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, and through this, to help create a rich environment of multifaceted data to guide health care researchers in their endeavor to better understand the breadth, depth and scope of the problem. We demonstrate the usefulness of the proposed algorithm by providing a web portal that allows mental health researchers to peruse the suicide-related literature in PubMed.

13.

Better synonyms for enriching biomedical search.

Yeganova, Lana; Kim, Sun; Chen, Qingyu; Balasanov, Grigory; Wilbur, W John; Lu, Zhiyong.

J Am Med Inform Assoc ; 27(12): 1894-1902, 2020 12 09.

Artigo em Inglês | MEDLINE | ID: mdl-33083825

RESUMO

OBJECTIVE: In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be. MATERIALS AND METHODS: In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms. RESULTS: Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods. CONCLUSIONS: We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.

Assuntos

Pesquisa Biomédica , Armazenamento e Recuperação da Informação/métodos , Linguística , PubMed , Algoritmos , Probabilidade , Terminologia como Assunto

14.

A Field Sensor: computing the composition and intent of PubMed queries.

Yeganova, Lana; Kim, Won; Comeau, Donald C; Wilbur, W John; Lu, Zhiyong.

Database (Oxford) ; 20182018 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-30010750

RESUMO

PubMed® is a search engine providing access to a collection of over 27 million biomedical bibliographic records as of 2017. PubMed processes millions of queries a day, and understanding these queries is one of the main building blocks for successful information retrieval. In this work, we present Field Sensor, a domain-specific tool for understanding the composition and predicting the user intent of PubMed queries. Given a query, the Field Sensor infers a field for each token or sequence of tokens in a query in multi-step process that includes syntactic chunking, rule-based tagging and probabilistic field prediction. In this work, the fields of interest are those associated with (meta-)data elements of each PubMed record such as article title, abstract, author name(s), journal title, volume, issue, page and date. We evaluate the accuracy of our algorithm on a human-annotated corpus of 10 000 PubMed queries, as well as a new machine-annotated set of 103 000 PubMed queries. The Field Sensor achieves an accuracy of 93 and 91% on the two corresponding corpora and finds that nearly half of all searches are navigational (e.g. author searches, article title searches etc.) and half are informational (e.g. topical searches). The Field Sensor has been integrated into PubMed since June 2017 to detect informational queries for which results sorted by relevance can be suggested as an alternative to those sorted by the default date sort. In addition, the composition of PubMed queries as computed by the Field Sensor proves to be essential for understanding how users query PubMed.

Assuntos

Algoritmos , PubMed , Ferramenta de Busca , Curadoria de Dados , Publicações , Padrões de Referência

15.

PubMed Phrases, an open set of coherent phrases for searching biomedical literature.

Kim, Sun; Yeganova, Lana; Comeau, Donald C; Wilbur, W John; Lu, Zhiyong.

Sci Data ; 5: 180104, 2018 06 12.

Artigo em Inglês | MEDLINE | ID: mdl-29893755

RESUMO

In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze and discuss the usage of these PubMed Phrases in literature search.

16.

Topics in machine learning for biomedical literature analysis and text retrieval.

Islamaj Dogan, Rezarta; Yeganova, Lana.

BMC Bioinformatics ; 12 Suppl 3: I1, 2011 Jun 09.

Artigo em Inglês | MEDLINE | ID: mdl-21658287

Assuntos

Inteligência Artificial , Armazenamento e Recuperação da Informação/métodos , Algoritmos , Processamento de Linguagem Natural

17.

Set Separation Problems and Global Optimization.

Falk, James E; Dandurova, Yelena; Yeganova, Lana.

Nonlinear Anal Theory Methods Appl ; 47(3): 1857-1867, 2001 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-29503498

RESUMO

Given a pair of finite, disjoint sets A and B in Rn , a fundamental problem with numerous applications is to find a simple function f(x) defined over Rn which separates the sets in the sense that f(a) > 0 for all a ∈ A and f(b) < 0 for all b ∈ B. This can always be done (e.g., with the piecewise linear function defined by the Voronoi partition implied by the points in A â B). However typically one seeks a linear (or possibly a quadratic) function f if possible, in which case we say that A and B are linearly (quadratically) separable. If A and B are separable in a linear or quadratic sense, there are generally many such functions which separate. In this case we seek a 'robust' separator, one that is best in a sense to be defined. When A and B are not separable in a linear or quadratic sense we seek a function which comes as close as possible to separating, according to some well defined criterion. In this paper we examine the optimization problems associated with the set separation problem, characterize them (convex or non-convex) and suggest algorithms for their solutions.

18.

Robust Set Separation Via Exponentials.

Dandurova, Yelena; Yeganova, Lana; Falk, James E.

Nonlinear Anal Theory Methods Appl ; 47(3): 1893-1904, 2001 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-29503499

RESUMO

Given a pair of finite disjoint sets A and B in Euclidean n-space, a fundamental problem with numerous applications is to efficiently determine a hyperplane H(ω, Î³) which separates these sets when they are separable, or 'nearly' separates them when they are not. We seek a hyperplane that separates them in the sense that a measure of the Euclidean distance between the separating hyperplane and all of the points is as large as possible. This is done by 'weighting' points relative to A âª B according to their distance to H(ω, Î³), with the closer points getting a higher weight, but still taking into account the points distant from H(ω, Î³). The negative exponential is chosen for that purpose. In this paper we examine the optimization problem associated with this set separation problem and characterize it (convex or non-convex).

19.

Robust Separation of Multiple Sets.

Yeganova, Lana E; Falk, James E; Dandurova, Yelena V.

Nonlinear Anal Theory Methods Appl ; 47(3): 1845-1856, 2001 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-29503497

RESUMO

Given K finite disjoint sets {Ak }, k = 1, , K in Euclidean n-space, a general problem with numerous applications is to find K simple nontrivial functions fk (x) which separate the sets {Ak } in the sense that fk (a) ≤ fi (a) for all a â Ak and i ≠ k, k = 1, , K. This can always be done (e.g., with the piecewise linear function obtained by the Voronoi Partition defined for the points in [Formula: see text]). However, typically one seeks linear functions fk (x) if possible, in which case we say the sets {Ak } are piecewise linear separable. If the sets are separable in a linear sense, there are generally many such functions that separate, in which case we seek a 'best' (in some sense) separator that is referred as a robust separator. If the sets are not separable in a linear sense, we seek a function which comes as close as possible to separating, according to some criterion.

20.

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.

Islamaj Dogan, Rezarta; Comeau, Donald C; Yeganova, Lana; Wilbur, W John.

Database (Oxford) ; 20142014.

Artigo em Inglês | MEDLINE | ID: mdl-24914232

RESUMO

BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net.

Assuntos

Abreviaturas como Assunto , Ontologias Biológicas , Mineração de Dados , Processamento de Linguagem Natural , Algoritmos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA