Search | VHL Regional Portal

1.

KEBLM: Knowledge-Enhanced Biomedical Language Models.

Lai, Tuan Manh; Zhai, ChengXiang; Ji, Heng.

J Biomed Inform ; 143: 104392, 2023 07.

Article in English | MEDLINE | ID: mdl-37211194

ABSTRACT

Pretrained language models (PLMs) have demonstrated strong performance on many natural language processing (NLP) tasks. Despite their great success, these PLMs are typically pretrained only on unstructured free texts without leveraging existing structured knowledge bases that are readily available for many domains, especially scientific domains. As a result, these PLMs may not achieve satisfactory performance on knowledge-intensive tasks such as biomedical NLP. Comprehending a complex biomedical document without domain-specific knowledge is challenging, even for humans. Inspired by this observation, we propose a general framework for incorporating various types of domain knowledge from multiple sources into biomedical PLMs. We encode domain knowledge using lightweight adapter modules, bottleneck feed-forward networks that are inserted into different locations of a backbone PLM. For each knowledge source of interest, we pretrain an adapter module to capture the knowledge in a self-supervised way. We design a wide range of self-supervised objectives to accommodate diverse types of knowledge, ranging from entity relations to description sentences. Once a set of pretrained adapters is available, we employ fusion layers to combine the knowledge encoded within these adapters for downstream tasks. Each fusion layer is a parameterized mixer of the available trained adapters that can identify and activate the most useful adapters for a given input. Our method diverges from prior work by including a knowledge consolidation phase, during which we teach the fusion layers to effectively combine knowledge from both the original PLM and newly-acquired external knowledge using a large collection of unannotated texts. After the consolidation phase, the complete knowledge-enhanced model can be fine-tuned for any downstream task of interest to achieve optimal performance. Extensive experiments on many biomedical NLP datasets show that our proposed framework consistently improves the performance of the underlying PLMs on various downstream tasks such as natural language inference, question answering, and entity linking. These results demonstrate the benefits of using multiple sources of external knowledge to enhance PLMs and the effectiveness of the framework for incorporating knowledge into PLMs. While primarily focused on the biomedical domain in this work, our framework is highly adaptable and can be easily applied to other domains, such as the bioenergy sector.

Subject(s)

Language , Natural Language Processing , Humans , Knowledge Bases , Software

2.

Exploring collaborative caption editing to augment video-based learning.

Bhavya, Bhavya; Chen, Si; Zhang, Zhilin; Li, Wenting; Zhai, Chengxiang; Angrave, Lawrence; Huang, Yun.

Educ Technol Res Dev ; 70(5): 1755-1779, 2022.

Article in English | MEDLINE | ID: mdl-35855355

ABSTRACT

Captions play a major role in making educational videos accessible to all and are known to benefit a wide range of learners. However, many educational videos either do not have captions or have inaccurate captions. Prior work has shown the benefits of using crowdsourcing to obtain accurate captions in a cost-efficient way, though there is a lack of understanding of how learners edit captions of educational videos either individually or collaboratively. In this work, we conducted a user study where 58 learners (in a course of 387 learners) participated in the editing of captions in 89 lecture videos that were generated by Automatic Speech Recognition (ASR) technologies. For each video, different learners conducted two rounds of editing. Based on editing logs, we created a taxonomy of errors in educational video captions (e.g., Discipline-Specific, General, Equations). From the interviews, we identified individual and collaborative error editing strategies. We then further demonstrated the feasibility of applying machine learning models to assist learners in editing. Our work provides practical implications for advancing video-based learning and for educational video caption editing.

3.

Estimating the influence of Twitter on pre-exposure prophylaxis use and HIV testing as a function of rates of men who have sex with men in the United States.

Chan, Man-Pui Sally; Morales, Alex; Zlotorzynska, Maria; Sullivan, Patrick; Sanchez, Travis; Zhai, Chengxiang; Albarracín, Dolores.

AIDS ; 35(Suppl 1): S101-S109, 2021 05 01.

Article in English | MEDLINE | ID: mdl-33867493

ABSTRACT

OBJECTIVES: Acceptance of pre-exposure prophylaxis (PrEP) and testing for HIV is likely to vary as a function of the norms and communications within a geographic area. This study examined associations involving county tweets, in person communications, and HIV prevention and testing in regions with higher (vs. lower) estimated rates of men who have sex with men (MSM). DESIGN AND METHODS: Ecological analyses examined (a) tweets about HIV (i.e. tweet rates per 100â000 county population and topic probabilities in 1959 US counties); (b) individual-level survey data about HIV prevention and testing and communications about PrEP and HIV (Nâ=â30â675 participants); and (c) estimated county-level MSM rates (per 1â000 adult men). RESULTS: In counties with higher rates of MSM, tweet rates were directly associated with PrEP use and HIV testing (rsâ=â.06, BF10â>â10). Topics correlated with PrEP use (rsâ=â-0.06 to 0.07, BF10â>â10) and HIV testing (rsâ=â-0.05 to 0.05, BF10â>â10). Mediation analyses showed that hearing about and discussing PrEP mediated the relations between tweet rates and PrEP use (bi∗â=â0.01-0.05, BF10â>â100) and between topics and PrEP use (bi∗â=â-0.04- 0.05, BF10â>â10). Moreover, hearing about PrEP was associated with PrEP use, which was in turn associated with tweet rates (bi∗â=â0.01, BF10â>â100) and topics (bi∗â=â-0.03 - 0.01, BF10â>â10). CONCLUSIONS: Rates of MSM appear to lead to HIV tweets in a region, in person communications about PrEP, and, ultimately, actual PrEP use. Also, as more men hear about PrEP, they may use PrEP more and may tweet about HIV.

Subject(s)

Anti-HIV Agents , HIV Infections , Pre-Exposure Prophylaxis , Sexual and Gender Minorities , Social Media , Adult , Anti-HIV Agents/therapeutic use , HIV Infections/diagnosis , HIV Infections/drug therapy , HIV Infections/prevention & control , HIV Testing , Homosexuality, Male , Humans , Male , United States

4.

Biosystems Design by Machine Learning.

Volk, Michael Jeffrey; Lourentzou, Ismini; Mishra, Shekhar; Vo, Lam Tung; Zhai, Chengxiang; Zhao, Huimin.

ACS Synth Biol ; 9(7): 1514-1533, 2020 07 17.

Article in English | MEDLINE | ID: mdl-32485108

ABSTRACT

Biosystems such as enzymes, pathways, and whole cells have been increasingly explored for biotechnological applications. However, the intricate connectivity and resulting complexity of biosystems poses a major hurdle in designing biosystems with desirable features. As -omics and other high throughput technologies have been rapidly developed, the promise of applying machine learning (ML) techniques in biosystems design has started to become a reality. ML models enable the identification of patterns within complicated biological data across multiple scales of analysis and can augment biosystems design applications by predicting new candidates for optimized performance. ML is being used at every stage of biosystems design to help find nonobvious engineering solutions with fewer design iterations. In this review, we first describe commonly used models and modeling paradigms within ML. We then discuss some applications of these models that have already shown success in biotechnological applications. Moreover, we discuss successful applications at all scales of biosystems design, including nucleic acids, genetic circuits, proteins, pathways, genomes, and bioprocesses. Finally, we discuss some limitations of these methods and potential solutions as well as prospects of the combination of ML and biosystems design.

Subject(s)

Biotechnology , Machine Learning , Proteins , Gene Editing , Gene Regulatory Networks , Linear Models , Metabolic Engineering , Proteins/chemistry , Proteins/metabolism

5.

HIV messaging on Twitter: an analysis of current practice and data-driven recommendations.

Lohmann, Sophie; White, Benjamin X; Zuo, Zhen; Chan, Man-Pui Sally; Morales, Alex; Li, Bo; Zhai, Chengxiang; Albarracín, Dolores.

AIDS ; 32(18): 2799-2805, 2018 11 28.

Article in English | MEDLINE | ID: mdl-30289801

ABSTRACT

OBJECTIVES: Social media messages have been increasingly used in health campaigns about prevention, testing, and treatment of HIV. We identified factors leading to the retransmission of messages from expert social media accounts to create data-driven recommendations for online HIV messaging. DESIGN AND METHODS: We sampled 20â201 HIV-related tweets (posted between 2010 and 2017) from 37 HIV experts. Potential predictors of retransmission were identified based on prior literature and machine learning methods, and were subsequently analyzed using multilevel negative binomial models. RESULTS: Fear-related language, longer messages, and including images (e.g. photos, gif, or videos) were the strongest predictors of retweet counts. These findings were similar for messages authored by HIV experts, and also messages retransmitted by experts, but created by nonexperts (e.g. celebrities or politicians). CONCLUSIONS: Fear appeals affect how much HIV messages spread on Twitter, as do structural characteristics, like the length of the tweet and inclusion of images. A set of five data-driven recommendations for increasing message spread is derived and discussed in the context of current centers for disease control and prevention social media guidelines.

Subject(s)

Behavior Therapy/methods , Disease Transmission, Infectious/prevention & control , HIV Infections/prevention & control , Health Education/methods , Social Media , HIV Infections/diagnosis , Humans

6.

An Online Risk Index for the Cross-Sectional Prediction of New HIV Chlamydia, and Gonorrhea Diagnoses Across U.S. Counties and Across Years.

Chan, Man-Pui Sally; Lohmann, Sophie; Morales, Alex; Zhai, Chengxiang; Ungar, Lyle; Holtgrave, David R; Albarracín, Dolores.

AIDS Behav ; 22(7): 2322-2333, 2018 Jul.

Article in English | MEDLINE | ID: mdl-29427233

ABSTRACT

The present study evaluated the potential use of Twitter data for providing risk indices of STIs. We developed online risk indices (ORIs) based on tweets to predict new HIV, gonorrhea, and chlamydia diagnoses, across U.S. counties and across 5 years. We analyzed over one hundred million tweets from 2009 to 2013 using open-vocabulary techniques and estimated the ORIs for a particular year by entering tweets from the same year into multiple semantic models (one for each year). The ORIs were moderately to strongly associated with the actual rates (.35 < rs < .68 for 93% of models), both nationwide and when applied to single states (California, Florida, and New York). Later models were slightly better than older ones at predicting gonorrhea and chlamydia, but not at predicting HIV. The proposed technique using free social media data provides signals of community health at a high temporal and spatial resolution.

Subject(s)

Big Data , Chlamydia Infections/epidemiology , Gonorrhea/epidemiology , HIV Infections/epidemiology , Social Media , California/epidemiology , Chlamydia Infections/diagnosis , Cross-Sectional Studies , Florida/epidemiology , Gonorrhea/diagnosis , HIV , HIV Infections/diagnosis , Humans , New York/epidemiology , Public Health , Risk Assessment , Sexually Transmitted Diseases/epidemiology , United States/epidemiology

7.

VisAGE: Integrating external knowledge into electronic medical record visualization.

Huang, Edward W; Wang, Sheng; Zhai, ChengXiang.

Pac Symp Biocomput ; 23: 578-589, 2018.

Article in English | MEDLINE | ID: mdl-29218916

ABSTRACT

In this paper, we present VisAGE, a method that visualizes electronic medical records (EMRs) in a low-dimensional space. Effective visualization of new patients allows doctors to view similar, previously treated patients and to identify the new patients' disease subtypes, reducing the chance of misdiagnosis. However, EMRs are typically incomplete or fragmented, resulting in patients who are missing many available features being placed near unrelated patients in the visualized space. VisAGE integrates several external data sources to enrich EMR databases to solve this issue. We evaluated VisAGE on a dataset of Parkinson's disease patients. We qualitatively and quantitatively show that VisAGE can more effectively cluster patients, which allows doctors to better discover patient subtypes and thus improve patient care.

Subject(s)

Electronic Health Records/statistics & numerical data , Algorithms , Computational Biology/methods , Computer Graphics/statistics & numerical data , Databases, Factual/statistics & numerical data , Disease Progression , False Positive Reactions , Female , Humans , Information Storage and Retrieval/statistics & numerical data , Knowledge Bases , Male , Parkinson Disease/drug therapy , Parkinson Disease/etiology , Polymorphism, Single Nucleotide , Protein Interaction Maps

8.

Who is Saying What on Twitter: An Analysis of Messages with References to HIV and HIV Risk Behavior.

Lohmann, Sophie; Lourentzou, Ismini; Zhai, Chengxiang; Albarracín, Dolores.

Acta Investig Psicol ; 8(1): 95-100, 2018 Apr.

Article in English | MEDLINE | ID: mdl-31105910

ABSTRACT

This research aimed to determine the nature of social media discussions about HIV. With the goal of conducting a descriptive analysis, we collected almost 1,000 tweets posted February to September 2015. The sample of tweets included keywords related to HIV or behavioral risk factors (e.g., sex, drug use) and was coded for content (e.g., HIV), behavior change strategies, and message source. Seven percent of tweets concerned HIV/AIDS, which were often referred to as jokes or insults. The majority of tweets coded as behavior change attempts involved attitude change strategies. The majority of the tweets (80%) came from private users (vs. organizations). Different types of sources employed different types of behavior change strategies: For instance, private users, compared to experts or organizations, included more strategies to decrease detrimental attitudes (29% versus 6%, p < .001), and also more strategies to counter myths and misinformation (6% versus 1%, p = .008). In summary, tweets related to HIV/AIDS and associated risk factors frequently use the terms in jokes and insults, come largely from private users, and entail attitudinal and informational strategies. Online health campaigns with clear calls to action and corrections of misinformation may make important contributions to social media conversations about HIV/AIDS.

Esta investigación tuvo el objectivo de caracterizar las discusiones sobre VIH en los medios sociales. Con el objetivo de realizar un análisis descriptivo, recogimos alrededor de mil tweets entre febrero y septiembre del 2015. Estos tweets fueron seleccionados si incluían palabras claves relacionadas con el VIH o con factores de riesgo conductual tales como sexo o uso de drogas. Cuatro codificadores clasificaron los tweets en función del contenido (e.g., el VIH como enfermedad, referido a un product o servicio), la estrategia de cambio conductual (cambio conductual, llamada a la acción, o corrección de mitos), y la fuente del mensaje (e.g., usuarios privados, expertos, empresas comerciales). La mayoría de los tweets (80%) provenía de usuarios privados en lugar de institucionales. El 7% de los tweets se refería estrictamente al VIH u otras infecciones de transmisión sexual, frecuentemente utilizando esos términos como bromas o insultos, tales como escribir que una experiencia displacentera "me dio SIDA". La mayoría de los intentos de cambio conductual incluía estrategias de reducción de actitudes negativas. Fuentes de distintos tipos empleaban estrategias de cambio conductual de distintos tipos. Por ejemplo, usuarios privados (comparados con expertos, organizaciones comerciales, y otras organizaciones, tal como periódicos y ONGs), publicaban más mesajes clasificados como estrategias de promoción de actitudes negativas (29% versus 6%, p < .001), y tenían más correcciones de mitos (6% versus 1%, p = .008). En resumen, los tweets que mencionan el VIH o factores de riesgo de VIH utilizan los términos en bromas e insultos con gran frecuencia, provienen mayormente de usuarios privados, e incluyen estrategias de cambio de actitud. Las campañas de Internet con llamadas claras a la acción y con correcciones de mitos pueden hacer contribuciones importantes a las conversaciones sobre VIH en los medios sociales.

9.

Who is saying what on Twitter: An analysis of messages with references to HIV and HIV risk behavior / Quién dice qué en Twitter: Mensajes con referencia a VIH y conducta de riesgo de VIH

Lohmann, Sophie; Lourentzou, Ismini; Zhai, Chengxiang; Albarracín, Dolores.

Acta investigación psicol. (en línea) ; 8(1): 95-100, abr. 2018. tab

Article in English | LILACS | ID: biblio-949481

ABSTRACT

Abstract: This research aimed to determine the nature of social media discussions about HIV. With the goal of conducting a descriptive analysis, we collected almost 1,000 tweets posted February to September 2015. The sample of tweets included keywords related to HIV or behavioral risk factors (e.g., sex, drug use) and was coded for content (e.g., HIV), behavior change strategies, and message source. Seven percent of tweets concerned HIV/AIDS, which were often referred to as jokes or insults. The majority of tweets coded as behavior change attempts involved attitude change strategies. The majority of the tweets (80%) came from private users (vs. organizations). Different types of sources employed different types of behavior change strategies: For instance, private users, compared to experts or organizations, included more strategies to decrease detrimental attitudes (29% versus 6%, p < .001), and also more strategies to counter myths and misinformation (6% versus 1%, p = .008). In summary, tweets related to HIV/AIDS and associated risk factors frequently use the terms in jokes and insults, come largely from private users, and entail attitudinal and informational strategies. Online health campaigns with clear calls to action and corrections of misinformation may make important contributions to social media conversations about HIV/AIDS.

Resumen: Esta investigación tuvo el objetivo de caracterizar las discusiones sobre VIH en los medios sociales. Con el objetivo de realizar un análisis descriptivo, recogimos alrededor de mil tweets entre febrero y septiembre del 2015. Estos tweets fueron seleccionados si incluían palabras claves relacionadas con el VIH o con factores de riesgo conductual tales como sexo o uso de drogas. Cuatro codificadores clasificaron los tweets en función del contenido (e.g., el VIH como enfermedad, referido a un producto o servicio), la estrategia de cambio conductual (cambio conductual, llamada a la acción, o corrección de mitos), y la fuente del mensaje (e.g., usuarios privados, expertos, empresas comerciales). La mayoría de los tweets (80%) provenía de usuarios privados en lugar de institucionales. El 7% de los tweets se refería estrictamente al VIH u otras infecciones de transmisión sexual, frecuentemente utilizando esos términos como bromas o insultos, tales como escribir que una experiencia displacentera "me dio SIDA". La mayoría de los intentos de cambio conductual incluía estrategias de reducción de actitudes negativas. Fuentes de distintos tipos empleaban estrategias de cambio conductual de distintos tipos. Por ejemplo, usuarios privados (comparados con expertos, organizaciones comerciales, y otras organizaciones, tal como periódicos y ONGs), publicaban más mensajes clasificados como estrategias de promoción de actitudes negativas (29% versus 6%, p < .001), y tenían más correcciones de mitos (6% versus 1%, p = .008). En resumen, los tweets que mencionan el VIH o factores de riesgo de VIH utilizan los términos en bromas e insultos con gran frecuencia, provienen mayormente de usuarios privados, e incluyen estrategias de cambio de actitud. Las campañas de Internet con llamadas claras a la acción y con correcciones de mitos pueden hacer contribuciones importantes a las conversaciones sobre VIH en los medios sociales.

10.

Framing Electronic Medical Records as Polylingual Documents in Query Expansion.

Huang, Edward W; Wang, Sheng; Lee, Doris Jung-Lin; Zhang, Runshun; Liu, Baoyan; Zhou, Xuezhong; Zhai, ChengXiang.

AMIA Annu Symp Proc ; 2017: 940-949, 2017.

Article in English | MEDLINE | ID: mdl-29854161

ABSTRACT

We present a study of electronic medical record (EMR) retrieval that emulates situations in which a doctor treats a new patient. Given a query consisting of a new patient's symptoms, the retrieval system returns the set of most relevant records of previously treated patients. However, due to semantic, functional, and treatment synonyms in medical terminology, queries are often incomplete and thus require enhancement. In this paper, we present a topic model that frames symptoms and treatments as separate languages. Our experimental results show that this method improves retrieval performance over several baselines with statistical significance. These baselines include methods used in prior studies as well as state-of-the-art embedding techniques. Finally, we show that our proposed topic model discovers all three types of synonyms to improve medical record retrieval.

Subject(s)

Electronic Health Records , Information Storage and Retrieval/methods , Multilingualism , Humans , Natural Language Processing , Semantics , Terminology as Topic

11.

DeepMeSH: deep semantic representation for improving large-scale MeSH indexing.

Peng, Shengwen; You, Ronghui; Wang, Hongning; Zhai, Chengxiang; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 32(12): i70-i79, 2016 06 15.

Article in English | MEDLINE | ID: mdl-27307646

ABSTRACT

MOTIVATION: Medical Subject Headings (MeSH) indexing, which is to assign a set of MeSH main headings to citations, is crucial for many important tasks in biomedical text mining and information retrieval. Large-scale MeSH indexing has two challenging aspects: the citation side and MeSH side. For the citation side, all existing methods, including Medical Text Indexer (MTI) by National Library of Medicine and the state-of-the-art method, MeSHLabeler, deal with text by bag-of-words, which cannot capture semantic and context-dependent information well. METHODS: We propose DeepMeSH that incorporates deep semantic information for large-scale MeSH indexing. It addresses the two challenges in both citation and MeSH sides. The citation side challenge is solved by a new deep semantic representation, D2V-TFIDF, which concatenates both sparse and dense semantic representations. The MeSH side challenge is solved by using the 'learning to rank' framework of MeSHLabeler, which integrates various types of evidence generated from the new semantic representation. RESULTS: DeepMeSH achieved a Micro F-measure of 0.6323, 2% higher than 0.6218 of MeSHLabeler and 12% higher than 0.5637 of MTI, for BioASQ3 challenge data with 6000 citations. AVAILABILITY AND IMPLEMENTATION: The software is available upon request. CONTACT: zhusf@fudan.edu.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Medical Subject Headings , Semantics , Software , Abstracting and Indexing , Data Mining , MEDLINE , National Library of Medicine (U.S.) , United States

12.

Big Data: Astronomical or Genomical?

Stephens, Zachary D; Lee, Skylar Y; Faghri, Faraz; Campbell, Roy H; Zhai, Chengxiang; Efron, Miles J; Iyer, Ravishankar; Schatz, Michael C; Sinha, Saurabh; Robinson, Gene E.

PLoS Biol ; 13(7): e1002195, 2015 Jul.

Article in English | MEDLINE | ID: mdl-26151137

ABSTRACT

Genomics is a Big Data science and is going to get much bigger, very soon, but it is not known whether the needs of genomics will exceed other Big Data domains. Projecting to the year 2025, we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a "four-headed beast"--it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis. We discuss aspects of new technologies that will need to be developed to rise up and meet the computational challenges that genomics poses for the near future. Now is the time for concerted, community-wide planning for the "genomical" challenges of the next decade.

Subject(s)

Genomics/trends , Astronomy/trends , Information Storage and Retrieval , Social Media/trends , Statistics as Topic

13.

MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence.

Liu, Ke; Peng, Shengwen; Wu, Junqiu; Zhai, Chengxiang; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 31(12): i339-47, 2015 Jun 15.

Article in English | MEDLINE | ID: mdl-26072501

ABSTRACT

MOTIVATION: Medical Subject Headings (MeSHs) are used by National Library of Medicine (NLM) to index almost all citations in MEDLINE, which greatly facilitates the applications of biomedical information retrieval and text mining. To reduce the time and financial cost of manual annotation, NLM has developed a software package, Medical Text Indexer (MTI), for assisting MeSH annotation, which uses k-nearest neighbors (KNN), pattern matching and indexing rules. Other types of information, such as prediction by MeSH classifiers (trained separately), can also be used for automatic MeSH annotation. However, existing methods cannot effectively integrate multiple evidence for MeSH annotation. METHODS: We propose a novel framework, MeSHLabeler, to integrate multiple evidence for accurate MeSH annotation by using 'learning to rank'. Evidence includes numerous predictions from MeSH classifiers, KNN, pattern matching, MTI and the correlation between different MeSH terms, etc. Each MeSH classifier is trained independently, and thus prediction scores from different classifiers are incomparable. To address this issue, we have developed an effective score normalization procedure to improve the prediction accuracy. RESULTS: MeSHLabeler won the first place in Task 2A of 2014 BioASQ challenge, achieving the Micro F-measure of 0.6248 for 9,040 citations provided by the BioASQ challenge. Note that this accuracy is around 9.15% higher than 0.5724, obtained by MTI. AVAILABILITY AND IMPLEMENTATION: The software is available upon request.

Subject(s)

Abstracting and Indexing/methods , Medical Subject Headings , Software , Algorithms , Data Mining , MEDLINE , Reproducibility of Results

14.

Exploiting ontology graph for predicting sparsely annotated gene function.

Wang, Sheng; Cho, Hyunghoon; Zhai, ChengXiang; Berger, Bonnie; Peng, Jian.

Bioinformatics ; 31(12): i357-64, 2015 Jun 15.

Article in English | MEDLINE | ID: mdl-26072504

ABSTRACT

MOTIVATION: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this 'overfitting' issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog. RESULTS: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions. AVAILABILITY AND IMPLEMENTATION: https://github.com/wangshenguiuc/clusDCA.

Subject(s)

Algorithms , Computational Biology/methods , Gene Ontology , Molecular Sequence Annotation , Proteins/metabolism , Saccharomyces cerevisiae Proteins/metabolism , Animals , Gene Regulatory Networks , Humans , Mice , Proteins/genetics , Saccharomyces cerevisiae Proteins/genetics , Vocabulary, Controlled

15.

Understanding user intents in online health forums.

Zhang, Thomas; Cho, Jason H D; Zhai, Chengxiang.

IEEE J Biomed Health Inform ; 19(4): 1392-8, 2015 Jul.

Article in English | MEDLINE | ID: mdl-25823052

ABSTRACT

Online health forums provide a convenient way for patients to obtain medical information and connect with physicians and peers outside of clinical settings. However, large quantities of unstructured and diversified content generated on these forums make it difficult for users to digest and extract useful information. Understanding user intents would enable forums to find and recommend relevant information to users by filtering out threads that do not match particular intents. In this paper, we derive a taxonomy of intents to capture user information needs in online health forums and propose novel pattern-based features for use with a multiclass support vector machine (SVM) classifier to classify original thread posts according to their underlying intents. Since no dataset existed for this task, we employ three annotators to manually label a dataset of 1192 HealthBoards posts spanning four forum topics. Experimental results show that a SVM using pattern-based features is highly capable of identifying user intents in forum posts, reaching a maximum precision of 75%, and that a SVM-based hierarchical classifier using both pattern and word features outperforms its SVM counterpart that uses only word features. Furthermore, comparable classification performance can be achieved by training and testing on posts from different forum topics.

Subject(s)

Health Information Exchange/classification , Intention , Internet , Support Vector Machine , Computational Biology , Humans , Machine Learning , Pattern Recognition, Automated/methods

16.

Integer Linear Programming for Constrained Multi-Aspect Committee Review Assignment.

Karimzadehgan, Maryam; Zhai, Chengxiang.

Inf Process Manag ; 48(4): 725-740, 2012 Jul 01.

Article in English | MEDLINE | ID: mdl-22711970

ABSTRACT

Automatic review assignment can significantly improve the productivity of many people such as conference organizers, journal editors and grant administrators. A general setup of the review assignment problem involves assigning a set of reviewers on a committee to a set of documents to be reviewed under the constraint of review quota so that the reviewers assigned to a document can collectively cover multiple topic aspects of the document. No previous work has addressed such a setup of committee review assignments while also considering matching multiple aspects of topics and expertise. In this paper, we tackle the problem of committee review assignment with multi-aspect expertise matching by casting it as an integer linear programming problem. The proposed algorithm can naturally accommodate any probabilistic or deterministic method for modeling multiple aspects to automate committee review assignments. Evaluation using a multi-aspect review assignment test set constructed using ACM SIGIR publications shows that the proposed algorithm is effective and efficient for committee review assignments based on multi-aspect expertise matching.

17.

Leveraging medical thesauri and physician feedback for improving medical literature retrieval for case queries.

Sondhi, Parikshit; Sun, Jimeng; Zhai, ChengXiang; Sorrentino, Robert; Kohn, Martin S.

J Am Med Inform Assoc ; 19(5): 851-8, 2012.

Article in English | MEDLINE | ID: mdl-22437075

ABSTRACT

OBJECTIVE: This paper presents a study of methods for medical literature retrieval for case queries, in which the goal is to retrieve literature articles similar to a given patient case. In particular, it focuses on analyzing the performance of state-of-the-art general retrieval methods and improving them by the use of medical thesauri and physician feedback. MATERIALS AND METHODS: The Kullback-Leibler divergence retrieval model with Dirichlet smoothing is used as the state-of-the-art general retrieval method. Pseudorelevance feedback and term weighing methods are proposed by leveraging MeSH and UMLS thesauri. Evaluation is performed on a test collection recently created for the ImageCLEF medical case retrieval challenge. RESULTS: Experimental results show that a well-tuned state-of-the-art general retrieval model achieves a mean average precision of 0.2754, but the performance can be improved by over 40% to 0.3980, through the proposed methods. DISCUSSION: The results over the ImageCLEF test collection, which is currently the best collection available for the task, are encouraging. There are, however, limitations due to small evaluation set size. The analysis shows that further refinement of the methods is necessary before they can be really useful in a clinical setting. CONCLUSION: Medical case-based literature retrieval is a critical search application that presents a number of unique challenges. This analysis shows that the state-of-the-art general retrieval models are reasonably good for the task, but the performance can be significantly improved by developing new task-specific retrieval models that incorporate medical thesauri and physician feedback.

Subject(s)

Feedback , Information Storage and Retrieval/methods , Medical Subject Headings , Natural Language Processing , Unified Medical Language System , Algorithms , Humans , Physicians , User-Computer Interface

18.

BeeSpace Navigator: exploratory analysis of gene function using semantic indexing of biological literature.

Sen Sarma, Moushumi; Arcoleo, David; Khetani, Radhika S; Chee, Brant; Ling, Xu; He, Xin; Jiang, Jing; Mei, Qiaozhu; Zhai, ChengXiang; Schatz, Bruce.

Nucleic Acids Res ; 39(Web Server issue): W462-9, 2011 Jul.

Article in English | MEDLINE | ID: mdl-21558175

ABSTRACT

With the rapid decrease in cost of genome sequencing, the classification of gene function is becoming a primary problem. Such classification has been performed by human curators who read biological literature to extract evidence. BeeSpace Navigator is a prototype software for exploratory analysis of gene function using biological literature. The software supports an automatic analogue of the curator process to extract functions, with a simple interface intended for all biologists. Since extraction is done on selected collections that are semantically indexed into conceptual spaces, the curation can be task specific. Biological literature containing references to gene lists from expression experiments can be analyzed to extract concepts that are computational equivalents of a classification such as Gene Ontology, yielding discriminating concepts that differentiate gene mentions from other mentions. The functions of individual genes can be summarized from sentences in biological literature, to produce results resembling a model organism database entry that is automatically computed. Statistical frequency analysis based on literature phrase extraction generates offline semantic indexes to support these gene function services. The website with BeeSpace Navigator is free and open to all; there is no login requirement at www.beespace.illinois.edu for version 4. Materials from the 2010 BeeSpace Software Training Workshop are available at www.beespace.illinois.edu/bstwmaterials.php.

Subject(s)

Abstracting and Indexing/methods , Genes , Software , Animals , Internet , MEDLINE

19.

Discovery of gene network variability across samples representing multiple classes.

Ko, Younhee; Zhai, ChengXiang; Rodriguez-Zas, Sandra L.

Int J Bioinform Res Appl ; 6(4): 402-17, 2010.

Article in English | MEDLINE | ID: mdl-20940126

ABSTRACT

Gene networks have been predicted using the expression profiles from microarray experiments that include multiple samples representing each of several classes or states (e.g., treatments, developmental stages, health status). A framework that integrates Bayesian networks, mixture of gene co-expression models and clustering is proposed to further mine information from the variation of samples within and across classes and enhance the understanding of gene networks. The approach was evaluated on two independent pathways using data from two microarray experiments. Our algorithm succeeded on reconstructing the topology of the gene pathways when benchmarked against empirical reports and randomised data sets. The majority or all the samples within a class shared the same co-expression model and were classified within the corresponding class. Our approach uncovered both gene relationships and profiles that are unique to a particular class or shared across classes.

Subject(s)

Gene Expression Profiling/methods , Gene Regulatory Networks , Algorithms , Bayes Theorem , Cluster Analysis , Databases, Factual , Oligonucleotide Array Sequence Analysis

20.

BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects.

He, Xin; Li, Yanen; Khetani, Radhika; Sanders, Barry; Lu, Yue; Ling, Xu; Zhai, Chengxiang; Schatz, Bruce.

Nucleic Acids Res ; 38(Web Server issue): W175-81, 2010 Jul.

Article in English | MEDLINE | ID: mdl-20576702

ABSTRACT

Text mining is one promising way of extracting information automatically from the vast biological literature. To maximize its potential, the knowledge encoded in the text should be translated to some semantic representation such as entities and relations, which could be analyzed by machines. But large-scale practical systems for this purpose are rare. We present BeeSpace question/answering (BSQA) system that performs integrated text mining for insect biology, covering diverse aspects from molecular interactions of genes to insect behavior. BSQA recognizes a number of entities and relations in Medline documents about the model insect, Drosophila melanogaster. For any text query, BSQA exploits entity annotation of retrieved documents to identify important concepts in different categories. By utilizing the extracted relations, BSQA is also able to answer many biologically motivated questions, from simple ones such as, which anatomical part is a gene expressed in, to more complex ones involving multiple types of relations. BSQA is freely available at http://www.beespace.uiuc.edu/QuestionAnswer.

Subject(s)

Data Mining , Genes, Insect , Insecta/genetics , Software , Animals , Behavior, Animal , Drosophila Proteins , Drosophila melanogaster/genetics , Drosophila melanogaster/metabolism , Drosophila melanogaster/physiology , Gene Expression Regulation , Homeodomain Proteins/genetics , Homeodomain Proteins/metabolism , Insecta/metabolism , Internet , Systems Integration , Trans-Activators/genetics , Trans-Activators/metabolism

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL