Pesquisa | Biblioteca Virtual em Saúde

Automated Phrase Mining from Massive Text Corpora.

Shang, Jingbo; Liu, Jialu; Jiang, Meng; Ren, Xiang; Voss, Clare R; Han, Jiawei.

IEEE Trans Knowl Data Eng ; 30(10): 1825-1837, 2018 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-31105412

RESUMO

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extend to model single-word quality phrases.

Representing Documents via Latent Keyphrase Inference.

Liu, Jialu; Ren, Xiang; Shang, Jingbo; Cassidy, Taylor; Voss, Clare R; Han, Jiawei.

Proc Int World Wide Web Conf ; 2016: 1057-1067, 2016 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-28229132

RESUMO

Many text mining approaches adopt bag-of-words or n-grams models to represent documents. Looking beyond just the words, i.e., the explicit surface forms, in a document can improve a computer's understanding of text. Being aware of this, researchers have proposed concept-based models that rely on a human-curated knowledge base to incorporate other related concepts in the document representation. But these methods are not desirable when applied to vertical domains (e.g., literature, enterprise, etc.) due to low coverage of in-domain concepts in the general knowledge base and interference from out-of-domain concepts. In this paper, we propose a data-driven model named Latent Keyphrase Inference (LAKI) that represents documents with a vector of closely related domain keyphrases instead of single words or existing concepts in the knowledge base. We show that given a corpus of in-domain documents, topical content units can be learned for each domain keyphrase, which enables a computer to do smart inference to discover latent document keyphrases, going beyond just explicit mentions. Compared with the state-of-art document representation approaches, LAKI fills the gap between bag-of-words and concept-based models by using domain keyphrases as the basic representation unit. It removes dependency on a knowledge base while providing, with keyphrases, readily interpretable representations. When evaluated against 8 other methods on two text mining tasks over two corpora, LAKI outperformed all.

ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering.

Ren, Xiang; El-Kishky, Ahmed; Wang, Chi; Tao, Fangbo; Voss, Clare R; Ji, Heng; Han, Jiawei.

KDD ; 2015: 995-1004, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-26705503

RESUMO

Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities. Then we predict the type of each entity mention based on the type signatures of its co-occurring relation phrases and the type indicators of its surface name, as computed over the corpus. Specifically, we formulate a joint optimization problem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering. Our experiments on multiple genres-news, Yelp reviews and tweets-demonstrate the effectiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the best compared method.

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA