Pesquisa | Biblioteca Virtual em Saúde

Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach.

Ren, Xiang; El-Kishky, Ahmed; Wang, Chi; Han, Jiawei.

KDD ; 2015: 2319-2320, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-26705508

RESUMO

In today's computerized and information-based society, we are soaked with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To unlock the value of these unstructured text data from various domains, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and tweets how these typed entities aid in knowledge discovery and management.

ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering.

Ren, Xiang; El-Kishky, Ahmed; Wang, Chi; Tao, Fangbo; Voss, Clare R; Ji, Heng; Han, Jiawei.

KDD ; 2015: 995-1004, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-26705503

RESUMO

Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities. Then we predict the type of each entity mention based on the type signatures of its co-occurring relation phrases and the type indicators of its surface name, as computed over the corpus. Specifically, we formulate a joint optimization problem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering. Our experiments on multiple genres-news, Yelp reviews and tweets-demonstrate the effectiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the best compared method.

Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks.

Wang, Chenguang; Song, Yangqiu; El-Kishky, Ahmed; Roth, Dan; Zhang, Ming; Han, Jiawei.

KDD ; 2015: 1215-1224, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-26705504

RESUMO

One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, Word-Net. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA