RESUMO
This paper presents a comprehensive survey of corpora and lexical resources available for Turkish. We review a broad range of resources, focusing on the ones that are publicly available. In addition to providing information about the available linguistic resources, we present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
RESUMO
Orthography-semantics consistency (OSC) is a measure that quantifies the degree of semantic relatedness between a word and its orthographic relatives. OSC is computed as the frequency-weighted average semantic similarity between the meaning of a given word and the meanings of all the words containing that very same orthographic string, as captured by distributional semantic models. We present a resource including optimized estimates of OSC for 15,017 English words. In a series of analyses, we provide a progressive optimization of the OSC variable. We show that computing OSC from word-embeddings models (in place of traditional count models), limiting preprocessing of the corpus used for inducing semantic vectors (in particular, avoiding part-of-speech tagging and lemmatization), and relying on a wider pool of orthographic relatives provide better performance for the measure in a lexical-processing task. We further show that OSC is an important and significant predictor of reaction times in visual word recognition and word naming, one that correlates only weakly with other psycholinguistic variables (e.g., family size, word frequency), indicating that it captures a novel source of variance in lexical access. Finally, some theoretical and methodological implications are discussed of adopting OSC as one of the predictors of reaction times in studies of visual word recognition.
Assuntos
Bases de Dados Factuais , Idioma , Leitura , Semântica , Testes de Associação de Palavras , Pesquisa Comportamental/métodos , Feminino , Humanos , Masculino , Psicolinguística/métodos , Tempo de Reação , Análise e Desempenho de TarefasRESUMO
BACKGROUND: Semantic Category Disambiguation (SCD) is the task of assigning the appropriate semantic category to given spans of text from a fixed set of candidate categories, for example Protein to "Fibrin". SCD is relevant to Natural Language Processing tasks such as Named Entity Recognition, coreference resolution and coordination resolution. In this work, we study machine learning-based SCD methods using large lexical resources and approximate string matching, aiming to generalise these methods with regard to domains, lexical resources and the composition of data sets. We specifically consider the applicability of SCD for the purposes of supporting human annotators and acting as a pipeline component for other Natural Language Processing systems. RESULTS: While previous research has mostly cast SCD purely as a classification task, we consider a task setting that allows for multiple semantic categories to be suggested, aiming to minimise the number of suggestions while maintaining high recall. We argue that this setting reflects aspects which are essential for both a pipeline component and when supporting human annotators. We introduce an SCD method based on a recently introduced machine learning-based system and evaluate it on 15 corpora covering biomedical, clinical and newswire texts and ranging in the number of semantic categories from 2 to 91. With appropriate settings, our system maintains an average recall of 99% while reducing the number of candidate semantic categories on average by 65% over all data sets. CONCLUSIONS: Machine learning-based SCD using large lexical resources and approximate string matching is sensitive to the selection and granularity of lexical resources, but generalises well to a wide range of text domains and data sets given appropriate resources and parameter settings. By substantially reducing the number of candidate categories while only very rarely excluding the correct one, our method is shown to be applicable to manual annotation support tasks and use as a high-recall component in text processing pipelines. The introduced system and all related resources are freely available for research purposes at: https://github.com/ninjin/simsem.
RESUMO
Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.