Balinese story texts dataset for narrative text analyses.

Bimantara, I Made Satria; Purwitasari, Diana; Er, Ngurah Agus Sanjaya; Natha, Putu Gede Suarya

Bimantara, I Made Satria; Purwitasari, Diana; Er, Ngurah Agus Sanjaya; Natha, Putu Gede Suarya.

Afiliação

Bimantara IMS; Informatics Department, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia.
Purwitasari D; Informatics Department, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia.
Er NAS; Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Badung 80361, Indonesia.
Natha PGS; Old Javanese Literatures Study Program, Faculty of Humanities, Udayana University, Badung 80361, Indonesia.

Data Brief ; 56: 110781, 2024 Oct.

Article em En | MEDLINE | ID: mdl-39252773

ABSTRACT

ABSTRACT

Automatic narrative text analysis is gaining traction as artificial intelligence-based computational linguistic tools such as named entity recognition systems and natural language processing (NLP) toolkits become more prevalent. Character identification is the first stage in narrative text analysis; however, it is difficult due to the diversity of appearances and distinctive characteristics among regions. Further challenging analyses, such as role classification, emotion and personality profiling, and character network development, require successful character identification initially, which is crucial. Because there are so many annotated English datasets, computational linguistic tools are mostly focused on English literature. However, there are restricted tools for analyzing Balinese story texts because of a scarcity of low-resource language datasets. The study presents the first annotated Balinese story texts dataset for narrative text analyses, consisting of four sub-datasets for character identification, alias clustering (named entity linking, alias resolution), and character classification. The dataset is a compilation of 120 manually annotated Balinese stories from books and public websites, spanning multiple genres such as folk tales, fairy tales, fables, and mythology. Two Balinese native speakers, including an expert in sociolinguistics and macrolinguistics, annotated the dataset using predetermined guidelines set by an expert. The inter-annotator agreement (IAA) score is calculated using Cohen's Kappa Coefficient, Jaccard Similarity Coefficient, Mean F1-score to measure the level of agreement between annotators and dataset consistency and its reliability. The first subdataset consists of 89,917 annotated words with five labels referring to the Balinese-character named entities. Each character entity's appearance in 6,634 sentences is further annotated in the second subdataset. These two sub-datasets can be used for character identification purposes at the word and sentence level. The list of character groups which are groups of various aliases for each character entity has been annotated in the third subdataset for alias clustering purposes. The third subdataset contains 930-character groups from 120 story texts with each story text containing an average of 7-to-8-character groups. In the fourth subdataset, 848-character groups-of the 930-character groups in the third subdataset-have been categorized as protagonists and antagonists. The protagonists (66.16 %) make up most character groups, with the antagonists (33.84 %) making up the rest of the groups. The fourth subdataset can be used for computing-based classification of characters into two roles between protagonist and antagonist. These datasets have the potential to improve research in narrative text analyses, especially in the areas of computational linguistic tools and advanced machine learning (ML) and deep learning (DL) models in low resource languages. It can also be used for further research including character network development, character relationship extraction, and character classification beyond protagonist and antagonist.

Palavras-chave

Alias clustering; Automatic narrative text understanding; Character classification; Character extraction; Character identification; Computational linguistic; Named entity linking; Named entity recognition

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Data Brief Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Indonésia País de publicação: Holanda

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google