Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
1.
J Appl Stat ; 51(10): 1878-1893, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39071253

RESUMO

As the online market grows rapidly, people are relying more on product review when they purchase the product. Hence, many companies and researchers are interested in analyzing product review which essentially a text data. In the current literature, it is common to use only text analysis tools to analyze text dataset. But in our work, we propose a method that utilizes both text analysis method such as topic modeling and statistical network model to build network among individuals and find interesting communities. We introduce a promising framework that incorporates topic modeling technique to define the edges among the individuals and form a network and uses stochastic blockmodels (SBM) to find the communities. The power of our proposed method is demonstrated in real-world application to Amazon product review dataset.

2.
Pain Manag ; 14(4): 183-194, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38717373

RESUMO

Background: Chronic neck and low back pain are very common and have detrimental effects for people and society. In this study, we explore the experiences of individuals with neck and/or back pain using a written narrative methodology. Materials & methods: A total of 92 individuals explained their pain experience using written narratives. Narratives were analyzed through thematic analysis and text data mining. Results: Participants wrote about their experience in terms of pain characteristics, diagnosis process, pain consequences, coping strategies, pain triggers, well-being and future expectations. Text data mining allowed us to identify concurrent networks that were basically related with pain characteristics, management and triggers. Conclusion: Written narratives are useful to understand individuals' experiences from their point of view.


[Box: see text].


Assuntos
Dor Crônica , Dor Lombar , Narração , Cervicalgia , Humanos , Dor Lombar/psicologia , Dor Lombar/terapia , Dor Lombar/diagnóstico , Masculino , Feminino , Dor Crônica/psicologia , Dor Crônica/terapia , Dor Crônica/diagnóstico , Cervicalgia/psicologia , Cervicalgia/terapia , Cervicalgia/diagnóstico , Adulto , Pessoa de Meia-Idade , Adaptação Psicológica , Idoso , Adulto Jovem , Pesquisa Qualitativa
3.
J Forensic Sci ; 69(4): 1289-1303, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38558223

RESUMO

We investigate likelihood ratio models motivated by digital forensics problems involving time-stamped user-generated event data from a device or account. Of specific interest are scenarios where the data may have been generated by a single individual (the device/account owner) or by two different individuals (the device/account owner and someone else), such as instances in which an account was hacked or a device was stolen before being associated with a crime. Existing likelihood ratio methods in this context require that a precise time is specified at which the device or account is purported to have changed hands (the changepoint)-this is the known changepoint likelihood ratio model. In this paper, we develop a likelihood ratio model that instead accommodates uncertainty in the changepoint using Bayesian techniques, that is, an unknown changepoint likelihood ratio model. We show that the likelihood ratio in this case can be calculated in closed form as an expression that is straightforward to compute. In experiments with simulated changepoints using real-world data sets, the results demonstrate that the unknown changepoint model attains comparable performance to the known changepoint model that uses a perfectly specified changepoint, and considerably outperforms the known changepoint model that uses a misspecified changepoint, illustrating the benefit of capturing uncertainty in the changepoint.

4.
Stud Health Technol Inform ; 310: 1584-1585, 2024 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-38426881

RESUMO

This study examined the effects of language differences between Korean and English on the performance of natural language processing in the classification task of identifying inpatient falls from unstructured nursing notes.


Assuntos
Aprendizado Profundo , Humanos , Acidentes por Quedas/prevenção & controle , Pacientes Internados , Registros Eletrônicos de Saúde , Idioma , Processamento de Linguagem Natural
5.
Diagnostics (Basel) ; 14(2)2024 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-38248014

RESUMO

This study aims to establish advanced sampling methods in free-text data for efficiently building semantic text mining models using deep learning, such as identifying vertebral compression fracture (VCF) in radiology reports. We enrolled a total of 27,401 radiology free-text reports of X-ray examinations of the spine. The predictive effects were compared between text mining models built using supervised long short-term memory networks, independently derived by four sampling methods: vector sum minimization, vector sum maximization, stratified, and simple random sampling, using four fixed percentages. The drawn samples were applied to the training set, and the remaining samples were used to validate each group using different sampling methods and ratios. The predictive accuracy was measured using the area under the receiver operating characteristics (AUROC) to identify VCF. At the sampling ratios of 1/10, 1/20, 1/30, and 1/40, the highest AUROC was revealed in the sampling methods of vector sum minimization as confidence intervals of 0.981 (95%CIs: 0.980-0.983)/0.963 (95%CIs: 0.961-0.965)/0.907 (95%CIs: 0.904-0.911)/0.895 (95%CIs: 0.891-0.899), respectively. The lowest AUROC was demonstrated in the vector sum maximization. This study proposes an advanced sampling method, vector sum minimization, in free-text data that can be efficiently applied to build the text mining models by smartly drawing a small amount of critical representative samples.

6.
BMC Bioinformatics ; 25(1): 23, 2024 Jan 12.
Artigo em Inglês | MEDLINE | ID: mdl-38216898

RESUMO

BACKGROUND: With the exponential growth of high-throughput technologies, multiple pathway analysis methods have been proposed to estimate pathway activities from gene expression profiles. These pathway activity inference methods can be divided into two main categories: non-Topology-Based (non-TB) and Pathway Topology-Based (PTB) methods. Although some review and survey articles discussed the topic from different aspects, there is a lack of systematic assessment and comparisons on the robustness of these approaches. RESULTS: Thus, this study presents comprehensive robustness evaluations of seven widely used pathway activity inference methods using six cancer datasets based on two assessments. The first assessment seeks to investigate the robustness of pathway activity in pathway activity inference methods, while the second assessment aims to assess the robustness of risk-active pathways and genes predicted by these methods. The mean reproducibility power and total number of identified informative pathways and genes were evaluated. Based on the first assessment, the mean reproducibility power of pathway activity inference methods generally decreased as the number of pathway selections increased. Entropy-based Directed Random Walk (e-DRW) distinctly outperformed other methods in exhibiting the greatest reproducibility power across all cancer datasets. On the other hand, the second assessment shows that no methods provide satisfactory results across datasets. CONCLUSION: However, PTB methods generally appear to perform better in producing greater reproducibility power and identifying potential cancer markers compared to non-TB methods.


Assuntos
Neoplasias , Humanos , Reprodutibilidade dos Testes , Neoplasias/genética , Entropia , Expressão Gênica
7.
JMIR Res Protoc ; 12: e48521, 2023 Nov 09.
Artigo em Inglês | MEDLINE | ID: mdl-37943599

RESUMO

BACKGROUND: Hospital-induced delirium is one of the most common and costly iatrogenic conditions, and its incidence is predicted to increase as the population of the United States ages. An academic and clinical interdisciplinary systems approach is needed to reduce the frequency and impact of hospital-induced delirium. OBJECTIVE: The long-term goal of our research is to enhance the safety of hospitalized older adults by reducing iatrogenic conditions through an effective learning health system. In this study, we will develop models for predicting hospital-induced delirium. In order to accomplish this objective, we will create a computable phenotype for our outcome (hospital-induced delirium), design an expert-based traditional logistic regression model, leverage machine learning techniques to generate a model using structured data, and use machine learning and natural language processing to produce an integrated model with components from both structured data and text data. METHODS: This study will explore text-based data, such as nursing notes, to improve the predictive capability of prognostic models for hospital-induced delirium. By using supervised and unsupervised text mining in addition to structured data, we will examine multiple types of information in electronic health record data to predict medical-surgical patient risk of developing delirium. Development and validation will be compliant to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement. RESULTS: Work on this project will take place through March 2024. For this study, we will use data from approximately 332,230 encounters that occurred between January 2012 to May 2021. Findings from this project will be disseminated at scientific conferences and in peer-reviewed journals. CONCLUSIONS: Success in this study will yield a durable, high-performing research-data infrastructure that will process, extract, and analyze clinical text data in near real time. This model has the potential to be integrated into the electronic health record and provide point-of-care decision support to prevent harm and improve quality of care. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/48521.

8.
Digit Health ; 9: 20552076231203672, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37846404

RESUMO

Objective: Digital twins (DTs) have received widespread attention recently, providing new ideas and possibilities for future healthcare. This review aims to provide a quantitative review to analyze specific study contents, research focus, and trends of DT in healthcare. Simultaneously, this review intends to expand the connotation of "healthcare" into two directions, namely "Disease treatment" and "Health enhancement" to analyze the content within the "DT + healthcare" field thoroughly. Methods: A data mining method named Structure Topic Modeling (STM) was used as the analytical tool due to its topic analysis ability and versatility. Google Scholar, Web of Science, and China National Knowledge Infrastructure supplied the material papers in this review. Results: A total of 94 high-quality papers published between 2018 and 2022 were gathered and categorized into eight topics, collectively covering the transformative impact across a broader spectrum in healthcare. Three main findings have emerged: (1) papers published in healthcare predominantly concentrate on technology development (artificial intelligence, Internet of Things, etc.) and application scenarios(personalized, precise, and real-time health service); (2) the popularity of research topics is influenced by various factors, including policies, COVID-19, and emerging technologies; and (3) the preference for topics is diverse, with a general inclination toward the attribute of "Health enhancement." Conclusions: This review underscores the significance of real-time capability and accuracy in shaping the future of DT, where algorithms and data transmission methods assume central importance in achieving these goals. Moreover, technological advancements, such as omics and Metaverse, have opened up new possibilities for DT in healthcare. These findings contribute to the existing literature by offering quantitative insights and valuable guidance to keep researchers ahead of the curve.

9.
MethodsX ; 11: 102339, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37693657

RESUMO

The need for technical support for data handling and visualization solutions has increased in tandem with the complexity of today's data and information, that is of multiple sources, huge in size and of different formats. This study focuses on handling and analyzing text-based data. Despite many available text analysis tools, there is a high demand among researchers for easy- to-use tools yet scalable and with incomparable visualization features. Of recent, there has been a significant focus on utilizing VOSviewer, an open-source software for bibliometric analysis. This software is able to analyze a significant amount of data and provide excellent network data mapping. However, there is a lack of existing work in evaluating this sophisticated tool for text analysis. Thus, this article explores the capability of VOSviewer and presents evidence-based implementation of this software for text analysis. Specifically, this study demonstrates the usage of VOSviewer to analyze text based on YouTube interviews related to ChatGPT. Hence, this study significantly contributes by processing textual data and producing visualization network maps that are different from bibliometric data. The study recognizes VOSviewer as a powerful tool for data visualization in mapping text data and illustrates the potential of this software for analyzing text networks in various fields. •The study illustrates how text analysis and visualization can be realized using VOSviewer, an open-source software mostly used for biblio- metric analysis.•The study presents the workflow indicating how the dataset can be prepared as input for VOSviewer for text analysis.•The study proves that VOSviewer is a powerful tool for data visualization and network mapping for any type of network data including transcripts from social media.

10.
BMC Med Inform Decis Mak ; 23(1): 132, 2023 07 22.
Artigo em Inglês | MEDLINE | ID: mdl-37481523

RESUMO

BACKGROUND: Topic models are a class of unsupervised machine learning models, which facilitate summarization, browsing and retrieval from large unstructured document collections. This study reviews several methods for assessing the quality of unsupervised topic models estimated using non-negative matrix factorization. Techniques for topic model validation have been developed across disparate fields. We synthesize this literature, discuss the advantages and disadvantages of different techniques for topic model validation, and illustrate their usefulness for guiding model selection on a large clinical text corpus. DESIGN, SETTING AND DATA: Using a retrospective cohort design, we curated a text corpus containing 382,666 clinical notes collected between 01/01/2017 through 12/31/2020 from primary care electronic medical records in Toronto Canada. METHODS: Several topic model quality metrics have been proposed to assess different aspects of model fit. We explored the following metrics: reconstruction error, topic coherence, rank biased overlap, Kendall's weighted tau, partition coefficient, partition entropy and the Xie-Beni statistic. Depending on context, cross-validation and/or bootstrap stability analysis were used to estimate these metrics on our corpus. RESULTS: Cross-validated reconstruction error favored large topic models (K ≥ 100 topics) on our corpus. Stability analysis using topic coherence and the Xie-Beni statistic also favored large models (K = 100 topics). Rank biased overlap and Kendall's weighted tau favored small models (K = 5 topics). Few model evaluation metrics suggested mid-sized topic models (25 ≤ K ≤ 75) as being optimal. However, human judgement suggested that mid-sized topic models produced expressive low-dimensional summarizations of the corpus. CONCLUSIONS: Topic model quality indices are transparent quantitative tools for guiding model selection and evaluation. Our empirical illustration demonstrated that different topic model quality indices favor models of different complexity; and may not select models aligning with human judgment. This suggests that different metrics capture different aspects of model goodness of fit. A combination of topic model quality indices, coupled with human validation, may be useful in appraising unsupervised topic models.


Assuntos
Algoritmos , Benchmarking , Humanos , Estudos Retrospectivos , Canadá , Registros Eletrônicos de Saúde
11.
SN Comput Sci ; 4(2): 190, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36748096

RESUMO

Sentiment analysis is one of the effective techniques for mining the opinion from shapeless data contains text like review of the products, review of the movie. Sentiment analysis is used as a key to gather response from consumers, reviews of brands, marketing analyses, and political campaigns. In the subject of natural processing, performing sentiment analysis using the data obtained from Twitter is considered as a new study in these days. The dataset is gathered using the Twitter API and the Twitter package. The analysis of Twitter data is a process which takes place automatically by text data analysis to determine the view of public on the specified topic. Here, an improvised sentimental analysis model is proposed to identify the polarity of the tweets such as positive, neutral and negative. In this paper, stochastic gradient descent (SGD) algorithm uses stochastic gradient neural network (SGNN) to categorize the sentiment analysis on basis of tweets provided by the Twitter users and the proposed stochastic gradient descent optimization Algorithm based on stochastic gradient neural network (SGDOA-SGNN) provides better performance when compared with the existing Forest-Whale Optimization Algorithm based on deep neural network F-WOA-DNN model.

12.
Health Informatics J ; 29(1): 14604582221115667, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36639910

RESUMO

Background/Objectives: Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design/Setting: Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01/01/2017 through 12/31/2020. Methods: We fit a non-negative matrix factorization topic model, with K = 50 latent topics/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model. Results: The DTM consisted of 382,666 documents and 2210 words/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye/ear/nose/throat conditions, conditions of the urinary system, and dermatological conditions, etc.). Conclusions: When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models.


Assuntos
Registros Eletrônicos de Saúde , Classificação Internacional de Doenças , Humanos , Feminino , Estudos Retrospectivos , Aprendizagem , Atenção Primária à Saúde
13.
Behav Res Methods ; 55(8): 4455-4477, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-36443583

RESUMO

Understanding what groups stand for is integral to a diverse array of social processes, ranging from understanding political conflicts to organisational behaviour to promoting public health behaviours. Traditionally, researchers rely on self-report methods such as interviews and surveys to assess groups' collective self-understandings. Here, we demonstrate the value of using naturally occurring online textual data to map the similarities and differences between real-world groups' collective self-understandings. We use machine learning algorithms to assess similarities between 15 diverse online groups' linguistic style, and then use multidimensional scaling to map the groups in two-dimensonal space (N=1,779,098 Reddit comments). We then use agglomerative and k-means clustering techniques to assess how the 15 groups cluster, finding there are four behaviourally distinct group types - vocational, collective action (comprising political and ethnic/religious identities), relational and stigmatised groups, with stigmatised groups having a less distinctive behavioural profile than the other group types. Study 2 is a secondary data analysis where we find strong relationships between the coordinates of each group in multidimensional space and the groups' values. In Study 3, we demonstrate how this approach can be used to track the development of groups' collective self-understandings over time. Using transgender Reddit data (N= 1,095,620 comments) as a proof-of-concept, we track the gradual politicisation of the transgender group over the past decade. The automaticity of this methodology renders it advantageous for monitoring multiple online groups simultaneously. This approach has implications for both governmental agencies and social researchers more generally. Future research avenues and applications are discussed.


Assuntos
Linguística , Humanos , Aprendizado de Máquina , Mídias Sociais
14.
Int J Mach Learn Cybern ; 14(1): 135-150, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-35432623

RESUMO

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.

15.
Sensors (Basel) ; 22(23)2022 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-36502024

RESUMO

This article focuses on the problem of detecting disinformation about COVID-19 in online discussions. As the Internet expands, so does the amount of content on it. In addition to content based on facts, a large amount of content is being manipulated, which negatively affects the whole society. This effect is currently compounded by the ongoing COVID-19 pandemic, which caused people to spend even more time online and to get more invested in this fake content. This work brings a brief overview of how toxic information looks like, how it is spread, and how to potentially prevent its dissemination by early recognition of disinformation using deep learning. We investigated the overall suitability of deep learning in solving problem of detection of disinformation in conversational content. We also provided a comparison of architecture based on convolutional and recurrent principles. We have trained three detection models based on three architectures using CNN (convolutional neural networks), LSTM (long short-term memory), and their combination. We have achieved the best results using LSTM (F1 = 0.8741, Accuracy = 0.8628). But the results of all three architectures were comparable, for example the CNN+LSTM architecture achieved F1 = 0.8672 and Accuracy = 0.852. The paper offers finding that introducing a convolutional component does not bring significant improvement. In comparison with our previous works, we noted that from all forms of antisocial posts, disinformation is the most difficult to recognize, since disinformation has no unique language, such as hate speech, toxic posts etc.


Assuntos
COVID-19 , Aprendizado Profundo , Humanos , COVID-19/diagnóstico , Pandemias , Redes Neurais de Computação , Idioma
16.
JMIR Med Inform ; 10(12): e40102, 2022 Dec 19.
Artigo em Inglês | MEDLINE | ID: mdl-36534443

RESUMO

BACKGROUND: Health care organizations are collecting increasing volumes of clinical text data. Topic models are a class of unsupervised machine learning algorithms for discovering latent thematic patterns in these large unstructured document collections. OBJECTIVE: We aimed to comparatively evaluate several methods for estimating temporal topic models using clinical notes obtained from primary care electronic medical records from Ontario, Canada. METHODS: We used a retrospective closed cohort design. The study spanned from January 01, 2011, through December 31, 2015, discretized into 20 quarterly periods. Patients were included in the study if they generated at least 1 primary care clinical note in each of the 20 quarterly periods. These patients represented a unique cohort of individuals engaging in high-frequency use of the primary care system. The following temporal topic modeling algorithms were fitted to the clinical note corpus: nonnegative matrix factorization, latent Dirichlet allocation, the structural topic model, and the BERTopic model. RESULTS: Temporal topic models consistently identified latent topical patterns in the clinical note corpus. The learned topical bases identified meaningful activities conducted by the primary health care system. Latent topics displaying near-constant temporal dynamics were consistently estimated across models (eg, pain, hypertension, diabetes, sleep, mood, anxiety, and depression). Several topics displayed predictable seasonal patterns over the study period (eg, respiratory disease and influenza immunization programs). CONCLUSIONS: Nonnegative matrix factorization, latent Dirichlet allocation, structural topic model, and BERTopic are based on different underlying statistical frameworks (eg, linear algebra and optimization, Bayesian graphical models, and neural embeddings), require tuning unique hyperparameters (optimizers, priors, etc), and have distinct computational requirements (data structures, computational hardware, etc). Despite the heterogeneity in statistical methodology, the learned latent topical summarizations and their temporal evolution over the study period were consistently estimated. Temporal topic models represent an interesting class of models for characterizing and monitoring the primary health care system.

17.
Front Comput Neurosci ; 16: 992296, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36185709

RESUMO

Text categorization is an effective activity that can be accomplished using a variety of classification algorithms. In machine learning, the classifier is built by learning the features of categories from a set of preset training data. Similarly, deep learning offers enormous benefits for text classification since they execute highly accurately with lower-level engineering and processing. This paper employs machine and deep learning techniques to classify textual data. Textual data contains much useless information that must be pre-processed. We clean the data, impute missing values, and eliminate the repeated columns. Next, we employ machine learning algorithms: logistic regression, random forest, K-nearest neighbors (KNN), and deep learning algorithms: long short-term memory (LSTM), artificial neural network (ANN), and gated recurrent unit (GRU) for classification. Results reveal that LSTM achieves 92% accuracy outperforming all other model and baseline studies.

18.
Front Public Health ; 10: 1000313, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36187685

RESUMO

Under the epidemic situation of COVID-19, university students have different levels of anxiety, depression, and other psychological problems, and these differing levels present different challenges. Therefore, universities and relevant departments should carry out accurate psychological health education for university students. Through research, this paper found that students' psychological problems during the COVID-19 epidemic were mainly reflected in four aspects: depression, interpersonal relationship, sleep and eating disorders, and compulsive behavior. Through the discussion of family of origin, self-awareness and motivation attribution, and social pressure, this paper analyzed the causes of psychological problems. The information resources of the network are usually unstructured data, and the text information, as the most typical unstructured data, occupies a large proportion. Moreover, this text information often contains users' emotional response to major events. In this paper, a data preprocessing system is designed, and three data preprocessing rules are defined: expression data conversion rules, data deduplication rules and invalid data cleaning rules. The characteristics of online community text data are analyzed, and the text feature extraction method is selected according to its characteristics. The results of this study show that the proportion of university students with psychological problems is about 23%, which is slightly higher than the research results during the non-epidemic period. This paper suggests that college students should master methods of self-regulation, improve their levels of physical exercise, improve their physical fitness, and establish and improve their defense mechanisms to alleviate psychological conflicts and pressures.


Assuntos
COVID-19 , Saúde Mental , COVID-19/epidemiologia , Depressão/epidemiologia , Emoções , Humanos , SARS-CoV-2 , Estudantes/psicologia , Inquéritos e Questionários
19.
Data Brief ; 44: 108545, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36060819

RESUMO

With this article, we present a repository containing datasets, analysis code, and some outputs related to a paper in press at Cognition. The data were collected as part of a pre-test, pilot test, and main study all designed in SurveyGizmo and participants recruited via Prolific.co (combined N=303). Datasets consist of raw and annotated data, where participant responses are free-text entries about what unexpected events might occur after a series of events, presented them with based on everyday scenarios. The code consists of all computational additions to the data, and analysis carried out for the results presented in the article. This data is released for the purpose of transparency and to allow for reproducability of the work. This human-labelled data should also be of use to machine learning researchers researching text analytics, natural language processing and sources of common-sense knowledge.

20.
Sensors (Basel) ; 22(17)2022 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-36080927

RESUMO

This article focuses on the problem of detecting toxicity in online discussions. Toxicity is currently a serious problem when people are largely influenced by opinions on social networks. We offer a solution based on classification models using machine learning methods to classify short texts on social networks into multiple degrees of toxicity. The classification models used both classic methods of machine learning, such as naïve Bayes and SVM (support vector machine) as well ensemble methods, such as bagging and RF (random forest). The models were created using text data, which we extracted from social networks in the Slovak language. The labelling of our dataset of short texts into multiple classes-the degrees of toxicity-was provided automatically by our method based on the lexicon approach to texts processing. This lexicon method required creating a dictionary of toxic words in the Slovak language, which is another contribution of the work. Finally, an application was created based on the learned machine learning models, which can be used to detect the degree of toxicity of new social network comments as well as for experimentation with various machine learning methods. We achieved the best results using an SVM-average value of accuracy = 0.89 and F1 = 0.79. This model also outperformed the ensemble learning by the RF and Bagging methods; however, the ensemble learning methods achieved better results than the naïve Bayes method.


Assuntos
Aprendizado de Máquina , Máquina de Vetores de Suporte , Teorema de Bayes , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA