Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 123
Filtrar
1.
Immunity ; 51(4): 696-708.e9, 2019 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-31618654

RESUMO

Signaling abnormalities in immune responses in the small intestine can trigger chronic type 2 inflammation involving interaction of multiple immune cell types. To systematically characterize this response, we analyzed 58,067 immune cells from the mouse small intestine by single-cell RNA sequencing (scRNA-seq) at steady state and after induction of a type 2 inflammatory reaction to ovalbumin (OVA). Computational analysis revealed broad shifts in both cell-type composition and cell programs in response to the inflammation, especially in group 2 innate lymphoid cells (ILC2s). Inflammation induced the expression of exon 5 of Calca, which encodes the alpha-calcitonin gene-related peptide (α-CGRP), in intestinal KLRG1+ ILC2s. α-CGRP antagonized KLRG1+ ILC2s proliferation but promoted IL-5 expression. Genetic perturbation of α-CGRP increased the proportion of intestinal KLRG1+ ILC2s. Our work highlights a model where α-CGRP-mediated neuronal signaling is critical for suppressing ILC2 expansion and maintaining homeostasis of the type 2 immune machinery.


Assuntos
Peptídeo Relacionado com Gene de Calcitonina/metabolismo , Inflamação/imunologia , Intestinos/imunologia , Linfócitos/imunologia , Neuropeptídeos/metabolismo , Animais , Peptídeo Relacionado com Gene de Calcitonina/genética , Células Cultivadas , Biologia Computacional , Imunidade Inata , Interleucina-5/genética , Interleucina-5/metabolismo , Lectinas Tipo C/metabolismo , Camundongos , Camundongos Endogâmicos BALB C , Camundongos Transgênicos , Neuropeptídeos/genética , Receptores Imunológicos/metabolismo , Análise de Sequência de RNA , Transdução de Sinais , Análise de Célula Única , Células Th2/imunologia , Transcriptoma , Regulação para Cima
2.
Biostatistics ; 24(4): 1045-1065, 2023 10 18.
Artigo em Inglês | MEDLINE | ID: mdl-35657012

RESUMO

Topic modeling is a popular method used to describe biological count data. With topic models, the user must specify the number of topics $K$. Since there is no definitive way to choose $K$ and since a true value might not exist, we develop a method, which we call topic alignment, to study the relationships across models with different $K$. In addition, we present three diagnostics based on the alignment. These techniques can show how many topics are consistently present across different models, if a topic is only transiently present, or if a topic splits into more topics when $K$ increases. This strategy gives more insight into the process of generating the data than choosing a single value of $K$ would. We design a visual representation of these cross-model relationships, show the effectiveness of these tools for interpreting the topics on simulated and real data, and release an accompanying R package, alto.

3.
Biometrics ; 80(2)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38682463

RESUMO

Inferring the cancer-type specificities of ultra-rare, genome-wide somatic mutations is an open problem. Traditional statistical methods cannot handle such data due to their ultra-high dimensionality and extreme data sparsity. To harness information in rare mutations, we have recently proposed a formal multilevel multilogistic "hidden genome" model. Through its hierarchical layers, the model condenses information in ultra-rare mutations through meta-features embodying mutation contexts to characterize cancer types. Consistent, scalable point estimation of the model can incorporate 10s of millions of variants across thousands of tumors and permit impressive prediction and attribution. However, principled statistical inference is infeasible due to the volume, correlation, and noninterpretability of mutation contexts. In this paper, we propose a novel framework that leverages topic models from computational linguistics to effectuate dimension reduction of mutation contexts producing interpretable, decorrelated meta-feature topics. We propose an efficient MCMC algorithm for implementation that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of existing out-of-the-box inferential high-dimensional multi-class regression methods and software. Applying our model to the Pan Cancer Analysis of Whole Genomes dataset reveals interesting biological insights including somatic mutational topics associated with UV exposure in skin cancer, aging in colorectal cancer, and strong influence of epigenome organization in liver cancer. Under cross-validation, our model demonstrates highly competitive predictive performance against blackbox methods of random forest and deep learning.


Assuntos
Algoritmos , Teorema de Bayes , Mutação , Neoplasias , Humanos , Neoplasias/genética , Modelos Estatísticos , Neoplasias Cutâneas/genética
4.
Philos Trans A Math Phys Eng Sci ; 382(2270): 20230145, 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-38403059

RESUMO

We apply a dynamic influence model to the opinions of the US federal courts to examine the role of the US Supreme Court in influencing the direction of legal discourse in the federal courts. We propose two mechanisms for how the Court affects innovation in legal language: a selection mechanism where the Court's influence primarily derives from its discretionary jurisdiction, and an authorship mechanism in which the Court's influence derives directly from its own innovations. To test these alternative hypotheses, we develop a novel influence measure based on a dynamic topic model that separates the Court's own language innovations from those of the lower courts. Applying this measure to the US federal courts, we find that the Supreme Court primarily exercises influence through the selection mechanism, with modest additional influence attributable to the authorship mechanism. This article is part of the theme issue 'A complexity science approach to law and governance'.

5.
Omega (Westport) ; : 302228241253972, 2024 May 13.
Artigo em Inglês | MEDLINE | ID: mdl-38739857

RESUMO

Stigma surrounding suicide is a massive problem in Indonesia. Thus, it is important to study how conversations about suicide take place. We take a machine learning approach and study tweets with suicide keywords to understand how people converse about suicide or express suicide ideation. Tweets with suicide-related keywords were extracted from May to June 2023. 20,057 tweets were subject to topic modelling with an 11-topic solution. While most topics contain negative messages, no purely stigmatizing topics emerge, despite prior research suggesting overwhelming stigma. Various kinds of existential, emotional, and social tweets about suicide take place among Indonesian users, indicating that Indonesian Twitter users utilize the platform to express their thoughts and emotions. Notably, religious-spiritual keywords are highly prevalent, suggesting that in a highly religious society, there is a need for policy makers and awareness campaigns to frame their positive messaging within the society's religious context.

6.
Stat Med ; 42(30): 5541-5554, 2023 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-37850249

RESUMO

We review popular unsupervised learning methods for the analysis of high-dimensional data encountered in, for example, genomics, medical imaging, cohort studies, and biobanks. We show that four commonly used methods, principal component analysis, K-means clustering, nonnegative matrix factorization, and latent Dirichlet allocation, can be written as probabilistic models underpinned by a low-rank matrix factorization. In addition to highlighting their similarities, this formulation clarifies the various assumptions and restrictions of each approach, which eases identifying the appropriate method for specific applications for applied medical researchers. We also touch upon the most important aspects of inference and model selection for the application of these methods to health data.


Assuntos
Algoritmos , Aprendizado de Máquina não Supervisionado , Humanos , Modelos Estatísticos , Genômica , Análise por Conglomerados
7.
J Med Internet Res ; 25: e45777, 2023 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-37014691

RESUMO

BACKGROUND: Anxiety disorder has become a major clinical and public health problem, causing a significant economic burden worldwide. Public attitudes toward anxiety can impact the psychological state, help-seeking behavior, and social activities of people with anxiety disorder. OBJECTIVE: The purpose of this study was to explore public attitudes toward anxiety disorders and the changing trends of these attitudes by analyzing the posts related to anxiety disorders on Sina Weibo, a Chinese social media platform that has about 582 million users, as well as the psycholinguistic and topical features in the text content of the posts. METHODS: From April 2018 to March 2022, 325,807 Sina Weibo posts with the keyword "anxiety disorder" were collected and analyzed. First, we analyzed the changing trends in the number and total length of posts every month. Second, a Chinese Linguistic Psychological Text Analysis System (TextMind) was used to analyze the changing trends in the language features of the posts, in which 20 linguistic features were selected and presented. Third, a topic model (biterm topic model) was used for semantic content analysis to identify specific themes in Weibo users' attitudes toward anxiety. RESULTS: The changing trends in the number and the total length of posts indicated that anxiety-related posts significantly increased from April 2018 to March 2022 (R2=0.6512; P<.001 to R2=0.8133; P<.001, respectively) and were greatly impacted by the beginning of a new semester (spring/fall). The analysis of linguistic features showed that the frequency of the cognitive process (R2=0.1782; P=.003), perceptual process (R2=0.1435; P=.008), biological process (R2=0.3225; P<.001), and assent words (R2=0.4412; P<.001) increased significantly over time, while the frequency of the social process words (R2=0.2889; P<.001) decreased significantly, and public anxiety was greatly impacted by the COVID-19 pandemic. Feature correlation analysis showed that the frequencies of words related to work and family are almost negatively correlated with those of other psychological words. Semantic content analysis identified 5 common topical areas: discrimination and stigma, symptoms and physical health, treatment and support, work and social, and family and life. Our results showed that the occurrence probability of the topical area "discrimination and stigma" reached the highest value and averagely accounted for 26.66% in the 4-year period. The occurrence probability of the topical area "family and life" (R2=0.1888; P=.09) decreased over time, while that of the other 4 topical areas increased. CONCLUSIONS: The findings of our study indicate that public discrimination and stigma against anxiety disorder remain high, particularly in the aspects of self-denial and negative emotions. People with anxiety disorders should receive more social support to reduce the impact of discrimination and stigma.


Assuntos
COVID-19 , Mídias Sociais , Humanos , COVID-19/epidemiologia , Pandemias , Linguística , Ansiedade , Atitude , China/epidemiologia
8.
J Med Internet Res ; 25: e45019, 2023 09 21.
Artigo em Inglês | MEDLINE | ID: mdl-37733396

RESUMO

BACKGROUND: Social networks have become one of the main channels for obtaining health information. However, they have also become a source of health-related misinformation, which seriously threatens the public's physical and mental health. Governance of health-related misinformation can be implemented through topic identification of rumors on social networks. However, little attention has been paid to studying the types and routes of dissemination of health rumors on the internet, especially rumors regarding health-related information in Chinese social media. OBJECTIVE: This study aims to explore the types of health-related misinformation favored by WeChat public platform users and their prevalence trends and to analyze the modeling results of the text by using the Latent Dirichlet Allocation model. METHODS: We used a web crawler tool to capture health rumor-dispelling articles on WeChat rumor-dispelling public accounts. We collected information from health-debunking articles posted between January 1, 2016, and August 31, 2022. Following word segmentation of the collected text, a document topic generation model called Latent Dirichlet Allocation was used to identify and generalize the most common topics. The proportion distribution of the themes was calculated, and the negative impact of various health rumors in different periods was analyzed. Additionally, the prevalence of health rumors was analyzed by the number of health rumors generated at each time point. RESULTS: We collected 9366 rumor-refuting articles from January 1, 2016, to August 31, 2022, from WeChat official accounts. Through topic modeling, we divided the health rumors into 8 topics, that is, rumors on prevention and treatment of infectious diseases (1284/9366, 13.71%), disease therapy and its effects (1037/9366, 11.07%), food safety (1243/9366, 13.27%), cancer and its causes (946/9366, 10.10%), regimen and disease (1540/9366, 16.44%), transmission (914/9366, 9.76%), healthy diet (1068/9366, 11.40%), and nutrition and health (1334/9366, 14.24%). Furthermore, we summarized the 8 topics under 4 themes, that is, public health, disease, diet and health, and spread of rumors. CONCLUSIONS: Our study shows that topic modeling can provide analysis and insights into health rumor governance. The rumor development trends showed that most rumors were on public health, disease, and diet and health problems. Governments still need to implement relevant and comprehensive rumor management strategies based on the rumors prevalent in their countries and formulate appropriate policies. Apart from regulating the content disseminated on social media platforms, the national quality of health education should also be improved. Governance of social networks should be clearly implemented, as these rapidly developed platforms come with privacy issues. Both disseminators and receivers of information should ensure a realistic attitude and disseminate health information correctly. In addition, we recommend that sentiment analysis-related studies be conducted to verify the impact of health rumor-related topics.


Assuntos
Educação em Saúde , Mídias Sociais , Humanos , Dieta Saudável , Governo , Comunicação , China
9.
BMC Med Inform Decis Mak ; 23(1): 132, 2023 07 22.
Artigo em Inglês | MEDLINE | ID: mdl-37481523

RESUMO

BACKGROUND: Topic models are a class of unsupervised machine learning models, which facilitate summarization, browsing and retrieval from large unstructured document collections. This study reviews several methods for assessing the quality of unsupervised topic models estimated using non-negative matrix factorization. Techniques for topic model validation have been developed across disparate fields. We synthesize this literature, discuss the advantages and disadvantages of different techniques for topic model validation, and illustrate their usefulness for guiding model selection on a large clinical text corpus. DESIGN, SETTING AND DATA: Using a retrospective cohort design, we curated a text corpus containing 382,666 clinical notes collected between 01/01/2017 through 12/31/2020 from primary care electronic medical records in Toronto Canada. METHODS: Several topic model quality metrics have been proposed to assess different aspects of model fit. We explored the following metrics: reconstruction error, topic coherence, rank biased overlap, Kendall's weighted tau, partition coefficient, partition entropy and the Xie-Beni statistic. Depending on context, cross-validation and/or bootstrap stability analysis were used to estimate these metrics on our corpus. RESULTS: Cross-validated reconstruction error favored large topic models (K ≥ 100 topics) on our corpus. Stability analysis using topic coherence and the Xie-Beni statistic also favored large models (K = 100 topics). Rank biased overlap and Kendall's weighted tau favored small models (K = 5 topics). Few model evaluation metrics suggested mid-sized topic models (25 ≤ K ≤ 75) as being optimal. However, human judgement suggested that mid-sized topic models produced expressive low-dimensional summarizations of the corpus. CONCLUSIONS: Topic model quality indices are transparent quantitative tools for guiding model selection and evaluation. Our empirical illustration demonstrated that different topic model quality indices favor models of different complexity; and may not select models aligning with human judgment. This suggests that different metrics capture different aspects of model goodness of fit. A combination of topic model quality indices, coupled with human validation, may be useful in appraising unsupervised topic models.


Assuntos
Algoritmos , Benchmarking , Humanos , Estudos Retrospectivos , Canadá , Registros Eletrônicos de Saúde
10.
Environ Manage ; 71(6): 1213-1227, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-36781453

RESUMO

The rapid transition of livestock husbandry in the 20th century involved a broad adoption of slurry-based livestock housing systems that resulted in farm economic benefits, but also in societal debate related to the environment and animal welfare. In this article, we apply the method of topic modeling to four major German newspapers to identify thematic emphases and changes in coverage around "slurry". We considered more than 2300 articles published between 1971 and 2020. Our results show that reporting encompasses economic, environmental, and social topics in which slurry is represented mostly critically ("poisonous substance"), occasionally neutrally ("scent of countryside"), or rarely positively ("input for the bioeconomy"). Three meta-themes overarch the majority of issues and reflect public discourse on agriculture: (i) the dichotomy of agricultural industrialization and family farming; (ii) contrasting actualities of factory farming and animal welfare; and (iii) the responsibility of policy for the emergence, existence and solution of livestock and slurry-related problems. A more balanced recognition of mutual values and constraints by the media could contribute to a discursive reconciliation of public and private interests.


Assuntos
Agricultura , Gado , Animais , Meio Ambiente , Fazendas , Alemanha
11.
IEEE Trans Knowl Data Eng ; 35(2): 1402-1420, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-36798878

RESUMO

Shortened time to knowledge discovery and adapting prior domain knowledge is a challenge for computational and data-intensive communities such as e.g., bioinformatics and neuroscience. The challenge for a domain scientist lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics when: investigating new methods, developing new tools, or integrating datasets. In this paper, we propose a novel "domain-specific topic model" (DSTM) to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplary scientific domains. Our DSTM is a generative model that extends the Latent Dirichlet Allocation (LDA) model and uses the Markov chain Monte Carlo (MCMC) algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include more than 25,000 of papers over the last ten years, featuring hundreds of tools and datasets that are commonly used in relevant studies. Evaluation experiments based on generalization and information retrieval metrics show that our model has better performance than the state-of-the-art baseline models for discovering highly-specific latent topics within a domain. Lastly, we demonstrate applications that benefit from our DSTM to discover intra-domain, cross-domain and trend knowledge patterns.

12.
J Ment Health ; 32(2): 386-395, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34582309

RESUMO

BACKGROUND: Depression raises a double challenge: besides the negative mood and the intrusive thoughts, the relation to the self also becomes difficult. Online forums are analysed as communicative platforms enabling the interactive reconstruction of the self. AIMS: The discourses of online depression forums are explored. Firstly, narrative patterns are identified according to their thematic focus (e.g. dysfunctional body, challenges of intimacy) and discursive logic (e.g. information exchange, support). Secondly, narratives are analysed in order to describe various ways of grounding a depressed self. METHODS: ∼70.000 depression-related posts from the biggest English-speaking online forums (e.g. www.reddit.com/r/depression, www.healthunlocked.com) were analysed. Quantitative (LDA topic modelling) and qualitative (deep reading) approaches were used simultaneously to determine the optimal number of topics and their interpretation. RESULTS: 13 topics were identified and interpreted according to their content and communicative function. Based on the inter-topic distances four clusters were identified (medicalized, intimacy-oriented, critical and uninhabitable self-narratives). CONCLUSIONS: The clusters of the 13 topics highlight various ways of narrating depression and the depressed self. Based on a comparison with a systematic review of mental illness recovery narratives, depression forums cover most narrative genres and emotional tones, thus create a unique opportunity for integrating the depressing experiences in the self.


Assuntos
Depressão , Transtornos Mentais , Humanos , Narração , Comunicação
13.
Comput Stat ; 38(2): 647-674, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37223721

RESUMO

Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.

14.
World Wide Web ; 26(1): 55-70, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-35308294

RESUMO

Every epidemic affects the real lives of many people around the world and leads to terrible consequences. Recently, many tweets about the COVID-19 pandemic have been shared publicly on social media platforms. The analysis of these tweets is helpful for emergency response organizations to prioritize their tasks and make better decisions. However, most of these tweets are non-informative, which is a challenge for establishing an automated system to detect useful information in social media. Furthermore, existing methods ignore unlabeled data and topic background knowledge, which can provide additional semantic information. In this paper, we propose a novel Topic-Aware BERT (TABERT) model to solve the above challenges. TABERT first leverages a topic model to extract the latent topics of tweets. Secondly, a flexible framework is used to combine topic information with the output of BERT. Finally, we adopt adversarial training to achieve semi-supervised learning, and a large amount of unlabeled data can be used to improve inner representations of the model. Experimental results on the dataset of COVID-19 English tweets show that our model outperforms classic and state-of-the-art baselines.

15.
J Med Internet Res ; 24(10): e39676, 2022 10 13.
Artigo em Inglês | MEDLINE | ID: mdl-36191167

RESUMO

BACKGROUND: The COVID-19 pandemic and its corresponding preventive and control measures have increased the mental burden on the public. Understanding and tracking changes in public mental status can facilitate optimizing public mental health intervention and control strategies. OBJECTIVE: This study aimed to build a social media-based pipeline that tracks public mental changes and use it to understand public mental health status regarding the pandemic. METHODS: This study used COVID-19-related tweets posted from February 2020 to April 2022. The tweets were downloaded using unique identifiers through the Twitter application programming interface. We created a lexicon of 4 mental health problems (depression, anxiety, insomnia, and addiction) to identify mental health-related tweets and developed a dictionary for identifying health care workers. We analyzed temporal and geographic distributions of public mental health status during the pandemic and further compared distributions among health care workers versus the general public, supplemented by topic modeling on their underlying foci. Finally, we used interrupted time series analysis to examine the statewide impact of a lockdown policy on public mental health in 12 states. RESULTS: We extracted 4,213,005 tweets related to mental health and COVID-19 from 2,316,817 users. Of these tweets, 2,161,357 (51.3%) were related to "depression," whereas 1,923,635 (45.66%), 225,205 (5.35%), and 150,006 (3.56%) were related to "anxiety," "insomnia," and "addiction," respectively. Compared to the general public, health care workers had higher risks of all 4 types of problems (all P<.001), and they were more concerned about clinical topics than everyday issues (eg, "students' pressure," "panic buying," and "fuel problems") than the general public. Finally, the lockdown policy had significant associations with public mental health in 4 out of the 12 states we studied, among which Pennsylvania showed a positive association, whereas Michigan, North Carolina, and Ohio showed the opposite (all P<.05). CONCLUSIONS: The impact of COVID-19 and the corresponding control measures on the public's mental status is dynamic and shows variability among different cohorts regarding disease types, occupations, and regional groups. Health agencies and policy makers should primarily focus on depression (reported by 51.3% of the tweets) and insomnia (which has had an ever-increasing trend since the beginning of the pandemic), especially among health care workers. Our pipeline timely tracks and analyzes public mental health changes, especially when primary studies and large-scale surveys are difficult to conduct.


Assuntos
COVID-19 , Distúrbios do Início e da Manutenção do Sono , Mídias Sociais , COVID-19/epidemiologia , COVID-19/prevenção & controle , Controle de Doenças Transmissíveis , Humanos , Infodemiologia , Saúde Mental , Pandemias/prevenção & controle , Políticas
16.
Sensors (Basel) ; 22(3)2022 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-35161598

RESUMO

With the rapid proliferation of social networking sites (SNS), automatic topic extraction from various text messages posted on SNS are becoming an important source of information for understanding current social trends or needs. Latent Dirichlet Allocation (LDA), a probabilistic generative model, is one of the popular topic models in the area of Natural Language Processing (NLP) and has been widely used in information retrieval, topic extraction, and document analysis. Unlike long texts from formal documents, messages on SNS are generally short. Traditional topic models such as LDA or pLSA (probabilistic latent semantic analysis) suffer performance degradation for short-text analysis due to a lack of word co-occurrence information in each short text. To cope with this problem, various techniques are evolving for interpretable topic modeling for short texts, pretrained word embedding with an external corpus combined with topic models is one of them. Due to recent developments of deep neural networks (DNN) and deep generative models, neural-topic models (NTM) are emerging to achieve flexibility and high performance in topic modeling. However, there are very few research works on neural-topic models with pretrained word embedding for generating high-quality topics from short texts. In this work, in addition to pretrained word embedding, a fine-tuning stage with an original corpus is proposed for training neural-topic models in order to generate semantically coherent, corpus-specific topics. An extensive study with eight neural-topic models has been completed to check the effectiveness of additional fine-tuning and pretrained word embedding in generating interpretable topics by simulation experiments with several benchmark datasets. The extracted topics are evaluated by different metrics of topic coherence and topic diversity. We have also studied the performance of the models in classification and clustering tasks. Our study concludes that though auxiliary word embedding with a large external corpus improves the topic coherency of short texts, an additional fine-tuning stage is needed for generating more corpus-specific topics from short-text data.


Assuntos
Envio de Mensagens de Texto , Análise por Conglomerados , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Redes Neurais de Computação
17.
Sensors (Basel) ; 22(3)2022 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-35161807

RESUMO

Combinatorial fusion algorithm (CFA) is a machine learning and artificial intelligence (ML/AI) framework for combining multiple scoring systems using the rank-score characteristic (RSC) function and cognitive diversity (CD). When measuring the relevance of a publication or document with respect to the 17 Sustainable Development Goals (SDGs) of the United Nations, a classification scheme is used. However, this classification process is a challenging task due to the overlapping goals and contextual differences of those diverse SDGs. In this paper, we use CFA to combine a topic model classifier (Model A) and a semantic link classifier (Model B) to improve the precision of the classification process. We characterize and analyze each of the individual models using the RSC function and CD between Models A and B. We evaluate the classification results from combining the models using a score combination and a rank combination, when compared to the results obtained from human experts. In summary, we demonstrate that the combination of Models A and B can improve classification precision only if these individual models perform well and are diverse.


Assuntos
Inteligência Artificial , Desenvolvimento Sustentável , Saúde Global , Humanos , Aprendizado de Máquina , Nações Unidas
18.
Entropy (Basel) ; 23(10)2021 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-34682025

RESUMO

This study constructs a comprehensive index to effectively judge the optimal number of topics in the LDA topic model. Based on the requirements for selecting the number of topics, a comprehensive judgment index of perplexity, isolation, stability, and coincidence is constructed to select the number of topics. This method provides four advantages to selecting the optimal number of topics: (1) good predictive ability, (2) high isolation between topics, (3) no duplicate topics, and (4) repeatability. First, we use three general datasets to compare our proposed method with existing methods, and the results show that the optimal topic number selection method has better selection results. Then, we collected the patent policies of various provinces and cities in China (excluding Hong Kong, Macao, and Taiwan) as datasets. By using the optimal topic number selection method proposed in this study, we can classify patent policies well.

19.
Cogn Process ; 21(1): 1-21, 2020 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-31555943

RESUMO

In recent years, latent semantic analysis (LSA) has reached a level of maturity at which its presence is ubiquitous in technology as well as in simulation of cognitive processes. In spite of this, in recent years there has been a trend of subjecting LSA to some criticisms, usually because it is compared to other models in very specific tasks and conditions and sometimes without having good knowledge of what the semantic representation of LSA means, and without exploiting all the possibilities of which LSA is capable other than the cosine. This paper provides a critical review to clarify some of the misunderstandings regarding LSA and other space models. The historical stability of the predecessors of LSA, the representational structure of word meaning and the multiple topologies that could arise from a semantic space, the computation of similarity, the myth that LSA dimensions have no meaning, the computational and algorithm plausibility to account for meaning acquisition in LSA (in contrast to others models based on online mechanisms), the possibilities of spatial models to substantiate recent proposals, and, in general, the characteristics of classic vector models and their ease and flexibility to simulate some cognitive phenomena will be reviewed. The review highlights the similarity between LSA and other techniques and proposes using long LSA experiences in other models, especially in predicting models such as word2vec. In sum, it emphasizes the lessons that can be learned from comparing LSA-based models to other models, rather than making statements about "the best."


Assuntos
Aprendizagem , Semântica , Algoritmos , Humanos , Conhecimento , Modelos Teóricos
20.
Entropy (Basel) ; 22(3)2020 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-33286100

RESUMO

In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model-the topic-specific words, document-specific topics, all model parameter values, and the total number of topics-in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for all topics-such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM's model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA