RESUMO
Internet health forums are a rich textual resource with content generated through free exchanges among patients and, in certain cases, health professionals. We tackle the problem of retrieving clinically relevant information from such forums, with relevant topics being defined from clinical auto-questionnaires. Texts in forums are largely unstructured and noisy, calling for adapted preprocessing and query methods. We minimize the number of false negatives in queries by using a synonym tool to achieve query expansion of initial topic keywords. To avoid false positives, we propose a new measure based on a statistical comparison of frequent co-occurrences in a large reference corpus (Web) to keep only relevant expansions. Our work is motivated by a study of breast cancer patients' health-related quality of life (QoL). We consider topics defined from a breast-cancer specific QoL-questionnaire. We quantify and structure occurrences in posts of a specialized French forum and outline important future developments.
Assuntos
Neoplasias da Mama/epidemiologia , Neoplasias da Mama/psicologia , Mineração de Dados/métodos , Troca de Informação em Saúde/estatística & dados numéricos , Qualidade de Vida/psicologia , Mídias Sociais/estatística & dados numéricos , Vocabulário Controlado , Inteligência Artificial , Feminino , Humanos , Processamento de Linguagem Natural , Inquéritos e QuestionáriosRESUMO
PNAS article classification is rooted in long-standing disciplinary divisions that do not necessarily reflect the structure of modern scientific research. We reevaluate that structure using latent pattern models from statistical machine learning, also known as mixed-membership models, that identify semantic structure in co-occurrence of words in the abstracts and references. Our findings suggest that the latent dimensionality of patterns underlying PNAS research articles in the Biological Sciences is only slightly larger than the number of categories currently in use, but it differs substantially in the content of the categories. Further, the number of articles that are listed under multiple categories is only a small fraction of what it should be. These findings together with the sensitivity analyses suggest ways to reconceptualize the organization of papers published in PNAS.
Assuntos
Publicações Periódicas como Assunto/classificação , Publicações/classificação , Classificação , Métodos , National Academy of Sciences, U.S. , Estatística como Assunto , Estados UnidosRESUMO
Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations of the Grade of Membership (GoM) model. We first describe GoM as an individual-level mixture model that allows individuals to have partial membership in several mixture components simultaneously. We then prove the equivalence between individual-level and population-level mixture models, and use this property to develop a Markov Chain Monte Carlo algorithm for Bayesian estimation of the model. We use our approach to analyze functional disability data from the NLTCS.