Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
AMIA Annu Symp Proc ; 2021: 1109-1118, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35308915

RESUMO

Mental illness, a serious problem across the globe, requires multi-pronged solutions including effective computational models to predict illness. Mental illness diagnosis is complicated by the pronounced sharing of symptoms and mutual pre-dispositions. Set in this context we offer a systematic comparison of seven deep learning and two conventional machine learning models for predicting mental illness from the history of present illness free-text descriptions in patient records. The models tested include a new architecture CB-MH which ranks best for F1 (0.62) while another attention model is best for F2 (0.71). We also explore model decisions using Integrated Gradients interpretability method which we use to identify key influential features. Overall, the majority of true positives have key features appearing in meaningful contexts. False negatives are most challenging with most key features appearing in unclear contexts. False positives are mostly true positives in actuality as supported by a small-scale clinician-based user judgement study.


Assuntos
Aprendizado Profundo , Transtornos Mentais , Humanos , Aprendizado de Máquina , Transtornos Mentais/diagnóstico
2.
BMC Med Inform Decis Mak ; 17(1): 49, 2017 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-28431582

RESUMO

BACKGROUND: It is becoming increasingly common for individuals and organizations to use social media platforms such as Facebook. These are being used for a wide variety of purposes including disseminating, discussing and seeking health related information. U.S. Federal health agencies are leveraging these platforms to 'engage' social media users to read, spread, promote and encourage health related discussions. However, different agencies and their communications get varying levels of engagement. In this study we use statistical models to identify factors that associate with engagement. METHODS: We analyze over 45,000 Facebook posts from 72 Facebook accounts belonging to 24 health agencies. Account usage, user activity, sentiment and content of these posts are studied. We use the hurdle regression model to identify factors associated with the level of engagement and Cox proportional hazards model to identify factors associated with duration of engagement. RESULTS: In our analysis we find that agencies and accounts vary widely in their usage of social media and activity they generate. Statistical analysis shows, for instance, that Facebook posts with more visual cues such as photos or videos or those which express positive sentiment generate more engagement. We further find that posts on certain topics such as occupation or organizations negatively affect the duration of engagement. CONCLUSIONS: We present the first comprehensive analyses of engagement with U.S. Federal health agencies on Facebook. In addition, we briefly compare and contrast findings from this study to our earlier study with similar focus but on Twitter to show the robustness of our methods.


Assuntos
Disseminação de Informação , Comportamento de Busca de Informação , Mídias Sociais , Rede Social , United States Dept. of Health and Human Services , Comunicação , Humanos , Modelos Estatísticos , Aceitação pelo Paciente de Cuidados de Saúde , Estados Unidos
3.
PLoS One ; 11(3): e0150881, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26982323

RESUMO

Life satisfaction refers to a somewhat stable cognitive assessment of one's own life. Life satisfaction is an important component of subjective well being, the scientific term for happiness. The other component is affect: the balance between the presence of positive and negative emotions in daily life. While affect has been studied using social media datasets (particularly from Twitter), life satisfaction has received little to no attention. Here, we examine trends in posts about life satisfaction from a two-year sample of Twitter data. We apply a surveillance methodology to extract expressions of both satisfaction and dissatisfaction with life. A noteworthy result is that consistent with their definitions trends in life satisfaction posts are immune to external events (political, seasonal etc.) unlike affect trends reported by previous researchers. Comparing users we find differences between satisfied and dissatisfied users in several linguistic, psychosocial and other features. For example the latter post more tweets expressing anger, anxiety, depression, sadness and on death. We also study users who change their status over time from satisfied with life to dissatisfied or vice versa. Noteworthy is that the psychosocial tweet features of users who change from satisfied to dissatisfied are quite different from those who stay satisfied over time. Overall, the observations we make are consistent with intuition and consistent with observations in the social science research. This research contributes to the study of the subjective well being of individuals through social media.


Assuntos
Felicidade , Satisfação Pessoal , Mídias Sociais , Humanos
4.
BMC Bioinformatics ; 16: 198, 2015 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-26091670

RESUMO

BACKGROUND: The rapid pace of bioscience research makes it very challenging to track relevant articles in one's area of interest. MEDLINE, a primary source for biomedical literature, offers access to more than 20 million citations with three-quarters of a million new ones added each year. Thus it is not surprising to see active research in building new document retrieval and sentence retrieval systems. We present Ferret, a prototype retrieval system, designed to retrieve and rank sentences (and their documents) conveying gene-centric relationships of interest to a scientist. The prototype has several features. For example, it is designed to handle gene name ambiguity and perform query expansion. Inputs can be a list of genes with an optional list of keywords. Sentences are retrieved across species but the species discussed in the records are identified. Results are presented in the form of a heat map and sentences corresponding to specific cells of the heat map may be selected for display. Ferret is designed to assist bio scientists at different stages of research from early idea exploration to advanced analysis of results from bench experiments. RESULTS: Three live case studies in the field of plant biology are presented related to Arabidopsis thaliana. The first is to discover genes that may relate to the phenotype of open immature flower in Arabidopsis. The second case is about finding associations reported between ethylene signaling and a set of 300+ Arabidopsis genes. The third case is on searching for potential gene targets of an Arabidopsis transcription factor hypothesized to be involved in plant stress responses. Ferret was successful in finding valuable information in all three cases. In the first case the bZIP family of genes was identified. In the second case sentences indicating relevant associations were found in other species such as potato and jasmine. In the third sentences led to new research questions about the plant hormone salicylic acid. CONCLUSIONS: Ferret successfully retrieved relevant gene-centric sentences from PubMed records. The three case studies demonstrate end user satisfaction with the system.


Assuntos
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Bases de Dados Bibliográficas , Armazenamento e Recuperação da Informação/métodos , PubMed , Software , Estresse Fisiológico/genética , Etilenos/metabolismo , Flores/química , Fenótipo , Ácido Salicílico/metabolismo , Semântica
5.
PLoS One ; 9(11): e112235, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25379727

RESUMO

OBJECTIVE: To investigate factors associated with engagement of U.S. Federal Health Agencies via Twitter. Our specific goals are to study factors related to a) numbers of retweets, b) time between the agency tweet and first retweet and c) time between the agency tweet and last retweet. METHODS: We collect 164,104 tweets from 25 Federal Health Agencies and their 130 accounts. We use negative binomial hurdle regression models and Cox proportional hazards models to explore the influence of 26 factors on agency engagement. Account features include network centrality, tweet count, numbers of friends, followers, and favorites. Tweet features include age, the use of hashtags, user-mentions, URLs, sentiment measured using Sentistrength, and tweet content represented by fifteen semantic groups. RESULTS: A third of the tweets (53,556) had zero retweets. Less than 1% (613) had more than 100 retweets (mean  = 284). The hurdle analysis shows that hashtags, URLs and user-mentions are positively associated with retweets; sentiment has no association with retweets; and tweet count has a negative association with retweets. Almost all semantic groups, except for geographic areas, occupations and organizations, are positively associated with retweeting. The survival analyses indicate that engagement is positively associated with tweet age and the follower count. CONCLUSIONS: Some of the factors associated with higher levels of Twitter engagement cannot be changed by the agencies, but others can be modified (e.g., use of hashtags, URLs). Our findings provide the background for future controlled experiments to increase public health engagement via Twitter.


Assuntos
Mídias Sociais , United States Dept. of Health and Human Services , Humanos , Disseminação de Informação , Modelos de Riscos Proporcionais , Análise de Regressão , Mídias Sociais/estatística & dados numéricos , Estados Unidos , United States Dept. of Health and Human Services/estatística & dados numéricos
6.
AMIA Annu Symp Proc ; 2012: 1030-9, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23304379

RESUMO

We present and test the intuition that letters to the editor in journals carry early signals of adverse drug events (ADEs). Surprisingly these letters have not yet been exploited for automatic ADE detection unlike for example, clinical records and PubMed. Part of the challenge is that it is not easy to access the full-text of letters (for the most part these do not appear in PubMed). Also letters are likely underrated in comparison with full articles. Besides demonstrating that this intuition holds we contribute techniques for post market drug surveillance. Specifically, we test an automatic approach for ADE detection from letters using off-the-shelf machine learning tools. We also involve natural language processing for feature definitions. Overall we achieve high accuracy in our experiments and our method also works well on a second new test set. Our results encourage us to further pursue this line of research.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Correspondência como Assunto , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Processamento de Linguagem Natural , Humanos , Vigilância de Produtos Comercializados
7.
BMC Bioinformatics ; 12 Suppl 8: S4, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151968

RESUMO

BACKGROUND: The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested. RESULTS: A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation. DISCUSSION: The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.


Assuntos
Mineração de Dados/métodos , Genes , Animais , Biologia Computacional/métodos , Publicações Periódicas como Assunto , Plantas/genética , Plantas/metabolismo
8.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151901

RESUMO

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Assuntos
Algoritmos , Mineração de Dados/métodos , Genes , Animais , Mineração de Dados/normas , Humanos , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , Estados Unidos
9.
Bioinformatics ; 27(13): i120-8, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21685060

RESUMO

MOTIVATION: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents. RESULTS: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts. CONTACT: sanmitra-bhattacharya@uiowa.edu; padmini-srinivasan@uiowa.edu.


Assuntos
Armazenamento e Recuperação da Informação , Medical Subject Headings , MEDLINE , Estados Unidos
10.
Indian J Community Med ; 35(4): 498-501, 2010 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-21278870

RESUMO

BACKGROUND: Millions of workers are occupationally exposed to dyes in the world, but little is known about their knowledge and attitudes toward the effects of dye on their health.. OBJECTIVES: The aim of this study was to assess the fabric dyers' and fabric printers' knowledge, attitude, and practice toward the health hazard of dyes. MATERIALS AND METHODS: The present study was taken up in the Madurai district which is situated in the Southern Tamil Nadu, India. One hundred and forty-two workers employed in small-scale dyeing and printing units participated in a face-to-face confidential interview. RESULTS: The mean age of fabric dyers and fabric printers was 42 years (±10.7). When enquired about whether dyes affect body organ(s), all the workers agreed that dye(s) will affect skin, but they were not aware that dyes could affect other parts of the body. All the workers believed that safe methods of handling of dyes and disposal of contaminated packaging used for dyes need to be considered. It was found that 34% of the workers were using personal protective equipment (PPE) such as rubber hand gloves during work. CONCLUSION: The workers had knowledge regarding the occupational hazards, and their attitudinal approach toward the betterment of the work environment is positive.

11.
BMC Bioinformatics ; 8 Suppl 9: S3, 2007 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-18047704

RESUMO

BACKGROUND: Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the biomedical literature. The goal in this paper, based on this approach, is to develop automatic annotation methods that can supplement the expensive manual annotation processes currently in place. RESULTS: Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.49, 0.41 and 0.33 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We find that alternative term weighting strategies are not different from each other in performance and feature selection strategies reduce performance. The best thresholding strategy is one where a single threshold is picked for each hierarchy. Hierarchy level is important especially for molecular function and biological process. The cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research shows that it is possible to beneficially exploit the hierarchical structures by defining and testing a relaxed criteria for classification correctness. Finally it is possible to build classifiers for codes with very few associated documents but as expected a huge penalty is paid in performance. CONCLUSION: The GO annotation problem is complex. Several key observations have been made as for example about topic drift that may be important to consider in annotation strategies.


Assuntos
Inteligência Artificial , Sistemas de Gerenciamento de Base de Dados , Bases de Dados de Proteínas , Documentação/métodos , Armazenamento e Recuperação da Informação/métodos , Proteínas/química , Proteínas/metabolismo , Processamento de Linguagem Natural , Proteínas/classificação , Proteínas/genética
12.
AMIA Annu Symp Proc ; : 681-5, 2007 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-18693923

RESUMO

Gene annotations with Gene Ontology codes offer scientists important options in their study of genes and their functions. Automatic GO annotation methods have the potential to supplement the intensive manual annotation processes. Annotation approaches using MEDLINE documents are generally two-phased where the first is to annotate documents with GO codes and the second is to annotate gene products via the documents. In this paper we study document annotation with GO codes using a temporal perspective. Specifically, we build adaptive code-specific classifiers. We also study topic drift i.e., changes in the contextual characteristics of annotations over time. We show that topic drift is significant especially in the biological process GO hierarchy. This at least partially explains the particular challenges faced with codes of this hierarchy.


Assuntos
Bases de Dados Genéticas , Genes , Processamento de Linguagem Natural , Vocabulário Controlado , Biologia Computacional , MEDLINE , Terminologia como Assunto
13.
BMC Bioinformatics ; 7: 220, 2006 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-16630348

RESUMO

BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. RESULTS: Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. CONCLUSION: We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents.


Assuntos
Indexação e Redação de Resumos/métodos , Inteligência Artificial , Genes , Genoma Humano , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Processamento de Linguagem Natural , Algoritmos , Humanos , Vocabulário Controlado
14.
AMIA Annu Symp Proc ; : 874-8, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16779165

RESUMO

This paper explores methods to compare concept spaces derived from different discourses in a common health domain. The concept spaces are generated from the research literature and from message board discussions on the Internet. We explore a number of methods for comparing and contrasting concept space pairs. We experiment with five select health domains in this exploratory research: Autism, AIDS, Fibromyalgia, Irritable Bowel Syndrome and Multiple Sclerosis. The paper concludes with a discussion about the potential of our methods. Future work on refinements to our techniques is also outlined.


Assuntos
Comunicação , Formação de Conceito , Armazenamento e Recuperação da Informação/métodos , Internet , Síndrome da Imunodeficiência Adquirida , Transtorno Autístico , Correio Eletrônico , Fibromialgia , Humanos , Síndrome do Intestino Irritável , Esclerose Múltipla , Pacientes , Médicos , PubMed , Unified Medical Language System , Vocabulário
15.
Stud Health Technol Inform ; 107(Pt 2): 808-12, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15360924

RESUMO

Our aim is to contribute to biomedical text extraction and mining research. In this paper we present exploratory research on the MeSH terms assigned to MEDLINE citations. We analyze MeSH based co-occurrences and identify the interesting ones, i.e., those that are likely to be semantically meaningful. For each selected co-occurring pair we derive a weighted vector representation that emphasizes the verb based functional aspects of the underlying semantics. Preliminary experiments exploring the potential value of these vectors gave us very good results. The larger goal of this project is to contribute to knowledge discovery research by mining the knowledge that is latent within the biomedical literature. It is also to provide a method capable of suggesting cross-disciplinary connections via the pairs derived from all of MEDLINE.


Assuntos
Armazenamento e Recuperação da Informação , MEDLINE , Medical Subject Headings , Semântica
16.
Bioinformatics ; 20 Suppl 1: i290-6, 2004 Aug 04.
Artigo em Inglês | MEDLINE | ID: mdl-15262811

RESUMO

MOTIVATION: Text mining systems aim at knowledge discovery from text collections. This work presents our text mining algorithm and demonstrates its use to uncover information that could form the basis of new hypotheses. In particular, we use it to discover novel uses for Curcuma longa, a dietary substance, which is highly regarded for its therapeutic properties in Asia. RESULTS: Several disease were identified that offer novel research contexts for curcumin. We analyze select suggestions, such as retinal diseases, Crohn's disease and disorders related to the spinal cord. Our analysis suggests that there is strong evidence in favor of a beneficial role for curcumin in these diseases. The evidence is based on curcumin's influence on several genes, such as COX-2, TNF-alpha, JNK, p38 MAPK and TGF-beta. This research suggests that our discovery algorithm may be used to suggest novel uses for dietary and pharmacological substances. More generally, our text mining algorithm may be used to uncover information that potentially sheds new light on a given topic of interest. AVAILABILITY: Contact authors.


Assuntos
Doença de Crohn/dietoterapia , Curcumina/uso terapêutico , Dietoterapia , MEDLINE , Processamento de Linguagem Natural , Doenças Retinianas/dietoterapia , Doenças da Medula Espinal/dietoterapia , Indexação e Redação de Resumos/métodos , Inteligência Artificial , Humanos , Estatística como Assunto
17.
AMIA Annu Symp Proc ; : 440-4, 2003.
Artigo em Inglês | MEDLINE | ID: mdl-14728211

RESUMO

This study evaluated the use of machine learning techniques in the classification of sentence type. 7253 structured abstracts and 204 unstructured abstracts of Randomized Controlled Trials from MedLINE were parsed into sentences and each sentence was labeled as one of four types (Introduction, Method, Result, or Conclusion). Support Vector Machine (SVM) and Linear Classifier models were generated and evaluated on cross-validated data. Treating sentences as a simple "bag of words", the SVM model had an average ROC area of 0.92. Adding a feature of relative sentence location improved performance markedly for some models and overall increasing the average ROC to 0.95. Linear classifier performance was significantly worse than the SVM in all datasets. Using the SVM model trained on structured abstracts to predict unstructured abstracts yielded performance similar to that of models trained with unstructured abstracts in 3 of the 4 types. We conclude that classification of sentence type seems feasible within the domain of RCT's. Identification of sentence types may be helpful for providing context to end users or other text summarization techniques.


Assuntos
Indexação e Redação de Resumos , Inteligência Artificial , Linguística , Modelos Lineares , MEDLINE , Curva ROC , Ensaios Clínicos Controlados Aleatórios como Assunto
18.
Proc AMIA Symp ; : 722-6, 2002.
Artigo em Inglês | MEDLINE | ID: mdl-12463919

RESUMO

We present a text mining application that exploits the MeSH heading subheading combinations present in MEDLINE records. The process begins with a user specified pair of subheadings. Co-occurring concepts qualified by these subheadings are regarded as being conceptually related and thus extracted. A parallel process using SemRep, a linguistic tool, also extracts conceptually related concept pairs from the titles of MEDLINE records. The pairs extracted via MeSH and the pairs extracted via SemRep are compared to yield a high confidence subset. These pairs are then combined to project a summary view associated with the selected subheading pair. For each concept the "diversity" in the set of related concepts is assessed. We suggest that this summary and the diversity indicators will be useful a health care practitioner or researcher. We illustrate this application with the subheading pair "drug therapy" and "therapeutic use" which approximates the treatment relationship between Drugs and Diseases.


Assuntos
Armazenamento e Recuperação da Informação/métodos , MEDLINE , Descritores , Tratamento Farmacológico
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA