Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Sci Rep ; 14(1): 12204, 2024 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-38806483

RESUMO

In the era of social media, the use of emojis and code-mixed language has become essential in online communication. However, selecting the appropriate emoji that matches a particular sentiment or emotion in the code-mixed text can be difficult. This paper presents a novel task of predicting multiple emojis in English-Hindi code-mixed sentences and proposes a new dataset called SENTIMOJI, which extends the SemEval 2020 Task 9 SentiMix dataset. Our approach is based on exploiting the relationship between emotion, sentiment, and emojis to build an end-to-end framework. We replace the self-attention sublayers in the transformer encoder with simple linear transformations and use the RMS-layer norm instead of the normal layer norm. Moreover, we employ Gated Linear Unit and Fully Connected layers to predict emojis and identify the emotion and sentiment of a tweet. Our experimental results on the SENTIMOJI dataset demonstrate that the proposed multi-task framework outperforms the single-task framework. We also show that emojis are strongly linked to sentiment and emotion and that identifying sentiment and emotion can aid in accurately predicting the most suitable emoji. Our work contributes to the field of natural language processing and can help in the development of more effective tools for sentiment analysis and emotion recognition in code-mixed languages. The codes and data will be available at https://www.iitp.ac.in/~ai-nlp-ml/resources.html#SENTIMOJI to facilitate research.

2.
J Intell Inf Syst ; : 1-22, 2023 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-37363075

RESUMO

With the growing presence of multimodal content on the web, a specific category of fake news is rampant on popular social media outlets. In this category of fake online information, real multimedia contents (images, videos) are used in different but related contexts with manipulated texts to mislead the readers. The presence of seemingly non-manipulated multimedia content reinforces the belief in the associated fabricated textual content. Detecting this category of misleading multimedia fake news is almost impossible without relevance to any prior knowledge. In addition to this, the presence of highly novel and emotion-invoking contents can fuel the rapid dissemination of such fake news. To counter this problem, in this paper, we first introduce a novel multimodal fake news dataset that includes background knowledge (from authenticate sources) of the misleading articles. Second, we design a multimodal framework using Supervised Contrastive Learning (SCL) based novelty detection and Emotion Prediction tasks for fake news detection. We perform extensive experiments to reveal that our proposed model outperforms the state-of-the-art (SOTA) models.

3.
Artif Intell Med ; 139: 102535, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37100505

RESUMO

Medical dialog systems have the potential to assist e-medicine in improving access to healthcare services, improving patient treatment quality, and lowering medical expenses. In this research, we describe a knowledge-grounded conversation generation model that demonstrates how large-scale medical information in the form of knowledge graphs can aid in language comprehension and generation in medical dialog systems. Generic responses are often produced by existing generative dialog systems, resulting in monotonous and uninteresting conversations. To solve this problem, we combine various pre-trained language models with a medical knowledge base (UMLS) to generate clinically correct and human-like medical conversations using the recently released MedDialog-EN dataset. The medical-specific knowledge graph contains broadly 3 types of medical-related information, including disease, symptom and laboratory test. We perform reasoning over the retrieved knowledge graph by reading the triples in each graph using MedFact attention, which allows us to use semantic information from the graphs for better response generation. In order to preserve medical information, we employ a policy network, which effectively injects relevant entities associated with each dialog into the response. We also study how transfer learning can significantly improve the performance by utilizing a relatively small corpus, created by extending the recently released CovidDialog dataset, containing the dialogs for diseases that are symptoms of Covid-19. Empirical results on the MedDialog corpus and the extended CovidDialog dataset demonstrate that our proposed model significantly outperforms the state-of-the-art methods in terms of both automatic evaluation and human judgment.


Assuntos
COVID-19 , Reconhecimento Automatizado de Padrão , Humanos , Semântica , Unified Medical Language System , Comunicação
4.
Sci Rep ; 13(1): 3310, 2023 02 27.
Artigo em Inglês | MEDLINE | ID: mdl-36849466

RESUMO

Smart healthcare systems that make use of abundant health data can improve access to healthcare services, reduce medical costs and provide consistently high-quality patient care. Medical dialogue systems that generate medically appropriate and human-like conversations have been developed using various pre-trained language models and a large-scale medical knowledge base based on Unified Medical Language System (UMLS). However, most of the knowledge-grounded dialogue models only use local structure in the observed triples, which suffer from knowledge graph incompleteness and hence cannot incorporate any information from dialogue history while creating entity embeddings. As a result, the performance of such models decreases significantly. To address this problem, we propose a general method to embed the triples in each graph into large-scalable models and thereby generate clinically correct responses based on the conversation history using the recently recently released MedDialog(EN) dataset. Given a set of triples, we first mask the head entities from the triples overlapping with the patient's utterance and then compute the cross-entropy loss against the triples' respective tail entities while predicting the masked entity. This process results in a representation of the medical concepts from a graph capable of learning contextual information from dialogues, which ultimately aids in leading to the gold response. We also fine-tune the proposed Masked Entity Dialogue (MED) model on smaller corpora which contain dialogues focusing only on the Covid-19 disease named as the Covid Dataset. In addition, since UMLS and other existing medical graphs lack data-specific medical information, we re-curate and perform plausible augmentation of knowledge graphs using our newly created Medical Entity Prediction (MEP) model. Empirical results on the MedDialog(EN) and Covid Dataset demonstrate that our proposed model outperforms the state-of-the-art methods in terms of both automatic and human evaluation metrics.


Assuntos
COVID-19 , Humanos , COVID-19/epidemiologia , Benchmarking , Comunicação , Entropia , Ouro
5.
PLoS One ; 18(2): e0280458, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36795731

RESUMO

Neural open-domain dialogue systems often fail to engage humans in long-term interactions on popular topics such as sports, politics, fashion, and entertainment. However, to have more socially engaging conversations, we need to formulate strategies that consider emotion, relevant-facts, and user behaviour in multi-turn conversations. Establishing such engaging conversations using maximum likelihood estimation (MLE) based approaches often suffer from the problem of exposure bias. Since MLE loss evaluates the sentences at the word level, we focus on sentence-level judgment for our training purposes. In this paper, we present a method named EmoKbGAN for automatic response generation that makes use of the Generative Adversarial Network (GAN) in multiple-discriminator settings involving joint minimization of the losses provided by each attribute specific discriminator model (knowledge and emotion discriminator). Experimental results on two bechmark datasets i.e the Topical Chat and Document Grounded Conversation dataset yield that our proposed method significantly improves the overall performance over the baseline models in terms of both automated and human evaluation metrics, asserting that the model can generate fluent sentences with better control over emotion and content quality.


Assuntos
Comunicação , Redes Neurais de Computação , Humanos , Emoções , Processamento de Imagem Assistida por Computador/métodos
6.
PLoS One ; 18(2): e0269856, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36758020

RESUMO

Effective dialogue generation for task completion is challenging to build. The task requires the response generation system to generate the responses consistent with intent and slot values, have diversity in response and be able to handle multiple domains. The response also needs to be context relevant with respect to the previous utterances in the conversation. In this paper, we build six different models containing Bi-directional Long Short Term Memory (Bi-LSTM) and Bidirectional Encoder Representations from Transformers (BERT) based encoders. To effectively generate the correct slot values, we implement a copy mechanism at the decoder side. To capture the conversation context and the current state of the conversation we introduce a simple heuristic to build a conversational knowledge graph. Using this novel algorithm we are able to capture important aspects in a conversation. This conversational knowledge-graph is then used by our response generation model to generate more relevant and consistent responses. Using this knowledge-graph we do not need the entire utterance history, rather only the last utterance to capture the conversational context. We conduct experiments showing the effectiveness of the knowledge-graph in capturing the context and generating good response. We compare these results against hierarchical-encoder-decoder models and show that the use of triples from the conversational knowledge-graph is an effective method to capture context and the user requirement. Using this knowledge-graph we show an average performance gain of 0.75 BLEU score across different models. Similar results also hold true across different manual evaluation metrics.


Assuntos
Comunicação , Reconhecimento Automatizado de Padrão , Algoritmos , Benchmarking , Fontes de Energia Elétrica
7.
PLoS One ; 18(1): e0278323, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36607963

RESUMO

In a task-oriented dialogue setting, user's mood and demands can change in an ongoing dialogue, which may lead to a non-informative conversation or may result in conversation drop-off. To rectify such scenarios, a conversational agent should be able to learn the user's behaviour online, and form informative, empathetic and interactive responses. To incorporate these three aspects, we propose a novel end-to-end dialogue system GenPADS. First, we build and train two models, viz. a politeness classifier to extract polite information present in user's and agent's utterances and a generation model (G) to generate varying but semantically correct responses. We then incorporate both of these models in a reinforcement learning (RL) setting using two different politeness oriented reward algorithms to adapt and generate polite responses. To train our politeness classifier, we annotate recently released Taskmaster dataset into four fine-grained classes depicting politeness and impoliteness. Further, to train our generator model, we prepare a GenDD dataset using the same Taskmaster dataset. Lastly, we train GenPADS and perform automatic and human evaluation by building seven different user simulators. Detailed analysis reveals that GenPADS performs better than the two considered baselines,viz. a transformer based seq2seq generator model for user's and agent's utterance and a retrieval based politeness adaptive dialogue system (PADS).


Assuntos
Algoritmos , Comunicação , Humanos , Aprendizagem , Reforço Psicológico , Adaptação Fisiológica
8.
Int J Digit Libr ; 23(3): 289-301, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35873651

RESUMO

Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context-question-answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering.

9.
Sci Rep ; 12(1): 4457, 2022 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-35292695

RESUMO

With the upsurge in suicide rates worldwide, timely identification of the at-risk individuals using computational methods has been a severe challenge. Anyone presenting with suicidal thoughts, mainly recurring and containing a deep desire to die, requires urgent and ongoing psychiatric treatment. This work focuses on investigating the role of temporal orientation and sentiment classification (auxiliary tasks) in jointly analyzing the victims' emotional state (primary task). Our model leverages the effectiveness of multitask learning by sharing features among the tasks through a novel multi-layer cascaded shared-private attentive network. We conducted our experiments on a diversified version of the prevailing standard emotion annotated corpus of suicide notes in English, CEASE-v2.0. Experiments show that our proposed multitask framework outperforms the existing state-of-the-art system by 3.78% in the Emotion task, with a cross-validation Mean Recall (MR) of 60.90%. From our empirical and qualitative analysis of results, we observe that learning the tasks of temporality and sentiment together has a clear correlation with emotion recognition.


Assuntos
Suicídio , Atitude , Emoções , Humanos , Aprendizagem , Ideação Suicida
10.
Sci Rep ; 12(1): 493, 2022 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-35017584

RESUMO

Temporal orientation is an important aspect of human cognition which shows how an individual emphasizes past, present, and future. Theoretical research in psychology shows that one's emotional state can influence his/her temporal orientation. We hypothesize that measuring human temporal orientation can benefit from concurrent learning of emotion. To test this hypothesis, we propose a deep learning-based multi-task framework where we concurrently learn a unified model for temporal orientation (our primary task) and emotion analysis (secondary task) using tweets. Our multi-task framework takes users' tweets as input and produces three temporal orientation labels (past, present or future) and four emotion labels (joy, sadness, anger, or fear) with intensity values as outputs. The classified tweets are then grouped for each user to obtain the user-level temporal orientation and emotion. Finally, we investigate the associations between the users' temporal orientation and their emotional state. Our analysis reveals that joy and anger are correlated to future orientation while sadness and fear are correlated to the past orientation.

11.
PLoS One ; 17(1): e0259238, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35085252

RESUMO

Peer Review is at the heart of scholarly communications and the cornerstone of scientific publishing. However, academia often criticizes the peer review system as non-transparent, biased, arbitrary, a flawed process at the heart of science, leading to researchers arguing with its reliability and quality. These problems could also be due to the lack of studies with the peer-review texts for various proprietary and confidentiality clauses. Peer review texts could serve as a rich source of Natural Language Processing (NLP) research on understanding the scholarly communication landscape, and thereby build systems towards mitigating those pertinent problems. In this work, we present a first of its kind multi-layered dataset of 1199 open peer review texts manually annotated at the sentence level (∼ 17k sentences) across the four layers, viz. Paper Section Correspondence, Paper Aspect Category, Review Functionality, and Review Significance. Given a text written by the reviewer, we annotate: to which sections (e.g., Methodology, Experiments, etc.), what aspects (e.g., Originality/Novelty, Empirical/Theoretical Soundness, etc.) of the paper does the review text correspond to, what is the role played by the review text (e.g., appreciation, criticism, summary, etc.), and the importance of the review statement (major, minor, general) within the review. We also annotate the sentiment of the reviewer (positive, negative, neutral) for the first two layers to judge the reviewer's perspective on the different sections and aspects of the paper. We further introduce four novel tasks with this dataset, which could serve as an indicator of the exhaustiveness of a peer review and can be a step towards the automatic judgment of review quality. We also present baseline experiments and results for the different tasks for further investigations. We believe our dataset would provide a benchmark experimental testbed for automated systems to leverage on current NLP state-of-the-art techniques to address different issues with peer review quality, thereby ushering increased transparency and trust on the holy grail of scientific research validation. Our dataset and associated codes are available at https://www.iitp.ac.in/~ai-nlp-ml/resources.html#Peer-Review-Analyze.


Assuntos
Benchmarking/normas , Bases de Dados Factuais , Humanos , Processamento de Linguagem Natural , Revisão da Pesquisa por Pares , Reprodutibilidade dos Testes
12.
IEEE/ACM Trans Comput Biol Bioinform ; 19(2): 1105-1116, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-32853152

RESUMO

MOTIVATION: To minimize the accelerating amount of time invested on the biomedical literature search, numerous approaches for automated knowledge extraction have been proposed. Relation extraction is one such task where semantic relations between the entities are identified from the free text. In the biomedical domain, extraction of regulatory pathways, metabolic processes, adverse drug reaction or disease models necessitates knowledge from the individual relations, for example, physical or regulatory interactions between genes, proteins, drugs, chemical, disease or phenotype. RESULTS: In this paper, we study the relation extraction task from three major biomedical and clinical tasks, namely drug-drug interaction, protein-protein interaction, and medical concept relation extraction. Towards this, we model the relation extraction problem in a multi-task learning (MTL)framework, and introduce for the first time the concept of structured self-attentive network complemented with the adversarial learning approach for the prediction of relationships from the biomedical and clinical text. The fundamental notion of MTL is to simultaneously learn multiple problems together by utilizing the concepts of the shared representation. Additionally, we also generate the highly efficient single task model which exploits the shortest dependency path embedding learned over the attentive gated recurrent unit to compare our proposed MTL models. The framework we propose significantly improves over all the baselines (deep learning techniques)and single-task models for predicting the relationships, without compromising on the performance of all the tasks.


Assuntos
Semântica
13.
PLoS One ; 15(11): e0241271, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33151948

RESUMO

Multimodal dialogue system, due to its many-fold applications, has gained much attention to the researchers and developers in recent times. With the release of large-scale multimodal dialog dataset Saha et al. 2018 on the fashion domain, it has been possible to investigate the dialogue systems having both textual and visual modalities. Response generation is an essential aspect of every dialogue system, and making the responses diverse is an important problem. For any goal-oriented conversational agent, the system's responses must be informative, diverse and polite, that may lead to better user experiences. In this paper, we propose an end-to-end neural framework for generating varied responses in a multimodal dialogue setup capturing information from both the text and image. Multimodal encoder with co-attention between the text and image is used for focusing on the different modalities to obtain better contextual information. For effective information sharing across the modalities, we combine the information of text and images using the BLOCK fusion technique that helps in learning an improved multimodal representation. We employ stochastic beam search with Gumble Top K-tricks to achieve diversified responses while preserving the content and politeness in the responses. Experimental results show that our proposed approach performs significantly better compared to the existing and baseline methods in terms of distinct metrics, and thereby generates more diverse responses that are informative, interesting and polite without any loss of information. Empirical evaluation also reveals that images, while used along with the text, improve the efficiency of the model in generating diversified responses.


Assuntos
Análise e Desempenho de Tarefas , Algoritmos , Bases de Dados como Assunto , Humanos , Modelos Teóricos , Motivação
14.
Interdiscip Sci ; 12(4): 537-546, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-32193856

RESUMO

Many kinds of disease-related data are now available and researchers are constantly attempting to mine useful information out of these. Medical data are not always homogeneous and in structured form, and mostly they are time-stamped data. Thus, special care is required to prevent any kind of information loss during mining such data. Mining medical data is challenging as predicting the non-accurate result is often not acceptable in this domain. In this paper, we have analyzed a partially annotated coronary artery disease (CAD) dataset which was originally in a semi-structured form. We have created a set of some well-defined features from the dataset, and then build predictive models for CAD risk identification using different supervised learning algorithms. We then further enhanced the performances of the models using a feature selection technique. Experiments show that results are quite interesting, and are expected to help medical practitioners for investigating CAD risk in patients.


Assuntos
Doença da Artéria Coronariana , Algoritmos , Humanos
15.
Heliyon ; 5(9): e02504, 2019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-31687594

RESUMO

Every machine translation system has some advantages. We propose an improved statistical system combination approach to achieve the advantages of existing machine translation systems. The primary task is to score all the phrases of the outputs of different machine translation systems selected for combination. Three steps are involved in the proposed statistical system combination approach, viz., alignment, decoding, and scoring. Pair alignment is done in the first step to prevent duplication so that only a single phrase is chosen from various phrases containing the same information. Thus the alignment and scoring strategy are implemented in our approach. Hypotheses are built in the second step. In the third step, we calculate the scores for all the hypotheses. The hypothesis with the highest score is chosen as the final translated output. Wrong scoring can mislead to identify the best part from different systems. It may be noted that a particular phrase may appear in various ways in different translations. To resolve the challenges, we incorporate WordNet in the alignment phase and word2vec in the scoring phase along with the existing factors. We find that the system combination model using WordNet and word2vec injection improves the machine translation accuracy. In this work, we have merged three systems viz., Hierarchical machine translation system, Bing Microsoft Translate, and Google Translate. The broad tests of translation on eight language pairs with benchmark datasets demonstrate that the proposed system achieves better quality than the individual systems and the state-of-the-art system combination models.

16.
PLoS One ; 14(2): e0211872, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30785900

RESUMO

Time Perspective (TP) is an important area of research within the 'psychological time' paradigm. TP, or the manner in which individuals conduct themselves as a reflection of their cogitation of the past, the present, and the future, is considered as a basic facet of human functioning. These perceptions of time have an influence on our actions, perceptions, and emotions. Assessment of TP based on human language on Twitter opens up a new avenue for research on subjective view of time at a large scale. In order to assess TP of users' from their tweets, the foremost task is to resolve grammatical tense into the underlying temporal orientation of tweets as for many tweets the tense information, and their temporal orientations are not the same. In this article, we first resolve grammatical tense of users' tweets to identify their underlying temporal orientation: past, present, or future. We develop a minimally supervised classification framework for temporal orientation task that enables incorporating linguistic knowledge into a deep neural network. The temporal orientation model achieves an accuracy of 78.7% when tested on a manually annotated test set. This method performs better when compared to the state-of-the-art technique. Secondly, we apply the classification model to classify the users' tweets in either of the past, present or future categories. Tweets classified this way are then grouped for each user which gives rise to unidimensional TP. The valence (positive, negative, and neutral) is added to the temporal orientation dimension to produce the bidimensional TP. We finally investigate the association between the Twitter users' unidimensional and bidimensional TP and their age, education and six basic emotions in a large-scale empirical manner. Our analysis shows that people tend to think more about the past as well as more positive about the future when they age. We also observe that future-negative people are less joyful, more sad, more disgusted, and more angry while past-negative people have more fear.


Assuntos
Psicolinguística/tendências , Mídias Sociais , Percepção do Tempo/fisiologia , Emoções/fisiologia , Medo/psicologia , Humanos , Idioma , Linguística , Redes Neurais de Computação
17.
IEEE J Biomed Health Inform ; 20(4): 1171-7, 2016 07.
Artigo em Inglês | MEDLINE | ID: mdl-26208367

RESUMO

Studying the patterns hidden in gene-expression data helps to understand the functionality of genes. In general, clustering techniques are widely used for the identification of natural partitionings from the gene expression data. In order to put constraints on dimensionality, feature selection is the key issue because not all features are important from clustering point of view. Moreover some limited amount of supervised information can help to fine tune the obtained clustering solution. In this paper, the problem of simultaneous feature selection and semisupervised clustering is formulated as a multiobjective optimization (MOO) task. A modern simulated annealing-based MOO technique namely AMOSA is utilized as the background optimization methodology. Here, features and cluster centers are represented in the form of a string and the assignment of genes to different clusters is done using a point symmetry-based distance. Six optimization criteria based on several internal and external cluster validity indices are utilized. In order to generate the supervised information, a popular clustering technique, Fuzzy C-mean, is utilized. Appropriate subset of features, proper number of clusters and the proper partitioning are determined using the search capability of AMOSA. The effectiveness of this proposed semisupervised clustering technique, Semi-FeaClustMOO, is demonstrated on five publicly available benchmark gene-expression datasets. Comparison results with the existing techniques for gene-expression data clustering again reveal the superiority of the proposed technique. Statistical and biological significance tests have also been carried out.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Aprendizado de Máquina Supervisionado , Algoritmos , Animais , Arabidopsis/genética , Arabidopsis/metabolismo , Análise por Conglomerados , Bases de Dados Genéticas , Lógica Fuzzy , Humanos , Ratos , Leveduras/genética , Leveduras/metabolismo
18.
Int J Data Min Bioinform ; 11(4): 365-91, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26336665

RESUMO

Named Entity Recognition and Classification (NERC) is an important task in information extraction for biomedicine domain. Biomedical Named Entities include mentions of proteins, genes, DNA, RNA, etc. which, in general, have complex structures and are difficult to recognise. In this paper, we propose a Single Objective Optimisation based classifier ensemble technique using the search capability of Genetic Algorithm (GA) for NERC in biomedical texts. Here, GA is used to quantify the amount of voting for each class in each classifier. We use diverse classification methods like Conditional Random Field and Support Vector Machine to build a number of models depending upon the various representations of the set of features and/or feature templates. The proposed technique is evaluated with two benchmark datasets, namely JNLPBA 2004 and GENETAG. Experiments yield the overall F- measure values of 75.97% and 95.90%, respectively. Comparisons with the existing systems show that our proposed system achieves state-of-the-art performance.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Máquina de Vetores de Suporte , Algoritmos , Bases de Dados Genéticas , Humanos
19.
J Cheminform ; 7(Suppl 1 Text mining for chemistry and the CHEMDNER track): S2, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25810773

RESUMO

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

20.
Springerplus ; 3: 465, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25279282

RESUMO

In this paper we have coupled feature selection problem with semi-supervised clustering. Semi-supervised clustering utilizes the information of unsupervised and supervised learning in order to overcome the problems related to them. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. In this paper we have solved the problem of automatic feature selection and semi-supervised clustering using multiobjective optimization. A recently created simulated annealing based multiobjective optimization technique titled archived multiobjective simulated annealing (AMOSA) is used as the underlying optimization technique. Here features and cluster centers are encoded in the form of a string. We assume that for each data set for 10% data points class level information are known to us. Two internal cluster validity indices reflecting different data properties, an external cluster validity index measuring the similarity between the obtained partitioning and the true labelling for 10% data points and a measure counting the number of features present in a particular string are optimized using the search capability of AMOSA. AMOSA is utilized to detect the appropriate subset of features, appropriate number of clusters as well as the appropriate partitioning from any given data set. The effectiveness of the proposed semi-supervised feature selection technique as compared to the existing techniques is shown for seven real-life data sets of varying complexities.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...