Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 134
Filter
Add more filters

Publication year range
1.
Proc Natl Acad Sci U S A ; 120(25): e2220726120, 2023 06 20.
Article in English | MEDLINE | ID: mdl-37307492

ABSTRACT

Large-scale language datasets and advances in natural language processing offer opportunities for studying people's cognitions and behaviors. We show how representations derived from language can be combined with laboratory-based word norms to predict implicit attitudes for diverse concepts. Our approach achieves substantially higher correlations than existing methods. We also show that our approach is more predictive of implicit attitudes than are explicit attitudes, and that it captures variance in implicit attitudes that is largely unexplained by explicit attitudes. Overall, our results shed light on how implicit attitudes can be measured by combining standard psychological data with large-scale language data. In doing so, we pave the way for highly accurate computational modeling of what people think and feel about the world around them.


Subject(s)
Cognition , Emotions , Humans , Computer Simulation , Laboratories , Attitude
2.
Proc Natl Acad Sci U S A ; 119(28): e2121798119, 2022 07 12.
Article in English | MEDLINE | ID: mdl-35787033

ABSTRACT

Using word embeddings from 850 billion words in English-language Google Books, we provide an extensive analysis of historical change and stability in social group representations (stereotypes) across a long timeframe (from 1800 to 1999), for a large number of social group targets (Black, White, Asian, Irish, Hispanic, Native American, Man, Woman, Old, Young, Fat, Thin, Rich, Poor), and their emergent, bottom-up associations with 14,000 words and a subset of 600 traits. The results provide a nuanced picture of change and persistence in stereotypes across 200 y. Change was observed in the top-associated words and traits: Whether analyzing the top 10 or 50 associates, at least 50% of top associates changed across successive decades. Despite this changing content of top-associated words, the average valence (positivity/negativity) of these top stereotypes was generally persistent. Ultimately, through advances in the availability of historical word embeddings, this study offers a comprehensive characterization of both change and persistence in social group representations as revealed through books of the English-speaking world from 1800 to 1999.


Subject(s)
Books , Search Engine , Female , History, 19th Century , History, 20th Century , Humans , Language , Male , Population Groups/history , Stereotyping
3.
Proc Natl Acad Sci U S A ; 119(10): e2108801119, 2022 03 08.
Article in English | MEDLINE | ID: mdl-35239440

ABSTRACT

SignificanceWe introduce an approach to identify latent topics in large-scale text data. Our approach integrates two prominent methods of computational text analysis: topic modeling and word embedding. We apply our approach to written narratives of violent death (e.g., suicides and homicides) in the National Violent Death Reporting System (NVDRS). Many of our topics reveal aspects of violent death not captured in existing classification schemes. We also extract gender bias in the topics themselves (e.g., a topic about long guns is particularly masculine). Our findings suggest new lines of research that could contribute to reducing suicides or homicides. Our methods are broadly applicable to text data and can unlock similar information in other administrative databases.


Subject(s)
Databases, Factual , Homicide , Models, Theoretical , Violence , Humans , United States
4.
BMC Med Inform Decis Mak ; 24(Suppl 2): 114, 2024 Apr 30.
Article in English | MEDLINE | ID: mdl-38689287

ABSTRACT

BACKGROUND: Traditional literature based discovery is based on connecting knowledge pairs extracted from separate publications via a common mid point to derive previously unseen knowledge pairs. To avoid the over generation often associated with this approach, we explore an alternative method based on word evolution. Word evolution examines the changing contexts of a word to identify changes in its meaning or associations. We investigate the possibility of using changing word contexts to detect drugs suitable for repurposing. RESULTS: Word embeddings, which represent a word's context, are constructed from chronologically ordered publications in MEDLINE at bi-monthly intervals, yielding a time series of word embeddings for each word. Focusing on clinical drugs only, any drugs repurposed in the final time segment of the time series are annotated as positive examples. The decision regarding the drug's repurposing is based either on the Unified Medical Language System (UMLS), or semantic triples extracted using SemRep from MEDLINE. CONCLUSIONS: The annotated data allows deep learning classification, with a 5-fold cross validation, to be performed and multiple architectures to be explored. Performance of 65% using UMLS labels, and 81% using SemRep labels is attained, indicating the technique's suitability for the detection of candidate drugs for repurposing. The investigation also shows that different architectures are linked to the quantities of training data available and therefore that different models should be trained for every annotation approach.


Subject(s)
Drug Repositioning , Humans , Unified Medical Language System , MEDLINE , Deep Learning , Natural Language Processing , Semantics
5.
Disasters ; 48 Suppl 1: e12631, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38860638

ABSTRACT

Smooth interaction with a disaster-affected community can create and strengthen its social capital, leading to greater effectiveness in the provision of successful post-disaster recovery aid. To understand the relationship between the types of interaction, the strength of social capital generated, and the provision of successful post-disaster recovery aid, intricate ethnographic qualitative research is required, but it is likely to remain illustrative because it is based, at least to some degree, on the researcher's intuition. This paper thus offers an innovative research method employing a quantitative artificial intelligence (AI)-based language model, which allows researchers to re-examine data, thereby validating the findings of the qualitative research, and to glean additional insights that might otherwise have been missed. This paper argues that well-connected personnel and religiously-based communal activities help to enhance social capital by bonding within a community and linking to outside agencies and that mixed methods, based on the AI-based language model, effectively strengthen text-based qualitative research.


Subject(s)
Artificial Intelligence , Disasters , Social Capital , Humans , Indonesia , Qualitative Research , Relief Work/organization & administration , Language
6.
Behav Res Methods ; 56(2): 952-967, 2024 Feb.
Article in English | MEDLINE | ID: mdl-36897503

ABSTRACT

Recent approaches to text analysis from social media and other corpora rely on word lists to detect topics, measure meaning, or to select relevant documents. These lists are often generated by applying computational lexicon expansion methods to small, manually curated sets of seed words. Despite the wide use of this approach, we still lack an exhaustive comparative analysis of the performance of lexicon expansion methods and how they can be improved with additional linguistic data. In this work, we present LEXpander, a method for lexicon expansion that leverages novel data on colexification, i.e., semantic networks connecting words with multiple meanings according to shared senses. We evaluate LEXpander in a benchmark including widely used methods for lexicon expansion based on word embedding models and synonym networks. We find that LEXpander outperforms existing approaches in terms of both precision and the trade-off between precision and recall of generated word lists in a variety of tests. Our benchmark includes several linguistic categories, as words relating to the financial area or to the concept of friendship, and sentiment variables in English and German. We also show that the expanded word lists constitute a high-performing text analysis method in application cases to various English corpora. This way, LEXpander poses a systematic automated solution to expand short lists of words into exhaustive and accurate word lists that can closely approximate word lists generated by experts in psychology and linguistics.


Subject(s)
Linguistics , Social Media , Humans , Semantics
7.
Behav Res Methods ; 2024 Aug 15.
Article in English | MEDLINE | ID: mdl-39147946

ABSTRACT

We introduce a novel dataset of affective, semantic, and descriptive norms for all facial emojis at the point of data collection. We gathered and examined subjective ratings of emojis from 138 German speakers along five essential dimensions: valence, arousal, familiarity, clarity, and visual complexity. Additionally, we provide absolute frequency counts of emoji use, drawn from an extensive Twitter corpus, as well as a much smaller WhatsApp database. Our results replicate the well-established quadratic relationship between arousal and valence of lexical items, also known for words. We also report associations among the variables: for example, the subjective familiarity of an emoji is strongly correlated with its usage frequency, and positively associated with its emotional valence and clarity of meaning. We establish the meanings associated with face emojis, by asking participants for up to three descriptions for each emoji. Using this linguistic data, we computed vector embeddings for each emoji, enabling an exploration of their distribution within the semantic space. Our description-based emoji vector embeddings not only capture typical meaning components of emojis, such as their valence, but also surpass simple definitions and direct emoji2vec models in reflecting the semantic relationship between emojis and words. Our dataset stands out due to its robust reliability and validity. This new semantic norm for face emojis impacts the future design of highly controlled experiments focused on the cognitive processing of emojis, their lexical representation, and their linguistic properties.

8.
Behav Res Methods ; 56(6): 5622-5646, 2024 Sep.
Article in English | MEDLINE | ID: mdl-38114881

ABSTRACT

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many attempts at language grounding, achieving an optimal equilibrium between textual representations of the language and our embodied experiences remains an open field. Some common concerns are the following. Is visual grounding advantageous for abstract words, or is its effectiveness restricted to concrete words? What is the optimal way of bridging the gap between text and vision? To what extent is perceptual knowledge from images advantageous for acquiring high-quality embeddings? Leveraging the current advances in machine learning and natural language processing, the present study addresses these questions by proposing a simple yet very effective computational grounding model for pre-trained word embeddings. Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information while simultaneously preserving the distributional statistics that characterize word usage in text corpora. By applying a learned alignment, we are able to indirectly ground unseen words including abstract words. A series of evaluations on a range of behavioral datasets shows that visual grounding is beneficial not only for concrete words but also for abstract words, lending support to the indirect theory of abstract concepts. Moreover, our approach offers advantages for contextualized embeddings, such as those generated by BERT (Devlin et al, 2018), but only when trained on corpora of modest, cognitively plausible sizes. Code and grounded embeddings for English are available at ( https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2 ).


Subject(s)
Language , Natural Language Processing , Humans , Machine Learning , Visual Perception/physiology
9.
Sensors (Basel) ; 23(18)2023 Sep 16.
Article in English | MEDLINE | ID: mdl-37765992

ABSTRACT

Access Control Policies (ACPs) are essential for ensuring secure and authorized access to resources in IoT networks. Recognizing these policies involves identifying relevant statements within project documents expressed in natural language. While current research focuses on improving recognition accuracy through algorithm enhancements, the challenge of limited labeled data from individual clients is often overlooked, which impedes the training of highly accurate models. To address this issue and harness the potential of IoT networks, this paper presents FL-Bert-BiLSTM, a novel model that combines federated learning and pre-trained word embedding techniques for access control policy recognition. By leveraging the capabilities of IoT networks, the proposed model enables real-time and distributed training on IoT devices, effectively mitigating the scarcity of labeled data and enhancing accessibility for IoT applications. Additionally, the model incorporates pre-trained word embeddings to leverage the semantic information embedded in textual data, resulting in improved accuracy for access control policy recognition. Experimental results substantiate that the proposed model not only enhances accuracy and generalization capability but also preserves data privacy, making it well-suited for secure and efficient access control in IoT networks.

10.
BMC Bioinformatics ; 22(Suppl 14): 631, 2022 Nov 16.
Article in English | MEDLINE | ID: mdl-36384559

ABSTRACT

BACKGROUND: Decisions in healthcare usually rely on the goodness and completeness of data that could be coupled with heuristics to improve the decision process itself. However, this is often an incomplete process. Structured interviews denominated Delphi surveys investigate experts' opinions and solve by consensus complex matters like those underlying surgical decision-making. Natural Language Processing (NLP) is a field of study that combines computer science, artificial intelligence, and linguistics. NLP can then be used as a valuable help in building a correct context in surgical data, contributing to the amelioration of surgical decision-making. RESULTS: We applied NLP coupled with machine learning approaches to predict the context (words) owning high accuracy from the words nearest to Delphi surveys, used as input. CONCLUSIONS: The proposed methodology has increased the usefulness of Delphi surveys favoring the extraction of keywords that can represent a specific clinical context. It permits the characterization of the clinical context suggesting words for the evaluation process of the data.


Subject(s)
Artificial Intelligence , Breast Neoplasms , Humans , Female , Breast Neoplasms/surgery , Natural Language Processing , Machine Learning
11.
J Biomed Inform ; 125: 103971, 2022 01.
Article in English | MEDLINE | ID: mdl-34920127

ABSTRACT

OBJECTIVE: Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings. MATERIALS AND METHODS: We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively. RESULTS: Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION: Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use. CONCLUSION: Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.


Subject(s)
Electronic Health Records , Publications , Humans , Natural Language Processing , PubMed , Reproducibility of Results
12.
J Biomed Inform ; 126: 103998, 2022 02.
Article in English | MEDLINE | ID: mdl-35063668

ABSTRACT

Formal thought disorder (ThD) is a clinical sign of schizophrenia amongst other serious mental health conditions. ThD can be recognized by observing incoherent speech - speech in which it is difficult to perceive connections between successive utterances and lacks a clear global theme. Automated assessment of the coherence of speech in patients with schizophrenia has been an active area of research for over a decade, in an effort to develop an objective and reliable instrument through which to quantify ThD. However, this work has largely been conducted in controlled settings using structured interviews and depended upon manual transcription services to render audio recordings amenable to computational analysis. In this paper, we present an evaluation of such automated methods in the context of a fully automated system using Automated Speech Recognition (ASR) in place of a manual transcription service, with "audio diaries" collected in naturalistic settings from participants experiencing Auditory Verbal Hallucinations (AVH). We show that performance lost due to ASR errors can often be restored through the application of Time-Series Augmented Representations for Detection of Incoherent Speech (TARDIS), a novel approach that involves treating the sequence of coherence scores from a transcript as a time-series, providing features for machine learning. With ASR, TARDIS improves average AUC across coherence metrics for detection of severe ThD by 0.09; average correlation with human-labeled derailment scores by 0.10; and average correlation between coherence estimates from manual and ASR-derived transcripts by 0.29. In addition, TARDIS improves the agreement between coherence estimates from manual transcripts and human judgment and correlation with self-reported estimates of AVH symptom severity. As such, TARDIS eliminates a fundamental barrier to the deployment of automated methods to detect linguistic indicators of ThD to monitor and improve clinical care in serious mental illness.


Subject(s)
Schizophrenia , Speech , Hallucinations , Humans , Linguistics , Machine Learning
13.
Proc Natl Acad Sci U S A ; 116(10): 4176-4181, 2019 03 05.
Article in English | MEDLINE | ID: mdl-30770443

ABSTRACT

By middle childhood, humans are able to learn abstract semantic relations (e.g., antonym, synonym, category membership) and use them to reason by analogy. A deep theoretical challenge is to show how such abstract relations can arise from nonrelational inputs, thereby providing key elements of a protosymbolic representation system. We have developed a computational model that exploits the potential synergy between deep learning from "big data" (to create semantic features for individual words) and supervised learning from "small data" (to create representations of semantic relations between words). Given as inputs labeled pairs of lexical representations extracted by deep learning, the model creates augmented representations by remapping features according to the rank of differences between values for the two words in each pair. These augmented representations aid in coping with the feature alignment problem (e.g., matching those features that make "love-hate" an antonym with the different features that make "rich-poor" an antonym). The model extracts weight distributions that are used to estimate the probabilities that new word pairs instantiate each relation, capturing the pattern of human typicality judgments for a broad range of abstract semantic relations. A measure of relational similarity can be derived and used to solve simple verbal analogies with human-level accuracy. Because each acquired relation has a modular representation, basic symbolic operations are enabled (notably, the converse of any learned relation can be formed without additional training). Abstract semantic relations can be induced by bootstrapping from nonrelational inputs, thereby enabling relational generalization and analogical reasoning.

14.
Proc Natl Acad Sci U S A ; 116(13): 5862-5871, 2019 03 26.
Article in English | MEDLINE | ID: mdl-30833402

ABSTRACT

Intergroup attitudes (evaluations) are generalized valence attributions to social groups (e.g., white-bad/Asian-good), whereas intergroup beliefs (stereotypes) are specific trait attributions to social groups (e.g., white-dumb/Asian-smart). When explicit (self-report) measures are used, attitudes toward and beliefs about the same social group are often related to each other but can also be dissociated. The present work used three approaches (correlational, experimental, and archival) to conduct a systematic investigation of the relationship between implicit (indirectly revealed) intergroup attitudes and beliefs. In study 1 (n = 1,942), we found significant correlations and, in some cases, evidence for redundancy, between Implicit Association Tests (IATs) measuring attitudes toward and beliefs about the same social groups (mean r = 0.31, 95% confidence interval: [0.24; 0.39]). In study 2 (n = 383), manipulating attitudes via evaluative conditioning produced parallel changes in belief IATs, demonstrating that implicit attitudes can causally drive implicit beliefs when information about the specific semantic trait is absent. In study 3, we used word embeddings derived from a large corpus of online text to show that the relative distance of 22 social groups from positive vs. negative words (reflecting generalized attitudes) was highly correlated with their distance from warm vs. cold, and even competent vs. incompetent, words (reflecting specific beliefs). Overall, these studies provide convergent evidence for tight connections between implicit attitudes and beliefs, suggesting that the dissociations observed using explicit measures may arise uniquely from deliberate judgment processes.


Subject(s)
Attitude , Culture , Group Processes , Humans , Psychological Tests , Psychology, Social , Stereotyping
15.
J Med Internet Res ; 24(11): e34067, 2022 11 02.
Article in English | MEDLINE | ID: mdl-36040993

ABSTRACT

BACKGROUND: Evidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient. OBJECTIVE: We aimed to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of the literature. Further imminent themes can be predicted using machine learning on the evolving associations between words. METHODS: Frequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the World Health Organization database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month's literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month's network were forecasted based on prior patterns, and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months. RESULTS: We found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift toward the symptoms of long COVID complications was observed during March 2021, and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an area under the receiver operating characteristic curve of 0.87. Predictive modeling revealed predisposing conditions, symptoms, cross-infection, and neurological complications as dominant research themes in COVID-19 publications based on the patterns observed in previous months. CONCLUSIONS: Machine learning-based prediction of emerging links can contribute toward steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.


Subject(s)
COVID-19 , Humans , Machine Learning , Semantics , Supervised Machine Learning , Post-Acute COVID-19 Syndrome
16.
Sociol Methods Res ; 51(4): 1484-1539, 2022 Nov.
Article in English | MEDLINE | ID: mdl-37974911

ABSTRACT

Public culture is a powerful source of cognitive socialization; for example, media language is full of meanings about body weight. Yet it remains unclear how individuals process meanings in public culture. We suggest that schema learning is a core mechanism by which public culture becomes personal culture. We propose that a burgeoning approach in computational text analysis - neural word embeddings - can be interpreted as a formal model for cultural learning. Embeddings allow us to empirically model schema learning and activation from natural language data. We illustrate our approach by extracting four lower-order schemas from news articles: the gender, moral, health, and class meanings of body weight. Using these lower-order schemas we quantify how words about body weight "fill in the blanks" about gender, morality, health, and class. Our findings reinforce ongoing concerns that machine-learning models (e.g., of natural language) can encode and reproduce harmful human biases.

17.
BMC Med Inform Decis Mak ; 22(1): 83, 2022 03 29.
Article in English | MEDLINE | ID: mdl-35351120

ABSTRACT

BACKGROUND: Analyzing the unstructured textual data contained in electronic health records (EHRs) has always been a challenging task. Word embedding methods have become an essential foundation for neural network-based approaches in natural language processing (NLP), to learn dense and low-dimensional word representations from large unlabeled corpora that capture the implicit semantics of words. Models like Word2Vec, GloVe or FastText have been broadly applied and reviewed in the bioinformatics and healthcare fields, most often to embed clinical notes or activity and diagnostic codes. Visualization of the learned embeddings has been used in a subset of these works, whether for exploratory or evaluation purposes. However, visualization practices tend to be heterogeneous, and lack overall guidelines. OBJECTIVE: This scoping review aims to describe the methods and strategies used to visualize medical concepts represented using word embedding methods. We aim to understand the objectives of the visualizations and their limits. METHODS: This scoping review summarizes different methods used to visualize word embeddings in healthcare. We followed the methodology proposed by Arksey and O'Malley (Int J Soc Res Methodol 8:19-32, 2005) and by Levac et al. (Implement Sci 5:69, 2010) to better analyze the data and provide a synthesis of the literature on the matter. RESULTS: We first obtained 471 unique articles from a search conducted in PubMed, MedRxiv and arXiv databases. 30 of these were effectively reviewed, based on our inclusion and exclusion criteria. 23 articles were excluded in the full review stage, resulting in the analysis of 7 papers that fully correspond to our inclusion criteria. Included papers pursued a variety of objectives and used distinct methods to evaluate their embeddings and to visualize them. Visualization also served heterogeneous purposes, being alternatively used as a way to explore the embeddings, to evaluate them or to merely illustrate properties otherwise formally assessed. CONCLUSIONS: Visualization helps to explore embedding results (further dimensionality reduction, synthetic representation). However, it does not exhaust the information conveyed by the embeddings nor constitute a self-sustaining evaluation method of their pertinence.


Subject(s)
Natural Language Processing , Semantics , Databases, Factual , Electronic Health Records , Humans , PubMed
18.
Sensors (Basel) ; 22(3)2022 Jan 29.
Article in English | MEDLINE | ID: mdl-35161808

ABSTRACT

Short text representation is one of the basic and key tasks of NLP. The traditional method is to simply merge the bag-of-words model and the topic model, which may lead to the problem of ambiguity in semantic information, and leave topic information sparse. We propose an unsupervised text representation method that involves fusing word embeddings and extended topic information. Following this, two fusion strategies of weighted word embeddings and extended topic information are designed: static linear fusion and dynamic fusion. This method can highlight important semantic information, flexibly fuse topic information, and improve the capabilities of short text representation. We use classification and prediction tasks to verify the effectiveness of the method. The testing results show that the method is valid.

19.
Behav Res Methods ; 54(6): 3015-3042, 2022 12.
Article in English | MEDLINE | ID: mdl-35167112

ABSTRACT

Age of acquisition (AoA) is a measure of word complexity which refers to the age at which a word is typically learned. AoA measures have shown strong correlations with reading comprehension, lexical decision times, and writing quality. AoA scores based on both adult and child data have limitations that allow for error in measurement, and increase the cost and effort to produce. In this paper, we introduce Age of Exposure (AoE) version 2, a proxy for human exposure to new vocabulary terms that expands AoA word lists through training regressors to predict AoA scores. Word2vec word embeddings are trained on cumulatively increasing corpora of texts, word exposure trajectories are generated by aligning the word2vec vector spaces, and features of words are derived for modeling AoA scores. Our prediction models achieve low errors (from 13% with a corresponding R2 of .35 up to 7% with an R2 of .74), can be uniformly applied to different AoA word lists, and generalize to the entire vocabulary of a language. Our method benefits from using existing readability indices to define the order of texts in the corpora, while the performed analyses confirm that the generated AoA scores accurately predicted the difficulty of texts (R2 of .84, surpassing related previous work). Further, we provide evidence of the internal reliability of our word trajectory features, demonstrate the effectiveness of the word trajectory features when contrasted with simple lexical features, and show that the exclusion of features that rely on external resources does not significantly impact performance.


Subject(s)
Language , Vocabulary , Child , Humans , Reproducibility of Results
20.
BMC Bioinformatics ; 22(Suppl 1): 599, 2021 Dec 17.
Article in English | MEDLINE | ID: mdl-34920708

ABSTRACT

BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS: In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. RESULTS: For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. CONCLUSION: On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.


Subject(s)
Information Storage and Retrieval , Medical Informatics , Pharmaceutical Preparations , Medical Informatics/methods , Semantics , Unified Medical Language System
SELECTION OF CITATIONS
SEARCH DETAIL