ABSTRACT
Biomedical scientific literature is growing at a very rapid pace, which makes increasingly difficult for human experts to spot the most relevant results hidden in the papers. Automatized information extraction tools based on text mining techniques are therefore needed to assist them in this task. In the last few years, deep neural networks-based techniques have significantly contributed to advance the state-of-the-art in this research area. Although the contribution to this progress made by supervised methods is relatively well-known, this is less so for other kinds of learning, namely unsupervised and self-supervised learning. Unsupervised learning is a kind of learning that does not require the cost of creating labels, which is very useful in the exploratory stages of a biomedical study where agile techniques are needed to rapidly explore many paths. In particular, clustering techniques applied to biomedical text mining allow to gather large sets of documents into more manageable groups. Deep learning techniques have allowed to produce new clustering-friendly representations of the data. On the other hand, self-supervised learning is a kind of supervised learning where the labels do not have to be manually created by humans, but are automatically derived from relations found in the input texts. In combination with innovative network architectures (e.g. transformer-based architectures), self-supervised techniques have allowed to design increasingly effective vector-based word representations (word embeddings). We show in this survey how word representations obtained in this way have proven to successfully interact with common supervised modules (e.g. classification networks) to whose performance they greatly contribute.
Subject(s)
Data Mining/methods , Deep Learning , Supervised Machine Learning , Unsupervised Machine Learning , Algorithms , Cluster Analysis , Neural Networks, ComputerABSTRACT
Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma.