Learning supervised embeddings for large scale sequence comparisons.

Kimothi, Dhananjay; Biyani, Pravesh; Hogan, James M; Soni, Akshay; Kelly, Wayne

Kimothi, Dhananjay; Biyani, Pravesh; Hogan, James M; Soni, Akshay; Kelly, Wayne.

Afiliação

Kimothi D; Department of ECE, Indraprastha Institute of Information Technology-Delhi, New Delhi, India.
Biyani P; School of Computer Science, Queensland University of Technology, Brisbane, Queensland, Australia.
Hogan JM; Department of ECE, Indraprastha Institute of Information Technology-Delhi, New Delhi, India.
Soni A; School of Computer Science, Queensland University of Technology, Brisbane, Queensland, Australia.
Kelly W; Microsoft, Sunnyvale, California, United States of America.

PLoS One ; 15(3): e0216636, 2020.

Article em En | MEDLINE | ID: mdl-32168338

ABSTRACT

ABSTRACT

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches-SuperVec and SuperVecX-to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.

Assuntos

Algoritmos; Aprendizado de Máquina; Homologia de Sequência; Área Sob a Curva; Bases de Dados Factuais; Fatores de Tempo

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Homologia de Sequência / Aprendizado de Máquina Tipo de estudo: Prognostic_studies Idioma: En Revista: PLoS One Ano de publicação: 2020 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google