Your browser doesn't support javascript.
loading
MetaMLP: A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples.
Arango-Argoty, Gustavo A; Heath, Lenwood S; Pruden, Amy; Vikesland, Peter J; Zhang, Liqing.
Affiliation
  • Arango-Argoty GA; Department of Computer Science and Virginia Tech, Blacksburg, Virginia, USA.
  • Heath LS; Department of Computer Science and Virginia Tech, Blacksburg, Virginia, USA.
  • Pruden A; Department of Civil and Environmental Engineering, Virginia Tech, Blacksburg, Virginia, USA.
  • Vikesland PJ; Department of Civil and Environmental Engineering, Virginia Tech, Blacksburg, Virginia, USA.
  • Zhang L; Department of Computer Science and Virginia Tech, Blacksburg, Virginia, USA.
J Comput Biol ; 28(11): 1063-1074, 2021 11.
Article in En | MEDLINE | ID: mdl-34665648
The functional profile of metagenomic samples enables improved understanding of microbial populations in the environment. Such analysis consists of assigning short sequencing reads to a particular functional category. Normally, manually curated databases are used for functional assignment, and genes are arranged into different classes. Sequence alignment has been widely used to profile metagenomic samples against curated databases. However, this method is time consuming and requires high computational resources. While several alignment-free methods based on k-mer composition have been developed in recent years, they still require large amounts of computer main memory. In this article, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences as numerical vectors (embeddings) and uses a simple one hidden layer neural network to profile functional categories, is developed. Unlike other methods, MetaMLP enables partial matching by using a reduced alphabet to build sequence embeddings from full and partial k-mers. MetaMLP is able to identify a slightly larger number of reads compared with DIAMOND (one of the fastest sequence alignment methods), as well as to perform accurate predictions with 0.99 precision and 0.99 recall. MetaMLP can process 100M reads in ∼10 minutes on a laptop computer, which is 50 times faster than DIAMOND.
Subject(s)
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Sequence Alignment / Computational Biology / Metagenomics Language: En Journal: J Comput Biol Journal subject: BIOLOGIA MOLECULAR / INFORMATICA MEDICA Year: 2021 Type: Article Affiliation country: United States

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Sequence Alignment / Computational Biology / Metagenomics Language: En Journal: J Comput Biol Journal subject: BIOLOGIA MOLECULAR / INFORMATICA MEDICA Year: 2021 Type: Article Affiliation country: United States