Machine learning classification by fitting amplicon sequences to existing OTUs.

Armour, Courtney R; Sovacool, Kelly L; Close, William L; Topçuoglu, Begüm D; Wiens, Jenna; Schloss, Patrick D

Armour, Courtney R; Sovacool, Kelly L; Close, William L; Topçuoglu, Begüm D; Wiens, Jenna; Schloss, Patrick D.

Afiliação

Armour CR; Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA.
Sovacool KL; Department of Computational Medicine and Bioinformatics, University of Michigan , Ann Arbor, Michigan, USA.
Close WL; Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA.
Topçuoglu BD; Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA.
Wiens J; Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan, USA.
Schloss PD; Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA.

mSphere ; 8(5): e0033623, 2023 10 24.

Article em En | MEDLINE | ID: mdl-37615431

RESUMO

The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operational taxonomic units (OTUs) whose composition changes when new data are added. We previously developed a new reference-based approach, OptiFit, that fits new sequence data to existing de novo OTUs without changing the composition of the original OTUs. While OptiFit produces OTUs that are as high quality as de novo OTUs, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models relative to models trained and tested only using de novo OTUs. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences. IMPORTANCE There is great potential for using microbiome data to aid in diagnosis. A challenge with de novo operational taxonomic unit (OTU)-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must be reclustered to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database.

Assuntos

Metagenômica; Microbiota; Humanos; Análise de Sequência de DNA/métodos; RNA Ribossômico 16S/genética; Metagenômica/métodos; Algoritmos; Microbiota/genética

Palavras-chave

bioinformatics; diagnostics; machine learning; microbial ecology; microbiome

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Metagenômica / Microbiota Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Metagenômica / Microbiota Idioma: En Ano de publicação: 2023 Tipo de documento: Article