Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors.

Sahoo, Bikram; Ali, Sarwan; Chen, Pin-Yu; Patterson, Murray; Zelikovsky, Alexander

Sahoo, Bikram; Ali, Sarwan; Chen, Pin-Yu; Patterson, Murray; Zelikovsky, Alexander.

Afiliação

Sahoo B; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA.
Ali S; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA.
Chen PY; IBM Research, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA.
Patterson M; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA.
Zelikovsky A; Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA.

Biomolecules ; 13(6)2023 06 02.

Article em En | MEDLINE | ID: mdl-37371514

ABSTRACT

ABSTRACT

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.

Assuntos

COVID-19; SARS-CoV-2; Humanos; SARS-CoV-2/genética; Análise de Sequência de DNA/métodos; Pandemias; Sequenciamento de Nucleotídeos em Larga Escala/métodos; Algoritmos; Aprendizado de Máquina

Palavras-chave

classification; embedding methods; long read; machine learning; sequencing error; third-generation single-molecule sequencing (TGS)

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: SARS-CoV-2 / COVID-19 Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: SARS-CoV-2 / COVID-19 Idioma: En Ano de publicação: 2023 Tipo de documento: Article