Predicting binding affinities of emerging variants of SARS-CoV-2 using spike protein sequencing data: observations, caveats and recommendations.

Zhang, Ruibo; Ghosh, Souparno; Pal, Ranadip

Zhang, Ruibo; Ghosh, Souparno; Pal, Ranadip.

Affiliation

Zhang R; Department of Electrical and Computer Engineering, Texas Tech University, TX, USA.
Ghosh S; Department of Statistics, University of Nebraska - Lincoln, NB, USA.
Pal R; Department of Electrical and Computer Engineering, Texas Tech University, TX, USA.

Brief Bioinform ; 23(3)2022 05 13.

Article in En | MEDLINE | ID: mdl-35437577

ABSTRACT

ABSTRACT

Predicting protein properties from amino acid sequences is an important problem in biology and pharmacology. Protein-protein interactions among SARS-CoV-2 spike protein, human receptors and antibodies are key determinants of the potency of this virus and its ability to evade the human immune response. As a rapidly evolving virus, SARS-CoV-2 has already developed into many variants with considerable variation in virulence among these variants. Utilizing the proteomic data of SARS-CoV-2 to predict its viral characteristics will, therefore, greatly aid in disease control and prevention. In this paper, we review and compare recent successful prediction methods based on long short-term memory (LSTM), transformer, convolutional neural network (CNN) and a similarity-based topological regression (TR) model and offer recommendations about appropriate predictive methodology depending on the similarity between training and test datasets. We compare the effectiveness of these models in predicting the binding affinity and expression of SARS-CoV-2 spike protein sequences. We also explore how effective these predictive methods are when trained on laboratory-created data and are tasked with predicting the binding affinity of the in-the-wild SARS-CoV-2 spike protein sequences obtained from the GISAID datasets. We observe that TR is a better method when the sample size is small and test protein sequences are sufficiently similar to the training sequence. However, when the training sample size is sufficiently large and prediction requires extrapolation, LSTM embedding and CNN-based predictive model show superior performance.

Subject(s)

COVID-19; SARS-CoV-2; Amino Acid Sequence; COVID-19/genetics; Humans; Protein Binding; Proteomics; SARS-CoV-2/genetics; Sequence Analysis, Protein; Spike Glycoprotein, Coronavirus/metabolism

Key words

COVID-19; biological sequence analysis; machine learning; performance evaluation; proteinprotein interaction; topological regression

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: SARS-CoV-2 / COVID-19 Type of study: Prognostic_studies / Risk_factors_studies Limits: Humans Language: En Journal: Brief Bioinform Journal subject: BIOLOGIA / INFORMATICA MEDICA Year: 2022 Document type: Article Affiliation country: United States

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google