Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone.
Biochem Biophys Res Commun
; 533(3): 553-558, 2020 Dec 10.
Article
em En
| MEDLINE
| ID: mdl-32981683
Coronaviruses infect many animals, including humans, due to interspecies transmission. Three of the known human coronaviruses: MERS, SARS-CoV-1, and SARS-CoV-2, the pathogen for the COVID-19 pandemic, cause severe disease. Improved methods to predict host specificity of coronaviruses will be valuable for identifying and controlling future outbreaks. The coronavirus S protein plays a key role in host specificity by attaching the virus to receptors on the cell membrane. We analyzed 1238 spike sequences for their host specificity. Spike sequences readily segregate in t-SNE embeddings into clusters of similar hosts and/or virus species. Machine learning with SVM, Logistic Regression, Decision Tree, Random Forest gave high average accuracies, F1 scores, sensitivities and specificities of 0.95-0.99. Importantly, sites identified by Decision Tree correspond to protein regions with known biological importance. These results demonstrate that spike sequences alone can be used to predict host specificity.
Palavras-chave
Texto completo:
1
Bases de dados:
MEDLINE
Assunto principal:
Coronavirus
/
Biologia Computacional
/
Especificidade de Hospedeiro
/
Glicoproteína da Espícula de Coronavírus
/
Aprendizado de Máquina
Tipo de estudo:
Prognostic_studies
/
Risk_factors_studies
Limite:
Animals
/
Humans
Idioma:
En
Revista:
Biochem Biophys Res Commun
Ano de publicação:
2020
Tipo de documento:
Article