Asymmetric trichotomous partitioning overcomes dataset limitations in building machine learning models for predicting siRNA efficacy.

Monopoli, Kathryn R; Korkin, Dmitry; Khvorova, Anastasia

Monopoli, Kathryn R; Korkin, Dmitry; Khvorova, Anastasia.

Afiliação

Monopoli KR; Department of Bioinformatics & Computational Biology, Worcester Polytechnic Institute, Worcester, MA 01609, USA.
Korkin D; RNA Therapeutics Institute, University of Massachusetts Chan Medical School, Worcester, MA 01655, USA.
Khvorova A; Department of Bioinformatics & Computational Biology, Worcester Polytechnic Institute, Worcester, MA 01609, USA.

Mol Ther Nucleic Acids ; 33: 93-109, 2023 Sep 12.

Article em En | MEDLINE | ID: mdl-37456778

RESUMO

Chemically modified small interfering RNAs (siRNAs) are promising therapeutics guiding sequence-specific silencing of disease genes. Identifying chemically modified siRNA sequences that effectively silence target genes remains challenging. Such determinations necessitate computational algorithms. Machine learning is a powerful predictive approach for tackling biological problems but typically requires datasets significantly larger than most available siRNA datasets. Here, we describe a framework applying machine learning to a small dataset (356 modified sequences) for siRNA efficacy prediction. To overcome noise and biological limitations in siRNA datasets, we apply a trichotomous, two-threshold, partitioning approach, producing several combinations of classification threshold pairs. We then test the effects of different thresholds on random forest machine learning model performance using a novel evaluation metric accounting for class imbalances. We identify thresholds yielding a model with high predictive power, outperforming a linear model generated from the same data, that was predictive upon experimental evaluation. Using a novel model feature extraction method, we observe target site base importances and base preferences consistent with our current understanding of the siRNA-mediated silencing mechanism, with the random forest providing higher resolution than the linear model. This framework applies to any classification challenge involving small biological datasets, providing an opportunity to develop high-performing design algorithms for oligonucleotide therapies.

Palavras-chave

MT: Oligonucleotides: Therapies and Applications; RNA interference; artificial intelligence; chemical modifications; computational model development; machine learning; oligonucleotide therapeutics; oligonucleotides; random forest; siRNA; supervised learning

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Bases de dados: MEDLINE Tipo de estudo: Prognostic_studies / Risk_factors_studies Idioma: En Revista: Mol Ther Nucleic Acids Ano de publicação: 2023 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google