Transfer learning for small molecule retention predictions.

Osipenko, Sergey; Botashev, Kazii; Nikolaev, Eugene; Kostyukevich, Yury

Osipenko, Sergey; Botashev, Kazii; Nikolaev, Eugene; Kostyukevich, Yury.

Afiliación

Osipenko S; Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Nobel Str., 3, 121205 Moscow, Russia.
Botashev K; Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Nobel Str., 3, 121205 Moscow, Russia.
Nikolaev E; Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Nobel Str., 3, 121205 Moscow, Russia. Electronic address: e.nikolaev@skoltech.ru.
Kostyukevich Y; Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Nobel Str., 3, 121205 Moscow, Russia. Electronic address: y.kostyukevich@skoltech.ru.

J Chromatogr A ; 1644: 462119, 2021 May 10.

Article en En | MEDLINE | ID: mdl-33845426

RESUMEN

Small molecule retention time prediction is a sophisticated task because of the wide variety of separation techniques resulting in fragmented data available for training machine learning models. Predictions are typically made with traditional machine learning methods such as support vector machine, random forest, or gradient boosting. Another approach is to use large data sets for training with a consequent projection of predictions. Here we evaluate the applicability of transfer learning for small molecule retention prediction as a new approach to deal with small retention data sets. Transfer learning is a state-of-the-art technique for natural language processing (NLP) tasks. We propose using text-based molecular representations (SMILES) widely used in cheminformatics for NLP-like modeling on molecules. We suggest using self-supervised pre-training to capture relevant features from a large corpus of one million molecules followed by fine-tuning on task-specific data. Mean absolute error (MAE) of predictions was in range of 88-248 s for tested reversed-phase data sets and 66 s for HILIC data set, which is comparable with MAE reported for traditional machine learning models based on descriptors or projection approaches on the same data.

Asunto(s)

Aprendizaje Automático; Bases de Datos como Asunto; Procesamiento de Lenguaje Natural; Reproducibilidad de los Resultados; Máquina de Vectores de Soporte; Factores de Tiempo

Palabras clave

Deep learning; Machine learning; Retention time prediction; Small molecules; Transfer learning

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Aprendizaje Automático Tipo de estudio: Prognostic_studies / Risk_factors_studies Idioma: En Revista: J Chromatogr A Año: 2021 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google