Machine learning models and performance dependency on 2D chemical descriptor space for retention time prediction of pharmaceuticals.

Beck, Armen G; Fine, Jonathan; Aggarwal, Pankaj; Regalado, Erik L; Levorse, Dorothy; De Jesus Silva, Jordan; Sherer, Edward C

Beck, Armen G; Fine, Jonathan; Aggarwal, Pankaj; Regalado, Erik L; Levorse, Dorothy; De Jesus Silva, Jordan; Sherer, Edward C.

Afiliación

Beck AG; Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA.
Fine J; Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA.
Aggarwal P; Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA. Electronic address: pankaj.aggarwal@merck.com.
Regalado EL; Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA.
Levorse D; Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA.
De Jesus Silva J; Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA.
Sherer EC; Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ 07065, USA.

J Chromatogr A ; 1730: 465109, 2024 Aug 16.

Article en En | MEDLINE | ID: mdl-38968662

ABSTRACT

ABSTRACT

The predictive modeling of liquid chromatography methods can be an invaluable asset, potentially saving countless hours of labor while also reducing solvent consumption and waste. Tasks such as physicochemical screening and preliminary method screening systems where large amounts of chromatography data are collected from fast and routine operations are particularly well suited for both leveraging large datasets and benefiting from predictive models. Therefore, the generation of predictive models for retention time is an active area of development. However, for these predictive models to gain acceptance, researchers first must have confidence in model performance and the computational cost of building them should be minimal. In this study, a simple and cost-effective workflow for the development of machine learning models to predict retention time using only Molecular Operating Environment 2D descriptors as input for support vector regression is developed. Furthermore, we investigated the relative performance of models based on molecular descriptor space by utilizing uniform manifold approximation and projection and clustering with Gaussian mixture models to identify chemically distinct clusters. Results outlined herein demonstrate that local models trained on clusters in chemical space perform equivalently when compared to models trained on all data. Through 10-fold cross-validation on a comprehensive set containing 67,950 of our company's proprietary analytes, these models achieved coefficients of determination of 0.84 and 3 % error in terms of retention time. This promising statistical significance is found to translate from cross-validation to prospective prediction on an external test set of pharmaceutically relevant analytes. The observed equivalency of global and local modeling of large datasets is retained with METLIN's SMRT dataset, thereby confirming the wider applicability of the developed machine learning workflows for global models.

Asunto(s)

Aprendizaje Automático; Preparaciones Farmacéuticas/análisis; Preparaciones Farmacéuticas/química; Cromatografía Liquida/métodos; Máquina de Vectores de Soporte; Análisis por Conglomerados

Palabras clave

Gaussian mixture models; Liquid chromatography; Retention time prediction; Support vector regression; Uniform Manifold Approximation & Projection

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Aprendizaje Automático Idioma: En Revista: J Chromatogr A Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Aprendizaje Automático Idioma: En Revista: J Chromatogr A Año: 2024 Tipo del documento: Article