Pesquisa | Biblioteca Virtual em Saúde

SMICLR: Contrastive Learning on Multiple Molecular Representations for Semisupervised and Unsupervised Representation Learning.

Pinheiro, Gabriel A; Da Silva, Juarez L F; Quiles, Marcos G.

J Chem Inf Model ; 62(17): 3948-3960, 2022 09 12.

Artigo em Inglês | MEDLINE | ID: mdl-36044610

RESUMO

Machine learning as a tool for chemical space exploration broadens horizons to work with known and unknown molecules. At its core lies molecular representation, an essential key to improve learning about structure-property relationships. Recently, contrastive frameworks have been showing impressive results for representation learning in diverse domains. Therefore, this paper proposes a contrastive framework that embraces multimodal molecular data. Specifically, our approach jointly trains a graph encoder and an encoder for the simplified molecular-input line-entry system (SMILES) string to perform the contrastive learning objective. Since SMILES is the basis of our method, i.e., we built the molecular graph from the SMILES, we call our framework as SMILES Contrastive Learning (SMICLR). When stacking a nonlinear regressor on the SMICLR's pretrained encoder and fine-tuning the entire model, we reduced the prediction error by, on average, 44% and 25% for the energetic and electronic properties of the QM9 data set, respectively, over the supervised baseline. We further improved our framework's performance when applying data augmentations in each molecular-input representation. Moreover, SMICLR demonstrated competitive representation learning results in an unsupervised setting.

Assuntos

Aprendizado de Máquina

Screening of the Role of the Chemical Structure in the Electrochemical Stability Window of Ionic Liquids: DFT Calculations Combined with Data Mining.

Moraes, Alex S; Pinheiro, Gabriel A; Lourenço, Tuanan C; Lopes, Mauro C; Quiles, Marcos G; Dias, Luis G; Da Silva, Juarez L F.

J Chem Inf Model ; 62(19): 4702-4712, 2022 Oct 10.

Artigo em Inglês | MEDLINE | ID: mdl-36122418

RESUMO

Ionic liquids have attracted the attention of researchers as possible electrolytes for electrochemical energy storage devices. However, their properties, such as the electrochemical stability window (ESW), ionic conductivity, and diffusivity, are influenced both by the chemical structures of cations and anions and by their combinations. Most studies in the literature focus on the understanding of common ionic liquids, and little effort has been made to find ways to improve our atomistic understanding of those systems. The goal of this paper is to explore the structural characteristics of cations and anions that form ionic liquids that can expand the HOMO/LUMO gap, a property directly linked to the ESW of the electrolyte. For that, we design a framework for randomly generating new ions by combining their fragments. Within this framework, we generate about 104 cations and 104 anions and fully optimize their structures using density functional theory. Our calculations show that aromatic cations are less stable ionic liquids than aliphatic ones, an expected result if chemical rationale is used. More importantly, we can improve the gap by adding electron-donating and electron-withdrawing functional groups to the cations and anions, respectively. The increase can be about 2 V, depending on the case. This improvement is reflected in a wider ESW.

Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition.

Cesar de Azevedo, Luis; Pinheiro, Gabriel A; Quiles, Marcos G; Da Silva, Juarez L F; Prati, Ronaldo C.

J Chem Inf Model ; 61(9): 4210-4223, 2021 09 27.

Artigo em Inglês | MEDLINE | ID: mdl-34387994

RESUMO

Most machine learning applications in quantum-chemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body functions), (ii) many-body tensor representation (MBTR, with two- and three-body terms), and (iii) smooth overlap of atomic positions (SOAP), to evaluate the prediction process's performance using different numbers of molecules in training samples and the effect of bias and variance on the final MSE. Overall, low sample sizes are related to higher MSE. Moreover, the bias component strongly influences the larger MSEs. Furthermore, there is little agreement among molecules with higher errors (outliers) across different descriptors. However, there is a high prevalence among the outliers intersection set and the convex hull volume of geometric coordinates (VCH). According to the obtained results with the distribution of MSE (and its components bias and variance) and the appearance of outliers, it is suggested to use ensembles of models with a low bias to minimize the MSE, more specifically when using a small number of molecules in the training set.

Assuntos

Algoritmos , Aprendizado de Máquina , Viés

Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset.

Pinheiro, Gabriel A; Mucelini, Johnatan; Soares, Marinalva D; Prati, Ronaldo C; Da Silva, Juarez L F; Quiles, Marcos G.

J Phys Chem A ; 124(47): 9854-9866, 2020 Nov 25.

Artigo em Inglês | MEDLINE | ID: mdl-33174750

RESUMO

Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular representation plays a crucial role. Most of the representations are based on the use of atomic coordinates (structure); however, it can increase ML training and predictions' computational cost. Herein, we investigate the impact of choosing free-coordinate descriptors based on the Simplified Molecular Input Line Entry System (SMILES) representation, which can substantially reduce the ML predictions' computational cost. Therefore, we evaluate a feed-forward neural network (FNN) model's prediction performance over five feature selection methods and nine ground-state properties (including energetic, electronic, and thermodynamic properties) from a public data set composed of â¼130k organic molecules. Our best results reached a mean absolute error, close to chemical accuracy, of â¼0.05 eV for the atomization energies (internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, and free energy at 298.15 K). Moreover, for the atomization energies, the results obtained an out-of-sample error nine times less than the same FNN model trained with the Coulomb matrix, a traditional coordinate-based descriptor. Furthermore, our results showed how limited the model's accuracy is by employing such low computational cost representation that carries less information about the molecular structure than the most state-of-the-art methods.

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA