Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
J Chem Inf Model ; 63(21): 6515-6524, 2023 11 13.
Article in English | MEDLINE | ID: mdl-37857374

ABSTRACT

We introduce an exploratory active learning (AL) algorithm using Gaussian process regression and marginalized graph kernel (GPR-MGK) to sample chemical compound space (CCS) at minimal cost. Targeting 251,728 enumerated alkane molecules with 4-19 carbon atoms, we applied the AL algorithm to select a diverse and representative set of molecules and then conducted high-throughput molecular simulations on these selected molecules. To demonstrate the power of the AL algorithm, we built directed message-passing neural networks (D-MPNN) using simulation data as the training set to predict liquid densities, heat capacities, and vaporization enthalpies of the CCS. Validations show that D-MPNN models built on the smallest training set considered in this work, which consists of 313 molecules or 0.124% of the original CCS, predict the properties with R2 > 0.99 against the computational data and R2 > 0.94 against the experimental data. The advantage of the presented AL algorithm is that the predicted uncertainty of GPR depends on only the molecular structures, which renders it compatible with high-throughput data generation.


Subject(s)
Alkanes , Neural Networks, Computer , Thermodynamics , Algorithms , Molecular Structure
2.
J Chem Inf Model ; 63(15): 4633-4640, 2023 Aug 14.
Article in English | MEDLINE | ID: mdl-37504964

ABSTRACT

Marginalized graph kernels have shown competitive performance in molecular machine learning tasks but currently lack measures of interpretability, which are important to improve trust in the models, detect biases, and inform molecular optimization campaigns. We here conceive and implement two interpretability measures for Gaussian process regression using a marginalized graph kernel (GPR-MGK) to quantify (1) the contribution of specific training data to the prediction and (2) the contribution of specific nodes of the graph to the prediction. We demonstrate the applicability of these interpretability measures for molecular property prediction. We compare GPR-MGK to graph neural networks on four logic and two real-world toxicology data sets and find that the atomic attribution of GPR-MGK generally outperforms the atomic attribution of graph neural networks. We also perform a detailed molecular attribution analysis using the FreeSolv data set, showing how molecules in the training set influence machine learning predictions and why Morgan fingerprints perform poorly on this data set. This is the first systematic examination of the interpretability of GPR-MGK and thereby is an important step in the further maturation of marginalized graph kernel methods for interpretable molecular predictions.

3.
J Chem Inf Model ; 61(11): 5414-5424, 2021 11 22.
Article in English | MEDLINE | ID: mdl-34723539

ABSTRACT

This work proposes a state-of-the-art hybrid kernel to calculate molecular similarity. Combined with Gaussian process models, the performance of the hybrid kernel in predicting molecular properties is comparable to that of the directed message-passing neural network (D-MPNN). The hybrid kernel consists of a marginalized graph kernel (MGK) and a radial basis function (RBF) kernel that operate on molecular graphs and global molecular features, respectively. Bayesian optimization was used to obtain the optimal hyperparameters for both models. The comparisons are performed on 11 publicly available data sets. Our results show that their performances are similar, their prediction errors are correlated, and the ensemble predictions of the two models perform better than either of them. Through principal component analysis, we found that the molecular embeddings of the hybrid kernel and the D-MPNN are also similar. The advantage of D-MPNN lies in the computational efficiency and scalability of large-scale data, while the advantage of the graph kernel models lies in the accurate uncertainty quantification.


Subject(s)
Neural Networks, Computer , Bayes Theorem
4.
Phys Chem Chem Phys ; 23(43): 24892-24904, 2021 Nov 10.
Article in English | MEDLINE | ID: mdl-34724700

ABSTRACT

The solvation free energy of organic molecules is a critical parameter in determining emergent properties such as solubility, liquid-phase equilibrium constants, pKa and redox potentials in an organic redox flow battery. In this work, we present a machine learning (ML) model that can learn and predict the aqueous solvation free energy of an organic molecule using the Gaussian process regression method based on a new molecular graph kernel. To investigate the performance of the ML model for electrostatic interaction, the nonpolar interaction contribution of the solvent and the conformational entropy of the solute in the solvation free energy, three data sets with implicit or explicit water solvent models, and contribution of the conformational entropy of the solute are tested. We demonstrate that our ML model can predict the solvation free energy of molecules at chemical accuracy with a mean absolute error of less than 1 kcal mol-1 for subsets of the QM9 dataset and the Freesolv database. To solve the general data scarcity problem for a graph-based ML model, we propose a dimension reduction algorithm based on the distance between molecular graphs, which can be used to examine the diversity of the molecular data set. It provides a promising way to build a minimum training set to improve prediction for certain test sets where the space of molecular structures is predetermined.

5.
J Phys Chem A ; 125(20): 4488-4497, 2021 May 27.
Article in English | MEDLINE | ID: mdl-33999627

ABSTRACT

This work presents a Gaussian process regression (GPR) model on top of a novel graph representation of chemical molecules that predicts thermodynamic properties of pure substances in single, double, and triple phases. A transferable molecular graph representation is proposed as the input for a marginalized graph kernel, which is the major component of the covariance function in our GPR models. Radial basis function kernels of temperature and pressure are also incorporated into the covariance function when necessary. We predicted three types of representative properties of pure substances in single, double, and triple phases, i.e., critical temperature, vapor-liquid equilibrium (VLE) density, and pressure-temperature density. The accuracy of the models is nearly identical to the precision of the experimental measurements. Moreover, the reliability of our predictions can be quantified on a per-sample basis using the posterior uncertainty of the GPR model. We compare our model against Morgan fingerprints and a graph neural network to further demonstrate the advantage of the proposed method.

6.
J Chem Phys ; 150(4): 044107, 2019 Jan 28.
Article in English | MEDLINE | ID: mdl-30709286

ABSTRACT

Data-driven prediction of molecular properties presents unique challenges to the design of machine learning methods concerning data structure/dimensionality, symmetry adaption, and confidence management. In this paper, we present a kernel-based pipeline that can learn and predict the atomization energy of molecules with high accuracy. The framework employs Gaussian process regression to perform predictions based on the similarity between molecules, which is computed using the marginalized graph kernel. To apply the marginalized graph kernel, a spatial adjacency rule is first employed to convert molecules into graphs whose vertices and edges are labeled by elements and interatomic distances, respectively. We then derive formulas for the efficient evaluation of the kernel. Specific functional components for the marginalized graph kernel are proposed, while the effects of the associated hyperparameters on accuracy and predictive confidence are examined. We show that the graph kernel is particularly suitable for predicting extensive properties because its convolutional structure coincides with that of the covariance formula between sums of random variables. Using an active learning procedure, we demonstrate that the proposed method can achieve a mean absolute error of 0.62 ± 0.01 kcal/mol using as few as 2000 training samples on the QM7 dataset.

7.
J Chem Phys ; 148(3): 034101, 2018 Jan 21.
Article in English | MEDLINE | ID: mdl-29352799

ABSTRACT

Molecular fingerprints, i.e., feature vectors describing atomistic neighborhood configurations, is an important abstraction and a key ingredient for data-driven modeling of potential energy surface and interatomic force. In this paper, we present the density-encoded canonically aligned fingerprint algorithm, which is robust and efficient, for fitting per-atom scalar and vector quantities. The fingerprint is essentially a continuous density field formed through the superimposition of smoothing kernels centered on the atoms. Rotational invariance of the fingerprint is achieved by aligning, for each fingerprint instance, the neighboring atoms onto a local canonical coordinate frame computed from a kernel minisum optimization procedure. We show that this approach is superior over principal components analysis-based methods especially when the atomistic neighborhood is sparse and/or contains symmetry. We propose that the "distance" between the density fields be measured using a volume integral of their pointwise difference. This can be efficiently computed using optimal quadrature rules, which only require discrete sampling at a small number of grid points. We also experiment on the choice of weight functions for constructing the density fields and characterize their performance for fitting interatomic potentials. The applicability of the fingerprint is demonstrated through a set of benchmark problems.

8.
Comput Phys Commun ; 217: 171-179, 2017 Aug.
Article in English | MEDLINE | ID: mdl-29104303

ABSTRACT

Mesoscopic numerical simulations provide a unique approach for the quantification of the chemical influences on red blood cell functionalities. The transport Dissipative Particles Dynamics (tDPD) method can lead to such effective multiscale simulations due to its ability to simultaneously capture mesoscopic advection, diffusion, and reaction. In this paper, we present a GPU-accelerated red blood cell simulation package based on a tDPD adaptation of our red blood cell model, which can correctly recover the cell membrane viscosity, elasticity, bending stiffness, and cross-membrane chemical transport. The package essentially processes all computational workloads in parallel by GPU, and it incorporates multi-stream scheduling and non-blocking MPI communications to improve inter-node scalability. Our code is validated for accuracy and compared against the CPU counterpart for speed. Strong scaling and weak scaling are also presented to characterizes scalability. We observe a speedup of 10.1 on one GPU over all 16 cores within a single node, and a weak scaling efficiency of 91% across 256 nodes. The program enables quick-turnaround and high-throughput numerical simulations for investigating chemical-driven red blood cell phenomena and disorders.

9.
Biophys J ; 112(10): 2030-2037, 2017 May 23.
Article in English | MEDLINE | ID: mdl-28538143

ABSTRACT

We present OpenRBC, a coarse-grained molecular dynamics code, which is capable of performing an unprecedented in silico experiment-simulating an entire mammal red blood cell lipid bilayer and cytoskeleton as modeled by multiple millions of mesoscopic particles-using a single shared memory commodity workstation. To achieve this, we invented an adaptive spatial-searching algorithm to accelerate the computation of short-range pairwise interactions in an extremely sparse three-dimensional space. The algorithm is based on a Voronoi partitioning of the point cloud of coarse-grained particles, and is continuously updated over the course of the simulation. The algorithm enables the construction of the key spatial searching data structure in our code, i.e., a lattice-free cell list, with a time and space cost linearly proportional to the number of particles in the system. The position and the shape of the cells also adapt automatically to the local density and curvature. The code implements OpenMP parallelization and scales to hundreds of hardware threads. It outperforms a legacy simulator by almost an order of magnitude in time-to-solution and >40 times in problem size, thus providing, to our knowledge, a new platform for probing the biomechanics of red blood cells.


Subject(s)
Erythrocytes/metabolism , Molecular Dynamics Simulation , Software , Algorithms , Animals , Cell Membrane/metabolism , Cluster Analysis , Cytoskeleton/metabolism , Erythrocytes/cytology , Models, Cardiovascular
10.
Phys Rev E ; 93(3): 033312, 2016 Mar.
Article in English | MEDLINE | ID: mdl-27078489

ABSTRACT

We analyze hydrodynamic fluctuations of a hybrid simulation under shear flow. The hybrid simulation is based on the Navier-Stokes (NS) equations on one domain and dissipative particle dynamics (DPD) on the other. The two domains overlap, and there is an artificial boundary for each one within the overlapping region. To impose the artificial boundary of the NS solver, a simple spatial-temporal averaging is performed on the DPD simulation. In the artificial boundary of the particle simulation, four popular strategies of constraint dynamics are implemented, namely the Maxwell buffer [Hadjiconstantinou and Patera, Int. J. Mod. Phys. C 08, 967 (1997)], the relaxation dynamics [O'Connell and Thompson, Phys. Rev. E 52, R5792 (1995)], the least constraint dynamics [Nie et al.,J. Fluid Mech. 500, 55 (2004); Werder et al., J. Comput. Phys. 205, 373 (2005)], and the flux imposition [Flekkøy et al., Europhys. Lett. 52, 271 (2000)], to achieve a target mean value given by the NS solver. Going beyond the mean flow field of the hybrid simulations, we investigate the hydrodynamic fluctuations in the DPD domain. Toward that end, we calculate the transversal autocorrelation functions of the fluctuating variables in k space to evaluate the generation, transport, and dissipation of fluctuations in the presence of a hybrid interface. We quantify the unavoidable errors in the fluctuations, due to both the truncation of the domain and the constraint dynamics performed in the artificial boundary. Furthermore, we compare the four methods of constraint dynamics and demonstrate how to reduce the errors in fluctuations. The analysis and findings of this work are directly applicable to other hybrid simulations of fluid flow with thermal fluctuations.

11.
Interface Focus ; 6(1): 20150065, 2016 Feb 06.
Article in English | MEDLINE | ID: mdl-26855752

ABSTRACT

Sickle-cell anaemia (SCA) is an inherited blood disorder exhibiting heterogeneous cell morphology and abnormal rheology, especially under hypoxic conditions. By using a multiscale red blood cell (RBC) model with parameters derived from patient-specific data, we present a mesoscopic computational study of the haemodynamic and rheological characteristics of blood from SCA patients with hydroxyurea (HU) treatment (on-HU) and those without HU treatment (off-HU). We determine the shear viscosity of blood in health as well as in different states of disease. Our results suggest that treatment with HU improves or worsens the rheological characteristics of blood in SCA depending on the degree of hypoxia. However, on-HU groups always have higher levels of haematocrit-to-viscosity ratio (HVR) than off-HU groups, indicating that HU can indeed improve the oxygen transport potential of blood. Our patient-specific computational simulations suggest that the HVR level, rather than the shear viscosity of sickle RBC suspensions, may be a more reliable indicator in assessing the response to HU treatment.

12.
BMC Genomics ; 17 Suppl 1: 4, 2016 Jan 11.
Article in English | MEDLINE | ID: mdl-26818118

ABSTRACT

BACKGROUND: The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads. RESULTS: The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp. CONCLUSIONS: To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID .


Subject(s)
Algorithms , Chromosome Inversion/genetics , High-Throughput Nucleotide Sequencing , DNA/chemistry , DNA/genetics , DNA/metabolism , Genome, Human , Humans , Internet , Sequence Alignment , Sequence Analysis, DNA , User-Computer Interface
13.
Chem Commun (Camb) ; 51(55): 11038-40, 2015 Jul 14.
Article in English | MEDLINE | ID: mdl-26062446

ABSTRACT

We present a non-isothermal mesoscopic model for investigation of the phase transition dynamics of thermoresponsive polymers. Since this model conserves energy in the simulations, it is able to correctly capture not only the transient behavior of polymer precipitation from solvent, but also the energy variation associated with the phase transition process. Simulations provide dynamic details of the thermally induced phase transition and confirm two different mechanisms dominating the phase transition dynamics. A shift of endothermic peak with concentration is observed and the underlying mechanism is explored.


Subject(s)
Phase Transition , Polymers/chemistry , Models, Molecular , Thermodynamics
14.
Chem Commun (Camb) ; 50(61): 8306-8, 2014 Aug 07.
Article in English | MEDLINE | ID: mdl-24938634

ABSTRACT

We present large-scale simulation results on the self-assembly of amphiphilic systems in bulk solution and under soft confinement. Self-assembled unilamellar and multilamellar vesicles are formed from amphiphilic molecules in bulk solution. The system is simulated by placing amphiphilic molecules inside large unilamellar vesicles (LUVs) and the dynamic soft confinement-induced self-assembled vesicles are investigated. Moreover, the self-assembly of sickle hemoglobin (HbS) is simulated in a crowded and fluctuating intracellular space and our results demonstrate that the HbS self-assembles into polymer fibers causing the LUV shape to be distorted.


Subject(s)
Unilamellar Liposomes/chemistry , Erythrocytes/chemistry , Erythrocytes/physiology , Hemoglobin, Sickle/chemistry , Hemoglobin, Sickle/metabolism , Humans , Hydrophobic and Hydrophilic Interactions , Unilamellar Liposomes/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL
...