Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
Proc Natl Acad Sci U S A ; 119(27): e2120333119, 2022 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-35776544

RESUMO

Conventional machine-learning (ML) models in computational chemistry learn to directly predict molecular properties using quantum chemistry only for reference data. While these heuristic ML methods show quantum-level accuracy with speeds several orders of magnitude faster than traditional quantum chemistry methods, they suffer from poor extensibility and transferability; i.e., their accuracy degrades on large or new chemical systems. Incorporating quantum chemistry frameworks into the ML models directly solves this problem. Here we take the structure of semiempirical quantum mechanics (SEQM) methods to construct dynamically responsive Hamiltonians. SEQM methods use empirical parameters fitted to experimental properties to construct reduced-order Hamiltonians, facilitating much faster calculations than ab initio methods but with compromised accuracy. By replacing these static parameters with machine-learned dynamic values inferred from the local environment, we greatly improve the accuracy of the SEQM methods. Trained on molecular energies and atomic forces, these dynamically generated Hamiltonian parameters show a strong correlation with atomic hybridization and bonding. Trained with only about 60,000 small organic molecular conformers, the resulting model retains interpretability, extensibility, and transferability when testing on much larger chemical systems and predicting various molecular properties. Overall, this work demonstrates the virtues of incorporating physics-based descriptions with ML to develop models that are simultaneously accurate, transferable, and interpretable.

2.
PLoS Comput Biol ; 19(6): e1011075, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37289841

RESUMO

Interactions between stressed organisms and their microbiome environments may provide new routes for understanding and controlling biological systems. However, microbiomes are a form of high-dimensional data, with thousands of taxa present in any given sample, which makes untangling the interaction between an organism and its microbial environment a challenge. Here we apply Latent Dirichlet Allocation (LDA), a technique for language modeling, which decomposes the microbial communities into a set of topics (non-mutually-exclusive sub-communities) that compactly represent the distribution of full communities. LDA provides a lens into the microbiome at broad and fine-grained taxonomic levels, which we show on two datasets. In the first dataset, from the literature, we show how LDA topics succinctly recapitulate many results from a previous study on diseased coral species. We then apply LDA to a new dataset of maize soil microbiomes under drought, and find a large number of significant associations between the microbiome topics and plant traits as well as associations between the microbiome and the experimental factors, e.g. watering level. This yields new information on the plant-microbial interactions in maize and shows that LDA technique is useful for studying the coupling between microbiomes and stressed organisms.


Assuntos
Microbiota , Interações Microbianas , Fenótipo
3.
BMC Bioinformatics ; 24(1): 441, 2023 Nov 22.
Artigo em Inglês | MEDLINE | ID: mdl-37990143

RESUMO

BACKGROUND: Correlation metrics are widely utilized in genomics analysis and often implemented with little regard to assumptions of normality, homoscedasticity, and independence of values. This is especially true when comparing values between replicated sequencing experiments that probe chromatin accessibility, such as assays for transposase-accessible chromatin via sequencing (ATAC-seq). Such data can possess several regions across the human genome with little to no sequencing depth and are thus non-normal with a large portion of zero values. Despite distributed use in the epigenomics field, few studies have evaluated and benchmarked how correlation and association statistics behave across ATAC-seq experiments with known differences or the effects of removing specific outliers from the data. Here, we developed a computational simulation of ATAC-seq data to elucidate the behavior of correlation statistics and to compare their accuracy under set conditions of reproducibility. RESULTS: Using these simulations, we monitored the behavior of several correlation statistics, including the Pearson's R and Spearman's [Formula: see text] coefficients as well as Kendall's [Formula: see text] and Top-Down correlation. We also test the behavior of association measures, including the coefficient of determination R[Formula: see text], Kendall's W, and normalized mutual information. Our experiments reveal an insensitivity of most statistics, including Spearman's [Formula: see text], Kendall's [Formula: see text], and Kendall's W, to increasing differences between simulated ATAC-seq replicates. The removal of co-zeros (regions lacking mapped sequenced reads) between simulated experiments greatly improves the estimates of correlation and association. After removing co-zeros, the R[Formula: see text] coefficient and normalized mutual information display the best performance, having a closer one-to-one relationship with the known portion of shared, enhanced loci between simulated replicates. When comparing values between experimental ATAC-seq data using a random forest model, mutual information best predicts ATAC-seq replicate relationships. CONCLUSIONS: Collectively, this study demonstrates how measures of correlation and association can behave in epigenomics experiments. We provide improved strategies for quantifying relationships in these increasingly prevalent and important chromatin accessibility assays.


Assuntos
Cromatina , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Cromatina/genética , Reprodutibilidade dos Testes , Sequenciamento de Cromatina por Imunoprecipitação , Transposases/genética , Análise de Sequência de DNA
4.
J Phys Chem A ; 127(17): 3768-3778, 2023 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-37078657

RESUMO

Highly energetic electron-hole pairs (hot carriers) formed from plasmon decay in metallic nanostructures promise sustainable pathways for energy-harvesting devices. However, efficient collection before thermalization remains an obstacle for realization of their full energy generating potential. Addressing this challenge requires detailed understanding of physical processes from plasmon excitation in the metal to their collection in a molecule or a semiconductor, where atomistic theoretical investigation may be particularly beneficial. Unfortunately, first-principles theoretical modeling of these processes is extremely costly, preventing a detailed analysis over a large number of potential nanostructures and limiting the analysis to systems with a few 100s of atoms. Recent advances in machine learned interatomic potentials suggest that dynamics can be accelerated with surrogate models which replace the full solution of the Schrödinger Equation. Here, we modify an existing neural network, Hierarchically Interacting Particle Neural Network (HIP-NN), to predict plasmon dynamics in Ag nanoparticles. The model takes as a minimum as three time steps of the reference real-time time-dependent density functional theory (rt-TDDFT) calculated charges as history and predicts trajectories for 5 fs in great agreement with the reference simulation. Further, we show that a multistep training approach in which the loss function includes errors from future time-step predictions can stabilize the model predictions for the entire simulated trajectory (∼25 fs). This extends the model's capability to accurately predict plasmon dynamics in large nanoparticles of up to 561 atoms, not present in the training data set. More importantly, with machine learning models on GPUs, we gain a speed-up factor of ∼103 as compared with the rt-TDDFT calculations when predicting important physical quantities such as dynamic dipole moments in Ag55 and a factor of ∼104 for extended nanoparticles that are 10 times larger. This underscores the promise of future machine learning accelerated electron/nuclear dynamics simulations for understanding fundamental properties of plasmon-driven hot carrier devices.

5.
J Chem Phys ; 158(18)2023 May 14.
Artigo em Inglês | MEDLINE | ID: mdl-37158328

RESUMO

Atomistic machine learning focuses on the creation of models that obey fundamental symmetries of atomistic configurations, such as permutation, translation, and rotation invariances. In many of these schemes, translation and rotation invariance are achieved by building on scalar invariants, e.g., distances between atom pairs. There is growing interest in molecular representations that work internally with higher rank rotational tensors, e.g., vector displacements between atoms, and tensor products thereof. Here, we present a framework for extending the Hierarchically Interacting Particle Neural Network (HIP-NN) with Tensor Sensitivity information (HIP-NN-TS) from each local atomic environment. Crucially, the method employs a weight tying strategy that allows direct incorporation of many-body information while adding very few model parameters. We show that HIP-NN-TS is more accurate than HIP-NN, with negligible increase in parameter count, for several datasets and network sizes. As the dataset becomes more complex, tensor sensitivities provide greater improvements to model accuracy. In particular, HIP-NN-TS achieves a record mean absolute error of 0.927 kcalmol for conformational energy variation on the challenging COMP6 benchmark, which includes a broad set of organic molecules. We also compare the computational performance of HIP-NN-TS to HIP-NN and other models in the literature.

6.
J Chem Phys ; 159(11)2023 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-37712780

RESUMO

Catalyzed by enormous success in the industrial sector, many research programs have been exploring data-driven, machine learning approaches. Performance can be poor when the model is extrapolated to new regions of chemical space, e.g., new bonding types, new many-body interactions. Another important limitation is the spatial locality assumption in model architecture, and this limitation cannot be overcome with larger or more diverse datasets. The outlined challenges are primarily associated with the lack of electronic structure information in surrogate models such as interatomic potentials. Given the fast development of machine learning and computational chemistry methods, we expect some limitations of surrogate models to be addressed in the near future; nevertheless spatial locality assumption will likely remain a limiting factor for their transferability. Here, we suggest focusing on an equally important effort-design of physics-informed models that leverage the domain knowledge and employ machine learning only as a corrective tool. In the context of material science, we will focus on semi-empirical quantum mechanics, using machine learning to predict corrections to the reduced-order Hamiltonian model parameters. The resulting models are broadly applicable, retain the speed of semiempirical chemistry, and frequently achieve accuracy on par with much more expensive ab initio calculations. These early results indicate that future work, in which machine learning and quantum chemistry methods are developed jointly, may provide the best of all worlds for chemistry applications that demand both high accuracy and high numerical efficiency.

7.
J Chem Inf Model ; 61(8): 3846-3857, 2021 08 23.
Artigo em Inglês | MEDLINE | ID: mdl-34347460

RESUMO

Machine learning (ML) plays a growing role in the design and discovery of chemicals, aiming to reduce the need to perform expensive experiments and simulations. ML for such applications is promising but difficult, as models must generalize to vast chemical spaces from small training sets and must have reliable uncertainty quantification metrics to identify and prioritize unexplored regions. Ab initio computational chemistry and chemical intuition alike often take advantage of differences between chemical conditions, rather than their absolute structure or state, to generate more reliable results. We have developed an analogous comparison-based approach for ML regression, called pairwise difference regression (PADRE), which is applicable to arbitrary underlying learning models and operates on pairs of input data points. During training, the model learns to predict differences between all possible pairs of input points. During prediction, the test points are paired with all training set points, giving rise to a set of predictions that can be treated as a distribution of which the mean is treated as a final prediction and the dispersion is treated as an uncertainty measure. Pairwise difference regression was shown to reliably improve the performance of the random forest algorithm across five chemical ML tasks. Additionally, the pair-derived dispersion is both well correlated with model error and performs well in active learning. We also show that this method is competitive with state-of-the-art neural network techniques. Thus, pairwise difference regression is a promising tool for candidate selection algorithms used in chemical discovery.


Assuntos
Algoritmos , Aprendizado de Máquina , Redes Neurais de Computação , Incerteza
8.
J Chem Phys ; 154(24): 244108, 2021 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-34241371

RESUMO

The Hückel Hamiltonian is an incredibly simple tight-binding model known for its ability to capture qualitative physics phenomena arising from electron interactions in molecules and materials. Part of its simplicity arises from using only two types of empirically fit physics-motivated parameters: the first describes the orbital energies on each atom and the second describes electronic interactions and bonding between atoms. By replacing these empirical parameters with machine-learned dynamic values, we vastly increase the accuracy of the extended Hückel model. The dynamic values are generated with a deep neural network, which is trained to reproduce orbital energies and densities derived from density functional theory. The resulting model retains interpretability, while the deep neural network parameterization is smooth and accurate and reproduces insightful features of the original empirical parameterization. Overall, this work shows the promise of utilizing machine learning to formulate simple, accurate, and dynamically parameterized physics models.

9.
J Chem Phys ; 153(10): 104502, 2020 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-32933279

RESUMO

Predicting the functional properties of many molecular systems relies on understanding how atomistic interactions give rise to macroscale observables. However, current attempts to develop predictive models for the structural and thermodynamic properties of condensed-phase systems often rely on extensive parameter fitting to empirically selected functional forms whose effectiveness is limited to a narrow range of physical conditions. In this article, we illustrate how these traditional fitting paradigms can be superseded using machine learning. Specifically, we use the results of molecular dynamics simulations to train machine learning protocols that are able to produce the radial distribution function, pressure, and internal energy of a Lennard-Jones fluid with increased accuracy in comparison to previous theoretical methods. The radial distribution function is determined using a variant of the segmented linear regression with the multivariate function decomposition approach developed by Craven et al. [J. Phys. Chem. Lett. 11, 4372 (2020)]. The pressure and internal energy are determined using expressions containing the learned radial distribution function and also a kernel ridge regression process that is trained directly on thermodynamic properties measured in simulation. The presented results suggest that the structural and thermodynamic properties of fluids may be determined more accurately through machine learning than through human-guided functional forms.

10.
J Chem Phys ; 148(24): 241715, 2018 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-29960311

RESUMO

We introduce the Hierarchically Interacting Particle Neural Network (HIP-NN) to model molecular properties from datasets of quantum calculations. Inspired by a many-body expansion, HIP-NN decomposes properties, such as energy, as a sum over hierarchical terms. These terms are generated from a neural network-a composition of many nonlinear transformations-acting on a representation of the molecule. HIP-NN achieves the state-of-the-art performance on a dataset of 131k ground state organic molecules and predicts energies with 0.26 kcal/mol mean absolute error. With minimal tuning, our model is also competitive on a dataset of molecular dynamics trajectories. In addition to enabling accurate energy predictions, the hierarchical structure of HIP-NN helps to identify regions of model uncertainty.

11.
J Chem Phys ; 148(24): 241733, 2018 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-29960353

RESUMO

The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. This universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.

12.
J Chem Theory Comput ; 20(2): 891-901, 2024 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-38168674

RESUMO

A light-matter hybrid quasiparticle, called a polariton, is formed when molecules are strongly coupled to an optical cavity. Recent experiments have shown that polariton chemistry can manipulate chemical reactions. Polariton chemistry is a collective phenomenon, and its effects increase with the number of molecules in a cavity. However, simulating an ensemble of molecules in the excited state coupled to a cavity mode is theoretically and computationally challenging. Recent advances in machine learning (ML) techniques have shown promising capabilities in modeling ground-state chemical systems. This work presents a general protocol to predict excited-state properties, such as energies, transition dipoles, and nonadiabatic coupling vectors with the hierarchically interacting particle neural network. ML predictions are then applied to compute the potential energy surfaces and electronic spectra of a prototype azomethane molecule in the collective coupling scenario. These computational tools provide a much-needed framework to model and understand many molecules' emerging excited-state polariton chemistry.

13.
J Chem Theory Comput ; 20(3): 1274-1281, 2024 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-38307009

RESUMO

Methodologies for training machine learning potentials (MLPs) with quantum-mechanical simulation data have recently seen tremendous progress. Experimental data have a very different character than simulated data, and most MLP training procedures cannot be easily adapted to incorporate both types of data into the training process. We investigate a training procedure based on iterative Boltzmann inversion that produces a pair potential correction to an existing MLP using equilibrium radial distribution function data. By applying these corrections to an MLP for pure aluminum based on density functional theory, we observe that the resulting model largely addresses previous overstructuring in the melt phase. Interestingly, the corrected MLP also exhibits improved performance in predicting experimental diffusion constants, which are not included in the training procedure. The presented method does not require autodifferentiating through a molecular dynamics solver and does not make assumptions about the MLP architecture. Our results suggest a practical framework for incorporating experimental data into machine learning models to improve the accuracy of molecular dynamics simulations.

14.
Nat Chem ; 16(5): 727-734, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38454071

RESUMO

Atomistic simulation has a broad range of applications from drug design to materials discovery. Machine learning interatomic potentials (MLIPs) have become an efficient alternative to computationally expensive ab initio simulations. For this reason, chemistry and materials science would greatly benefit from a general reactive MLIP, that is, an MLIP that is applicable to a broad range of reactive chemistry without the need for refitting. Here we develop a general reactive MLIP (ANI-1xnr) through automated sampling of condensed-phase reactions. ANI-1xnr is then applied to study five distinct systems: carbon solid-phase nucleation, graphene ring formation from acetylene, biofuel additives, combustion of methane and the spontaneous formation of glycine from early earth small molecules. In all studies, ANI-1xnr closely matches experiment (when available) and/or previous studies using traditional model chemistry methods. As such, ANI-1xnr proves to be a highly general reactive MLIP for C, H, N and O elements in the condensed phase, enabling high-throughput in silico reactive chemistry experimentation.

15.
J Chem Theory Comput ; 19(11): 3209-3222, 2023 Jun 13.
Artigo em Inglês | MEDLINE | ID: mdl-37163680

RESUMO

Extended Lagrangian Born-Oppenheimer molecular dynamics (XL-BOMD) in its most recent shadow potential energy version has been implemented in the semiempirical PyTorch-based software PySeQM. The implementation includes finite electronic temperatures, canonical density matrix perturbation theory, and an adaptive Krylov subspace approximation for the integration of the electronic equations of motion within the XL-BOMB approach (KSA-XL-BOMD). The PyTorch implementation leverages the use of GPU and machine learning hardware accelerators for the simulations. The new XL-BOMD formulation allows studying more challenging chemical systems with charge instabilities and low electronic energy gaps. The current public release of PySeQM continues our development of modular architecture for large-scale simulations employing semi-empirical quantum-mechanical treatment. Applied to molecular dynamics, simulation of 840 carbon atoms, one integration time step executes in 4 s on a single Nvidia RTX A6000 GPU.

16.
Nat Comput Sci ; 3(3): 230-239, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38177878

RESUMO

Machine learning (ML) models, if trained to data sets of high-fidelity quantum simulations, produce accurate and efficient interatomic potentials. Active learning (AL) is a powerful tool to iteratively generate diverse data sets. In this approach, the ML model provides an uncertainty estimate along with its prediction for each new atomic configuration. If the uncertainty estimate passes a certain threshold, then the configuration is included in the data set. Here we develop a strategy to more rapidly discover configurations that meaningfully augment the training data set. The approach, uncertainty-driven dynamics for active learning (UDD-AL), modifies the potential energy surface used in molecular dynamics simulations to favor regions of configuration space for which there is large model uncertainty. The performance of UDD-AL is demonstrated for two AL tasks: sampling the conformational space of glycine and sampling the promotion of proton transfer in acetylacetone. The method is shown to efficiently explore the chemically relevant configuration space, which may be inaccessible using regular dynamical sampling at target temperature conditions.


Assuntos
Fabaceae , Incerteza , Glicina , Aprendizado de Máquina , Simulação de Dinâmica Molecular
17.
Sci Rep ; 13(1): 16262, 2023 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-37758757

RESUMO

Throughout computational science, there is a growing need to utilize the continual improvements in raw computational horsepower to achieve greater physical fidelity through scale-bridging over brute-force increases in the number of mesh elements. For instance, quantitative predictions of transport in nanoporous media, critical to hydrocarbon extraction from tight shale formations, are impossible without accounting for molecular-level interactions. Similarly, inertial confinement fusion simulations rely on numerical diffusion to simulate molecular effects such as non-local transport and mixing without truly accounting for molecular interactions. With these two disparate applications in mind, we develop a novel capability which uses an active learning approach to optimize the use of local fine-scale simulations for informing coarse-scale hydrodynamics. Our approach addresses three challenges: forecasting continuum coarse-scale trajectory to speculatively execute new fine-scale molecular dynamics calculations, dynamically updating coarse-scale from fine-scale calculations, and quantifying uncertainty in neural network models.

18.
Phys Rev E ; 105(4-2): 045301, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-35590626

RESUMO

We propose a data-driven method to describe consistent equations of state (EOS) for arbitrary systems. Complex EOS are traditionally obtained by fitting suitable analytical expressions to thermophysical data. A key aspect of EOS is that the relationships between state variables are given by derivatives of the system free energy. In this work, we model the free energy with an artificial neural network and utilize automatic differentiation to directly learn the derivatives of the free energy. We demonstrate this approach on two different systems, the analytic van der Waals EOS and published data for the Lennard-Jones fluid, and we show that it is advantageous over direct learning of thermodynamic properties (i.e., not as derivatives of the free energy but as independent properties), in terms of both accuracy and the exact preservation of the Maxwell relations. Furthermore, the method implicitly provides the free energy of a system without explicit integration.

19.
Sci Data ; 9(1): 579, 2022 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-36192410

RESUMO

Physical processes that occur within porous materials have wide-ranging applications including - but not limited to - carbon sequestration, battery technology, membranes, oil and gas, geothermal energy, nuclear waste disposal, water resource management. The equations that describe these physical processes have been studied extensively; however, approximating them numerically requires immense computational resources due to the complex behavior that arises from the geometrically-intricate solid boundary conditions in porous materials. Here, we introduce a new dataset of unprecedented scale and breadth, DRP-372: a catalog of 3D geometries, simulation results, and structural properties of samples hosted on the Digital Rocks Portal. The dataset includes 1736 flow and electrical simulation results on 217 samples, which required more than 500 core years of computation. This data can be used for many purposes, such as constructing empirical models, validating new simulation codes, and developing machine learning algorithms that closely match the extensive purely-physical simulation. This article offers a detailed description of the contents of the dataset including the data collection, simulation schemes, and data validation.

20.
Nat Rev Chem ; 6(9): 653-672, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37117713

RESUMO

Machine learning (ML) is becoming a method of choice for modelling complex chemical processes and materials. ML provides a surrogate model trained on a reference dataset that can be used to establish a relationship between a molecular structure and its chemical properties. This Review highlights developments in the use of ML to evaluate chemical properties such as partial atomic charges, dipole moments, spin and electron densities, and chemical bonding, as well as to obtain a reduced quantum-mechanical description. We overview several modern neural network architectures, their predictive capabilities, generality and transferability, and illustrate their applicability to various chemical properties. We emphasize that learned molecular representations resemble quantum-mechanical analogues, demonstrating the ability of the models to capture the underlying physics. We also discuss how ML models can describe non-local quantum effects. Finally, we conclude by compiling a list of available ML toolboxes, summarizing the unresolved challenges and presenting an outlook for future development. The observed trends demonstrate that this field is evolving towards physics-based models augmented by ML, which is accompanied by the development of new methods and the rapid growth of user-friendly ML frameworks for chemistry.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA