ABSTRACT
The accurate but fast calculation of molecular excited states is still a very challenging topic. For many applications, detailed knowledge of the energy funnel in larger molecular aggregates is of key importance, requiring highly accurate excitation energies. To this end, machine learning techniques can be a very useful tool, though the cost of generating highly accurate training data sets still remains a severe challenge. To overcome this hurdle, this work proposes the use of multifidelity machine learning where very little training data from high accuracies is combined with cheaper and less accurate data to achieve the accuracy of the costlier level. In the present study, the approach is employed to predict vertical excitation energies to the first excited state for three molecules of increasing size, namely, benzene, naphthalene, and anthracene. The energies are trained and tested for conformations stemming from classical molecular dynamics and density functional based tight-binding simulations. It can be shown that the multifidelity machine learning model can achieve the same accuracy as a machine learning model built only on high-cost training data while expending a much lower computational effort to generate the data. The numerical gain observed in these benchmark test calculations was over a factor of 30 but certainly can be much higher for high-accuracy data.
ABSTRACT
Inspired by Pople diagrams popular in quantum chemistry, we introduce a hierarchical scheme, based on the multilevel combination (C) technique, to combine various levels of approximations made when molecular energies are calculated. When combined with quantum machine learning (QML) models, the resulting CQML model is a generalized unified recursive kernel ridge regression that exploits correlations implicitly encoded in training data composed of multiple levels in multiple dimensions. Here, we have investigated up to three dimensions: chemical space, basis set, and electron correlation treatment. Numerical results have been obtained for atomization energies of a set of â¼7000 organic molecules with up to 7 atoms (not counting hydrogens) containing CHONFClS, as well as for â¼6000 constitutional isomers of C7H10O2. CQML learning curves for atomization energies suggest a dramatic reduction in necessary training samples calculated with the most accurate and costly method. In order to generate millisecond estimates of CCSD(T)/cc-pvdz atomization energies with prediction errors reaching chemical accuracy (â¼1 kcal/mol), the CQML model requires only â¼100 training instances at CCSD(T)/cc-pvdz level, rather than thousands within conventional QML, while more training molecules are required at lower levels. Our results suggest a possibly favorable trade-off between various hierarchical approximations whose computational cost scales differently with electron number.