Predicting atmospheric optical properties for radiative transfer computations using neural networks

The radiative transfer equations are well-known, but radiation parametrizations in atmospheric models are computationally expensive. A promising tool for accelerating parametrizations is the use of machine learning techniques. In this study, we develop a machine learning-based parametrization for the gaseous optical properties by training neural networks to emulate the Rapid Radiative Transfer model for General circulation Model applications - Parallel (RRTMGP). To minimize computational costs, we reduce the range of atmospheric conditions for which the neural networks are applicable and use machine-specific optimised BLAS functions to accelerate matrix computations. To generate training data, we use a set of randomly perturbed atmospheric profiles and calculate optical properties using RRTMGP. Predicted optical properties are highly accurate and the resulting radiative fluxes have errors within 1.2 W m$^{-2}$ for longwave and 0.12 W m$^{-2}$ for shortwave radiation. Our parametrization is 3 to 7 times faster than RRTMGP, depending on the size of the neural networks. We further test the trade-off between speed and accuracy by training neural networks for a single LES case, so smaller and therefore faster networks can achieve a desired accuracy, especially for shortwave radiation. We conclude that our machine learning-based parametrization can speed-up radiative transfer computations whilst retaining high accuracy.

The radiative transfer equations are well-known, but radiation parametrizations in atmospheric models are computationally expensive. A promising tool for accelerating parametrizations is the use of machine learning techniques. In this study, we develop a machine learning-based parametrization for the gaseous optical properties by training neural networks to emulate the Rapid Radiative Transfer model for General circulation Model applications -Parallel (RRTMGP). To minimize computational costs, we reduce the range of atmospheric conditions for which the neural networks are applicable and use machine-specific optimised BLAS functions to accelerate matrix computations. To generate training data, we use a set of randomly perturbed atmospheric profiles and calculate optical properties using RRTMGP. Predicted optical properties are highly accurate and the resulting radiative fluxes have errors within 1.2 W m −2 for longwave and 0.12 W m −2 for shortwave radiation. Our parametrization is 3 to 7 times faster than RRTMGP, depending on the size of the neural networks. We further test the trade-off between speed and accuracy by training neural networks for a single LES case, so smaller and therefore faster networks can achieve a desired accuracy, especially for shortwave radiation. We conclude that our machine learningbased parametrization can speed-up radiative transfer computations whilst retaining high accuracy.

Introduction
Accurate calculations of radiative fluxes are key to capturing the coupling between radiation, the atmosphere, and the surface. Unlike many parametrizations of subgrid processes, the radiative transfer equations are well-known and the accuracy of radiative transfer computations therefore depends mainly on the available input. However, fully computing the radiative fluxes requires solving the radiative transfer equation for all spectral lines and other absorption features in the solar and thermal spectrum. Direct integration across the spectrum requires enormous numbers of computations to resolve each line; this integration is often parameterized with correlated-k distribution methods [1][2][3] to drastically reduce the number of quadrature points. Even with this approximation radiative transfer schemes in weather and climate models remain a large computational burden. An important part of radiative transfer parametrizations is therefore to find approaches or further approximations that further reduce the computational costs, for example by coarsening the spatial and temporal resolution of the radiative transfer computations [4,5] or random sampling in spectral space [6].
A promising and increasingly explored approach to accelerate or improve parametrizations is the use of machine-learning techniques [8]. The application of machine learning to accelerate expensive radiative transfer computations is the first use of machine learning in the atmospheric sciences [9] and several studies have already applied machine learning to predict vertical profiles of longwave [9][10][11][12][13][14] and shortwave [11,12,14] radiative fluxes in weather and climate models. This end-to-end approach, i.e. predicting radiative fluxes by fully emulating a radiative transfer scheme, may result in speed-ups of more than one order of magnitude [9,10,12]. However, such an approach is not very flexible with respect to changes in the vertical grid as the neural networks are trained for a fixed number of vertical layers [9][10][11][12]. Moreover, this approach does not fully respect the well-understood underlying physics as it also involves replacing the two-stream [15] equations that converge to the well-known radiative transfer equations.
In this study we present a machine-learning approach for accelerating radiative transfer computations that respects the radiative transfer equation while emulating only the most ad hoc parts. Radiative transfer solvers require information on the optical properties of the atmosphere to compute upwelling and downwelling radiative fluxes. These atmospheric optical properties determine how much radiation is emitted (Plank source function), absorbed or scattered (optical depth, single scattering albedo), and the direction of scattering (asymmetry parameter). Our aim is to emulate the gaseous optical properties calculations of the Rapid Radiative Transfer Model for General circulation model applications -Parallel (RRTMGP) [16]. We use neural networks as a computationally efficient tool to replace the lookup-tables used in RRTMGP and train the networks for a narrowed range of atmospheric conditions, e.g. by neglecting variations of many trace gases and limiting the range of temperatures and pressures. Constraining the range and number of inputs allows further optimisation of the neural networks, which helps to reduce the computational costs of our parametrization compared to RRTMGP. This approximation can be particularly suitable for limited-area models, such as large-eddy simulations (LES), in which the range of values of thermodynamic variables is smaller than in numerical weather predictions or climate simulations and concentrations of many trace gases can often be assumed in time and space.

Training data generation
We train three sets of artificial neural networks to predict all optical properties: • One set (NWP) that is trained for a wide range of atmospheric conditions, roughly representing the variability expected in numerical weather prediction, but with all gases except water vapour and ozone kept constant; Due to the relatively small variability of atmospheric conditions within one LES case, the LEStuned networks may contain substantially fewer layers and nodes than the NWP-tuned networks and can therefore be faster. This may provide a great speed-up, when the LES case is used for a large number of numerical experiments.
To generate the data required to train and validate the neural networks we use the new radiation package RTE+RRTMGP [16]. For each grid cell in the domain, RRTMGP calculates all optical properties from temperature (T ), pressure (p), and the concentrations of a wide range of gases. Given these optical properties, RTE (Radiative Transfer for Energetics) calculates the radiative fluxes throughout each column.
RRTMGP covers the spectral range of radiation relevant to atmospheric problems using a correlated k-distribution [3] with 14 shortwave (200 nm-12 µm) bands, 16 longwave (12 µm-1000 µm) bands, and 16 g-points per band. We therefore need to predict 224 (14 × 16) values for the short wave optical properties and 256 (16 × 16) for the longwave optical depth. For the Planck source function we also need to predict the upward and downward emission at each layer interface, resulting in 768 (3 × 16 × 16) values.
The NWP-tuned neural networks are trained with only T , p, H 2 O and O 3 as input, which are time-varying 3D variables in global weather prediction models. To create training data, we use the set of 100 atmospheric profiles of temperature, pressure and several gas concentrations from the Radiative Forcing Model Intercomparison Project (RFMIP) [17], but with all gas concentrations except H 2 O and O 3 set constant to reduce the degrees of freedom. These 100 profiles were chosen such that the weighted sum of the radiative fluxes through the profiles represents the global mean radiative flux [17] but do not represent the full diversity of atmospheric conditions on earth. Therefore, we generate extra data spanning a wider range of atmospheric conditions, by generating permutations of this set of 100 profiles with random perturbations in T , p, H 2 O and O 3 : where r 1 , r 2 , r 3 and r 4 are random numbers between −1 and 1 and i is the index of each layer. Since the four random numbers are generated independently, the perturbations are uncorrelated. This gives a larger variation of different atmospheric conditions in the training data and reduces the risk of overfitting but may result in combinations of T , p, H 2 O and O 3 that are unlikely to occur in reality. To reduce the risk of generating unrealistic data, we recompute H 2 O with another random number whenever H 2 O is larger than the saturation water vapour mixing ratio. The surface temperature is randomly chosen between T 0 −10 K and T 0 +10 K, where T 0 is the temperature at the lowest pressure level. For the two LES-tuned sets of neural networks, we compute O 3 as a monotonic function of pressure following [18] with a lower boundary of 5 × 10 −3 ppmv. Consequently, we train these networks using only T , p and H 2 O. For the first LES-tuned set (Cabauw), we run a 10-hour LES simulation (07 UTC to 17 UTC) of a developing convective boundary layer over grassland near the Cabauw Experimental Site for Atmospheric Research (CESAR) in the Netherlands with shallow cumulus clouds forming in the afternoon (see [19] and [20] for a detailed description of the case). The simulation is performed using the Dutch Atmospheric Large-Eddy Simulation (DALES) [21], with domain size of 19.2 × 19.2 × 5.47 km 3 and a resolution of 100 × 100 × 24 m 3 . For the second LES-tuned set (RCEMIP), we run a 100-day simulation with MicroHH [22] following the specification for cloud resolving models (see RCE_small300 in [18]) of the Radiative Convective Equilibrium Model Intercomparison Project (RCEMIP) [18]. This is a case with deep convection over a tropical ocean with an atmosphere in radiative convective equilibrium, meaning that radiative cooling is balanced by convective heating [18]. The simulation is performed with a domain size of 100 × 100 × 32 km 3 , a horizontal resolution of 1 km and 72 vertical levels. For each LES-tuned set, we then determine the minimum and maximum H 2 O and T values of each vertical layer and subsequently generate 1000 random profiles of p, H 2 O and T to cover the full parameter space of the corresponding simulation. To deal with negative or unrealistically low water vapour concentrations in the simulations, we set a lower H 2 O limit in these profiles of 16 ppmv and 5 ppmv for the Cabauw and RCEMIP simulation, respectively.
For every combination of T , p, O 3 and H 2 O (NWP) or T , p, and H 2 O (Cabauw, RCEMIP), we then calculate the optical properties at each g-point using RRTMGP. To improve convergence of the neural network, we log-scale the optical depths, Planck source function, H 2 O, (O 3 ,) and p during both training and inference using a fast approximation of the natural logarithm to keep the computational effort feasible: where we take n = 16. This approximation follows from the fast approximation of the exponential we use during inference (Eq. 3.2). Subsequently, we normalize all variables to a zero mean and unit variance, with different means and standard deviations for the upper (p <9948 Pa) and lower (p >9948 Pa) atmosphere per variable. A random 90% of the dataset is used to train the neural networks and 5% of the data is used for validation. The remaining 5% was reserved to test the trained networks, but is no longer used as we use a newly generated set of profiles for the tests presented in this study.

Artificial neural networks (a) Network architecture and training
The neural networks are designed in and trained with TensorFlow [23], version 1.11/1.12. We need to predict 4 different optical properties: the single scattering albedo w 0 (shortwave only), the shortwave τsw and longwave τ lw optical depth, and the Planck source function B. For each optical property we train two neural networks, for the upper atmosphere (p <9948 Pa) and lower (p >9948 Pa) atmosphere, because this distinction is also made in RRTMGP. By training seperate neural networks for each optical property we can reduce the complexity of each network. Furthermore, this allows us to (re-)train the networks for the different optical properties independently. The neural networks are trained to predict the optical properties of each grid cell or layer from the values of T , p, (O 3 ), and H 2 O of only that grid cell or layer. As such, each network has either 4 (NWP-tuned set) or 3 (LES-tuned sets) inputs and either 224 (τsw, w 0 ), 256 (τ lw ) or 768 (B) outputs.
All networks are feed-forward multi-layer perceptrons with densely-connected layers. We use a leaky ReLu activation function [24] with a slope of 0.2 in all hidden layers and a linear activation function in the output layer. We train for 5 × 10 6 steps or 658 epochs, where one epoch is one iteration over the entire training dataset. During training, we use the Adam optimiser [25] to optimise the weights, with a batch size of 128 and an initial learning rate of 0.01 that decays every 10 epochs. As the loss function, we use the mean squared error (MSE): where N batch and N gpt are the batch size and number of g-points, respectively. N N and RR are the optical properties predicted by the neural networks and calculated by RRTMGP, respectively. Although several studies have already shown successful attempts to optimise neural network architecture for accuracy with machine learning [26,27], choosing the numbers of hidden layers and the number of nodes per layer is often still a matter of manual tuning. Wider and deeper network are able to learn more complex functions, with the risk of overfitting, but are slower during both training and inference. We test several network sizes (Table 1) to investigate the trade off between accuracy and performance for different network sizes. This includes a network without hidden layers (Linear) that performs only linear regression as a reference.

(b) Implementation in radiative transfer solver
We use the trained neural networks as a parametrization for the gas optics in the RTE+RRTMGP framework. We replace the RRTMGP gas optics and source function routine and pair the new parametrization with the RTE radiative transfer solver. The new parametrization  The matrix products are the computationally most expensive parts of the neural network-solver. We therefore make use of the Level 3 functions of the BLAS library for which machine-specific optimised versions exist. In our implementation, we use the C-interface to the sgemm function of Intel's Math Kernel Library (MKL). For the exponentiation, we use a fast approximation in line with the natural logarithm approximation (eq. 2.1) neural networks is the evolution of the MSE as training progresses (Fig. 1). For each of the four optical properties, the MSE of the linear networks remains approximately constant throughout the training. It must be noted here that Figure 1 only shows the MSE of the evaluations, which is done every 50.000 training steps (≈6.6 epochs) using the full evaluation dataset, meaning that the linear networks have mostly converged before the first evaluation step. However, the MSE of the linear networks is over an order of magnitude larger than the MSE of the other networks, which suggests that a linear approach is not suitable here.
Comparing the neural networks with one or more hidden layers, we generally see that as the size of the networks increases, the number of epochs needed to reach convergence increases and the MSE at the end of the training decreases. The relatively large difference between the MSE's of the 2L-64_64 and 3L-32_64_128 networks suggests that we may still be able to strongly reduce the MSE by increasing the network size. This was not done, however, because the higher computational costs of larger networks will also limit the speed-up that can be achieved. Interestingly, the 1L_64 networks perform slightly better than the 2L_32_32 networks despite having the same number of hidden nodes: due to the larger number of output nodes (224/256), the 1L_64 networks have more connections between nodes. This means that more complex functions can be learned, but may also lead to higher computational costs.
To give a more intuitive measure of the predictive skill of the network, we generate a new set of 100 randomly perturbed profiles and calculate R-squared values (R 2 ) between the optical properties computed by the neural networks and by RRTMGP. The R 2 -values are determined for each g-point separately and subsequently averaged over all 224 or 256 g-points (R 2 ) to represent the overall performance . The networks with one or more hidden layers typically have R 2 > 0.9998 for the optical depths and the Planck source function and R 2 > 0.998 for the single scattering albedo. These high correlation coefficients give us confidence that the neural networks are able to predict the optical properties with very high accuracy. Averaged over the four optical properties, the Linear networks only have R 2 ≈ 0.95 and are thus clearly less accurate than the other networks. Although this R 2 also seems very high, this accuracy is already insufficient obtain accurate radiative fluxes (Section ii). Although the neural networks have high predictive skill, for atmospheric modelling applications we are mainly interested in the radiative fluxes through the atmosphere and at the surface. The errors of the radiative fluxes that follow from the predicted optical properties therefore help to assess whether the achieved accuracy of the neural networks is sufficient. To this end, we use the implementation of the neural networks in RTE+RRTMGP to calculate radiative fluxes based on the optical properties predicted by the neural networks.
The optical properties of the linear networks result in flux errors over 5 W m −2 (not shown), which is significantly larger than the flux errors of the other networks and not acceptable for application in NWP or LES. For this reason, the linear networks will not be included in further analyses. For the other network sizes (Table 1), the errors generally decrease as the complexity of the neural networks increases (Fig. 2). Interestingly, the mean flux errors differ only slightly, whereas the spread of the flux errors is clearly smaller for the more complex networks. This indicates that reducing the size of the neural networks used to predict optical properties does not introduce a significant bias in the radiative fluxes, but results in larger flux errors for individual atmospheric profiles because smaller networks predict less accurate optical properties. Nonetheless, the flux errors are more than two orders of magnitude smaller than the mean radiative fluxes: at the surface, the average downwelling and upwelling longwave fluxes are approximately 303 W m −2 and 789 W m −2 , respectively, and the average downwelling and upwelling shortwave fluxes are approximately 320 W m −2 and 55 W m −2 , respectively. Errors of net radiative fluxes computed by radiative transfer parameterizations in climate models can be up to 4 W m −2 for longwave and 5 W m −2 for shortwave radiation at the surface and the top of the atmosphere [28], depending on atmospheric conditions. These errors are larger than the range of our radiative flux errors with respect to RRTMGP, illustrating the accuracy of our neural network based parametrization.
Additionally, our errors in the downwelling surface fluxes and upwelling top of atmosphere fluxes with respect to RRTMGP are on average similar to or smaller than the average errors of RRTMGP with respect to the highly-accurate, but computationally very expensive, line-by-line radiative transfer model (LBLRTM) [29]. Since the neural networks are trained against RRTMGP, this suggest that further increasing the predictive skill of the networks will not improve the radiative fluxes much, as the flux errors compared to the "ground truth" will then be dominated by the errors of RRTMGP with respect to LBLRTM. The errors of radiative heating rates, which are proportional to the divergence of the radiative fluxes, are within about 0.01 K d −1 in the shortwave for all network sizes, except at the top of the atmosphere. The longwave radiative heating rates have errors within 1 K d −1 for most of the profile, although the smaller networks give errors up to 3 K d −1 at the surface. However, given the strong the strong dependency of the near-surface longwave heating rates on the surface temperature, we may expect that heating rates errors dampen out quickly due to adjustments of the surface temperature. Although in atmospheric models we are mainly interested in the total, spectrally-integrated longwave and shortwave fluxes, radiative fluxes per spectral band may be of interest for accurate predictions of the UV-index [30] or photosynthetically activate radiation (PAR). Therefore, we also determine the radiative fluxes for each spectral band and subsequently the mean absolute errors per band of the downwelling surface fluxes with respect to RRTMGP. The maximum error of the flux per band ranges between 0.0081 W m −2 and 0.049 W m −2 in the longwave spectrum and between 0.0055 W m −2 and 0.030 W m −2 in the shortwave spectrum (Figure 3), depending on the network sizes. Using the neural network-predicted optical properties, we can thus calculate the radiative fluxes of each spectral band with errors below 0.05 W m −2 . Moreover, the minimum errors of the radiative flux per band are even several orders of magnitude smaller than the maximum. Besides the accuracy of the predicted optical properties and the resulting radiative fluxes, the computational costs of the neural networks are of great interest since our main goal is to accelerate radiation computations. To evaluate the difference in runtime between RRTMGP and the neural network-solver, we generate 11 sets of 100 profiles with 1020 vertical layers from the RFMIP profiles. The optical properties of these sets are evaluated sequentially and the runtimes of the last 10 sets are averaged. The runtime of the first set is neglected to allow some spin-up time, mainly for the initialization of BLAS.
The neural network-solver is approximately 3 to 7 times faster than RRTMGP, depending on size of the networks (Fig. 4). The speed-up of the neural network-solver generally decreases for increasing network complexity. This is as expected, because the number of matrix multiplications scales with the number of layers and because the size of the matrices, and thus the computational effort required for the matrix multiplications, scales with the number of nodes per layer. The choice of a network size thus depends on the acceptable accuracy for a particular LES application.
Although our neural network-solver reduces the computational costs of the optical properties calculations significantly, end-to-end machine learning approaches that predict radiative fluxes immediately [9][10][11][12] have achieved speed-ups up to 80 times for the full radiative schemes. However, these end-to-end approaches also replace the governing radiative transfer equations and will give little flexibilty to changes in grid resolution. With our approach we keep the radiative transfer equations intact. However, this means that we still have to perform the spectral integration by predicting optical properties and calculating fluxes for all g-points.

(b) LES-tuned networks
The predicted optical properties of LES-tuned neural networks (Cabauw, RCEMIP), which are trained only for the range of atmospheric conditions expected in one LES case, are also very accurate. To further test the LES-tuned neural networks, we sample 1000 random atmospheric profiles each from the Cabauw and RCEMIP simulations and calculate profiles of optical properties with the Cabauw and RCEMIP sets of neural networks, respectively. For the Cabauw networks, we generally observe an improvement in the accuracy of the radiative heating rates, the downwelling surface fluxes and the upwelling top-of-atmosphere fluxes ( Figure 5) due to the LES tuning. This improvement is especially large in the shortwave spectrum, where for some networks sizes the mean squared errors of the surface and top-of-atmosphere fluxes are over an order of magnitude lower with the Cabauw networks than with the NWP networks. The lower mean squared errors of the Cabauw neural networks show that with LES tuning, we can use relatively small networks (e.g. 1L-32) to achieve similar or even higher accuracy than the more complex networks . . (2L-64_64, 3L-32_64_128) of the NWP set, which results in a larger speed-up compared to RRTMGP (Figure 3).
For the RCEMIP neural networks, the accuracy improvement in the shortwave spectrum we achieve with LES tuning (Figure 6) is similar to the improvement achieved by Cabauw networks. In the longwave spectrum, the accuracy of the radiative heating rates is higher with the RCEMIP networks than with the NWP networks. However, for some network sizes the accuracy of the radiative surface and top-of-atmosphere fluxes is lower with the RCEMIP networks. This suggests that not all RCEMIP networks have learned to predict optical properties for all combinations of T , p and H 2 O in the RCEMIP simulation as good as the NWP networks.
The mean absolute errors of the NWP networks on the profiles of the Cabauw ( Figure 5) and RCEMIP ( Figure 6) simulations are frequently larger than the errors of the NWP networks on the RFMIP-based profiles (Figure 2). On one hand, we may attribute the higher mean absolute errors on differences in the mean radiative fluxes, assuming the relative error remains approximately the same. However, the lower errors of NWP networks on the RFMIP-based profiles may also be a sign of overfitting due to insufficiently independent training and testing data. Nevertheless, given that the mean absolute errors are well within 0.5 W m −2 we are still confident that the NWP neural networks can be accurately used on a relatively wide range of atmospheric conditions. and heating rates are based on 1000 random profiles of RCEMIP simulation up to 3 K d −1 in the longwave spectrum, mainly near the surface, but we expect these errors to decrease rapidly after adjustment of the surface and air temperatures. The neural networks tested in this study are approximately 3 to 7 times faster than RRTMGP, depending on network size. The larger networks achieve lower speed-ups than the small networks, but result in more accurate radiative fluxes and heating rates, clearly showing a tradeoff between accuracy and computational speed. To further investigate this trade-off, we trained two additional sets of neural networks that are tuned for a single LES case each (Cabauw, RCEMIP) In general, these LES-tuned networks are more accurate on profiles of their respective simulations than the NWP networks, especially for shortwave radiation. This indicates that with LES tuning, smaller and therefore faster neural networks suffice to achieve a desired accuracy.
Given that RRTMGP uses linear interpolation from look-up tables to to compute optical properties [16], the computational efficiency of our neural network-based parametrization may be surprising. We attribute the speed-ups achieved by our parametrization to a large extent to the case-specific tuning, i.e. considering only a few gases or greatly limiting the range of atmospheric conditions (Cabauw and RCEMIP only), which reduces the problem size for which the neural networks have to be trained. Furthermore, the matrix computations required to solve the neural networks allow the use of machine-specific optimised BLAS libraries and reduces the memory use at the expense of floating point operations.
The speed-ups we achieve are less than those achieved by end-to-end approaches that emulate full radiative transfer parametrizations [9][10][11][12]. An advantage of our machine learning approach is that it still respects the governing radiative transfer equations. A promising future approach would be the application of machine learning to optimise the spectral integration. With such a machine learning approach the radiative transfer equations will still be solved, while the number of quadrature points may be reduced.