Skip to content

Building Thermal Aggregates

Why Aggregate Weather Data?

ERA5 data provides extremely detailed meteorological information. For each day, we have access to:

  • Temperatures
  • HDD
  • CDD
  • HDD²
  • CDD²
  • Across several thousand geographical points.

This wealth of information is both a strength and a challenge. In theory, more data should lead to better forecasts. In practice, however, using all these series directly would introduce major complications. The variables are highly redundant, and many consist primarily of noise, which makes the forecasting models unstable and significantly increases the risk of overfitting. To circumvent these issues, we will adopt a much more robust approach by summarizing the overall thermal state of the country using just a handful of synthetic indicators.

This strategy is highly intuitive given our specific objective. Because we are forecasting national electricity consumption, the model needs to capture the overall thermal baseline, widespread cold snaps, and major heatwaves, rather than getting bogged down in local micro-variations.

Building Spatial Aggregates

We will now transform the thousands of ERA5 series into a few synthetic variables.

Spatial Mean Temperature

For each day, we calculate the mean temperature across all sites using the following equation: \(T_{\text{mean}}(t) = \frac{1}{N} \sum_{i=1}^{N} T_i(t)\)

where:

  • \(N\) represents the number of ERA5 points.
  • \(T_i(t)\) is the temperature of site \(i\) on day \(t\).

This variable summarizes the average thermal baseline of the studied territory.

Extreme Temperatures

We also calculate the spatial minimum and maximum temperatures:

  • \(T_{\text{min}}(t) = \min_{i} T_i(t)\)
  • and \(T_{\text{max}}(t) = \max_{i} T_i(t)\)

These two variables help detect extreme thermal events, where

  • \(T_{\text{min}}\) captures the coldest snaps
  • and \(T_{\text{max}}\) captures the peak heatwaves.

Although national consumption primarily depends on the mean temperature, extreme conditions can trigger major demand spikes.

Aggregating Derived Thermal Variables

We then apply the exact same principle to the derived thermal variables:

  • \(\text{HDD}\)
  • \(\text{CDD}\)
  • \(\text{HDD}^2\)
  • \(\text{CDD}^2\)

For each of these, we calculate a daily spatial mean:

\(\text{HDD}_{\text{mean}}(t) = \frac{1}{N} \sum_{i=1}^{N} \text{HDD}_i(t)\)

and analogously for the other variables.

Massive Reduction in Complexity

Following this aggregation step, we no longer have to handle tens of thousands of meteorological time series. Instead, we are left with just 7 synthetic variables:

  • \(T_{\text{mean}}\)
  • \(T_{\text{min}}\)
  • \(T_{\text{max}}\)
  • \(\text{HDD}_{\text{mean}}\)
  • \(\text{CDD}_{\text{mean}}\)
  • \(\text{HDD}^2_{\text{mean}}\)
  • \(\text{CDD}^2_{\text{mean}}\)

This represents a massive reduction in dimensionality. We shift from a gigantic meteorological feature space to a small set of robust, easily interpretable variables. This simplification generally improves model stability, generalization capability, and overall readability.

Adding Meteorological Lag Memory

Much like electricity consumption, weather exhibits a certain degree of inertia. A prolonged cold snap often impacts electricity demand over several consecutive days because:

  • Buildings cool down progressively.
  • Human behavior adapts with a delay.
  • Heating systems themselves react with inherent inertia.

Therefore, we will apply the same principle to our weather variables as we did for consumption autoregression: using a rolling time window that looks back at previous days' observations. As with the AR model, we choose a 14-day window. To predict the residual for day \(D\), each observation includes:

  • The 7 thermal variables observed at \(D-14\)
  • Those from \(D-13\)
  • Those from \(D-12\)
  • ...all the way up to \(D-1\).

In other words, the model leverages meteorological data from the 14 days leading up to the day being forecast. This yields \(14 \times 7 = 98\) meteorological features used to predict the residuals of our base model.

An Intentionally Simple Approach

The strategy chosen for this tutorial is intentionally straightforward. We are not trying to build an overly sophisticated weather model,potimize every single geographical location and automatically select from hundreds of candidate variables.

Instead, our goal is different: to demonstrate that a simple, physically coherent, and robust thermal representation can already significantly improve an energy forecast. This philosophy is crucial in applied machine learning: in many real-world scenarios, a simpler, more robust, and more interpretable model is highly preferable to an extremely complex solution that struggles to generalize.

Script integrating the meteorological features: scripts/with_meteo.py