improver.calibration.rainforest_calibration module

RainForests calibration Plugins.

RainForests calibration

RainForests calibration is a situation dependent non-parametric method for the calibration of rainfall. It is loosely based on the ECPoint method of Hewson and Pillosu Hewson & Pillosu, 2021.

Sub-grid variability as a means of calibration

Like ECPoint, RainForests aims to calibrate grid-scale rainfall forecasts by accounting for sub-grid variability. Sub-grid variability in this context describes the relationship between the point observations one would expect to measure within a grid box and the areal average grid-scale NWP forecast value. This relationship is represented by a mapping that takes each NWP forecast value and maps it to a distribution of expected observed values.

Naturally the relationship between this distribution and the forecast value is contingent on the underlying rainfall processes at play. For instance, the distribution associated with a rainband will be characteristically different to that associated with post-frontal showers and different still to that associated with deep-tropical convection. To this end, a suitable set of meteorological variables can be used to distinguish different rainfall regimes and identify the associated distribution that describes the underlying sub-grid variability.

Rainforests (like ECPoint) processes each ensemble member separately to produce a per-realization output that can be considered a pseudo-ensemble representing the conditional probability distribution that describes the likelihood of observing a given outcome when the realised atmospheric state is consistent with that represented in the input ensemble member forecast. The predicted distributions for different ensemble members are then blended in probability space to produce the final calibrated probability output.

One advantage of the ECPoint approach is that the calibration is inherently non-local. As the calibration is done by identifying distinct weather types, the model bias and scale difference should be independent of any given location and time as the underlying physical process should be identical. Thus a grid-point can be calibrated using data from any location, provided the underlying weather type is the consistent. This enables effective calibration to be applied to areas that are typically lacking sufficient cases to calibrate against.

The RainForests method

RainForests uses machine learning based tree models, namely gradient-boosted decision tree (GBDT) models to replace the manually constructed decision tree model of ECPoint.

Our aim is to produce a probability distribution of the expected rainfall, given the NWP forecasts of relevant variables. We define a set of rainfall thresholds, suitably spaced so as to accurately model the distribution. Then we train a separate GBDT model for each lead time and threshold (note that each GBDT model itself consists of several hundred trees).

The set of variables that feed into the GBDT models are what allows the calibration to distinguish between different weather situations and build the appropriate probability distribution accordingly. These input variables are typically diagnostic variables for the ensemble realization (including the variable of interest), but can include static and dynamic ancillary variables, such as local solar time, and whole-of-ensemble values for diagnostic variables, such as mean or standard deviation.

Here we use LightGBM for training the models, and compile the models with Treelite for efficient prediction.

GBDT vs manually constructed DT

The choice of using GBDT models in place of the manually constructed decision tree of ECPoint comes with some advantages, but at the expense of some trade-offs:

Advantages:

  • GBDT is a sum of many trees rather than a single tree. This means outputs are near-continuous relative to the inputs.

  • Trees are built algorithmically, not manually, with each branch split automatically chosen to be optimal relative to the loss function. In principle this gives better accuracy, and makes it easier to retrain on new data.

trade-offs:

  • By using an ensemble of trees, the intuitive connection between weather type and feature variables becomes obscured.

  • Using a series of decision tree ensembles in place of a single decision tree increases the computational demand significantly.

  • Some initial effort is required to select a good set of model hyper-parameters that neither under- or over-fit. This process is more challenging and less transparent than ECPoint, however is required only once rather than each time the decision tree(s) are constructed.

Implementation details

Model training

The model training process is relatively simple and involves collating a series of forecast-observation pairs with the associated feature variables into a single pandas dataframe. In general, each ensemble member yields a separate row of the dataframe, although for reasons of computational efficiency it may be desirable to only use a subset of members, for example by using only the control realization in each ensemble. As each model predicts the probability that rainfall will exceed a particular threshold, the output is expected to be a number between 0 and 1. However, on occasion the GBDT models may predict values slightly outside of this range and so the model predictions are capped to back onto this interval; such values are possible due to the current choice of loss function used in training, namely the mean squared loss function which is akin to Brier score when working in probability space.

Currently model training is done offline, using a minimum 12-month period to capture the full seasonal cycle.

Forecast calibration

Forecast calibration uses the trained GBDT models, along with the forecast cube and associated feature cubes. The tree-models are passed in via a model-config json which identifies the appropriate tree-model file for each threshold.

The decision-tree models are used to construct representative probability distributions for each input ensemble member which are then blended to give the calibrated distribution for the full ensemble.

The distributions for the individual ensemble members are formed in a two-step process:

  1. Evaluate the CDF defined over the specified model thresholds for each ensemble member. Each threshold exceedance probability is evaluated using the corresponding decision-tree model.

  2. Interpolate each ensemble member distribution to the output thresholds.

Deterministic forecasts can also be calibrated using the same approach to produce a calibrated CDF; in this case inputs are treated as an ensemble of size 1.

class ApplyRainForestsCalibration(model_config_dict: Dict[str, Dict[str, Dict[str, str]]], threads: int = 1, bin_data: bool = False)[source]

Bases: PostProcessingPlugin

Generic class to calibrate input forecast via RainForests.

The choice of tree-model library is determined from package availability, and whether all required models files are available. Treelite is the preferred option, defaulting to lightGBM if requirements are missing.

_abc_impl = <_abc_data object>
_check_num_features(features)[source]

Check that the correct number of features has been passed into the model. :type features: CubeList :param features: Cubelist containing feature variables.

Return type:

None

_get_feature_splits(model_config_dict)[source]

Get the combined feature splits (over all thresholds) for each lead time.

Parameters:

model_config_dict – dictionary of the same format expected by __init__

Return type:

Dict[int, List[ndarray]]

Returns:

dict where keys are the lead times and the values are lists of lists. The outer list has length equal to the number of model features, and it contains the lists of feature splits for each feature. Each feature’s list of splits is ordered.

_get_num_features()[source]

Subclasses should override this function.

Return type:

int

_parse_model_config(model_config_dict)[source]

Parse the model config dictionary, set self.lead_times and self.model_thresholds, and return a sorted version of the config dictionary.

Parameters:
  • model_config_dict (Dict[str, Dict[str, Dict[str, str]]]) – Nested dictionary with string keys. Keys of outer level are

  • times (lead) –

  • thresholds. (and keys of inner level are) –

Return type:

Dict[float32, Dict[float32, Dict[str, str]]]

Returns:

Dictionary with the same nested structure as model_config_dict, but the lead time and threshold keys now have type np.float.

static check_filenames(key_name, model_config_dict)[source]

Check whether files specified by model_config_dict exist, and raise an error if any are missing.

Parameters:
  • key_name (str) – ‘treelite_model’ or ‘lightgbm_model’ are the expected names.

  • model_config_dict (Dict[str, Dict[str, Dict[str, str]]]) – Dictionary containing Rainforests model configuration variables.

process()[source]

Subclasses should override this function.

Return type:

None

class ApplyRainForestsCalibrationLightGBM(model_config_dict, threads=1, bin_data=False)[source]

Bases: ApplyRainForestsCalibration

Class to calibrate input forecast given via RainForests approach using light-GBM tree models

__init__(model_config_dict, threads=1, bin_data=False)[source]

Initialise the tree model variables used in the application of RainForests Calibration. LightGBM Boosters are used for tree model predictors.

Parameters:
  • model_config_dict (Dict[str, Dict[str, Dict[str, str]]]) – Dictionary containing Rainforests model configuration variables.

  • threads (int) – Number of threads to use during prediction with tree-model objects.

  • bin_data (bool) – Bin data according to splits used in models. This speeds up prediction if there are many data points which fall into the same bins for all threshold models. Limits the calculation of common feature values by only calculating them once.

Dictionary is of format:

{
    "24": {
        "0.000010": {
            "lightgbm_model": "<path_to_lightgbm_model_object>",
            "treelite_model": "<path_to_treelite_model_object>"
        },
        "0.000050": {
            "lightgbm_model": "<path_to_lightgbm_model_object>",
            "treelite_model": "<path_to_treelite_model_object>"
        },
        "0.000100": {
            "lightgbm_model": "<path_to_lightgbm_model_object>",
            "treelite_model": "<path_to_treelite_model_object>"
        },
    }
}

The keys specify the lead times and model threshold values, while the associated values are the path to the corresponding tree-model objects for that lead time and threshold.

_abc_impl = <_abc_data object>
_align_feature_variables(feature_cubes, forecast_cube)[source]

Ensure that feature cubes have consistent dimension coordinates. If realization dimension present in any cube, all cubes lacking this dimension will have realization dimension added and broadcast along this new dimension.

This situation occurs when derived fields (such as accumulated solar radiation) are used as predictors. As these fields do not contain a realization dimension, they must be broadcast to match the NWP fields that do contain realization, so that all features have consistent shape.

In the case of deterministic models (those without a realization dimension), a realization dimension is added to allow consistent behaviour between ensemble and deterministic models.

Parameters:
  • feature_cubes (CubeList) – Cubelist containing feature variables to align.

  • forecast_cube (Cube) – Cube containing the forecast variable to align.

Return type:

Tuple[CubeList, Cube]

Returns:

  • feature_cubes with realization coordinate added to each cube if absent

  • forecast_cube with realization coordinate added if absent

Raises:

ValueError – if feature/forecast variables have inconsistent dimension coordinates (excluding realization dimension), or if feature/forecast variables have different length realization coordinate over cubes containing a realization coordinate.

_calculate_threshold_probabilities(forecast_cube, feature_cubes)[source]

Evaluate the threshold exceedence probabilities for each ensemble member in forecast_cube using the tree_models, with the associated feature_cubes taken as inputs to the tree_model predictors.

Parameters:
  • forecast_cube (Cube) – Cube containing the variable to be calibrated.

  • feature_cubes (CubeList) – Cubelist containing the independent feature variables for prediction.

Return type:

Cube

Returns:

A cube containing threshold exceedence probabilities.

Raises:

ValueError – If an unsupported model object is passed. Expects lightgbm Booster, or treelite_runtime Predictor (if treelite dependency is available).

_evaluate_probabilities(input_data, lead_time_hours, output_data)[source]

Evaluate probability that forecast exceeds thresholds.

Parameters:
  • input_data (ndarray) – 2-d array of data for the feature variables of the model

  • lead_time_hours (int) – lead time in hours

  • output_data (ndarray) – array to populate with output; will be modified in place

Return type:

None

_get_ensemble_distributions(per_realization_CDF, forecast, output_thresholds)[source]

Interpolate probabilities calculated at model thresholds to extract probabilities at output thresholds for all realizations.

Parameters:
  • per_realization_CDF (Cube) – Cube containing the CDF probabilities for each ensemble member at model thresholds, with threshold as the first dimension.

  • forecast (Cube) – Cube containing NWP ensemble forecast.

  • output_thresholds (ndarray) – Sorted array of thresholds at which to calculate the output probabilities.

Return type:

Cube

Returns:

Cube containing probabilities at output thresholds for all realizations. Dimensions are same as forecast cube with additional threshold dimension first.

_get_num_features()[source]

Subclasses should override this function.

Return type:

int

_make_decreasing(probability_data)[source]

Enforce monotonicity on the CDF data, where threshold dimension is assumed to be the leading dimension.

This is achieved by identifying the minimum value progressively along the leading dimension by comparing to all preceding probability values along this dimension. The same is done for maximum values, comparing to all succeeding values along the leading dimension. Averaging these resulting arrays results in an array decreasing monotonically in the threshold dimension.

Parameters:

probability_data (ndarray) – The probability data as exceedence probabilities.

Return type:

ndarray

Returns:

The probability data, enforced to be monotonically decreasing along the leading dimension.

_prepare_features_array(feature_cubes)[source]

Convert gridded feature cubes into a numpy array, with feature variables sorted alphabetically.

Note: It is expected that feature_cubes has been aligned using _align_feature_variables prior to calling this function.

Parameters:

feature_cubes (CubeList) – Cubelist containing the independent feature variables for prediction.

Return type:

ndarray

Returns:

Array containing flattened feature variables,

Raises:

ValueError – If flattened cubes have differing length.

_prepare_threshold_probability_cube(forecast_cube, thresholds)[source]

Initialise a cube with the same dimensions as the input forecast_cube, with an additional threshold dimension added as the leading dimension.

Parameters:
  • forecast_cube – Cube containing the forecast to be calibrated.

  • thresholds – Points of the the threshold dimension.

Returns:

An empty probability cube.

process(forecast_cube, feature_cubes, output_thresholds, threshold_units=None)[source]

Apply rainforests calibration to forecast cube.

Ensemble forecasts must be in realization representation. Deterministic forecasts can be processed to produce a pseudo-ensemble; a realization dimension will be added to deterministic forecast cubes if one is not already present.

The calibration is done in a situation dependent fashion using a series of decision-tree models to construct representative probability distributions for each input ensemble member which are then blended to give the calibrated distribution for the full ensemble.

These distributions are formed in a two-step process:

1. Evaluate CDF defined over the specified model thresholds for each ensemble member. Each threshold exceedance probability is evaluated using the corresponding decision-tree model.

  1. Interpolate each ensemble member distribution to the output thresholds.

Parameters:
  • forecast_cube (Cube) – Cube containing the forecast to be calibrated; must be as realizations.

  • feature_cubes (CubeList) – Cubelist containing the feature variables (physical parameters) used as inputs to the tree-models for the generation of the associated probability distributions. Feature cubes are expected to have the same dimensions as forecast_cube, with the exception of the realization dimension. Where the feature_cube contains a realization dimension this is expected to be consistent, otherwise the cube will be broadcast along the realization dimension.

  • output_thresholds (List) – Set of output thresholds.

  • threshold_units (Optional[str]) – Units in which output_thresholds are specified. If None, assumed to be the same as forecast_cube.

Return type:

Cube

Returns:

The calibrated forecast cube.

Raises:

RuntimeError – If the number of tree models is inconsistent with the number of model thresholds.

class ApplyRainForestsCalibrationTreelite(model_config_dict, threads=1, bin_data=False)[source]

Bases: ApplyRainForestsCalibrationLightGBM

Class to calibrate input forecast given via RainForests approach using treelite compiled tree models

__init__(model_config_dict, threads=1, bin_data=False)[source]

Initialise the tree model variables used in the application of RainForests Calibration. Treelite Predictors are used for tree model predictors.

Parameters:
  • model_config_dict (Dict[str, Dict[str, Dict[str, str]]]) – Dictionary containing Rainforests model configuration variables.

  • threads (int) – Number of threads to use during prediction with tree-model objects.

  • bin_data (bool) – Bin data according to splits used in models. This speeds up prediction if there are many data points which fall into the same bins for all threshold models. Limits the calculation of common feature values by only calculating them once.

Dictionary is of format:

{
    "24": {
        "0.000010": {
            "lightgbm_model": "<path_to_lightgbm_model_object>",
            "treelite_model": "<path_to_treelite_model_object>"
        },
        "0.000050": {
            "lightgbm_model": "<path_to_lightgbm_model_object>",
            "treelite_model": "<path_to_treelite_model_object>"
        },
        "0.000100": {
            "lightgbm_model": "<path_to_lightgbm_model_object>",
            "treelite_model": "<path_to_treelite_model_object>"
        },
    }
}

The keys specify the model threshold value, while the associated values are the path to the corresponding tree-model objects for that threshold.

_abc_impl = <_abc_data object>
_get_num_features()[source]

Subclasses should override this function.

Return type:

int