improver.calibration.load_and_train_quantile_regression_random_forest module#

Script to load inputs and train a model using Quantile Regression Random Forest (QRF).

class LoadForTrainQRF(feature_config, parquet_diagnostic_names, cf_names, forecast_periods, cycletime, training_length, experiments, unique_site_id_keys='wmo_id')[source]#

Bases: PostProcessingPlugin

Plugin to load input files for training a Quantile Regression Random Forest (QRF) model.

__init__(feature_config, parquet_diagnostic_names, cf_names, forecast_periods, cycletime, training_length, experiments, unique_site_id_keys='wmo_id')[source]#

Initialise the LoadForTrainQRF plugin.

Parameters:

feature_config (dict[str, list[str]]) – Feature configuration defining the features to be used for Quantile Regression Random Forests.
parquet_diagnostic_names (Union[list[str], str]) – A list containing the diagnostic names that will be used for filtering the forecast and truth DataFrames read in from the parquet files. The target diagnostic name is expected to be the first item in the list. These names could be different from the CF name e.g. ‘temperature_at_screen_level’. This is expected to be the same length as the cf_names and experiments lists.
cf_names (Union[list[str], str]) – A list containing the CF names of the diagnostics. The CF names should match the order of the parquet_diagnostic_names. The target diagnostic to be calibrated is expected to be the first item in the list. These names could be different from the diagnostic name used to identify in the parquet files. For example, the diagnostic name could be ‘temperature_at_screen_level’ and the corresponding CF name could be ‘air_temperature’. This is expected to be the same length as the parquet_diagnostic_names and experiments lists.
forecast_periods (str) – Range of forecast periods to be calibrated in hours in the form: “start:end:interval” e.g. “6:18:6” or a single forecast period e.g. “6”. Multiple ranges can be specified using semicolon separation, e.g. “1:133:1;135:199:3” for hourly T+1 to T+132 and 3-hourly T+135 to T+198.
cycletime (str) – The time at which the forecast is valid in the form: YYYYMMDDTHHMMZ.
training_length (int) – The number of days of training data to use. experiment: The name of the experiment (step) that calibration is applied to.
experiments (list[str]) – The names of the experiment (step) that calibration is applied to. This is used to filter the forecast DataFrame on load. This is expected to be the same length as the parquet_diagnostic_names and cf_names lists.
unique_site_id_key – The names of the coordinates that uniquely identify each site, e.g. “wmo_id” or “latitude,longitude”.

_abc_impl = <_abc._abc_data object>#

_parse_forecast_periods()[source]#

Parse the forecast periods argument to produce a list of forecast periods in seconds.

Return type:: list[int]
Returns:: List of forecast periods in seconds, deduplicated and sorted.
Raises:: ValueError – If the forecast_periods argument is not a single integer, a range in the form ‘start:end:interval’, or semicolon-separated ranges.

_read_parquet_files(forecast_table_path, truth_table_path, forecast_periods)[source]#

Read the forecast and truth data from parquet files. self.quantile_forest_installed = quantile_forest_package_available() :type forecast_table_path: Path | str :param forecast_table_path: Path to the forecast parquet file. :type truth_table_path: Path | str :param truth_table_path: Path to the truth parquet file. :type forecast_periods: list[int] :param forecast_periods: List of forecast periods in seconds.

Returns:

DataFrame containing the forecast data.
DataFrame containing the truth data.

Return type:

Tuple containing

Raises:

ValueError – If the forecast parquet file does not contain the expected fields. Either “percentile” or “realization”.
ValueError – If the forecast parquet file does not contain the expected features.
ValueError – If the truth parquet file does not contain the expected fields.

process(file_paths)[source]#

Load input files for training a Quantile Regression Random Forest (QRF) model. Two sources of input data must be provided: historical forecasts and historical truth data (to use in calibration).

Parameters:

file_paths (cli.inputpaths) – A list of input paths containing: - The path to a Parquet file containing the truths to be used for calibration. The expected columns within the Parquet file are: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude. - The path to a Parquet file containing the forecasts to be used for calibration. The expected columns within the Parquet file are: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, percentile, diagnostic, latitude, longitude, period, height, cf_name, units. Please note that the presence of a forecast_period column is used to separate the forecast parquet file from the truth parquet file. - Optionally, paths to NetCDF files containing additional predictors.

Returns:

DataFrame containing the forecast data.
- DataFrame containing the truth data.
- List of cubes containing additional features.

A tuple of (None, None, None) is returned if:

The quantile_forest package is not installed.
No parquet files are provided.
Either the forecast or truth parquet files are missing.

Return type:

Tuple containing

class PrepareAndTrainQRF(feature_config, target_cf_name, n_estimators=100, max_depth=None, max_samples=None, random_state=None, transformation=None, pre_transform_addition=0, unique_site_id_keys='wmo_id', **kwargs)[source]#

Bases: PostProcessingPlugin

Plugin to prepare and train a Quantile Regression Random Forest (QRF) model.

__init__(feature_config, target_cf_name, n_estimators=100, max_depth=None, max_samples=None, random_state=None, transformation=None, pre_transform_addition=0, unique_site_id_keys='wmo_id', **kwargs)[source]#

Initialise the PrepareAndTrainQRF plugin.

Parameters:

feature_config (dict[str, list[str]]) – Feature configuration defining the features to be used for Quantile Regression Random Forests.
target_cf_name (str) – A string containing the CF name of the forecast to be calibrated e.g. air_temperature.
n_estimators (int) – The number of trees in the forest.
max_depth (Optional[int]) – The maximum depth of the trees.
max_samples (Optional[float]) – The maximum number of samples to draw from the total number of samples to train each tree.
random_state (Optional[int]) – Seed used by the random number generator.
transformation (Optional[str]) – Transformation to be applied to the data before fitting.
pre_transform_addition (float) – Value to be added before transformation.
unique_site_id_key – The names of the coordinates that uniquely identify each site, e.g. “wmo_id” or [“latitude”, “longitude”].
kwargs – Additional keyword arguments for the quantile regression model.

_abc_impl = <_abc._abc_data object>#

_add_static_features_from_cubes_to_df(forecast_df, cube_inputs)[source]#

Add features to the forecast DataFrame from cubes based on the feature configuration. Other features are expected to already be present in the forecast DataFrame.

Parameters:

forecast_df (DataFrame) – DataFrame containing the forecast data.
cube_inputs (CubeList) – List of cubes containing additional features.

Return type:

DataFrame

Returns:

DataFrame with additional features added from the input cubes.

static _check_matching_times(forecast_df, truth_df)[source]#

Find the intersecting times available within the forecast and truth DataFrames.

Parameters:

forecast_df (DataFrame) – DataFrame containing the forecast data.
truth_df (DataFrame) – DataFrame containing the truth data.

Return type:

list[Timestamp]

Returns:

List of intersecting times as pandas Timestamp objects.

filter_bad_sites(forecast_df, truth_df)[source]#

Remove sites that have NaNs in the data.

Parameters:

feature_df – DataFrame containing the forecast data with features.
truth_df (DataFrame) – DataFrame containing the truth data.

Returns:

DataFrame containing the forecast data with bad sites removed.
DataFrame containing the truth data with bad sites removed.

Return type:

Tuple containing

Raises:

ValueError – If the truth DataFrame is empty after removing NaNs.
ValueError – If there are no matching sites and times between the forecast and truth DataFrames after removing NaNs.

process(forecast_df, truth_df, cube_inputs=None)[source]#

Load input files and train a Quantile Regression Random Forest (QRF) model. This model can be applied later to calibrate the forecast. Two sources of input data must be provided: historical forecasts and historical truth data (to use in calibration). The model is output as a pickle file.

Parameters:

forecast_df (DataFrame) – DataFrame containing the forecast data.
truth_df (DataFrame) – DataFrame containing the truth data.
cube_inputs (Optional[CubeList]) – List of cubes containing additional features.

Return type:

Optional[tuple[RandomForestQuantileRegressor, str, float]]

Returns: A tuple containing:

The trained RandomForestQuantileRegressor model.
The transformation applied to the data before fitting.
The value added before transformation.

Raises:: ValueError – If there are no matching times between the forecast and truth data.

class RandomForestQuantileRegressor[source]#: Bases: object

improver.calibration.load_and_train_quantile_regression_random_forest module

Contents

improver.calibration.load_and_train_quantile_regression_random_forest module#

This Page