improver.calibration.quantile_regression_random_forest module#
Plugins to perform quantile regression using random forests.
- class ApplyQuantileRegressionRandomForests(target_name, feature_config, quantiles, transformation=None, pre_transform_addition=0, unique_site_id_keys=['wmo_id'])[source]#
Bases:
PostProcessingPluginPlugin to apply a trained model using quantile regression random forests.
- __init__(target_name, feature_config, quantiles, transformation=None, pre_transform_addition=0, unique_site_id_keys=['wmo_id'])[source]#
Initialise the plugin.
- Parameters:
target_name (str) – Name of the target variable to be calibrated.
feature_config (dict) – Feature configuration defining the features to be used for quantile regression. The configuration is a dictionary of strings, where the keys are the names of the columns within the dataframe. Some features may be used as initially provided within the dataframe, whilst others may be computed from the data e.g. mean, std. If the key is the feature itself e.g. distance to water, then the value should state “static”. In this case, the name of feature e.g. ‘distance_to_water’ is expected to be a column name in the input dataframe. The config will have the structure: “DYNAMIC_VARIABLE_CF_NAME”: [“FEATURE1”, “FEATURE2”] e.g. { “air_temperature”: [“mean”, “std”, “altitude”], “visibility_at_screen_level”: [“mean”, “std”] “distance_to_water”: [“static”], }
quantiles (float) – Quantiles used for prediction (values ranging from 0 to 1).
transformation (str) – Transformation to be applied to the data before fitting.
pre_transform_addition (float) – Value to be added before transformation.
unique_site_id_keys (
list[str]) – The names of the coordinates that uniquely identify each site, e.g. “wmo_id” or [“latitude”, “longitude”].
- _abc_impl = <_abc._abc_data object>#
- _reverse_transformation(forecast)[source]#
Reverse the transformation applied to the data prior to fitting the QRF.
- Parameters:
forecast (
ndarray) – Calibrated forecast.- Returns:
Forecast with the transformation reversed.
- Return type:
forecast
- process(qrf_model, forecast_df)[source]#
Apply a quantile regression random forests model.
- Parameters:
qrf_model (
RandomForestQuantileRegressor) – A trained QRF model.forecast_df (
DataFrame) – DataFrame containing the forecast information and features.
- Return type:
- Returns:
Calibrated forecast as a numpy array.
- class TrainQuantileRegressionRandomForests(target_name, feature_config, n_estimators, max_depth=None, max_samples=None, random_state=None, transformation=None, pre_transform_addition=0, unique_site_id_keys='wmo_id', **kwargs)[source]#
Bases:
BasePluginPlugin to train a model using quantile regression random forests.
- __init__(target_name, feature_config, n_estimators, max_depth=None, max_samples=None, random_state=None, transformation=None, pre_transform_addition=0, unique_site_id_keys='wmo_id', **kwargs)[source]#
Initialise the plugin.
- Parameters:
target_name (str) – Name of the target variable to be calibrated e.g. ‘air_temperature’.
feature_config (dict) – Feature configuration defining the features to be used for quantile regression. The configuration is a dictionary of strings, where the keys are the names of the columns within the dataframe. Some features may be used as initially provided within the dataframe, whilst others may be computed from the data e.g. mean, std. If the key is the feature itself e.g. distance to water, then the value should state “static”. In this case, the name of feature e.g. ‘distance_to_water’ is expected to be a column name in the input dataframe. The config will have the structure: “DYNAMIC_VARIABLE_CF_NAME”: [“FEATURE1”, “FEATURE2”] e.g. { “air_temperature”: [“mean”, “std”, “altitude”], “visibility_at_screen_level”: [“mean”, “std”] “distance_to_water”: [“static”], }
n_estimators (int) – Number of trees in the forest.
max_depth (int) – Maximum depth of the tree.
max_samples (float) – If an int, then it is the number of samples to draw to train each tree. If a float, then it is the fraction of samples to draw to train each tree. If None, then each tree contains the same total number of samples as originally provided.
random_state (int) – Random seed for reproducibility.
transformation (str) – Transformation to be applied to the data before fitting.
pre_transform_addition (float) – Value to be added before transformation.
unique_site_id_keys (
Union[list[str],str]) – The names of the coordinates that uniquely identify each site, e.g. “wmo_id” or [“latitude”, “longitude”].kwargs – Additional keyword arguments for the quantile regression model.
- _abc_impl = <_abc._abc_data object>#
- fit_qrf(forecast_features, target)[source]#
Fit the quantile regression random forest model. :type forecast_features:
ndarray:param forecast_features: Array of forecast features. :type forecast_features: numpy.ndarray :type target:ndarray:param target: Array of target values. :type target: numpy.ndarray- Returns:
Fitted quantile regression model.
- Return type:
qrf_model (RandomForestQuantileRegressor)
- process(forecast_df, truth_df)[source]#
Train a quantile regression random forests model.
- Parameters:
- Return type:
References
Johnson. (2024). quantile-forest: A Python Package for Quantile Regression Forests. Journal of Open Source Software, 9(93), 5976. https://doi.org/10.21105/joss.05976. Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7(35), 983–999. http://jmlr.org/papers/v7/meinshausen06a.html Taillardat, M., O. Mestre, M. Zamo, and P. Naveau, 2016: Calibrated Ensemble Forecasts Using Quantile Regression Forests and Ensemble Model Output Statistics. Mon. Wea. Rev., 144, 2375–2393, https://doi.org/10.1175/MWR-D-15-0260.1. Taillardat, M. and Mestre, O.: From research to applications – examples of operational ensemble post-processing in France using machine learning, Nonlin. Processes Geophys., 27, 329–347, https://doi.org/10.5194/npg-27-329-2020, 2020.
- _check_valid_transformation(transformation)[source]#
Check if the transformation is one of the supported types. :type transformation:
str:param transformation: Transformation to be checked.- Raises:
ValueError – If the transformation is not one of the supported types.
- _drop_nans_from_forecast_df(forecast_df, merge_columns, feature_column_names, valid_forecast_proportion=0.5)[source]#
Drops any NaNs from the forecast DataFrame. Extraneous columns are excluded e.g. period, so we can drop nans across all columns without removing data due to nans in unused columns. :type forecast_df:
DataFrame:param forecast_df: Input forecast DataFrame. :type merge_columns:list[str] :param merge_columns: Columns used for merging forecast and truth DataFrames. :type feature_column_names:list[str] :param feature_column_names: Names of the feature columns. :type valid_forecast_proportion:float:param valid_forecast_proportion: Proportion of forecast data that can be removed.- Raises:
ValueError – If more than the specific proportion of the forecast data has been
removed after dropping NaNs. –
- Return type:
- apply_transformation(data, transformation, pre_transform_addition)[source]#
Apply the specified transformation to the data. :type data:
ndarray:param data: Data to be transformed. :type transformation:str:param transformation: Transformation to be applied. :type pre_transform_addition:float:param pre_transform_addition: Value to be added before transformation.- Returns:
Transformed data.
- prep_feature(df, variable_name, feature_name, transformation=None, pre_transform_addition=0, unique_site_id_keys='wmo_id')[source]#
Prepare features that require computation from the input DataFrame.
Options available are mean, standard deviation, min, max, percentiles and a members above and a members below count of the input feature, the day of year, sine of day of year, cosine of day of year, hour of day, sine of hour of day and cosine of hour of day.
When computing the mean or standard deviation, these will be computed over either the percentile or realization column, depending upon which is available. When a percentile column is provided, the expectation is that these percentiles are equally spaced between 0 and 100, so that these percentiles can be treated as being equally likely.
- Parameters:
df (
DataFrame) – Input DataFrame.variable_name (
str) – Name of the variable to be used for the computation.feature_name (
str) – Feature to be computed. Options are “mean”, “std”, “min”, “max”, “percentile_<perc>” where <perc> is the required percentile between 0 and 100, “members_below_<threshold>” where <threshold> is the threshold value to count the number of members below, “members_above_<threshold>” where <threshold> is the threshold value to count the number of members above, “day_of_year”, “day_of_year_sin”, “day_of_year_cos”, “hour_of_day”, “hour_of_day_sin” and “hour_of_day_cos”.transformation (
Optional[str]) – Transformation to be applied to the data before fitting. This is only used when computing members_below or members_above features.pre_transform_addition (
float32) – Value to be added before transformation. This is only used when computing members_below or members_above features.unique_site_id_keys (
Union[list[str],str]) – The names of the coordinates that uniquely identify each site, e.g. “wmo_id” or [“latitude”, “longitude”].
- Returns:
DataFrame with the computed feature added.
- Return type:
df
- Raises:
ValueError – If all computed values for the feature are NaN.
- prep_features_from_config(df, feature_config, transformation=None, pre_transform_addition=0, unique_site_id_keys='wmo_id')[source]#
Process the feature_config to prepare the features as required and return the expected column names that will be used as features with the QRF.
- Parameters:
df (
DataFrame) – Input DataFrame.feature_config (
dict[str,list[str]]) – Feature configuration defining the features to be used for QRF.transformation (
Optional[str]) – Transformation to be applied to the data before fitting. This is only used when computing members_below or members_above features.pre_transform_addition (
float32) – Value to be added before transformation. This is only used when computing members_below or members_above features.unique_site_id_keys (
Union[list[str],str]) – The names of the coordinates that uniquely identify each site, e.g. “wmo_id” or [“latitude”, “longitude”].
- Return type:
- Returns:
Processed DataFrame and a list of expected column names that will be used as features with the QRF.
- Raises:
ValueError – If a variable expected in the feature_config is not present in
the DataFrame e.g. "surface temperature". –
ValueError – If a feature expected for a specific variable in the feature_config
is not supported e.g. "interquartile_range". –
- quantile_forest_package_available()[source]#
Return True if quantile_forest package is available, False otherwise.
- sanitise_forecast_dataframe(df, feature_config)[source]#
Sanitise the forecast DataFrame by removing columns that are no longer required. Following the computation of e.g. the mean or standard deviation, the original feature can be removed. The column over which the mean or standard deviation has been computed (e.g. the percentile or realization column) is also removed.