improver.clustering.realization_clustering module

improver.clustering.realization_clustering module#

Plugins to perform clustering on realizations within a cube.

class RealizationClusterAndMatch(hierarchy, model_id_attr, clustering_method, target_grid_name=None, regrid_mode='esmf-area-weighted', regrid_for_clustering=True, renumber_primary_realizations=True, regrid_kwargs=None, cycletime=None, **kwargs)[source]#

Bases: BasePlugin

Cluster primary input realizations and match secondary inputs to clusters.

This plugin performs KMedoids clustering on a primary input, then matches secondary input realizations to the resulting clusters based on mean squared error. When multiple secondary inputs are provided, their order in the hierarchy determines their precedence: inputs listed earlier (leftmost) in the secondary_inputs dictionary have higher priority and can overwrite matches from later (lower-priority) ones for overlapping forecast periods. In other words, the first (leftmost) secondary input in the dictionary has the highest precedence, and later ones have lower precedence. See the Args section of the __init__ docstring for details on how the hierarchy is specified and used.

See also

For a practical usage example, see: doc/source/examples/realization_cluster_and_match_example_data.py.

__init__(hierarchy, model_id_attr, clustering_method, target_grid_name=None, regrid_mode='esmf-area-weighted', regrid_for_clustering=True, renumber_primary_realizations=True, regrid_kwargs=None, cycletime=None, **kwargs)[source]#

Initialise the clustering and matching class.

Parameters:

hierarchy (dict[str, str | dict[str, list[int]]]) –
The hierarchy of inputs defining the primary input, which is clustered, and secondary inputs, which are matched to each cluster. The order of the secondary_inputs is used as the priority for matching. The list values for secondary inputs specify forecast periods in hours. A two-element list [start, end] will be expanded to include all hours in that inclusive range. Lists with other lengths are treated as explicit lists of forecast period hours. All values will be automatically converted to seconds to match the forecast_period coordinate units in the input cubes:
```
{
    "primary_input": "input1",
    "secondary_inputs": {"input2": [0, 24], "input3": [0, 6]},
}
```
In this example, input2 will use forecast periods in the range 0 to 24 hours inclusive (i.e., any forecast periods between 0 and 86400 seconds), and input3 will use the range 0 to 6 hours (0 to 86400 seconds). For lead times, where secondary inputs are not provided the primary input will be used. Only forecast periods that actually exist in the input cubes within these ranges will be processed.
model_id_attr (str) – The model ID attribute used to identify different models within the input cubes.
target_grid_name (str | None) – The name of the target grid cube for regridding. Only required if regrid_for_clustering is True.
clustering_method (str) – The clustering method to use.
regrid_mode (str) – The regridding mode to use. Default is “esmf-area-weighted”. See RegridLandSea for available modes.
regrid_for_clustering (bool) – If True, regrid all cubes (primary and secondary) to the target grid before clustering and matching. This regridding step speeds up the computation by reducing the data size and, importantly, emphasises larger-scale spatial features in the data, rather than small-scale detail. This helps the clustering focus on the most relevant broad patterns rather than being dominated by fine-scale noise. If False, clustering and matching are performed on the original grids without regridding. Default is True.
renumber_primary_realizations (bool) – If True (default), primary input cubes will have their realization coordinates renumbered to contiguous integers (0 to n_realizations-1) after clustering and matching. This allows seamless merging of primary cubes with different realization numbering schemes. If False, original realization numbering is preserved. When False, a UserWarning is issued if primary input cubes have differing realization numbering, as this may cause merge failures. Defaults to True for automatic renumbering behaviour.
regrid_kwargs (dict[str, Any] | None) –
Additional keyword arguments to pass to RegridLandSea. Common options include:
- mdtol (float): Tolerance of missing data (default 1).
- extrapolation_mode (str): Mode to fill regions outside domain.
- landmask (Cube): Land-sea mask for mask-aware regridding.
- landmask_vicinity (float): Radius for coastline search.
cycletime (str | None) – The forecast_reference_time on the input cubes will be reset to this value. The forecast periods will be adjusted accordingly with the validity times kept fixed. cycletime should be provided in the format YYYYMMDDTHHMMZ (e.g., 20240101T0000Z). If not provided, the forecast_reference_time on the input cubes will be left unchanged.
**kwargs (Any) – Additional arguments for the clustering method.

Raises:

ValueError – If regrid_for_clustering is True but target_grid_name is None.
NotImplementedError – If the clustering method is not supported (currently only KMedoids is supported).

_abc_impl = <_abc._abc_data object>#

_categorise_secondary_inputs(cubes, n_clusters, primary_cube)[source]#

Categorise secondary inputs by full or partial realizations.

This method also validates that secondary inputs don’t request forecast periods not present in the primary input. If such forecast periods are found, a warning is issued and those periods are filtered out.

Parameters:

cubes (CubeList) – CubeList containing all input cubes (primary and secondary) required for clustering and matching. Each cube should be identifiable by the model_id_attr.
n_clusters (int) – Number of clusters (realizations), created from the primary input, required to be considered a ‘full’ input.
primary_cube (Cube) – The primary input cube, used to filter forecast periods.

Returns:

full_realization_inputs: List of (model_name, forecast_periods) tuples: for secondary inputs with at least n_clusters realizations for the relevant forecast periods. The forecast_periods are the forecast periods (in seconds) that exist in the cubes within the specified hour range and are present in the primary input.
partial_realization_inputs: List of (model_name, forecast_periods): tuples for secondary inputs with fewer than n_clusters realizations for the relevant forecast periods. The forecast_periods are the forecast periods (in seconds) that exist in the cubes within the specified hour range and are present in the primary input.

Return type:

Tuple (full_realization_inputs, partial_realization_inputs)

static _convert_hours_to_seconds(hours)[source]#

Convert a list of hours to seconds.

Parameters:: hours (list[int]) – List of forecast period values in hours.
Return type:: list[int]
Returns:: List of forecast period values in seconds.

_ensure_forecast_period_is_dimension(cube)[source]#

Ensure forecast_period is a dimension coordinate and realization is first.

If forecast_period exists but is not a dimension coordinate (i.e., it’s scalar or auxiliary), promote it to a dimension coordinate using new_axis. Then ensure realization is the leading dimension. Also ensures that the time coordinate is associated with the forecast_period dimension to avoid it being scalar.

Parameters:: cube (Cube) – The cube to check and potentially modify.
Return type:: Cube
Returns:: The cube with forecast_period as a dimension coordinate (if it exists), time associated with the forecast_period dimension, and realization as the first dimension.

static _expand_forecast_period_range(fp_range)[source]#

Expand a forecast period range [start, end] to a list of integers.

Parameters:: fp_range (list[int]) – A list containing either [start, end] values defining a range in hours, or a list of specific forecast period hours.
Return type:: list[int]
Returns:: If fp_range has 2 elements, returns integers from start to end inclusive in steps of 1 hour. Otherwise, returns the list as-is.
Raises:: ValueError – If start > end (when 2 elements provided).

_initialise_matched_cubes_with_primary(clustered_primary_cube)[source]#

Initialise matched_cubes with clustered primary cube for all periods.

This ensures we always have a full set of realizations to work with as a base, which can then be selectively replaced by secondary inputs.

Parameters:: clustered_primary_cube (Cube) – The clustered primary cube containing all forecast periods.
Return type:: CubeList
Returns:: A CubeList containing one cube per forecast period from the clustered primary cube, each with forecast_period as a dimension coordinate.

_maybe_regrid_candidate_cube(candidate_cube, target_grid_cube)[source]#

Regrid the candidate cube if regrid_for_clustering is True, otherwise return as is.

Parameters:

candidate_cube (Cube) – The input candidate Cube to potentially regrid.
target_grid_cube (Cube) – The target grid Cube to regrid onto if regridding is enabled.

Return type:

Cube

Returns:

The regridded candidate Cube if regrid_for_clustering is True, otherwise the original candidate Cube.

_process_full_realization_inputs(full_realization_inputs, cubes, target_grid_cube, regridded_clustered_primary_cube, replaced_realizations, matched_cubes, cluster_sources, secondary_input_realizations_to_clusters)[source]#

Process full realization inputs in reverse precedence order.

This method replaces entire forecast period cubes with secondary inputs that have more realizations than the number of clusters to which they are being matched. It processes inputs working from lowest to highest precedence so that higher precedence inputs can overwrite.

Parameters:

full_realization_inputs (list[tuple[str, list[int]]]) – List of (name, forecast_periods) tuples for inputs with full realization sets.
cubes (CubeList) – The input CubeList containing all data.
target_grid_cube (Cube) – The target grid cube for regridding.
regridded_clustered_primary_cube (Cube) – The regridded clustered primary cube.
replaced_realizations (dict[int, set[int]]) – Dictionary tracking which (forecast_period, cluster) pairs have been replaced. Modified in-place.
matched_cubes (CubeList) – CubeList containing cubes to modify. Modified in-place.
cluster_sources (dict[int, dict[str, list[int]]]) – Dictionary tracking which input was used for each cluster at each forecast period. Modified in-place. Format: {cluster_idx: {model_name: [fp1, fp2, …]}}
secondary_input_realizations_to_clusters (dict[str, dict[int, list[int]]]) – Dictionary tracking which realizations from secondary inputs correspond to which clusters. Modified in-place. Format: {secondary_input_name: {forecast_period: {cluster_index: [realization_indices]}}}

Return type:

None

_process_partial_realization_inputs(partial_realization_inputs, cubes, target_grid_cube, regridded_clustered_primary_cube, replaced_realizations, matched_cubes, cluster_sources, secondary_input_realizations_to_clusters)[source]#

Process partial realization inputs in reverse precedence order.

This method selectively replaces specific realizations at specific forecast periods. It processes inputs with fewer realizations than the number of clusters to which they are being matched. It works from lowest to highest precedence so that higher precedence inputs can overwrite lower precedence ones.

Parameters:

partial_realization_inputs (list[tuple[str, list[int]]]) – List of (name, forecast_periods) tuples for inputs with partial realization sets.
cubes (CubeList) – CubeList containing all primary and secondary input cubes needed for clustering and matching. Each cube must have the model_id_attr attribute set, and all relevant models, forecast periods, and realizations to be processed or matched should be included.
target_grid_cube (Cube) – The target grid cube for regridding.
regridded_clustered_primary_cube (Cube) – The regridded clustered primary cube.
replaced_realizations (dict[int, set[int]]) – Dictionary tracking which (forecast_period, cluster) pairs have been replaced. Modified in-place.
matched_cubes (CubeList) – CubeList to append/modify matched results. Modified in-place.
cluster_sources (dict[int, dict[str, list[int]]]) – Dictionary tracking which input was used for each cluster at each forecast period. Modified in-place. Format: {cluster_idx: {model_name: [fp1, fp2, …]}}
secondary_input_realizations_to_clusters (dict[str, dict[int, list[int]]]) – Dictionary tracking which realizations from secondary inputs correspond to which clusters. Modified in-place. Format: {secondary_input_name: {forecast_period: {cluster_index: [realization_indices]}}}

Return type:

None

_select_realizations_for_kmedoid_clusters(primary_cube, clustering_result)[source]#

Select the realizations corresponding to the medoid indices from the clustering result.

Parameters:

primary_cube (Cube) – The input cube to select realizations from.
clustering_result (KMedoids) – The result of the clustering.

Returns:

The clustered cube.

Return type:

cube_clustered

Raises:

ValueError – If the number of clusters is greater than the number of realizations in the input cube.

_update_cluster_sources(cluster_sources, cluster_indices, candidate_name, fp)[source]#

Update cluster sources tracking when replacing data from one model with another.

This method removes the forecast period from the primary input’s tracking and adds it to the secondary input for the specified clusters, maintaining a record of which model provided data for each cluster at each forecast_period.

Parameters:

cluster_sources (dict[int, dict[str, list[int]]]) – Dictionary tracking which input was used for each cluster at each forecast period. Modified in-place. Format: {cluster_idx: {model_name: [fp1, fp2, …]}}
cluster_indices (list[int]) – List of cluster indices being updated.
candidate_name (str) – Name of the secondary input being added e.g. ‘secondary_input1’.
fp (int) – Forecast period value in seconds.

Return type:

None

cluster_primary_input(primary_cube, target_grid_cube)[source]#

Cluster the primary input cube. If regridding is enabled, the primary input cube is regridded to the target grid before clustering using the specified regridding method. Please see RegridLandSea for available modes.

Parameters:

primary_cube (Cube) – The primary input cube to cluster.
target_grid_cube (Cube | None) – The target grid cube for regridding. Can be None if regrid_for_clustering is False.

Return type:

tuple[Cube, Cube]

Returns:

Tuple of the clustered primary input cube and the regridded clustered primary input cube (or the same cube twice if regridding is disabled).

compact_secondary_mapping(secondary_input_realizations_to_clusters)[source]#

Compact the mapping of secondary input realizations to clusters by grouping forecast periods for each unique realization assignment per cluster.

Parameters:

secondary_input_realizations_to_clusters (dict[str, dict[int, dict[int, list[int]]]]) –

A nested dictionary mapping secondary input names to forecast periods, then to cluster indices, then to lists of realization indices: {

secondary_input_name: {

forecast_period: {
cluster_idx: [realization_index]

}

}

}

Return type:

dict[str, dict[int, list[dict[str, list[list[int]] | list[int]]]]]

Returns:

A compacted dictionary mapping each secondary input name to a dictionary of cluster indices, each containing a list of dicts with:

”realization”: the realization index assigned.

”forecast_periods”: a sorted list of forecast periods.

Example

{

secondary_input_name: {

cluster_idx: [

{: “realization”: 3, “forecast_periods”: [3600, 7200, 10800]

}

]

}

Note

Only one realization is assigned to each cluster for each forecast period.

process(cubes)[source]#

Cluster and match the data.

This method clusters the primary input realizations and matches secondary input realizations to the resulting clusters, according to the specified hierarchy and precedence. The realizations in the primary input can be renumbered if desired.

Parameters:

cubes (CubeList) –

The input CubeList containing all primary and secondary input cubes required for clustering and matching. Each cube must have the model_id_attr attribute set to identify its source/model. For each model (primary and secondary), include all forecast periods and realizations that should be considered for matching or replacement.

Expected input shapes:

2D: (y, x)
    for single realization, single forecast period fields.
3D: (realization, y, x)
    for multiple realizations at a single forecast period.
4D: (realization, forecast_period, y, x)
    for multiple realizations and multiple forecast periods.

The leading dimension must always be realization if present. For 4D cubes, the second dimension must be forecast_period.

Return type:

Cube

Returns:

The matched cube containing all secondary inputs matched to clusters. The output cube will have realization and forecast_period as leading dimensions (if present in the input), followed by spatial dimensions (y, x). The returned cube includes the following JSON string attributes: - ‘primary_input_realizations_to_clusters’: tracks which primary input

realizations were assigned to each cluster.

’secondary_input_realizations_to_clusters’: tracks which secondary input
realization was assigned to each cluster per forecast period.
’cluster_sources’: tracks which input model provided the final data for
each cluster-forecast_period pairing.

Raises:

ValueError – If no primary cube is found with the specified model_id_attr.

Warning

UserWarning: If primary cubes have different realization numbering schemes: when renumber_primary_realizations=False, which may cause merge failures.
UserWarning: If no secondary inputs have forecast periods that overlap with: the primary input, in which case only the clustered primary input will be returned.
UserWarning: If secondary inputs have forecast periods not present in the: primary input, which will be ignored.

track_secondary_realizations_to_clusters(secondary_input_realizations_to_clusters, cluster_indices, realization_indices, candidate_name, fp, candidate_cube)[source]#

Track which secondary realizations contributed to each cluster for a given secondary input, forecast period, and candidate cube.

This updates the provided dictionary in-place, ensuring all keys and values are native Python ints for serialization compatibility.

Parameters:

secondary_input_realizations_to_clusters (dict[str, dict[int, dict[int, list[int]]]]) –
Nested dictionary to update, mapping secondary input names to forecast periods, then to cluster indices, then to lists of realization indices: {

secondary_input_name: {

forecast_period: {
cluster_idx: [realization_indices]

}

}

}
cluster_indices (list[int]) – List of cluster indices assigned for this forecast period.
realization_indices (list[int]) – List of realization indices from the candidate cube that were assigned to each cluster.
candidate_name (str) – Name of the secondary input/model.
fp (int) – Forecast period (in seconds).
candidate_cube (Cube) – The candidate Cube from which realization indices are drawn.

Return type:

None

Returns:

None. The dictionary is updated in-place.

class RealizationClustering(clustering_method, **kwargs)[source]#

Bases: BasePlugin

Class to perform clustering on realizations of a cube. For example, this can be used to cluster a large number of ensemble members based on their spatial patterns into a smaller set of distinct clusters. If the input is precipitation forecasts, the resultant clusters could represent different types of precipitation events.

__init__(clustering_method, **kwargs)[source]#

Initialise the RealizationClustering class.

Parameters:

clustering_method (str) – The clustering method to use. The clustering method
use (to)
class. (improver.clustering.FitClustering)
**kwargs (Any) – Additional arguments for the clustering method.

_abc_impl = <_abc._abc_data object>#

static _convert_to_2d(array)[source]#

Convert an array with arbitrary dimensions to a 2D array by maintaining the zeroth dimension and flattening all other dimensions.

This prepares the data for clustering algorithms that expect 2D input where rows are samples (realizations) and columns are features (e.g. spatial points x forecast periods) so that an array of shape (18, 4, 100, 100) is converted to shape (18, 40000).

Parameters:: array (ndarray) – The input array to convert. Can have any number of dimensions.
Returns:: The converted 2D array with shape (array.shape[0], -1).
Return type:: array_2d

process(cube)[source]#

Apply the clustering method to the cube.

Cubes with more than 2 dimensions are converted to 2D arrays before clustering by flattening all dimensions except the leading dimension. The leading dimension is assumed to be the realization dimension.

Parameters:: cube (Cube) – The input cube to cluster with the realization dimension as the leading dimension.
Return type:: Any
Returns:: The result of the clustering algorithm applied to the input data.
Raises:: ValueError – If the leading dimension of the input cube is not the realization dimension.

class RealizationSelection(forecast_period, model_id_attr='mosg__model_configuration', cycletime=None, selection_attr=None, selection_attr_value='cluster_medoid')[source]#

Bases: BasePlugin

Plugin to select realizations based on clustering results.

This plugin is intended to be used with the output from the RealizationClusterAndMatch plugin. A typical use case is where RealizationClusterAndMatch has performed clustering and matching using a subset of forecast periods (for computational efficiency or other reasons), but you wish to apply the resulting cluster assignments to any forecast period. The RealizationSelection plugin enables this by selecting and relabelling realizations from the original forecast cubes according to the cluster mapping attributes stored in the cluster cube output by RealizationClusterAndMatch.

To use this plugin, provide as input the same forecast cubes as were supplied to RealizationClusterAndMatch (but strictly only at a single forecast period), together with the cluster cube output from RealizationClusterAndMatch.

__init__(forecast_period, model_id_attr='mosg__model_configuration', cycletime=None, selection_attr=None, selection_attr_value='cluster_medoid')[source]#

Initialise the RealizationSelection plugin.

Parameters:

forecast_period (int) – The forecast period (in seconds) to use for interrogating the cluster mapping attributes in order to select the appropriate realizations.
model_id_attr (str) – The name of the cube attribute used to identify the model source.
cycletime (Optional[str]) – The forecast_reference_time on the input forecast cubes will be reset to this value. The forecast periods will be adjusted accordingly with the validity times kept fixed. cycletime should be provided in the format YYYYMMDDTHHMMZ (e.g., 20240101T0000Z). If not provided, the forecast_reference_time on the input cubes will be left unchanged.
selection_attr (Optional[str]) – Optional name of a cube attribute to add to the output to identify that these realizations were selected using this plugin. If not provided (None), no attribute is added. Example: “realization_selection_method”.
selection_attr_value (str) – The value (e.g. a description of the selection method) to assign to the selection_attr attribute. Default is “cluster_medoid”. Only used if selection_attr is provided.

_abc_impl = <_abc._abc_data object>#

_extract_primary_model_from_cluster_sources(cluster_cube)[source]#

Extract the primary model name from the cluster_sources attribute.

The primary model is identified as the model that appears in the most clusters, which corresponds to the model used for initial clustering.

Parameters:: cluster_cube (Cube) – The cluster cube output from RealizationClusterAndMatch, containing the cluster_sources attribute as a JSON string.
Return type:: str
Returns:: The primary model name.
Raises:: ValueError – If cluster_sources attribute is not found in the cube.

_remove_blend_time_from_selected_cubes(selected_cubes)[source]#

Remove blend_time coordinate from all selected cubes if present on any.

blend_time is removed to avoid ambiguity in the merged output, as selected cubes may come from different source models with differing blend_time values.

Parameters:: selected_cubes (list[Cube]) – Realization-selected cubes, modified in place.
Return type:: None

build_cluster_to_selection(nearest_fp, use_secondary, secondary_map, primary_map, cluster_cube)[source]#

Build a mapping from cluster index to (model name, realization index) for selection.

Parameters:

nearest_fp (int) – The forecast period (in seconds) from the secondary mapping that is nearest while being greater than or equal to the requested forecast period.
use_secondary (bool) – Whether to use the secondary mapping (True) or fall back to the primary mapping (False). Determined by find_nearest_secondary_mapping_fp method.
secondary_map (dict[str, dict[int, list[dict[str, list[int]]]]]) – Dictionary mapping secondary input names to cluster mappings, where each cluster index maps to a list of dicts with “realization” and “forecast_periods”.
primary_map (dict[str, int]) – Dictionary mapping cluster index (as string) to medoid realization index (int).
cluster_cube (Cube) – The cluster cube output from RealizationClusterAndMatch, containing the cluster mapping attributes. Used to determine the model name for the primary input when assigning realizations to clusters.

Return type:

dict[int, tuple[str, int]]

Returns:

Dictionary mapping cluster index (int) to a tuple of (model name, realization index).

find_nearest_secondary_mapping_fp(mapping_fps, fp)[source]#

Find the nearest forecast period in the secondary mapping that is greater than or equal to the requested forecast period.

Parameters:

mapping_fps (Optional[set[int]]) – Set of forecast periods (in seconds) available in the secondary mapping.
fp (int) – The forecast period (in seconds) for which to find the nearest greater-than-or-equal mapping.

Returns:

nearest_fp: The smallest forecast period from mapping_fps that is

greater than or equal to fp (or fp if mapping_fps is empty). - use_secondary: Boolean indicating whether the secondary mapping

should be used (True if at least one forecast period in mapping_fps is greater than or equal to fp, else False).

Return type:

A tuple containing

parse_mapping_attributes(cluster_cube)[source]#

Parse and decode the mapping attributes from the cluster cube.

Parameters:

cluster_cube (Cube) – The cube output from RealizationClusterAndMatch, containing the mapping attributes as JSON-encoded strings.

Returns:

primary_map: Dictionary mapping cluster index (as string)
to medoid realization index (int).
secondary_map: Dictionary mapping secondary input names to
cluster mappings, where each cluster index maps to a list of dicts with “realizations” and “forecast_periods”.

Return type:

A tuple containing

Raises:

TypeError – If the mapping attributes are not in the expected format (str or dict).

process(cubes)[source]#

Select realizations from input forecast cubes according to cluster assignments defined by the cluster_cube’s attributes.

Parameters:: cubes (list of Cube) – List of input cubes, including forecast cubes and a cluster cube. The forecast cubes are from all source models for a common validity time and with each containing a “realization” coordinate that contributed to the clustering. Each cube must have the model_id_attr attribute set to identify its source model. The cluster cube is output from RealizationClusterAndMatch, containing the cluster mapping attributes. The cluster cube is identified by the presence of the “primary_input_realization_to_cluster_medoid” attribute.
Return type:: Cube
Returns:: A merged Cube containing the selected realizations, with realization indices matching the cluster indices in cluster_cube.

select_realizations_for_clusters(cluster_to_selection, forecast_cubes)[source]#

Select and relabel realizations from the input cubes according to the cluster-to-selection mapping.

Parameters:

cluster_to_selection (dict[int, tuple[str, int]]) – Dictionary mapping cluster index (int) to a tuple of (model name, realization index).
forecast_cubes (CubeList) – CubeList of input forecast cubes, each for a single forecast period.

Return type:

list[Cube]

Returns:

A list of Cube objects, each containing a single realization relabelled to the cluster index. Forecast cubes without a realization coordinate are treated as deterministic inputs and selected directly.

Raises:

ValueError – If no forecast cube is found for a specified model name.
ValueError – If a specified realization index is out of bounds for the corresponding model cube.

split_cubes_forecast_and_cluster(cubes)[source]#

Split a CubeList into forecast cubes and the cluster cube.

The cluster cube is identified by the presence of the “primary_input_realization_to_cluster_medoid” attribute. The forecast cubes are assumed to be the cubes without such an attribute and that share a common validity time.

Parameters:

cubes (CubeList) – CubeList of input cubes expected to contain forecast cubes and a cluster cube.

Returns:

forecast_cubes: CubeList of forecast cubes.
cluster_cube: The cluster cube.

Return type:

Tuple of (forecast_cubes, cluster_cube)

Raises:

ValueError – If no cluster cube is found.
ValueError – If no forecast cubes are found.

validate_common_validity_time(forecast_cubes)[source]#

Validate that all forecast cubes share a common validity time.

Parameters:: forecast_cubes (CubeList) – CubeList of forecast cubes.
Raises:: ValueError – If forecast cubes do not share a common validity time.
Return type:: None

class RealizationToClusterMatcher[source]#

Bases: BasePlugin

Match candidate realizations to clusters based on mean squared error (MSE). In this context, ‘candidate realizations’ refers to the set of realizations being considered for assignment to clusters from a secondary input. These are matched to clusters derived from a primary input by minimizing mean squared error (MSE).

Assigns realizations from a secondary input (e.g. a high-resolution regional ensemble model) to clusters defined by a primary input (e.g. a coarse global ensemble model). When multiple candidates compete for the same cluster, only the candidate with the lowest MSE is assigned; other candidates are not assigned to any cluster.

Supports both 3D cubes and 4D cubes of dimensions (realization, y, x) and (realization, forecast_period, y, x) respectively only.

__init__()[source]#: Initialise the plugin.

_abc_impl = <_abc._abc_data object>#

_mean_squared_error_per_realization(clustered_array, candidate_array, n_realizations)[source]#

Calculate MSE between clustered and candidate realization arrays. Lower MSE indicates a candidate realization better matches a cluster’s representative member.

For 3D cubes, the MSE is calculated by averaging over spatial dimensions (y, x). For 4D cubes, the mean is calculated over spatial dimensions first, then the MSE is averaged over forecast_period.

Parameters:

clustered_array (ndarray) – The clustered array with shape (n_clusters, y, x) or (n_clusters, forecast_period, y, x).
candidate_array (ndarray) – The candidate array with shape (n_realizations, y, x) or (n_realizations, forecast_period, y, x).
n_realizations (int) – The number of realizations in the candidate array.

Return type:

ndarray

Returns:

Array of MSE values with shape (n_realizations, n_clusters) with element [i, j] containing the MSE between candidate realization i and cluster j.

_validate_cube_dimensions(clusters_cube, candidate_cube)[source]#

Validate that both the clustered and candidate cubes have matching dimensions.

Parameters:

clusters_cube (Cube) – The clustered cube.
candidate_cube (Cube) – The candidate cube.

Raises:

ValueError – If cube dimensions don’t match.
ValueError – If dimension coordinate names don’t match.

Return type:

None

_validate_forecast_period_coords(clusters_cube, candidate_cube)[source]#

Validate matching forecast_period coordinates for 4D cubes.

Parameters:

clusters_cube (Cube) – The clustered cube.
candidate_cube (Cube) – The candidate cube.

Raises:

ValueError – If forecast period coords do not match between clustered and candidate cubes.

Return type:

None

assign_clusters(realization_cluster_mse)[source]#

Assign clusters to candidate realizations using greedy MSE minimization.

This method assigns candidate realizations to clusters by minimizing mean squared error. The algorithm iterates through realizations in descending order of their “MSE cost” (the sum of differences between each cluster’s MSE and the minimum MSE for that realization). Realizations with higher costs (those with more uniform MSE across clusters, i.e. without a cluster that they are “well matched” to) are processed first; low cost-realizations (those with a stronger “preference” for one cluster) are processed last. During each iteration, if the realization’s MSE is better than the current holder’s MSE (or the cluster is unassigned), it replaces assignment to that cluster; otherwise the cluster remains assigned to its current realization. This iterative process continues through all realizations, with early assignments by flexible realizations often being replaced by later-processed realizations that have stronger (lower MSE) matches to clusters

Note: This greedy algorithm is chosen for its relative simplicity and computational efficiency. While optimal assignment algorithms (such as the Hungarian algorithm) could guarantee globally optimal solutions, this approach provides good results with O(n²) complexity and deterministic behavior.

Parameters:

realization_cluster_mse (ndarray) – The MSE array with shape (n_realizations, n_clusters).

Returns:

cluster_indices: List of cluster indices that were assigned (may be < n_clusters), sorted in ascending order.
realization_indices: List of realization indices assigned to each cluster (one per assigned cluster).

Return type:

Tuple of (cluster_indices, realization_indices)

process(clusters_cube, candidate_cube)[source]#

Assign candidate realizations to clusters by mean squared error (MSE).

This method takes a cube of clustered realizations (e.g. from a global model) and candidate realizations (e.g. from a higher-resolution model), then assigns each cluster to the candidate realization with the lowest MSE for that cluster. When multiple candidates compete for the same cluster, only the one with the lowest MSE is assigned; other candidates are not assigned to any cluster.

Supports both 3D cubes (realization, y, x) and 4D cubes (realization, forecast_period, y, x). When using 4D cubes, both input cubes must have matching forecast_period coordinates.

Parameters:

clusters_cube (Cube) – The cube containing clustered realizations (e.g., from KMedoids clustering). Shape: (n_clusters, y, x) or (n_clusters, forecast_period, y, x).
candidate_cube (Cube) – The input cube with realizations to assign to clusters. Shape: (n_realizations, y, x) or (n_realizations, forecast_period, y, x).

Returns:

cluster_indices: List of cluster indices that were assigned.: May have length < n_clusters if there are fewer candidates than clusters.
realization_indices: List of realization indices assigned to each: cluster (one per assigned cluster).

Return type:

Tuple of (cluster_indices, realization_indices)

improver.clustering.realization_clustering module

Contents

improver.clustering.realization_clustering module#

This Page