improver.categorical.decision_tree module

Module containing categorical decision tree implementation.

class ApplyDecisionTree(decision_tree, model_id_attr=None, record_run_attr=None, target_period=None, title=None)[source]

Bases: BasePlugin

Definition and implementation of a categorical decision tree. This plugin uses a variety of diagnostic inputs and the decision tree logic to determine the most representative category for each location defined in the input cubes. This can be used for generating categorical data such as weather symbols.

Decision trees

Decision trees use diagnostic fields to diagnose a suitable category to represent the weather conditions, such as a weather symbol. The tree is comprised of a series of interconnected decision nodes, leaf nodes and a stand-alone meta node. At each decision node one or multiple forecast diagnostics are compared to predefined threshold values. The decision node has an if_true and if_false path on to the next node. By traversing the nodes it should be possible, given the right weather conditions, to arrive at any of the leaf nodes, which describe the leaf name, code, and optional information for night-time and modal grouping.

The first few nodes of a decision tree are represented in the schematic below.

Schematic of thundery nodes in a decision tree

There are two thresholds being used in these nodes. The first is the diagnostic threshold which identifies the critical value for a given diagnostic. In the first node of the schematic shown this threshold is the count of lightning flashes in an hour exceeding 0.0. The second threshold is the probability of exceeding (in this case) this diagnostic threshold. In this first node it’s a probability of 0.3 (30%). So the node overall states that if there is an equal or greater than 30% probability of any lightning flashes in the hour being forecast, proceed to the if_true node, else move to the if_false node.

Encoding a decision tree

The meta node provides the name to use for the metadata of the resulting cube and can be anywhere in the decision tree, but must have “meta” as its key. This becomes the cube name and is also used for two attributes that describe the categorical data: <name> and <name>_meaning:

{
  "meta": {
      "name": "weather_code",
  },
}

The first decision node in the thundery nodes shown above is encoded as follows:

{
  "lightning": {
      "if_true": "lightning_cloud",
      "if_false": "heavy_precipitation",
      "if_diagnostic_missing": "if_false",
      "probability_thresholds": [0.3],
      "threshold_condition": ">=",
      "condition_combination": "",
      "diagnostic_fields": [
          "probability_of_number_of_lightning_flashes_per_unit_area_in_vicinity_above_threshold"
      ],
      "diagnostic_thresholds": [[0.0, "m-2"]],
      "diagnostic_conditions": ["above"]
  },
}

The key at the first level, “lightning” in this case, names the node so that it can be targeted as an if_true or if_false destination from other nodes. The dictionary accessed with this key contains the essentials that make the node function.

  • if_true (str): The next node if the condition in this node is true.

  • if_false (str): The next node if the condition in this node is false.

  • if_diagnostic_missing (str, optional): If the expected diagnostic is not provided, should the tree proceed to the if_true or if_false node. This can be useful if the tree is to be applied to output from different models, some of which do not provide all the diagnostics that might be desirable.

  • probability_thresholds (list(float)): The probability threshold(s) that must be exceeded or not exceeded (see threshold_condition) for the node to progress to the succeed target. Two values required if condition_combination is being used.

  • threshold_condition (str): Defines the inequality test to be applied to the probability threshold(s). Inequalities that can be used are “<=”, “<”, “>”, “>=”.

  • condition_combination (str): If multiple tests are being applied in a single node, this value determines the logic with which they are combined. The values can be “AND”, “OR”.

  • diagnostic_fields (List(str or List(str)): The name(s) of the diagnostic(s) that will form the test condition in this node. There may be multiple diagnostics if they are being combined in the test using a condition_combination. Alternatively, if they are being manipulated within the node (e.g. added together), they must be separated by the desired operators (e.g. ‘diagnostic1’, ‘+’, ‘diagnostic2’).

  • diagnostic_thresholds (List(List(float, str, Optional(int))): The diagnostic threshold value and units being used in the test. An optional third value provides a period in seconds that is associated with the threshold value. For example, a precipitation accumulation threshold might be given for a 1-hour period (3600 seconds). If instead the decision tree generates data representing a 3-hour period using 3-hour precipitation accumulations then the threshold value will be scaled up by a factor of 3. Only thresholds with an associated period will be scaled in this way. A threshold [value, units] pair must be provided for each diagnostic field with the same nested list structure; as the basic unit is a list of value and unit, the overall nested structure is one list deeper.

  • diagnostic_conditions (as diagnostic_fields): The expected inequality that has been used to construct the input probability field. This is checked against the spp__relative_to_threshold attribute of the threshold coordinate in the provided diagnostic.

It is also possible to build a node which uses a deterministic forecast. This is not currently used within the weather symbols decision tree but, as an example, the following shows how such a node would be encoded:

{
  "precip_rate": {
      "if_true": "rain",
      "if_false": "dry",
      "if_diagnostic_missing": "if_false",
      "thresholds": [0],
      "threshold_condition": ">",
      "diagnostic_fields": ["precipitation_rate"],
      "deterministic": True
  },
}

The keys for this dictionary have the same meaning as for a probabilistic node but with the following additional keys:

  • thresholds (list(float)): The threshold(s) that must be exceeded or not exceeded (see threshold_condition) for the node to progress to the succeed target. Two values required if condition_combination is being used.

  • deterministic (boolean): Determines whether the node is expecting a deterministic input.

The first leaf node above is encoded as follows:

{
  "Thunder_Shower_Day": {
      "leaf": 29,
      "if_night": "Thunder_Shower_Night",
      "group": "convection",
      "is_unreachable": True,
  },
}

The key at the first level, “Thunder_Shower_Day” in this case, names the node so that it can be targeted as an if_true or if_false destination from decision nodes. The key also forms part of the metadata attribute defining the category meanings. The dictionary accessed with this key contains the following.

  • leaf (int): The category code associated with this leaf

  • if_night (str, optional): The alternate leaf node to be used when a night time symbol is required.

  • group (str, optional): Indicates which group this leaf belongs to when determining the modal category.

  • is_unreachable (bool): True for a leaf which needs including in the meta data but cannot be reached.

The modal category also relies on the severity of symbols generally increasing with the category value, so that in the case of ties, the more severe category is selected.

Every decision tree must have a starting node, and this is taken as the first node defined in the dictionary, or second if the first node is the meta node.

Manipulation of the diagnostics is possible using the decision tree configuration to enable more complex comparisons. For example:

"heavy_rain_or_sleet_shower": {
    "if_true": 14,
    "if_false": 17,
    "probability_thresholds": [0.0],
    "threshold_condition": "<",
    "condition_combination": "",
    "diagnostic_fields": [
        [
            "probability_of_lwe_sleetfall_rate_above_threshold",
            "+",
            "probability_of_lwe_snowfall_rate_above_threshold",
            "-",
            "probability_of_rainfall_rate_above_threshold"
        ]
    ],
    "diagnostic_thresholds": [[[1.0, "mm hr-1"], [1.0, "mm hr-1"], [1.0, "mm hr-1"]]],
    "diagnostic_conditions": [["above", "above", "above"]]
},

This node uses three diagnostics. It combines them according to the mathematical operators that separate the names in the diagnostic_fields list. The resulting value is compared to the probability threshold value using the threshold condition. In this example the purpose is to check whether the probability of the rain rate exceeding 1.0 mm/hr is greater than the combined probability of the same rate being exceeded by sleet and snow.

__init__(decision_tree, model_id_attr=None, record_run_attr=None, target_period=None, title=None)[source]

Define a decision tree for determining a category based upon the input diagnostics. Use this decision tree to allocate a category to each location.

Parameters:
  • decision_tree (dict) – Decision tree definition, provided as a dictionary.

  • model_id_attr (Optional[str]) – Name of attribute recording source models that should be inherited by the output cube. The source models are expected as a space-separated string.

  • record_run_attr (Optional[str]) – Name of attribute used to record models and cycles used in constructing the output.

  • target_period (Optional[int]) – The period in seconds that the category being produced should represent. This should correspond with any period diagnostics, e.g. precipitation accumulation, being used as input. This is used to scale any threshold values that are defined with an associated period in the decision tree. It will only be used if the decision tree provided has threshold values defined with an associated period.

  • title (Optional[str]) – An optional title to assign to the title attribute of the resulting output. This will override the title generated from the inputs, where this generated title is only set if all of the inputs share a common title.

float_tolerance defines the tolerance when matching thresholds to allow for the difficulty of float comparisons. float_abs_tolerance defines the tolerance for when the threshold is zero. It has to be sufficiently small that a valid rainfall rate or snowfall rate could not trigger it.

_abc_impl = <_abc_data object>
static _invert_comparator(comparator)[source]

Inverts a single comparator string.

Return type:

str

static _set_reference_time(cube, cubes)[source]

Replace the forecast_reference_time and/or blend_time if present coord point on cube with the latest value from cubes. Forecast_period is also updated.

check_coincidence(cubes)[source]

Check that all the provided cubes are valid at the same time and if any of the input cubes have time bounds, these match.

The last input cube with bounds (or first input cube if none have bounds) is selected as a template_cube for later producing the output cube.

Parameters:

cubes (Union[List[Cube], CubeList]) – List of input cubes used in the decision tree

Raises:
  • ValueError – If validity times differ for diagnostics.

  • ValueError – If period diagnostics have different periods.

  • ValueError – If period diagnostics do not match target_period.

static compare_array_to_threshold(arr, comparator, threshold)[source]

Compare two arrays element-wise and return a boolean array.

Parameters:
  • arr (ndarray) –

  • comparator (str) – One of ‘<’, ‘>’, ‘<=’, ‘>=’.

  • threshold (float) –

Return type:

ndarray

Returns:

Array of booleans.

Raises:

ValueError – If comparator is not one of ‘<’, ‘>’, ‘<=’, ‘>=’.

construct_extract_constraint(diagnostic, threshold, coord_named_threshold)[source]

Construct an iris constraint.

Parameters:
  • diagnostic (str) – The name of the diagnostic to be extracted from the CubeList.

  • threshold (AuxCoord) – The thresholds within the given diagnostic cube that is needed, including units. Note these are NOT coords from the original cubes, just constructs to associate units with values.

  • coord_named_threshold (bool) – If true, use old naming convention for threshold coordinates (coord.long_name=threshold). Otherwise extract threshold coordinate name from diagnostic name

Return type:

Constraint

Returns:

A constraint

create_categorical_cube(cubes)[source]

Create an empty categorical cube taking the cube name and categorical attribute names from the meta node, and categorical attribute values from the leaf nodes. The reference time is the latest from the set of input cubes and the optional record run attribute is a combination from all source cubes. Everything else comes from the template cube, which is the first cube with time bounds, or the first cube if none have time bounds.

Parameters:

cubes (Union[List[Cube], CubeList]) – List of input cubes used in the decision tree

Return type:

Cube

Returns:

A cube with suitable metadata to describe the categories that will fill it and data initiated with the value -1 to allow any unset points to be readily identified.

create_condition_chain(test_conditions)[source]

Construct a list of all the conditions specified in a single query.

Parameters:

test_conditions (Dict) – A query from the decision tree.

Returns:

(1) If each a_1, …, a_n is an extract expression (i.e. a constraint, or a list of constraints, operator strings and floats), and b is either “AND”, “OR” or “”, then [[a1, …, an], b] is a valid condition chain. (2) If a1, …, an are each valid conditions chain, and b is either “AND” or “OR”, then [[a1, …, an], b] is a valid condition chain.

Return type:

A valid condition chain is defined recursively

evaluate_condition_chain(cubes, condition_chain)[source]

Recursively evaluate the list of conditions.

We can safely use recursion here since the depth will be small.

Parameters:
  • cubes (CubeList) – A cubelist containing the diagnostics required for the decision tree, these at co-incident times.

  • condition_chain (List) – A valid condition chain is defined recursively: (1) If each a_1, …, a_n is an extract expression (i.e. a constraint, or a list of constraints, operator strings and floats), and b is either “AND”, “OR” or “”, then [[a1, …, an], b] is a valid condition chain. (2) If a1, …, an are each valid conditions chain, and b is either “AND” or “OR”, then [[a1, …, an], b] is a valid condition chain.

Return type:

ndarray

Returns:

An array of masked array of booleans

evaluate_extract_expression(cubes, expression)[source]

Evaluate a single condition.

Parameters:
  • cubes (CubeList) – A cubelist containing the diagnostics required for the decision tree, these at co-incident times.

  • expression (Union[Constraint, List]) – Defined recursively: A list consisting of an iris.Constraint or a list of iris.Constraint, strings (representing operators) and floats is a valid expression. A list consisting of valid expressions, strings (representing operators) and floats is a valid expression.

Return type:

ndarray

Returns:

An array or masked array of booleans

static find_all_routes(graph, start, end, route=None)[source]

Function to trace all routes through the decision tree.

Parameters:
  • graph (Dict) – A dictionary that describes each node in the tree, e.g. {<node_name>: [<if_true_name>, <if_false_name>]}

  • start (str) – The node name of the tree root (currently always lightning).

  • end (int) – The category code to which we are tracing all routes.

  • route (Optional[List[str]]) – A list of node names found so far.

Return type:

List[str]

Returns:

A list of node names that defines the route from the tree root to the category leaf (end of chain).

References

Method based upon Python Patterns - Implementing Graphs essay https://www.python.org/doc/essays/graphs/

invert_condition(condition)[source]

Invert a comparison condition to allow positive identification of conditions satisfying the negative case.

Parameters:

condition (Dict) – A single query from the decision tree.

Return type:

Tuple[str, str]

Returns:

  • A string representing the inverted comparison.

  • A string representing the inverted combination

prepare_input_cubes(cubes)[source]

Check that the input cubes contain all the diagnostics and thresholds required by the decision tree. Sets self.coord_named_threshold to “True” if threshold-type coordinates have the name “threshold” (as opposed to the standard name of the diagnostic), for backward compatibility. A cubelist containing only cubes of the required diagnostic-threshold combinations is returned.

Parameters:

cubes (CubeList) – A CubeList containing the input diagnostic cubes.

Return type:

Tuple[CubeList, Optional[List[str]]]

Returns:

  • A CubeList containing only the required cubes.

  • A list of node names where the diagnostic data is missing and this is indicated as allowed by the presence of the if_diagnostic_missing key.

Raises:

IOError – Raises an IOError if any of the required input data is missing. The error includes details of which fields are missing.

process(cubes)[source]

Apply the decision tree to the input cubes to produce categorical output.

Parameters:

cubes (CubeList) – A cubelist containing the diagnostics required for the decision tree, these at co-incident times.

Return type:

Cube

Returns:

A cube of categorical data.

remove_optional_missing(optional_node_data_missing)[source]

Some decision tree nodes are optional and have an “if_diagnostic_missing” key to enable passage through the tree in the absence of the required input diagnostic. This code modifies the tree in the following ways:

  • Rewrites the decision tree to skip the missing nodes by connecting nodes that proceed them to the node targetted by “if_diagnostic_missing”

  • If the node(s) missing are those at the start of the decision tree, the start_node is modified to find the first available node.

Parameters:

optional_node_data_missing (List[str]) – List of node names for which data is missing but for which this is allowed.

_define_invertible_conditions()[source]

Returns a dictionary of boolean comparator strings where the value is the logical inverse of the key.

Return type:

Dict[str, str]