Statistical Processing

Context

IMPROVER both processes forecast data that has been statistically processed and supports a range of statistical processing steps. As a result, it makes significant use of the CF Metadata Conventions cell methods.

Overview of cell methods

CF Metadata Conventions provide the attribute cell_methods which can be used to describe the application of simple statistical processing to a variable (e.g. a maximum of the temperature over a period of time). This is represented as a string attribute comprising a list of blank-separated words of the form name: method. Each name: method pair indicates that for an axis identified by name, the values representing the variable have been determined or derived by the specified method.

name

Can be a dimension of the variable, a scalar coordinate variable, a valid CF Standard Name, or the word area.

method

Should be selected from a list: point, maximum, maximum_absolute_value, median, mid_range, minimum, minimum_absolute_value, mean, mean_absolute_value, mean_of_upper_decile, mode, range, root_mean_square, standard_derivation, sum, sum_of_squares, variance.

The default interpretation for variables which do not have a cell_methods attribute depends on whether the quantity being described is extensive (depends on the size of the cell along the axis of interest) or intensive (no dependence on the size of the cell along the axis of interest). The CF Metadata Conventions document uses precipitation as an example. Precipitation rate is intensive; it would be described by cell_methods = "time: point" and requires no time bounds. Precipitation accumulation is extensive; it would be described by cell_methods = "time: sum" and does require time bounds. In principle, the cell_methods could be omitted in both cases, but the inclusion of cell_methods = "time: sum" in the accumulation case is good practice, as this flags up the need to refer to the time bounds to fully interpret the quantity. The point method is somewhat unique as it means that no statistical processing has been applied to the underlying quantity (along the axis of interest) and it is usual to omit this.

As an example of the use of cell methods, consider a one-dimensional array of maximum air temperatures, which could be represented as:

variables:
    float air_temperature(time) ;
        air_temperature:cell_methods = "time: maximum" ;
    int64 time ;
        time:bounds = "time_bnds" ;
        time:units = "seconds since 1970-01-01 00:00:00" ;
        time:standard_name = "time" ;
        time:calendar = "gregorian" ;
    int64 time_bnds(bnds) ;

The time_bnds values define the periods over which the statistical processing has been calculated. If more than one cell method is to be indicated, they should be arranged in the order they were applied. The left-most operation is assumed to have been applied first.

It is possible to indicate more precisely how the cell method was applied by including extra information in parentheses after the method. This information includes standardized and non-standardized parts. The only standardized option is interval, used to provide the typical interval between the original data values to which the method was applied, in the situation where the present data values are statistically representative of original data values which had a finer spacing.

Warning

It is important to understand that the following example, does not (necessarily) represent a 1-hour maximum temperature, as the period over which the maximum is derived is included in the bounds (not shown), but rather that the temperature data used to calculate the maximum was provided in 1-hour steps.

air_temperature: cell_methods = "time: maximum (interval: 1 hour)" ;

The non-standardized part is included in the same way (unless there is also a standardized part), for example:

air_temperature: cell_methods = "time: mean (time-weighted)" ;

If both standardized and non-standardized parts are present, the latter must be preceded by the comment: keyword; for an example of this, see the CF Metadata Conventions.

Cell methods can be used to describe any simple statistical method that has been applied to a variable. However, this leads to the question of whether a statistical process that has been applied is significant for, or of relevance to, the end user? This is likely to depend on exactly who that user is and what they need to know about the data. Two extreme examples of how the cell methods might be presented for the same variable (in this case a maximum temperature in a period) are shown below:

  1. Simple version, just describing the information required to correctly interpret the variable; note that without the cell_methods, this would be a different variable, an instantaneous temperature:

air_temperature:cell_methods = "time: maximum"
  1. Complex version, including a whole chain of processes that have been applied to the variable:

air_temperature:cell_methods = "time: maximum realization: mean area: mean (neighbourhood: square topographic) forecast_reference_time: mean (time-weighted) area: mean (recursive-filter) model: mean (model-weighted)" ;

The issue with the complex version is that only the time: maximum is required by any user to correctly interpret and use the variable. The other processing steps tell you more about how it was generated and are really acting as a substitute for provenance metadata. This can obscure the essential statistical information, making it harder to understand what the variable actually represents.

Use of cell methods in IMPROVER

IMPROVER should only use cell methods to represent the what metadata of the variable, i.e. information that is required to correctly interpret the variable. See the section on Conformance to the CF Metadata Conventions in Principles.

The use of the interval within the extra information in cell methods is unhelpful and potentially confusing within IMPROVER and should be omitted.

There are two main ways in which cell methods are used within IMPROVER at present:

  1. Maximum, minimum and sum methods applied to time for percentile values, using a cell methods statement of the form below:

float air_temperature(percentile) ;
        air_temperature:standard_name = "air_temperature" ;
        air_temperature:units = "K" ;
air_temperature:cell_methods = "time: maximum" ;
  1. Maximum, minimum and sum methods applied to time for probability values, using a cell methods statement of the form below; note that there is a non-standard comment to indicate that the statistical processing is over the base variable rather than the probability.

float probability_of_air_temperature_above_threshold(threshold) ;
    probability_of_air_temperature_above_threshold:long_name = "probability_of_air_temperature_above_threshold" ;
    probability_of_air_temperature_above_threshold:units = "1" ;
    probability_of_air_temperature_above_threshold:cell_methods = "time: maximum (comment: of air_temperature)" ;

References

CF Metadata Conventions

CF Standard Name