Principles
Overview
The principles applied to the metadata within IMPROVER have been listed in the Introduction, but are discussed in more detail in this section.
Conformance to the CF Metadata Conventions
As far as possible the use of metadata within IMPROVER should conform to the CF Metadata Conventions (CF), whilst recognising the need to extend this in some areas.
However, whilst CF does a good job with ‘where’ and ‘what’ metadata (see note below), it does not really cover ‘who’ metadata. Therefore, it is helpful to keep CF (and direct extensions to this) focused on these aspects and not try to stretch it to cover ‘who’.
An example is CF cell methods, which are intended to cover ‘what’ (i.e. to ensure that data can be correctly interpreted) but can unwittingly be extended into ‘who’ (e.g. to try to describe the post-processing that has been applied), which can complicate its main purpose.
Note
Metadata Categorization
To inform decisions about the implementation of metadata, it can be helpful to consider three types of metadata in terms of what they are trying to achieve:
Where - location in space and time, and additional ‘data space’ information
What - what is this value? i.e. phenomenon, units, extra characteristics
Who - data identity, i.e. where it came from, provenance, data processing, etc
Clear purpose
Metadata should only be present, if it has a purpose. It should not be included just because it might be useful in future, as if no one is using it, it is unlikely to be properly maintained.
There should be just enough metadata to describe the data sufficiently.
Nothing misleading or unnecessarily restrictive
Metadata that is misleading, or simply wrong, should not be present.
Also, metadata that restricts wider use should be avoided, if possible.
Referencing more detailed external documentation
Ideally, the metadata should support the ability to ‘refer out’ to further information about the data to avoid ‘overloading’ the files with metadata. This makes it easier to provide more detail of the provenance of the data and the post-processing applied (the ‘where’ metadata described in the earlier note).
One approach to help support this that has been partially adopted within IMPROVER is the use of name-spacing for new (non-CF) attributes. This has two purposes:
It clearly separates the attribute from CF standard attributes, identifying it as belonging to a separate metadata convention.
It provides the potential for this attribute to be ‘resolved’ to a definition held in an external registry in the future
Namespacing is implemented using a double underscore to maintain CF conformance
whilst clearly identifying the namespace part of the attribute name.
For example, “ssp” is used to denote the statistical post-processing namespace,
with the attributes in this namespace prefixed by spp__
.
At present this is only used for a single attribute ssp__relative_to_threshold
,
which takes one of the four values:
greater_than
greater_than_or_equal_to
less_than
less_than_or_equal_to
The other namespace used is “mosg”, which is used to indicate
a Met Office standard grid attribute.
Despite the name this can, and generally should, be used.
At present this is used for two attributes:
mosg__model_configuration
, which identifies the
sources that have contributed to the blend, and
mosg__model_run
, which extends the information provided by
the contribution of specific model runs and
the level of their contribution.
At present, IMPROVER does not provide explicit references to external metadata. Earlier, in the development process,the use of NetCDF Linked Data was explored (see note below), but the method of implementation at the time created issues that outweighed the benefits. The intent is to revisit ways of referencing external metadata in the future.
Note
NetCDF Linked Data (NetCDF-LD)
NetCDF Linked Data was a proposed standard being developed by the Open Geospatial Consortium (OGC) NetCDF Standards Group, building on Linked Data, the widely used standard for managing relations between data items. The NetCDF-LD standard allows proxies to URIs to act as resolvable identifiers, which can reference to external metadata held in registries. This approach can be used to link the files to a much richer set of metadata (such as the details of the model which generated the data).
Practical considerations
IMPROVER code is usually implemented in a series of processing chains, so it makes sense to consider the metadata in this context. From a post-processing perspective, it is helpful to divide the metadata into three different types:
Low-level, specific to a particular plug-in (Open Source)
Centralised, appropriate to all users (Open Source)
Organisation-specific metadata (Bespoke)
Metadata is set, updated or removed at three stages in a post-processing chain:
Start - standardise CLI / amend_metadata plug-in
Plug-ins - reflecting changes made to the data and enforce wider standards
End - set metadata in the files that will be shared
Note
amend_metadata makes use of JSON dictionaries to flexibly update metadata (delete, set)
The general approach proposed is to be conservative with metadata; remove attributes on the source data that do not serve a purpose for the post-processed data. In particular, processing stage 1, will remove or transform most organisation-specific metadata, to ensure that the metadata does not become out of date. For this reason, only 6 global attributes are expected (being either retained or set at the start).
Conventions
institution
source
title
mosg__model_configuration
mosg__model_run
Organisation-specific metadata may be added in at the end of the processing chain.
Low-level metadata will usually only be transitory, required for certain processing steps, but not exposed in the final output.
Centralised metadata is key to the use of the final output, providing the required information to understand and exploit the data. This will be continually updated and added to as the data are tranformed in the post-processing steps. A couple of the most significant of these changes are:
Thresholding to generate probabilities:
variable name - prefixed with
probability_of _
standard_name or long_name ‘top and tailed’ with
probability_of_
and_above_threshold
or_below_threshold
, respectivelyunits - set to
1
Blend grid (multi-model):
source - changed to be
IMPROVER
title - updated to describe the blend appropriately
mosg__model_configuration - set to list of model identifiers
mosg__model_run - set to list of model runs and weights