improver.calibration.dataframe_utilities module

Functionality to convert a pandas DataFrame in the expected format into an iris cube.

Ingestion of DataFrames into iris cubes

DataFrames of the forecasts and truths (observations) can be provided for use with Ensemble Model Output Statistics (EMOS). The format expected for the forecast and truths DataFrames is described below. The forecasts are ensemble site forecasts in percentile format at a set of observation sites. The truths are observations from observation sites.

Forecast DataFrame

The forecast DataFrame is expected to contain the following compulsory columns: forecast; blend_time; forecast_period; forecast_reference_time; time; wmo_id; diagnostic; latitude; longitude; period; height; cf_name; units; experiment; and exactly one of percentile or realization. Optionally, the DataFrame may also contain station_id. If the truth DataFrame also contains station_id, then forecast and truth data will be matched using both wmo_id and station_id. The station_id data may be either string or int. Any other columns not mentioned above will be ignored.

A summary of the expected contents of a forecast table is shown below.

Column	Dtype	Notes
forecast	float64	The value for a particular forecast.
altitude	float32	The altitude in metres.
blend_time	datetime64[ns,UTC]	The time at which a blend of models was produced.
forecast_period	timedelta64[ns]	The difference between the blend time (and forecast reference time) and the validity time.
forecast_reference_time	datetime64[ns,UTC]	The time at which the forecast analysis was made for a forecast from a single source. Equal to the blend_time for a forecast created from blending multiple forecast sources.
latitude	float32	The latitude in degrees.
longitude	float32	The longitude in degrees.
time	datetime64[ns,UTC]	The validity time of the forecasts. Signifies the end of the forecast period for period diagnostics.
wmo_id	object	The five digit WMO ID.
station_id	object or int	Optional additional site identifier.
cf_name	object	The CF name for the diagnostic. From DataFrames consisting of one diagnostic this is expected to be constant.
units	object	The units of the forecast value. From DataFrames consisting of one diagnostic this is expected to be constant.
percentile	float64	The percentile value.
realization	int	The realization number.
period	timedelta64[ns]	The period the forecast valid is over. Set to missing data for instantaneous forecasts.
height	float32	The height of the forecast value. From DataFrames consisting of one diagnostic this is expected to be constant.
diagnostic	category	The name of the diagnostic. From DataFrames consisting of one diagnostic this is expected to be constant.
experiment	object	A value used for identifying how the data was generated when the table contains multiple equivalent forecasts.

An example forecast table for an instantaneous diagnostic is shown below.

index	forecast	altitude	blend_time	forecast_period	forecast_reference_time	latitude	longitude	time	wmo_id	cf_name	units	percentile	period	height	diagnostic	experiment
0	282.69	15	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	60	-5	2021-08-02 18:00:00+00:00	03001	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
1	283.2	82	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	59	-4	2021-08-02 18:00:00+00:00	03002	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
2	282.62	30	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	58	-3	2021-08-02 18:00:00+00:00	03003	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
3	286.17	4	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	57	-2	2021-08-02 18:00:00+00:00	03004	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
4	284.43	15	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	56	-1	2021-08-02 18:00:00+00:00	03005	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold

An example forecast table for an instantaneous diagnostic including station_id is shown below. The last 3 rows will be represented as different spot_index values in the output, since they have different station_id.

index	forecast	altitude	blend_time	forecast_period	forecast_reference_time	latitude	longitude	time	wmo_id	station_id	cf_name	units	percentile	period	height	diagnostic	experiment
0	282.69	15	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	60	-5	2021-08-02 18:00:00+00:00	03001	029233	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
1	283.2	82	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	59	-4	2021-08-02 18:00:00+00:00	03002	029234	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
2	282.62	30	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	58	-3	2021-08-02 18:00:00+00:00	00000	029235	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
3	286.17	4	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	57	-2	2021-08-02 18:00:00+00:00	00000	029236	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold
4	284.43	15	2021-08-01 18:00:00+00:00	1 days	2021-08-01 18:00:00+00:00	56	-1	2021-08-02 18:00:00+00:00	00000	029237	air_temperature	K	5	NaT	1.5	temperature_at_screen_level	threshold

An example forecast table for a period diagnostic is shown below.

index	forecast	altitude	blend_time	forecast_period	forecast_reference_time	latitude	longitude	time	wmo_id	cf_name	units	percentile	period	height	diagnostic	experiment
0	282.69	15	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	60	-5	2021-08-01 21:00:00+00:00	03001	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
1	283.2	82	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	59	-4	2021-08-01 21:00:00+00:00	03002	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
2	282.62	30	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	58	-3	2021-08-01 21:00:00+00:00	03003	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
3	286.17	4	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	57	-2	2021-08-01 21:00:00+00:00	03004	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
4	284.43	15	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	56	-1	2021-08-01 21:00:00+00:00	03005	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold

An example forecast table for a period diagnostic including station_id is shown below.

index	forecast	altitude	blend_time	forecast_period	forecast_reference_time	latitude	longitude	time	wmo_id	station_id	cf_name	units	percentile	period	height	diagnostic	experiment
0	282.69	15	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	60	-5	2021-08-01 21:00:00+00:00	03001	029233	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
1	283.2	82	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	59	-4	2021-08-01 21:00:00+00:00	03002	029234	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
2	282.62	30	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	58	-3	2021-08-01 21:00:00+00:00	00000	029235	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
3	286.17	4	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	57	-2	2021-08-01 21:00:00+00:00	00000	029236	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold
4	284.43	15	2021-08-01 00:00:00+00:00	0 days 09:00:00	2021-08-01 00:00:00+00:00	56	-1	2021-08-01 21:00:00+00:00	00000	029237	temperature_at_screen_level_daytime_max	K	5	0 days 12:00:00	1.5	temperature_at_screen_level_max-daytime	threshold

Truth DataFrame

The truth DataFrame is expected to contain the following compulsory columns: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude. Optionally, the DataFrame may also contain station_id and units. If the forecast DataFrame also contains station_id, then forecast and truth data will be matched using both wmo_id and station_id. Other columns will be ignored. If the truth DataFrame contains a units column, then it will be used for the units of the output truth cube. Otherwise, the units of the truth cube will be copied from the units of the forecast DataFrame. The station_id data may be either string or int. Any other columns not mentioned above will be ignored.

A summary of the expected contents of a truth table is shown below.

Column	Dtype	Notes
time	datetime64[ns,UTC]	The time of the observation.
wmo_id	object	The five digit WMO ID.
latitude	float32	The latitude in degrees.
longitude	float32	The longtitude in degrees.
altitude	float32	The altitude in metres.
ob_value	float32	The value for a particular observation.
diagnostic	category	The name of the diagnostic.
units	str	Optional units of the observation values.

An example truth table is shown below.

index	ob_value	altitude	latitude	longitude	time	wmo_id	diagnostic
0	283.45	15	60	-5	2021-08-02 18:00:00+00:00	03001	temperature_at_screen_level
1	283.91	82	59	-4	2021-08-02 18:00:00+00:00	03002	temperature_at_screen_level
2	281.63	30	58	-3	2021-08-02 18:00:00+00:00	03003	temperature_at_screen_level
3	286.55	4	57	-2	2021-08-02 18:00:00+00:00	03004	temperature_at_screen_level
4	283.19	15	56	-1	2021-08-02 18:00:00+00:00	03005	temperature_at_screen_level

_dataframe_column_check(df, compulsory_columns)[source]

Check that the compulsory columns are present on the DataFrame. Any other columns within the DataFrame are ignored.

Parameters:

df (DataFrame) – Dataframe expected to contain the compulsory columns.
compulsory_columns (Sequence) – The names of the compulsory columns.

Raises:

ValueError – Raise an error if a compulsory column is missing.

Return type:

None

_define_height_coord(height)[source]

Define a height coordinate. A unit of metres is assumed.

Parameters:: height – The value for the height coordinate in metres.
Return type:: AuxCoord
Returns:: The height coordinate.

_define_time_coord(adate, time_bounds=None)[source]

Define a time coordinate. The coordinate will have bounds, if bounds are provided.

Parameters:

adate (Timestamp) – The point for the time coordinate.
time_bounds (Optional[Sequence[Timestamp]]) – The values defining the bounds for the time coordinate.

Return type:

DimCoord

Returns:

A time coordinate. This coordinate will have bounds, if bounds are provided.

_drop_duplicates(df, cols)[source]

Drop duplicates and then sort the DataFrame.

Parameters:

df (DataFrame) – DataFrame to have duplicates removed.
cols (Sequence[str]) – Columns for use in removing duplicates and for sorting.

Return type:

DataFrame

Returns:

A DataFrame with duplicates removed (only the last duplicate is kept). The DataFrame is sorted according to the columns provided.

_ensure_consistent_static_cols(forecast_df, static_cols, site_id_col)[source]

Ensure that the columns expected to have the same value for a given site, actually have the same values. These “static” columns could change if, for example, the altitude of a site is corrected.

Parameters:

forecast_df (DataFrame) – Forecast DataFrame.
static_cols (List[str]) – List of columns that are expected to be “static”.
site_id_col (str) – The name of the column containing the site ID.

Return type:

DataFrame

Returns:

Forecast DataFrame with the same value for a given site for the static columns provided.

_fill_missing_entries(df, combi_cols, static_cols, site_id_col)[source]

Fill the input DataFrame with rows that correspond to missing entries. The expected entries are computed using all combinations of the values within the combi_cols. In practice, this will allow support for creating entries for times that are missing when a new site with an ID is added. If the DataFrame provided is completely empty, then the empty DataFrame is returned.

Parameters:

df – DataFrame to be filled with rows corresponding to missing entries.
combi_cols – The key columns within the DataFrame. All combinations of the values within these columns are expected to exist, otherwise, an entry will be created.
static_cols – The names of the columns that are considered “static” and therefore can be reliably filled using other entries for the given WMO ID.
site_id_col – Name of the column used to identify the sites within the DataFrame.

Returns:

DataFrame where any missing combination of the combi_cols will have been created.

_prepare_dataframes(forecast_df, truth_df, forecast_period, percentiles=None, experiment=None)[source]

Prepare DataFrames for conversion to cubes by: 1) checking which forecast representation is present, 2) checking that the expected columns are present, 3) (Optionally) checking the percentiles are as expected, 4) removing duplicates from the forecast and truth, 5) finding the sites common to both the forecast and truth DataFrames and 6) replacing and supplementing the truth DataFrame with information from the forecast DataFrame. Note that this fourth step will also ensure that a row containing a NaN for the ob_value is inserted for any missing observations.

Parameters:

forecast_df (DataFrame) – DataFrame expected to contain the following columns: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, one of REPRESENTATION_COLUMNS (percentile or realization), diagnostic, latitude, longitude, altitude, period, height, cf_name, units and experiment. Optionally, the DataFrame may also contain station_id. Any other columns are ignored.
truth_df (DataFrame) – DataFrame expected to contain the following columns: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude. Optionally the DataFrame may also contain the following columns: station_id, units. Any other columns are ignored.
forecast_period (int) – Forecast period in seconds as an integer.
percentiles (Optional[List[float]]) – The set of percentiles to be used for estimating EMOS coefficients.
experiment (Optional[str]) – A value within the experiment column to select from the forecast table.

Return type:

Tuple[DataFrame, DataFrame]

Returns:

A sanitised version of the forecasts and truth DataFrames that are ready for conversion to cubes.

_preprocess_temporal_columns(df)[source]

Pre-process the columns with temporal dtype to convert from numpy datetime objects to pandas datetime objects. Casting the dtype of the columns to object type results in columns of dtype “object” with the contents of the columns being pandas datetime objects, rather than numpy datetime objects.

Parameters:: df (DataFrame) – A DataFrame with temporal columns with numpy datetime dtypes.
Return type:: DataFrame
Returns:: A DataFrame without numpy datetime dtypes. The content of the columns with temporal dtypes are accessible as pandas datetime objects.

_quantile_check(df)[source]

Check that the percentiles provided can be considered to be quantiles with equal spacing spanning the percentile range.

Parameters:: df (DataFrame) – DataFrame with a percentile column.
Raises:: ValueError – Percentiles are not equally spaced.
Return type:: None

_training_dates_for_calibration(cycletime, forecast_period, training_length)[source]

Compute the date range required for extracting the required training dataset. The final validity time within the training dataset is at least one day prior to the cycletime. The final validity time within the training dataset is additionally offset by the number of days within the forecast period to ensure that the dates defined by the training dataset are in the past relative to the cycletime. For example, for a cycletime of 20170720T0000Z with a forecast period of T+30 and a training length of 3 days, the validity time is 20170721T0600Z. Subtracting one day gives 20170720T0600Z. Note that this is in the future relative to the cycletime and we want the training dates to be in the past relative to the cycletime. Subtracting the forecast period rounded down to the nearest day for T+30 gives 1 day. Subtracting this additional day gives 20170719T0600Z. This is the final validity time within the training period. We then compute the validity times for a 3 day training period using 20170719T0600Z as the final validity time giving 20170719T0600Z, 20170718T0600Z and 20170717T0600Z.

Parameters:

cycletime (str) – Cycletime of a format similar to 20170109T0000Z. The training dates will always be in the past, relative to the cycletime.
forecast_period (int) – Forecast period in seconds as an integer.
training_length (int) – Training length in days as an integer.

Return type:

DatetimeIndex

Returns:

Datetimes defining the training dataset. The number of datetimes is equal to the training length.

_unique_check(df, column)[source]

Check whether the values in the column are unique.

Parameters:

df (DataFrame) – The DataFrame to be checked.
column (str) – Name of a column in the DataFrame.

Raises:

ValueError – Only one unique value within the specifed column is expected.

Return type:

None

forecast_and_truth_dataframes_to_cubes(forecast_df, truth_df, cycletime, forecast_period, training_length, percentiles=None, experiment=None)[source]

Convert a forecast DataFrame into an iris Cube and a truth DataFrame into an iris Cube.

Parameters:

forecast_df (DataFrame) – DataFrame expected to contain the following columns: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, one of REPRESENTATION_COLUMNS (percentile or realization), diagnostic, latitude, longitude, period, height, cf_name, units. Optionally, the DataFrame may also contain station_id. Any other columns are ignored.
truth_df (DataFrame) – DataFrame expected to contain the following columns: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude. Optionally the DataFrame may also contain the following columns: station_id, units. Any other columns are ignored.
cycletime (str) – Cycletime of a format similar to 20170109T0000Z.
forecast_period (int) – Forecast period in seconds as an integer.
training_length (int) – Training length in days as an integer.
percentiles (Optional[List[float]]) – The set of percentiles to be used for estimating EMOS coefficients. These should be a set of equally spaced quantiles.
experiment (Optional[str]) – A value within the experiment column to select from the forecast table.

Return type:

Tuple[Cube, Cube]

Returns:

Forecasts and truths for the training period in Cube format.

forecast_dataframe_to_cube(df, training_dates, forecast_period)[source]

Convert a forecast DataFrame into an iris Cube. The percentiles within the forecast DataFrame are rebadged as realizations.

Parameters:

df (DataFrame) – DataFrame expected to contain the following columns: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, REPRESENTATION_COLUMNS (percentile or realization), diagnostic, latitude, longitude, period, height, cf_name, units. Optionally, the DataFrame may also contain station_id. Any other columns are ignored.
training_dates (DatetimeIndex) – Datetimes spanning the training period.
forecast_period (int) – Forecast period in seconds as an integer.

Return type:

Cube

Returns:

Cube containing the forecasts from the training period.

get_forecast_representation(df)[source]

Check which of REPRESENTATION_COLUMNS (percentile or realization) exists in the DataFrame.

Parameters:: df (DataFrame) – DataFrame expected to contain exactly one of REPRESENTATION_COLUMNS.
Returns:: The member of REPRESENTATION_COLUMNS found in the DataFrame columns.
Return type:: representation_type
Raises:: ValueError – If none of the allowed columns are present, or more than one is present.

truth_dataframe_to_cube(df, training_dates)[source]

Convert a truth DataFrame into an iris Cube.

Parameters:

df (DataFrame) – DataFrame expected to contain the following columns: ob_value, time, wmo_id, diagnostic, latitude, longitude, altitude, cf_name, height, period. Optionally the DataFrame may also contain the following columns: station_id, units. Any other columns are ignored.
training_dates (DatetimeIndex) – Datetimes spanning the training period.

Return type:

Cube

Returns:

Cube containing the truths from the training period.