improver.calibration.dataframe_utilities module

Functionality to convert a pandas DataFrame in the expected format into an iris cube.

Ingestion of DataFrames into iris cubes

DataFrames of the forecasts and truths (observations) can be provided for use with Ensemble Model Output Statistics (EMOS). The format expected for the forecast and truths DataFrames is described below. The forecasts are ensemble site forecasts in percentile format at a set of observation sites. The truths are observations from observation sites.

Forecast DataFrame

The forecast DataFrame is expected to contain the following compulsory columns: forecast; blend_time; forecast_period; forecast_reference_time; time; wmo_id; diagnostic; latitude; longitude; period; height; cf_name; units; experiment; and exactly one of percentile or realization. Optionally, the DataFrame may also contain station_id. If the truth DataFrame also contains station_id, then forecast and truth data will be matched using both wmo_id and station_id. The station_id data may be either string or int. Any other columns not mentioned above will be ignored.

A summary of the expected contents of a forecast table is shown below.

Column

Dtype

Notes

forecast

float64

The value for a particular forecast.

altitude

float32

The altitude in metres.

blend_time

datetime64[ns,UTC]

The time at which a blend of models was produced.

forecast_period

timedelta64[ns]

The difference between the blend time (and forecast reference time) and the validity time.

forecast_reference_time

datetime64[ns,UTC]

The time at which the forecast analysis was made for a forecast from a single source. Equal to the blend_time for a forecast created from blending multiple forecast sources.

latitude

float32

The latitude in degrees.

longitude

float32

The longitude in degrees.

time

datetime64[ns,UTC]

The validity time of the forecasts. Signifies the end of the forecast period for period diagnostics.

wmo_id

object

The five digit WMO ID.

station_id

object or int

Optional additional site identifier.

cf_name

object

The CF name for the diagnostic. From DataFrames consisting of one diagnostic this is expected to be constant.

units

object

The units of the forecast value. From DataFrames consisting of one diagnostic this is expected to be constant.

percentile

float64

The percentile value.

realization

int

The realization number.

period

timedelta64[ns]

The period the forecast valid is over. Set to missing data for instantaneous forecasts.

height

float32

The height of the forecast value. From DataFrames consisting of one diagnostic this is expected to be constant.

diagnostic

category

The name of the diagnostic. From DataFrames consisting of one diagnostic this is expected to be constant.

experiment

object

A value used for identifying how the data was generated when the table contains multiple equivalent forecasts.

An example forecast table for an instantaneous diagnostic is shown below.

index

forecast

altitude

blend_time

forecast_period

forecast_reference_time

latitude

longitude

time

wmo_id

cf_name

units

percentile

period

height

diagnostic

experiment

0

282.69

15

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

60

-5

2021-08-02 18:00:00+00:00

03001

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

1

283.2

82

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

59

-4

2021-08-02 18:00:00+00:00

03002

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

2

282.62

30

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

58

-3

2021-08-02 18:00:00+00:00

03003

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

3

286.17

4

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

57

-2

2021-08-02 18:00:00+00:00

03004

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

4

284.43

15

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

56

-1

2021-08-02 18:00:00+00:00

03005

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

An example forecast table for an instantaneous diagnostic including station_id is shown below. The last 3 rows will be represented as different spot_index values in the output, since they have different station_id.

index

forecast

altitude

blend_time

forecast_period

forecast_reference_time

latitude

longitude

time

wmo_id

station_id

cf_name

units

percentile

period

height

diagnostic

experiment

0

282.69

15

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

60

-5

2021-08-02 18:00:00+00:00

03001

029233

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

1

283.2

82

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

59

-4

2021-08-02 18:00:00+00:00

03002

029234

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

2

282.62

30

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

58

-3

2021-08-02 18:00:00+00:00

00000

029235

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

3

286.17

4

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

57

-2

2021-08-02 18:00:00+00:00

00000

029236

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

4

284.43

15

2021-08-01 18:00:00+00:00

1 days

2021-08-01 18:00:00+00:00

56

-1

2021-08-02 18:00:00+00:00

00000

029237

air_temperature

K

5

NaT

1.5

temperature_at_screen_level

threshold

An example forecast table for a period diagnostic is shown below.

index

forecast

altitude

blend_time

forecast_period

forecast_reference_time

latitude

longitude

time

wmo_id

cf_name

units

percentile

period

height

diagnostic

experiment

0

282.69

15

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

60

-5

2021-08-01 21:00:00+00:00

03001

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

1

283.2

82

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

59

-4

2021-08-01 21:00:00+00:00

03002

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

2

282.62

30

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

58

-3

2021-08-01 21:00:00+00:00

03003

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

3

286.17

4

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

57

-2

2021-08-01 21:00:00+00:00

03004

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

4

284.43

15

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

56

-1

2021-08-01 21:00:00+00:00

03005

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

An example forecast table for a period diagnostic including station_id is shown below.

index

forecast

altitude

blend_time

forecast_period

forecast_reference_time

latitude

longitude

time

wmo_id

station_id

cf_name

units

percentile

period

height

diagnostic

experiment

0

282.69

15

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

60

-5

2021-08-01 21:00:00+00:00

03001

029233

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

1

283.2

82

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

59

-4

2021-08-01 21:00:00+00:00

03002

029234

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

2

282.62

30

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

58

-3

2021-08-01 21:00:00+00:00

00000

029235

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

3

286.17

4

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

57

-2

2021-08-01 21:00:00+00:00

00000

029236

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

4

284.43

15

2021-08-01 00:00:00+00:00

0 days 09:00:00

2021-08-01 00:00:00+00:00

56

-1

2021-08-01 21:00:00+00:00

00000

029237

temperature_at_screen_level_daytime_max

K

5

0 days 12:00:00

1.5

temperature_at_screen_level_max-daytime

threshold

Truth DataFrame

The truth DataFrame is expected to contain the following compulsory columns: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude. Optionally, the DataFrame may also contain station_id and units. If the forecast DataFrame also contains station_id, then forecast and truth data will be matched using both wmo_id and station_id. Other columns will be ignored. If the truth DataFrame contains a units column, then it will be used for the units of the output truth cube. Otherwise, the units of the truth cube will be copied from the units of the forecast DataFrame. The station_id data may be either string or int. Any other columns not mentioned above will be ignored.

A summary of the expected contents of a truth table is shown below.

Column

Dtype

Notes

time

datetime64[ns,UTC]

The time of the observation.

wmo_id

object

The five digit WMO ID.

latitude

float32

The latitude in degrees.

longitude

float32

The longtitude in degrees.

altitude

float32

The altitude in metres.

ob_value

float32

The value for a particular observation.

diagnostic

category

The name of the diagnostic.

units

str

Optional units of the observation values.

An example truth table is shown below.

index

ob_value

altitude

latitude

longitude

time

wmo_id

diagnostic

0

283.45

15

60

-5

2021-08-02 18:00:00+00:00

03001

temperature_at_screen_level

1

283.91

82

59

-4

2021-08-02 18:00:00+00:00

03002

temperature_at_screen_level

2

281.63

30

58

-3

2021-08-02 18:00:00+00:00

03003

temperature_at_screen_level

3

286.55

4

57

-2

2021-08-02 18:00:00+00:00

03004

temperature_at_screen_level

4

283.19

15

56

-1

2021-08-02 18:00:00+00:00

03005

temperature_at_screen_level

_dataframe_column_check(df, compulsory_columns)[source]

Check that the compulsory columns are present on the DataFrame. Any other columns within the DataFrame are ignored.

Parameters:
  • df (DataFrame) – Dataframe expected to contain the compulsory columns.

  • compulsory_columns (Sequence) – The names of the compulsory columns.

Raises:

ValueError – Raise an error if a compulsory column is missing.

Return type:

None

_define_height_coord(height)[source]

Define a height coordinate. A unit of metres is assumed.

Parameters:

height – The value for the height coordinate in metres.

Return type:

AuxCoord

Returns:

The height coordinate.

_define_time_coord(adate, time_bounds=None)[source]

Define a time coordinate. The coordinate will have bounds, if bounds are provided.

Parameters:
Return type:

DimCoord

Returns:

A time coordinate. This coordinate will have bounds, if bounds are provided.

_drop_duplicates(df, cols)[source]

Drop duplicates and then sort the DataFrame.

Parameters:
  • df (DataFrame) – DataFrame to have duplicates removed.

  • cols (Sequence[str]) – Columns for use in removing duplicates and for sorting.

Return type:

DataFrame

Returns:

A DataFrame with duplicates removed (only the last duplicate is kept). The DataFrame is sorted according to the columns provided.

_ensure_consistent_static_cols(forecast_df, static_cols, site_id_col)[source]

Ensure that the columns expected to have the same value for a given site, actually have the same values. These “static” columns could change if, for example, the altitude of a site is corrected.

Parameters:
  • forecast_df (DataFrame) – Forecast DataFrame.

  • static_cols (List[str]) – List of columns that are expected to be “static”.

  • site_id_col (str) – The name of the column containing the site ID.

Return type:

DataFrame

Returns:

Forecast DataFrame with the same value for a given site for the static columns provided.

_fill_missing_entries(df, combi_cols, static_cols, site_id_col)[source]

Fill the input DataFrame with rows that correspond to missing entries. The expected entries are computed using all combinations of the values within the combi_cols. In practice, this will allow support for creating entries for times that are missing when a new site with an ID is added. If the DataFrame provided is completely empty, then the empty DataFrame is returned.

Parameters:
  • df – DataFrame to be filled with rows corresponding to missing entries.

  • combi_cols – The key columns within the DataFrame. All combinations of the values within these columns are expected to exist, otherwise, an entry will be created.

  • static_cols – The names of the columns that are considered “static” and therefore can be reliably filled using other entries for the given WMO ID.

  • site_id_col – Name of the column used to identify the sites within the DataFrame.

Returns:

DataFrame where any missing combination of the combi_cols will have been created.

_prepare_dataframes(forecast_df, truth_df, forecast_period, percentiles=None, experiment=None)[source]

Prepare DataFrames for conversion to cubes by: 1) checking which forecast representation is present, 2) checking that the expected columns are present, 3) (Optionally) checking the percentiles are as expected, 4) removing duplicates from the forecast and truth, 5) finding the sites common to both the forecast and truth DataFrames and 6) replacing and supplementing the truth DataFrame with information from the forecast DataFrame. Note that this fourth step will also ensure that a row containing a NaN for the ob_value is inserted for any missing observations.

Parameters:
  • forecast_df (DataFrame) – DataFrame expected to contain the following columns: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, one of REPRESENTATION_COLUMNS (percentile or realization), diagnostic, latitude, longitude, altitude, period, height, cf_name, units and experiment. Optionally, the DataFrame may also contain station_id. Any other columns are ignored.

  • truth_df (DataFrame) – DataFrame expected to contain the following columns: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude. Optionally the DataFrame may also contain the following columns: station_id, units. Any other columns are ignored.

  • forecast_period (int) – Forecast period in seconds as an integer.

  • percentiles (Optional[List[float]]) – The set of percentiles to be used for estimating EMOS coefficients.

  • experiment (Optional[str]) – A value within the experiment column to select from the forecast table.

Return type:

Tuple[DataFrame, DataFrame]

Returns:

A sanitised version of the forecasts and truth DataFrames that are ready for conversion to cubes.

_preprocess_temporal_columns(df)[source]

Pre-process the columns with temporal dtype to convert from numpy datetime objects to pandas datetime objects. Casting the dtype of the columns to object type results in columns of dtype “object” with the contents of the columns being pandas datetime objects, rather than numpy datetime objects.

Parameters:

df (DataFrame) – A DataFrame with temporal columns with numpy datetime dtypes.

Return type:

DataFrame

Returns:

A DataFrame without numpy datetime dtypes. The content of the columns with temporal dtypes are accessible as pandas datetime objects.

_quantile_check(df)[source]

Check that the percentiles provided can be considered to be quantiles with equal spacing spanning the percentile range.

Parameters:

df (DataFrame) – DataFrame with a percentile column.

Raises:

ValueError – Percentiles are not equally spaced.

Return type:

None

_training_dates_for_calibration(cycletime, forecast_period, training_length)[source]

Compute the date range required for extracting the required training dataset. The final validity time within the training dataset is at least one day prior to the cycletime. The final validity time within the training dataset is additionally offset by the number of days within the forecast period to ensure that the dates defined by the training dataset are in the past relative to the cycletime. For example, for a cycletime of 20170720T0000Z with a forecast period of T+30 and a training length of 3 days, the validity time is 20170721T0600Z. Subtracting one day gives 20170720T0600Z. Note that this is in the future relative to the cycletime and we want the training dates to be in the past relative to the cycletime. Subtracting the forecast period rounded down to the nearest day for T+30 gives 1 day. Subtracting this additional day gives 20170719T0600Z. This is the final validity time within the training period. We then compute the validity times for a 3 day training period using 20170719T0600Z as the final validity time giving 20170719T0600Z, 20170718T0600Z and 20170717T0600Z.

Parameters:
  • cycletime (str) – Cycletime of a format similar to 20170109T0000Z. The training dates will always be in the past, relative to the cycletime.

  • forecast_period (int) – Forecast period in seconds as an integer.

  • training_length (int) – Training length in days as an integer.

Return type:

DatetimeIndex

Returns:

Datetimes defining the training dataset. The number of datetimes is equal to the training length.

_unique_check(df, column)[source]

Check whether the values in the column are unique.

Parameters:
  • df (DataFrame) – The DataFrame to be checked.

  • column (str) – Name of a column in the DataFrame.

Raises:

ValueError – Only one unique value within the specifed column is expected.

Return type:

None

forecast_and_truth_dataframes_to_cubes(forecast_df, truth_df, cycletime, forecast_period, training_length, percentiles=None, experiment=None)[source]

Convert a forecast DataFrame into an iris Cube and a truth DataFrame into an iris Cube.

Parameters:
  • forecast_df (DataFrame) – DataFrame expected to contain the following columns: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, one of REPRESENTATION_COLUMNS (percentile or realization), diagnostic, latitude, longitude, period, height, cf_name, units. Optionally, the DataFrame may also contain station_id. Any other columns are ignored.

  • truth_df (DataFrame) – DataFrame expected to contain the following columns: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude. Optionally the DataFrame may also contain the following columns: station_id, units. Any other columns are ignored.

  • cycletime (str) – Cycletime of a format similar to 20170109T0000Z.

  • forecast_period (int) – Forecast period in seconds as an integer.

  • training_length (int) – Training length in days as an integer.

  • percentiles (Optional[List[float]]) – The set of percentiles to be used for estimating EMOS coefficients. These should be a set of equally spaced quantiles.

  • experiment (Optional[str]) – A value within the experiment column to select from the forecast table.

Return type:

Tuple[Cube, Cube]

Returns:

Forecasts and truths for the training period in Cube format.

forecast_dataframe_to_cube(df, training_dates, forecast_period)[source]

Convert a forecast DataFrame into an iris Cube. The percentiles within the forecast DataFrame are rebadged as realizations.

Parameters:
  • df (DataFrame) – DataFrame expected to contain the following columns: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, REPRESENTATION_COLUMNS (percentile or realization), diagnostic, latitude, longitude, period, height, cf_name, units. Optionally, the DataFrame may also contain station_id. Any other columns are ignored.

  • training_dates (DatetimeIndex) – Datetimes spanning the training period.

  • forecast_period (int) – Forecast period in seconds as an integer.

Return type:

Cube

Returns:

Cube containing the forecasts from the training period.

get_forecast_representation(df)[source]

Check which of REPRESENTATION_COLUMNS (percentile or realization) exists in the DataFrame.

Parameters:

df (DataFrame) – DataFrame expected to contain exactly one of REPRESENTATION_COLUMNS.

Returns:

The member of REPRESENTATION_COLUMNS found in the DataFrame columns.

Return type:

representation_type

Raises:

ValueError – If none of the allowed columns are present, or more than one is present.

truth_dataframe_to_cube(df, training_dates)[source]

Convert a truth DataFrame into an iris Cube.

Parameters:
  • df (DataFrame) – DataFrame expected to contain the following columns: ob_value, time, wmo_id, diagnostic, latitude, longitude, altitude, cf_name, height, period. Optionally the DataFrame may also contain the following columns: station_id, units. Any other columns are ignored.

  • training_dates (DatetimeIndex) – Datetimes spanning the training period.

Return type:

Cube

Returns:

Cube containing the truths from the training period.