Opened 6 years ago

Last modified 5 years ago

#142 new enhancement

Coordinate Type: Ensemble — at Version 13

Reported by: markh Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description (last modified by markh)

1. Title

Ensemble Coordinates

2. Moderator

TBC

3. Requirement

The description of the dimension of a data variable which describes an ensemble of forecasts may involve a number of elements of metadata.

Coordinate variables and auxiliary coordinates describing an ensemble need to be able to be labelled as such.

It is useful to encapsulate information about the nature of the ensemble with the coordinate. Current use cases for metadata standardisation are: the original size of the ensemble; the single-model or multi-model nature of the ensemble, the presence of an explicit control forecast as realization=0.

4. Technical Proposal

4.6 Ensemble Coordinate

Variables representing an ensemble or collection of realizations shall have an attribute axis with a value E. These variables are discrete, as described in section 4.5. They do not represent continuous quantities.

Ensemble variables have a number of optional standardised attributes available for use. Further bespoke attributes to describe the ensemble in project specific ways are allowed.

4.6.1 Ensemble Size

An ensemble coordinate may include an attribute named ensemble_size representing the original size of the ensemble.

This attribute shall have a value which is a positive integer.

This value provides a context for the realization, which is preserved though sub-setting, slicing and statistical processing. it is expected to remain unchanged through such operations on the data variable, which will alter the values of the realization coordinate.

4.6.2 Single Model Ensemble

An ensemble coordinate may include an attribute named single_model_ensemble representing the assertion that the ensemble of members all originate from the same numerical model.

This is a boolean field and may only contain the values true or false.

The absence of this attribute shall not be interpreted as a positive or negative statement. No inference on the models providing ensemble members shall be inferred from the absence of this attribute.

4.6.3 Ensemble Control Member

An ensemble coordinate may include an attribute named ensemble_control_member.

This value provides a definition that one member of the ensemble is the control member and identifies this member. This control member shall have the identified value within the ensemble coordinate's data.

The absence of this attribute shall be interpreted as a negative statement, explicitly stating that there is no control member identified within the ensemble.

5. Benefits

Information regarding the nature of an ensemble is encoded in an ensemble coordinate, analogous to temporal coordinates.

Encoding of the ensemble size, the presence of a control member and the single or multiple model nature of the ensemble is standardised.

Future standardisation of ensemble characteristics has a model to follow.

6. Status Quo

At present there is no standardised way of capturing information or characteristics about an ensemble, all information is encoded ad hoc by data producers.

For example, the size of the ensemble can only be inferred from the length of a realization dimension. If the ensemble is sliced, to leave only one member, the size of the ensemble is lost in the resulting data variable. This example is particularly problematic for conversion from CF to other data formats.

Change History (13)

comment:1 Changed 6 years ago by markh

This ticket leads on from the mailing list discussion: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2015/058424.html and the 'next message (by thread)' postings

The historical discussion: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/057010.html and the 'next message (by thread)' postings are also relevant

comment:2 Changed 6 years ago by mgschultz

Dear Mark,

reads very well, I think. Only one suggestion: perhaps one should also try to standardize the "description" of the ensemble members? I guess, this would be a character variable with string length and ensemble size as dimensions. Would it make sense to suggest another optional attribute for the ensemble coordinate variable which points to this description variable?

Best regards, Martin

comment:3 Changed 6 years ago by mgschultz

Dear Mark,

reads very well, I think. Only one suggestion: perhaps one should also try to standardize the "description" of the ensemble members? I guess, this would be a character variable with string length and ensemble size as dimensions. Would it make sense to suggest another optional attribute for the ensemble coordinate variable which points to this description variable?

Best regards, Martin

comment:4 follow-up: Changed 6 years ago by jonathan

Dear Mark

Thanks for making this proposal. I didn't have the opportunity to comment in the email discussion because I was on holiday then.

I think you are right that ensemble axes need special recognition. We should also acknowledge that an ensemble axis is a special case of a discrete axis (CF sect 4.5). If we add a new section 4.6 for ensemble axes, that should be stated at the outset. Conversely, the ensemble axis should be added to the list of applications of discrete axes in sect 4.5.

When saying that an ensemble axis must have a standard name of realization, you are assuming it will have a coordinate variable. However, I don't think it needs to have one. The elements of an ensemble might be distinguished by a combination of auxiliary coordinate variables, and might not have any meaningful order, which is implied by a coordinate variable. However, I think it would be OK to say that a coordinate variable with a standard name of realization indicates it is an ensemble axis. I would say that units of 1 should not be required because dimensionless units are generally allowed not to be omitted. (Requiring them would be something for the "strict" variety of CF which has been proposed.) If there is no other standard name for an ensemble axis for the moment, then I don't think we need a value of the axis attribute, because the standard name alone will do the job of identifying it, and the axis is redundant. (This is a different situation from the spatial axes, which have various means of identification, inherited from COARDS, so the axis attribute is a useful extra clue.)

Apart from the realization number, ensemble members might be identified by string-valued auxiliary coordinate variables. It was proposed a long time ago, but not agreed, that we could introduce standard names of institution and source for such variables, with the same meanings as the attributes of those names. In the context of CMIP, experiment would also be a good standard name to have, while the realizations in CMIP are identified by strings e.g. r1i1p1, not numbers. Therefore it would be good to allow realization coordinates to be string-valued. I think it would be useful to say something about the use of auxiliary coordinate variables to describe ensemble members in this way.

About ensemble_size. In your ticket 108, we discussed a more general mechanism for recording the original size of a dimension in cell_methods, as Karl has reminded us on the email list. That ticket hasn't been concluded, and I guess you think cell_methods is inappropriate because you want the information to be recorded for subsetting as well as statistical operations. In the earlier discussion, I believe that your use-case for subsetting was the selection of a single ensemble member, which I thought could be regarded as a point cell method. Do you have a use-case for a selection of a subset of several ensemble members? Maybe that could be recorded as a cell method too. Consider a situation where you have 10 ensemble members, you select 5 of them, and then you compute the ensemble mean. I think that recording the size 5, which would naturally belong in cell methods, would be at least as useful as recording the original size 10. I wonder what you use the original ensemble size for? It also seems to me that any treatment of this kind could apply to any discrete axis, not just an ensemble axis.

The standard name of source would be appropriate for for a string_valued auxiliary coordinate variable that identifies the model (see CF sect 2.6.2). If we had that facility, I don't think you'd need the single_model_ensemble attribute. If they are all from the same model, it can be identified by a scalar coordinate of source; if they come from different models, there can be an auxiliary coordinate variable of source with the dimension of the ensemble size.

I haven't come across the idea behind the ensemble_control_member_0 before. This sounds rather specific to a particular application. Is it in sufficiently widespread use that it requires a standard? What does the "control member" mean? In climate model ensembles that I deal with, all members are equivalent.

Best wishes

Jonathan

comment:5 Changed 6 years ago by markh

To meet my use cases I have focussed on numerical realization coordinates.

However, I have no particular focus on coordinate variables, this identification mechanism should work for auxiliary coordinates just as well.

It sounds like there is a desire to have a string based identification and perhaps derived identifiers from multiple sources. These sound like valid use cases to me.

I suggest we generalise the proposal such that any coordinate can be used to describe an ensemble axis. This would suggest the the standard_name realization is not a good way to identify this coordinate type.

The introduction to chapter 4 states: The attribute axis may be attached to a coordinate variable and given one of the values X, Y, Z or T which stand for a longitude, latitude, vertical, or time axis respectively.

Would you like this to imply that axis should only be used on a coordinate variable? (Appendix A does not place such a restriction) (This is not part of the compliance check, afaics)

Would you like this to imply that a data variable will reference 1or0 coordinates with an axis attribute of a particular value, or would it be suitable to have a data variable referencing multiple coordinates

If neither of these restrictions apply, then we can add the axis = E identifier to any coordinate or auxiliary coordinate variable and use this to handle ensemble coordinate typing.

this seems like a good solution to me

In light of this, I will reconsider the specification of a control member where the coordinate may not be numerical. This is a required use case, and commonly delivered in WMO specific contexts for short range ensemble forecasts

Updates to follow

comment:6 in reply to: ↑ 4 Changed 6 years ago by markh

Replying to jonathan:

In your ticket 108, we discussed a more general mechanism for recording the original size of a dimension in cell_methods, as Karl has reminded us on the email list.

#108 is not aiming to store an original size; the intent of #108 is to capture the domain, analogous to the 'interval' in cell_methods currently available, but in a richer fashion. In this scenario, the domain could store the realization numbers which were input to a statistical process, but this may not be the original size.

THese two pieces of information are independent.

I may want to support a work flow which subsets 15 realisations out of the original 19, then calculates the mean and standard deviation of the remaining realisations.

In order to output this comprehensively, I would like to state that this is a mean of realisations (0,2,3,4,.....) from an ensemble originally formed of 19 members

That ticket hasn't been concluded, and I guess you think cell_methods is inappropriate because you want the information to be recorded for subsetting as well as statistical operations.

This is the case. More generally, the ensemble size is information about the coordinate, not about the data variable directly, and not about statitics calculated over the ensemble dimension, so cell_methods appears to me to be the wrong place to store this information.

I am more comfortable looking for analogies with space and time coordinates, such as calendar definitions for ways to define this metadata in a controlled manner.

comment:7 follow-up: Changed 6 years ago by jonathan

Dear Mark

Yes, I agree that axis="E" would a better way to identify an ensemble axis than depending on a particular standard name. As originally introduced, the axis attribute was intended for coordinate variables, not auxiliary coordinate variables, but we've agreed a ticket which allows it for aux coord vars too. As you say, axis="E" could be attached to any of the aux coord vars, or the coord var (if there is one), of an ensemble axis.

I may want to support a work flow which subsets 15 realisations out of the original 19, then calculates the mean and standard deviation of the remaining realisations. In order to output this comprehensively, I would like to state that this is a mean of realisations (0,2,3,4,.....) from an ensemble originally formed of 19 members

OK. I too am more comfortable looking for analogies with spatiotemporal coordinates, and you can imagine that there might be need to record the original dimension of other axes before a selection was made. Therefore I think it would be good to use an attribute which didn't say "ensemble" in its name.

Although I appreciate this doesn't feel quite like cell methods, there is an relationship, I suggest, since cell methods has the idea that each cell represents variation within itself, and therefore allows the original spacing of the data to be recorded. By an extension, you could regard a selection of the points along an axis as an operation e.g. a sample operation, which maps the original full range of variation into a smaller number of cells. Then you put e.g. cell_methods="ensemble: sample (dimension: original_ensemble) ensemble: mean". This would use a new cell method, instead of a new attribute. In your case, dimension ensemble=15 and original_ensemble=19.

Best wishes

Jonathan

comment:8 in reply to: ↑ 7 Changed 5 years ago by jonathan

Correction to my last posting. I didn't get the final paragraph quite right. In your example, the ensemble originally had 19 members, of which you selected a subset of 15, and then you computed e.g. the mean of them. So finally the ensemble axis has size 1, because it's been collapsed, but you may want to record the size it had before collapse, and before sampling. This could be done by using cell_methods to list these as consecutive operations, with an extension to record the dimension before each one: cell_methods="ensemble: sample (dimension: full_ensemble) ensemble: mean (dimension: sub_ensemble)", where dimensions full_ensemble=19, sub_ensemble=15 and ensemble=1.

Cheers, Jonathan

comment:9 Changed 5 years ago by markh

  • Description modified (diff)

It looks from the comments like we are in agreement about the use of axis=E to define that a coordinate is an ensemble coordinate. It also looks like there are good reasons not to rely on a standard name.

I am updating the summary to reflect this.

There is still discussion about how to capture associated metadata, further consideration is needed on this aspect.

comment:10 Changed 5 years ago by markh

The outstanding question of how to capture coordinate level metadata is still outstanding.

regarding 'size of ensemble':

  • Jonathan has pointed highlighted that it is plausible to address this as an extension to cell methods.
  • I have pointed proposed that this is captured as a an attribute on an ensemble coordinate

Both of these appear reasonable interpretations

The other two use cases I have to address to meet my uers' needs are:

  • single model ensemble (boolean)
  • ensemble control member (value)

Neither of these elements seem to fit the Cell Methods paradigm at all well.

With this in mind I think it is preferable to encapsulate all the ensemble coordinate metadata as named attributes on the ensemble coordinate, rather than extending cell methods to meet this use case and still requiring extra attributes.

Please may people share thoughts on why there is benefit in extending cell methods to enable 'original ensemble size' to be encoded there, when it is still necessary to define new limited scope attributes to address the use cases I have provided?

If we do not identify strong benefit from the cell methods approach, I think we should use defined name attributes for all these cases.

thank you mark

comment:11 Changed 5 years ago by jonathan

Dear Mark

Thanks for your posting. I would say that cell_methods offers fairly natural way to encode more useful information than a simple attribute for the original ensemble size would do (as in my example above), and that there's no need to link this to your other two proposals, which don't belong in cell_methods, I agree. I commented above on these other two issues.

Rather than a new single_model_ensemble attribute, the source attribute could be used to identify the model. If there is just one model, there's just one word. If the ensemble has several models, it would be a list of words. Alternatively, you could have a string-valued auxiliary coordinate variable to record the model for each ensemble member.

I haven't come across the idea behind the ensemble_control_member_0 before. What does the "control member" mean? In climate model ensembles that I deal with, all members are equivalent. You mention that it is used in WMO contexts. Please could you give an example to illustrate the use case?

Best wishes

Jonathan

comment:12 Changed 5 years ago by markh

An ensemble control member is a deterministic forecast within the ensemble. There are no perturbations or stochastic physics applied to this member. It is generally stable through the development of the ensemble configuration.

Numerous models make a scientific decision to provide a control member as a distinct member of the ensemble, in these cases there is value in explicitly identifying this member. It is a member of the ensemble, it is not separate from it.

The Met Office GloSea ensemble makes a scientific decision not to have a control member, all members within the ensemble are perturbed, there is no control.

Other models contributing to the ECMWF S2S project https://software.ecmwf.int/wiki/display/S2S/ explicitly include a control member and label that member as the control as part of their data submission.

comment:13 Changed 5 years ago by markh

  • Description modified (diff)
Note: See TracTickets for help on using tickets.