Opened 6 years ago

Last modified 5 years ago

#142 new enhancement

Coordinate Type: Ensemble — at Version 31

Reported by: markh Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description (last modified by markh)

1. Title

Ensemble Coordinates

2. Moderator

TBC

3. Requirement

The description of the dimension of a data variable which describes an ensemble of forecasts may involve a number of elements of metadata.

Coordinate variables and auxiliary coordinates describing an ensemble need to be able to be labelled as such.

It is useful to encapsulate information about the nature of the ensemble with the coordinate. Current use cases for metadata standardisation are: the original size of the ensemble; the single-model or multi-model nature of the ensemble, the presence of an explicit control forecast as realization=0.

4. Technical Proposal

4.6 Ensemble Coordinate

Variables representing an ensemble or collection of realizations shall have an attribute axis with a value E. These variables are discrete, as described in section 4.5; they do not represent continuous quantities.

Ensemble variables have a number of optional standardised attributes available for use. Further bespoke attributes to describe the ensemble in project specific ways are always allowed.

4.6.1 Identifying Members

A data variable representing an ensemble will commonly have an ensemble coordinate with the standard name realization; this is not mandatory. The realization standard name is used to provide a unique identifying number to each ensemble member within the ensemble.

An ensemble coordinate providing a string label for each ensemble member within the ensemble may also be included. The standard name ensemble_member_label is proposed (canonical unit = (string expected)); this name is not yet approved as a standard_name. An ensemble_member_label shall have unique values and missing values are not allowed.

4.6.2 Ensemble Control Member ====

An ensemble coordinate with the standard name realization or ensemble_member_label may include an attribute named ensemble_control_member.

This value provides a definition that one member of the ensemble is the control member and identifies this member. This control member shall have the identified value within the ensemble coordinate's data.

The absence of this attribute shall be interpreted as a negative statement, explicitly stating that there is no control member identified within the ensemble.

4.6.3 Single or Multiple Model Ensemble =

An ensemble coordinate may be identified as being from one model or multiple models by providing further variables identified as coordinates or auxiliary coordinates by the data variable. All of these coordinates shall have an attribute axis with a value E.

The standard name's source and institution are used to identify the multiplicity of models which the ensemble is taken from, one other or both may be present.

Such coordinates may be scalar coordinate variables or they may be attached to the same dimension(s) as other ensemble coordinates referenced by a data variable. Scalar coordinates are commonly used to define a single model ensemble; in this case, this is informationally equivalent to auxiliary coordinates with identical values.

The absence of such source and institution coordinates shall not be interpreted as a positive or negative statement. No inference on the multiplicity of models which the ensemble is taken from shall be inferred from the absence of such coordinates.

5. Benefits

Information regarding the nature of an ensemble is encoded in an ensemble coordinate, analogous to temporal coordinates.

Encoding of the presence of a control member and the single or multiple model nature of the ensemble is standardised.

Future standardisation of ensemble characteristics has a model to follow.

6. Status Quo

At present there is no standardised way of capturing information or characteristics about an ensemble, all information is encoded ad hoc by data producers.

For example, the size of the ensemble can only be inferred from the length of a realization dimension. If the ensemble is sliced, to leave only one member, the size of the ensemble is lost in the resulting data variable. This example is particularly problematic for conversion from CF to other data formats.

Change History (31)

comment:1 Changed 6 years ago by markh

This ticket leads on from the mailing list discussion: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2015/058424.html and the 'next message (by thread)' postings

The historical discussion: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/057010.html and the 'next message (by thread)' postings are also relevant

comment:2 Changed 6 years ago by mgschultz

Dear Mark,

reads very well, I think. Only one suggestion: perhaps one should also try to standardize the "description" of the ensemble members? I guess, this would be a character variable with string length and ensemble size as dimensions. Would it make sense to suggest another optional attribute for the ensemble coordinate variable which points to this description variable?

Best regards, Martin

comment:3 Changed 6 years ago by mgschultz

Dear Mark,

reads very well, I think. Only one suggestion: perhaps one should also try to standardize the "description" of the ensemble members? I guess, this would be a character variable with string length and ensemble size as dimensions. Would it make sense to suggest another optional attribute for the ensemble coordinate variable which points to this description variable?

Best regards, Martin

comment:4 follow-up: Changed 6 years ago by jonathan

Dear Mark

Thanks for making this proposal. I didn't have the opportunity to comment in the email discussion because I was on holiday then.

I think you are right that ensemble axes need special recognition. We should also acknowledge that an ensemble axis is a special case of a discrete axis (CF sect 4.5). If we add a new section 4.6 for ensemble axes, that should be stated at the outset. Conversely, the ensemble axis should be added to the list of applications of discrete axes in sect 4.5.

When saying that an ensemble axis must have a standard name of realization, you are assuming it will have a coordinate variable. However, I don't think it needs to have one. The elements of an ensemble might be distinguished by a combination of auxiliary coordinate variables, and might not have any meaningful order, which is implied by a coordinate variable. However, I think it would be OK to say that a coordinate variable with a standard name of realization indicates it is an ensemble axis. I would say that units of 1 should not be required because dimensionless units are generally allowed not to be omitted. (Requiring them would be something for the "strict" variety of CF which has been proposed.) If there is no other standard name for an ensemble axis for the moment, then I don't think we need a value of the axis attribute, because the standard name alone will do the job of identifying it, and the axis is redundant. (This is a different situation from the spatial axes, which have various means of identification, inherited from COARDS, so the axis attribute is a useful extra clue.)

Apart from the realization number, ensemble members might be identified by string-valued auxiliary coordinate variables. It was proposed a long time ago, but not agreed, that we could introduce standard names of institution and source for such variables, with the same meanings as the attributes of those names. In the context of CMIP, experiment would also be a good standard name to have, while the realizations in CMIP are identified by strings e.g. r1i1p1, not numbers. Therefore it would be good to allow realization coordinates to be string-valued. I think it would be useful to say something about the use of auxiliary coordinate variables to describe ensemble members in this way.

About ensemble_size. In your ticket 108, we discussed a more general mechanism for recording the original size of a dimension in cell_methods, as Karl has reminded us on the email list. That ticket hasn't been concluded, and I guess you think cell_methods is inappropriate because you want the information to be recorded for subsetting as well as statistical operations. In the earlier discussion, I believe that your use-case for subsetting was the selection of a single ensemble member, which I thought could be regarded as a point cell method. Do you have a use-case for a selection of a subset of several ensemble members? Maybe that could be recorded as a cell method too. Consider a situation where you have 10 ensemble members, you select 5 of them, and then you compute the ensemble mean. I think that recording the size 5, which would naturally belong in cell methods, would be at least as useful as recording the original size 10. I wonder what you use the original ensemble size for? It also seems to me that any treatment of this kind could apply to any discrete axis, not just an ensemble axis.

The standard name of source would be appropriate for for a string_valued auxiliary coordinate variable that identifies the model (see CF sect 2.6.2). If we had that facility, I don't think you'd need the single_model_ensemble attribute. If they are all from the same model, it can be identified by a scalar coordinate of source; if they come from different models, there can be an auxiliary coordinate variable of source with the dimension of the ensemble size.

I haven't come across the idea behind the ensemble_control_member_0 before. This sounds rather specific to a particular application. Is it in sufficiently widespread use that it requires a standard? What does the "control member" mean? In climate model ensembles that I deal with, all members are equivalent.

Best wishes

Jonathan

comment:5 Changed 6 years ago by markh

To meet my use cases I have focussed on numerical realization coordinates.

However, I have no particular focus on coordinate variables, this identification mechanism should work for auxiliary coordinates just as well.

It sounds like there is a desire to have a string based identification and perhaps derived identifiers from multiple sources. These sound like valid use cases to me.

I suggest we generalise the proposal such that any coordinate can be used to describe an ensemble axis. This would suggest the the standard_name realization is not a good way to identify this coordinate type.

The introduction to chapter 4 states: The attribute axis may be attached to a coordinate variable and given one of the values X, Y, Z or T which stand for a longitude, latitude, vertical, or time axis respectively.

Would you like this to imply that axis should only be used on a coordinate variable? (Appendix A does not place such a restriction) (This is not part of the compliance check, afaics)

Would you like this to imply that a data variable will reference 1or0 coordinates with an axis attribute of a particular value, or would it be suitable to have a data variable referencing multiple coordinates

If neither of these restrictions apply, then we can add the axis = E identifier to any coordinate or auxiliary coordinate variable and use this to handle ensemble coordinate typing.

this seems like a good solution to me

In light of this, I will reconsider the specification of a control member where the coordinate may not be numerical. This is a required use case, and commonly delivered in WMO specific contexts for short range ensemble forecasts

Updates to follow

comment:6 in reply to: ↑ 4 Changed 6 years ago by markh

Replying to jonathan:

In your ticket 108, we discussed a more general mechanism for recording the original size of a dimension in cell_methods, as Karl has reminded us on the email list.

#108 is not aiming to store an original size; the intent of #108 is to capture the domain, analogous to the 'interval' in cell_methods currently available, but in a richer fashion. In this scenario, the domain could store the realization numbers which were input to a statistical process, but this may not be the original size.

THese two pieces of information are independent.

I may want to support a work flow which subsets 15 realisations out of the original 19, then calculates the mean and standard deviation of the remaining realisations.

In order to output this comprehensively, I would like to state that this is a mean of realisations (0,2,3,4,.....) from an ensemble originally formed of 19 members

That ticket hasn't been concluded, and I guess you think cell_methods is inappropriate because you want the information to be recorded for subsetting as well as statistical operations.

This is the case. More generally, the ensemble size is information about the coordinate, not about the data variable directly, and not about statitics calculated over the ensemble dimension, so cell_methods appears to me to be the wrong place to store this information.

I am more comfortable looking for analogies with space and time coordinates, such as calendar definitions for ways to define this metadata in a controlled manner.

comment:7 follow-up: Changed 6 years ago by jonathan

Dear Mark

Yes, I agree that axis="E" would a better way to identify an ensemble axis than depending on a particular standard name. As originally introduced, the axis attribute was intended for coordinate variables, not auxiliary coordinate variables, but we've agreed a ticket which allows it for aux coord vars too. As you say, axis="E" could be attached to any of the aux coord vars, or the coord var (if there is one), of an ensemble axis.

I may want to support a work flow which subsets 15 realisations out of the original 19, then calculates the mean and standard deviation of the remaining realisations. In order to output this comprehensively, I would like to state that this is a mean of realisations (0,2,3,4,.....) from an ensemble originally formed of 19 members

OK. I too am more comfortable looking for analogies with spatiotemporal coordinates, and you can imagine that there might be need to record the original dimension of other axes before a selection was made. Therefore I think it would be good to use an attribute which didn't say "ensemble" in its name.

Although I appreciate this doesn't feel quite like cell methods, there is an relationship, I suggest, since cell methods has the idea that each cell represents variation within itself, and therefore allows the original spacing of the data to be recorded. By an extension, you could regard a selection of the points along an axis as an operation e.g. a sample operation, which maps the original full range of variation into a smaller number of cells. Then you put e.g. cell_methods="ensemble: sample (dimension: original_ensemble) ensemble: mean". This would use a new cell method, instead of a new attribute. In your case, dimension ensemble=15 and original_ensemble=19.

Best wishes

Jonathan

comment:8 in reply to: ↑ 7 Changed 5 years ago by jonathan

Correction to my last posting. I didn't get the final paragraph quite right. In your example, the ensemble originally had 19 members, of which you selected a subset of 15, and then you computed e.g. the mean of them. So finally the ensemble axis has size 1, because it's been collapsed, but you may want to record the size it had before collapse, and before sampling. This could be done by using cell_methods to list these as consecutive operations, with an extension to record the dimension before each one: cell_methods="ensemble: sample (dimension: full_ensemble) ensemble: mean (dimension: sub_ensemble)", where dimensions full_ensemble=19, sub_ensemble=15 and ensemble=1.

Cheers, Jonathan

comment:9 Changed 5 years ago by markh

  • Description modified (diff)

It looks from the comments like we are in agreement about the use of axis=E to define that a coordinate is an ensemble coordinate. It also looks like there are good reasons not to rely on a standard name.

I am updating the summary to reflect this.

There is still discussion about how to capture associated metadata, further consideration is needed on this aspect.

comment:10 Changed 5 years ago by markh

The outstanding question of how to capture coordinate level metadata is still outstanding.

regarding 'size of ensemble':

  • Jonathan has pointed highlighted that it is plausible to address this as an extension to cell methods.
  • I have pointed proposed that this is captured as a an attribute on an ensemble coordinate

Both of these appear reasonable interpretations

The other two use cases I have to address to meet my uers' needs are:

  • single model ensemble (boolean)
  • ensemble control member (value)

Neither of these elements seem to fit the Cell Methods paradigm at all well.

With this in mind I think it is preferable to encapsulate all the ensemble coordinate metadata as named attributes on the ensemble coordinate, rather than extending cell methods to meet this use case and still requiring extra attributes.

Please may people share thoughts on why there is benefit in extending cell methods to enable 'original ensemble size' to be encoded there, when it is still necessary to define new limited scope attributes to address the use cases I have provided?

If we do not identify strong benefit from the cell methods approach, I think we should use defined name attributes for all these cases.

thank you mark

comment:11 Changed 5 years ago by jonathan

Dear Mark

Thanks for your posting. I would say that cell_methods offers fairly natural way to encode more useful information than a simple attribute for the original ensemble size would do (as in my example above), and that there's no need to link this to your other two proposals, which don't belong in cell_methods, I agree. I commented above on these other two issues.

Rather than a new single_model_ensemble attribute, the source attribute could be used to identify the model. If there is just one model, there's just one word. If the ensemble has several models, it would be a list of words. Alternatively, you could have a string-valued auxiliary coordinate variable to record the model for each ensemble member.

I haven't come across the idea behind the ensemble_control_member_0 before. What does the "control member" mean? In climate model ensembles that I deal with, all members are equivalent. You mention that it is used in WMO contexts. Please could you give an example to illustrate the use case?

Best wishes

Jonathan

comment:12 Changed 5 years ago by markh

An ensemble control member is a deterministic forecast within the ensemble. There are no perturbations or stochastic physics applied to this member. It is generally stable through the development of the ensemble configuration.

Numerous models make a scientific decision to provide a control member as a distinct member of the ensemble, in these cases there is value in explicitly identifying this member. It is a member of the ensemble, it is not separate from it.

The Met Office GloSea ensemble makes a scientific decision not to have a control member, all members within the ensemble are perturbed, there is no control.

Other models contributing to the ECMWF S2S project https://software.ecmwf.int/wiki/display/S2S/ explicitly include a control member and label that member as the control as part of their data submission.

comment:13 Changed 5 years ago by markh

  • Description modified (diff)

comment:14 follow-up: Changed 5 years ago by jonathan

Dear Mark

OK, I understand about the ensemble control member. Thanks. That information about its purpose would be useful to include in your next. I think the ensemble_control_member attribute should be allowed only if axis="E" is present. How is the member identified? For instance, is it an index within the ensemble dimension?

Best wishes

Jonathan

comment:15 in reply to: ↑ 14 Changed 5 years ago by markh

Replying to jonathan:

I think the ensemble_control_member attribute should be allowed only if axis="E" is present.

I agree, this attribute is only valid if for a variable with axis="E" set.

How is the member identified? For instance, is it an index within the ensemble dimension?

I propose the value of the coordinate which identifies the control is the value of the attribute. So, if there is a coordinate where the control member is identified by a realization value of zero, the coordinate would be:

int realization(realization) ;
    realization:axis = "E" ;
    realization:ensemble_control_member = 0 ;

comment:16 follow-up: Changed 5 years ago by jonathan

As I mentioned above, realizations might be identified by strings (as in CMIP) rather than numbers, which means that we couldn't specify the data type of ensemble_control_member. However, if it were an index within the ensemble dimension, it would definitely be a number.

Jonathan

comment:17 in reply to: ↑ 16 Changed 5 years ago by markh

Replying to jonathan:

As I mentioned above, realizations might be identified by strings (as in CMIP) rather than numbers, which means that we couldn't specify the data type of ensemble_control_member. However, if it were an index within the ensemble dimension, it would definitely be a number.

The data type of the attribute shall be consistent with the data type of the coordinate it is attached to.

Specifying the index is troublesome, as any change to the coordinate, in order, by subseting, or anything similar breaks the link

This approach is too fragile, in my view. Specifying the coordinate's value is safer.

comment:18 follow-up: Changed 5 years ago by jonathan

Dear Mark

Most CF attributes have a defined data type (string or numeric), though there are a few whose type depends on the variable viz. _FillValue, missing_value, flag_masks, flag_values. I agree that what you suggest would work, since the purpose of the attribute is to compare with the values in the variable to find the one which matches.

Another possibility occurs to me, which would avoid defining an attribute altogether - I'm sure you've noticed I tend to prefer not to define new machinery if we can make use of something we have! You could define a standard name such as ensemble_control_binary_mask, for an auxiliary coordinate variable which is 1 for the ensemble control member and 0 for the others. This would be perhaps even more robust, because it would survive subsetting and permutation of the ensemble axis, since all auxiliary coordinate variables have to be transformed consistently by such operations. If you deleted the ensemble control member from the ensemble, you would not have to remember to zap the attribute, since the 1 element would vanish from the mask at the same time.

Best wishes

Jonathan

comment:19 in reply to: ↑ 18 Changed 5 years ago by markh

Replying to jonathan:

Another possibility occurs to me, which would avoid defining an attribute altogether - I'm sure you've noticed I tend to prefer not to define new machinery if we can make use of something we have! You could define a standard name such as ensemble_control_binary_mask, for an auxiliary coordinate variable which is 1 for the ensemble control member and 0 for the others.

I can see what you are suggesting here and I can see that it would be functional.

I find the approach less clear than defining a particular coordinate value for the variable to provide this labelling function. I do not consider the 'control member' to be a masking operation, this seems like an odd fit to me.

This would be perhaps even more robust, because it would survive subsetting and permutation of the ensemble axis, since all auxiliary coordinate variables have to be transformed consistently by such operations. If you deleted the ensemble control member from the ensemble, you would not have to remember to zap the attribute, since the 1 element would vanish from the mask at the same time.

There is no need to remove such an attribute after operations, it is still valid, even if a lookup into the coordinate array returns false, due to subset operations. It still carries information in this case, that there was a control member defined.

char realization(realization):
    realization:axis = "E" ;
    realization:ensemble_control_member = "a" ;

is useful information, even when the only values in the array in the particular data set are

realisation: ("c", "f")

I think this information is metadata about the coordinate: we are describing the coordinate, stating that there is a value which carries extra meaning, if it is present.

Encapsulating this information within the variable makes sense to me. I think it is well worth the price of adding one bespoke attribute to the convention.

mark

comment:20 Changed 5 years ago by markh

similarly, the absence of this attribute on an ensemble coordinate explicitly states that there was no control member in the ensemble, which is also useful information.

comment:21 Changed 5 years ago by jonathan

Dear Mark

OK then for ensemble_control_member, thanks. My comments still stand regarding cell_methods and the indication of the number of models involved.

Best wishes and thanks

Jonathan

comment:22 Changed 5 years ago by markh

I think the suggestion about using source to provide the required information on number of models in the ensemble makes good sense. I think this will alleviate the need for a bespoke attribute

I will consider the wording for describing this and update the proposed text accordingly

comment:23 Changed 5 years ago by markh

I have reviewed #108, which has been suggested in context with the use of a CellMethod to store the size of the original ensemble.

It is my view that #108 needs significant revision if it is to be adopted to meet its current use case. More importantly it is aiming at a different use case so I don't think it is relevant without significant rethink.

To use a CellMethod to define the number of members an ensemble had when it was created, CF would need new CellMethod syntax. It is not clear to me that this is generally useful. I am wary of extending CellMethod's syntax only to meet this use case.

The 'number of members in the original ensemble' information is information about the ensemble coordinate. It is not information directly about the data variable. It is not information related to a 'cell'.

http://cf-metadata.github.io/#_data_representative_of_cells states:

When gridded data does not represent the point values of a field but instead represents some 
characteristic of the field within cells of finite "volume," a complete description of the variable 
should include metadata that describes the domain or extent of each cell, and the characteristic of 
the field that the cell values represent.

In this case, the gridded data may well be representing the point values of the field. Commonly there will be no 'cells of finite volume' for these data sets.

This information is about the ensemble coordinate. it is only relevant to the data variable through the ensemble coordinate. Keeping this information on the ensemble coordinate variable seems like a useful encapsulation of information.

As such, I reiterate my view that CF should encode the number of members an ensemble originally contained as an optional numerical attribute, specifically named for the purpose and only valid when attached to an ensemble coordinate.

I think this is a localised and sensible way to address the requirement to store this metadata interoperably.

Please may contributors with a preference for a CellMethods approach detail reasons why they feel this is a preferred solution to having a defined, limited scope, attribute?

thank you mark

comment:24 follow-ups: Changed 5 years ago by jonathan

Dear Mark

You are right that using cell_methods for this would be a generalisation. In fact the opening of sect 7, which you quote, is already too restrictive, in talking about "volume" (even with quotation marks), given that operations can be applied to any single or combination of coordinate axes. We should probably rewrite it anyway! In general, cell_methods indicates how the data you have represents the cells (hence the title of the section). I think it is plausible to regard taking a subset of the points within the cell, in order to represent the cell, as a cell method, and that is what you are doing when extracting an sub-ensemble. The minimum or the maximum, which are existing cell methods, are particular subsets (of size one) to represent the cell. The sample operation might also be performed on axes with continuous coordinate variables, taking every Nth point for example.

The advantages that I see of introducing a sample method with a dimension comment are (a) it meets your use case of recording that a subset has been extracted, and recording how big the full set was - both of these sorts of metadata may be useful in other ways too, (b) it allows you to record multiple subsetting, which your use-case needs (see comment 7 above) - you may want to know both the original size (19) and the number of members used to compute the mean (15), (c) it allows you to record when this subsetting occurred, in relation to other statistical reductions which might have been applied, (d) it doesn't require any new attributes, just an extension of an existing one.

Best wishes

Jonathan

comment:25 in reply to: ↑ 24 Changed 5 years ago by markh

Replying to jonathan:

thank you for your considered response

If we were to adopt this approach, what changes would be required to Cell Methods?

  1. Would we have to change the Cell Methods preamble to highlight its wider utility?
  2. What syntax would be added to Cell Methods to enable the storage of the size of a collection prior to an operation
    • note: this is different from the 'domain' discussion in #108, which is focussed on storing the values which were used from a coordinate which has been aggregated, not the size of that coordinate.
  3. Would subset be an explicit operation for a cell method?
    • If so, would this be a true subset (< but not <=)
  4. I do not expect to have a dimension in my file of length ensemble_size; in general I just have the number

By example, How would I write the metadata from:

  1. 1 GRIB2 message (ensemble 7 of a collection of size 19)?
  2. 19 GRIB2 messages (ensembles 0-18 of size 19)

thank you mark

comment:26 in reply to: ↑ 24 Changed 5 years ago by markh

Replying to jonathan:

a minor detail point

(b) it allows you to record multiple subsetting, which your use-case needs (see comment 7 above) - you may want to know both the original size (19) and the number of members used to compute the mean (15)

I do not need this for my use case. The only number I need to store here is 19, not 15.

Providing this capability goes beyond my requirements

comment:27 follow-up: Changed 5 years ago by jonathan

Dear Mark

  1. I haven't drafted exact words (I could help with that if we decide to do it this way) but I don't think it would be a large change. I think the general idea of cell methods is to record how the value given for each cell represents the range of values which may occur within the extent of the cell. This idea applies to all the non-default methods (mean, minimum, etc.) while for point and sum there is only one value under consideration, so the idea is irrelevant.
  1. My suggestion is a standardised comment dimension: dimension in () after the cell method to record the dimension as it was before the operation.
  1. I suggested sample but it could be subset - I think that would also be OK for continuous axes. I don't think the question of < or <= can be answered in general; it's just a subset of the elements that the axis previously had. What do you mean by "explicit"?
  1. It's OK to store a number in a dimension, isn't it? A scalar variable could be used instead, but a dimension has the additional possibility that you could use it to preserve coordinate variables or auxiliary coordinate variables of the axis as it was before the subset was taken. I know you don't need it for your case, but it's good to have the option.

I'm not sure that I have understood your example - sorry. If you have a dimension full (=19) and take some subset of it, so you now have an ensemble with dimension sub (which could be 1 or 19 or anything between) the cell method entry would be sub: subset (dimension: full).

Best wishes

Jonathan

comment:28 in reply to: ↑ 27 Changed 5 years ago by markh

Replying to jonathan:

  1. appendixE: table E1 lists the current cell_method names. These are descriptions of the numerical process carried out on the values, point, sum, mean ....
    • subset and sample are quite different in character. I don't think they fit in table E1 very well
    • are these 'operations' just represented as 'point' cell_method instances?
  1. dimensions are used to define the size of variables and the size of variables changes, leading to changes in dimension. The value in this case is constant, it is an attribute of the ensemble coordinate. Storing it as a dimension is an indirection step which doesn't feel required and increases the risk of inconsistency. I prefer storing the explicit numerical value.

comment:29 Changed 5 years ago by jonathan

Dear Mark

I suppose we just feel differently about these things, so we'll have to see what others think. For my part, I think that taking a subset of the points in a cell is not radically different from taking a single one, such as the maximum or the minimum, which are special subsets. The facility of recording the size before the operation of cell methods would be generally useful; for example, to calculate a standard error from a standard deviation you need to know how big the sample was from which the statistic was calculated. If the single number has no purpose other than this, storing it in a dimension (rather than a scalar coordinate or in text in the attribute) is perhaps unnecessary machinery, but I don't see what inconsistency it would cause if nothing uses this dimension. On the other hand, it would be positively useful if you wanted to keep some auxiliary coordinate variables for the full sample, which I can imagine being valuable for ensembles. I know that this is anticipating a use-case which has not been raised, but at the time when we have a free choice about how to do things to meet an existing use-case, it's sensible to choose an approach which offers future possibilities. We could offer the alternatives of encoding the number as text in the cell_methods, or naming a dimension in the cell_methods. This would not be a difficulty for parsing since dimension names must begin with a letter.

Best wishes

Jonathan

comment:30 Changed 5 years ago by jonathan

Dear Mark

On further thought, I agree with you about the dimension. For your use-case, you don't need to keep old coordinates, so you don't need a dimension, and it's sufficient to give the size explicitly in the cell_methods. Your use case could be something like ensemble: subset (dimension: 19) ensemble: mean (dimension: 15) to indicate that first a subset of 15 was taken from 19, then the 15 were averaged. If you're not interested in noting that the mean was of only a subset, you could just put ensemble: mean (dimension: 19) to record the original full size. The dimension ensemble=1 is for the resulting axis after all this processing. If subsequently there is a use-case for recording coordinates as they were before cell_methods were applied, it would be an obvious and easy extension to put a dimension name instead of the number.

Best wishes

Jonathan

comment:31 Changed 5 years ago by markh

  • Description modified (diff)

I have updated this proposal to take account of source and institution information for model multiplicity definition.

I have also added a short section on ensemble labelling, including a proposed new standard name: ensemble_member_label, canonical unit (a string value is expected) for providing string labels for ensemble members.

I remain worried by the approach under discussion for identifying original ensemble size using cell methods. I think that this information, in the context of my use case, is information about the coordinate variable, not the data variable.

Given the lack of agreement on this topic, I have removed it completely from this proposal. It is an additive change, which ever route is adopted. I propose we evaluate this ticket without that detail; we can return to it later if the rest of this proposal is deemed suitable.

Does this updated proposal have sufficient merit that it may be adopted? Are there remaining areas of concern?

thank you

markh

Note: See TracTickets for help on using tickets.