Opened 7 years ago
Last modified 4 years ago
#108 new enhancement
Defining a domain for a cell_method — at Version 5
Reported by: | markh | Owned by: | cf-conventions@… |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: | antonio.cofino@… |
Description (last modified by markh)
1. Title
Defining the domain of operation as part of a cell_method definition.
2. Moderator
see comments
3. Requirement
The information contained within a data variable's coordinate may have value in defining a cell_method for a result data variable calculated from the initial data variable.
4. Initial Statement of Technical Proposal
7.3.2. Recording the spacing of the original data and other information
To indicate more precisely how the cell method was applied, extra information may be included in parentheses () after the identification of the method. This information includes standardized and non-standardized parts.
Currently the only standardized information isThe interval spacing keyword is used to provide the typical interval between the original data values to which the method was applied, in the situation where the present data values are statistically representative of original data values which had a finer spacing. The syntax is (interval: value unit), where value is a numerical value and unit is a string that can be recognized by UNIDATA's Udunits package [UDUNITS]. The unit will usually be dimensionally equivalent to the unit of the corresponding dimension, but this is not required (which allows, for example, the interval for a standard deviation calculated from points evenly spaced in distance along a parallel to be reported in units of length even if the zonal coordinate of the cells is given in degrees). Recording the original interval is particularly important for standard deviations. For example, the standard deviation of daily values could be indicated by cell_methods="time: standard_deviation (interval: 1 day)" and of annual values by cell_methods="time: standard_deviation (interval: 1 year)".
If the cell method applies to a combination of axes, they may have a common original interval e.g. cell_methods="lat: lon: standard_deviation (interval: 10 km)". Alternatively, they may have separate intervals, which are matched to the names of axes by position e.g. cell_methods="lat: lon: standard_deviation (interval: 0.1 degree_N interval: 0.2 degree_E)", in which 0.1 degree applies to latitude and 0.2 degree to longitude.
To explicitly define the domain over which the statistic was calculated, the syntax (domain: varname [varname]) may be used. In this case, each 'varname' listed is the name of an ancillary variable, explicitly referenced by the data variable. Each of the ancillary variables referenced contributes to the definition of the domain over which the cell_method operation was conducted.
5. Benefits
The community will benefit by having enhanced capabilities for defining cell_methods for data variables calculated across complex domains.
https://cf-pcmdi.llnl.gov/trac/wiki/aggregateExampleMH presents a use case for this capability.
6. Status Quo
The capability to define complex domains for cell_methods will not be standardised.
Change History (5)
comment:1 follow-up: ↓ 2 Changed 7 years ago by jonathan
comment:2 in reply to: ↑ 1 Changed 7 years ago by markh
Replying to jonathan:
I feel this wording is not clear in its intention. For the usual case of a numerical coordinate, the domain over which the statistic is calculated is indicated by the cell bounds. Your intention is to provide the original coordinates, isn't it. That's a more detailed sort of information than the typical spacing.
Absolutely, the aim of the ticket is is enable the data creator to provide this level of detail in a specified manner where they deem it useful, such as in my use case. Perhaps:
For complex domains, cell bounds and interval definitions may be unable to properly define the domain the method was applied over. To explicitly define the domain over which the statistic was calculated, the syntax (domain: varname varname) may be used. Each 'varname' is the name of an ancillary variable, referenced by the data variable. Each of the ancillary variables referenced contributes to the definition of the domain over which the cell_method operation was conducted.
I continue to think that a more economical and therefore better approach would be to give the dimension name for the uncollapsed coordinate variable(s), rather than naming the variables themselves. They must all have that dimension if they are in the file, so I think this is robust for the CF convention, which is formulated for single netCDF files. I agree that programs that process files have to be careful about this kind of thing, but there are many other parts of the convention which are similarly fragile. I know you've commented on this already on the email list, but I didn't quite follow your argument there, sorry to say.
The concern I have expressed with using the dimension name for the domain specification is that the named dimension is not used by the data variable, so any connection is fragile: that dimension is not in scope for the data variable.
I wonder if there are potential problems with data integrity wrapped up with this kind of referencing. It is different from the current name:dimension cell_method reference, as currently I can just check that dimension is a data variable dimension and I know I have the relevant information in scope.
I can see that referencing the dimension name is a suitable way of delivering to the use case I have presented, and I do not pretend that referencing ancillary variables avoids all referencing problems. I think it is a plausible solution.
I prefer the slightly longer notation, where each ancillary variable is named within the domain string, and each named ancillary variable must be in scope for the data variable, I think this lends itself better to compliance checking and provides information more clearly to the data consumer.
comment:3 follow-up: ↓ 4 Changed 6 years ago by jonathan
Dear Mark
I'd like to put forward a different argument in support of your preference, namely that if we relied on the dimension name alone it would have to be assumed that all variables with that dimension were auxiliaries of the field before collapse. That is not necessary so; some such variables might be irrelevant. Looking at the issue that way, it now seems to me is that what we need to record is the coordinates and auxiliary coordinates of the field before collapse. This information is more complete than you can record with interval.
Here's a proposal for the new para along these lines:
The original domain of the variable before the statistical operation was applied can be described completely using "dimension: dimname [dimname ...]" and coordinates: varname [varname ...]". The dimnames are the names of the netCDF dimensions for the affected axes in the original domain. This syntax implies that any 1D variable whose name is dimname is a coordinate variable of the original domain. The varnames are the names of auxiliary coordinate variables for the affected axes in the original domain. Before the statistical operation was carried out, these variables would have been named by the coordinates attribute of the original data variable.
In addition, I think we would need to amend the opening para of 7.3.2
To indicate more precisely how the cell method was applied, extra information may optionally be included in parentheses () after the identification of the method. This information includes standardized and non-standardized parts. The possible standardized parts begin with interval:, dimensions: and coordinates:.
To provide the typical interval between the original data values to which the method was applied, in the situation where the present data values are statistically representative of original data values which had a finer spacing, the syntax is ...
and perhaps we might rename the section to a more general title.
New requirements in the conformance document for sect 7.3.2:
- The names listed after dimensions: must all be names of netCDF dimensions, which must not be dimensions of the data variable.
- Any 1D variable whose name equals the name of one of the original dimensions is regarded as a coordinate variable of the original domain and therefore must be strictly monotonic (increasing or decreasing) and must not have the _FillValue or missing_value attributes.
- If there is a coordinates: list for a cell_methods entry there must also be a dimensions: list for the same entry.
- The names listed after coordinates: must all be names of netCDF variables. These variables are regarded as auxiliary coordinate variables of the original domain. They must not have as dimensions any of the dimensions to which the cell_methods entry applies. Their dimensions must be a subset of the other dimensions of the data variable and the original dimensions named by this cell_methods entry. In addition, any label variable will have a trailing dimension for the maximum string length.
What do you think?
Cheers
Jonathan
comment:4 in reply to: ↑ 3 Changed 6 years ago by markh
Replying to jonathan:
Dear Mark
I'd like to put forward a different argument in support of your preference, namely that if we relied on the dimension name alone it would have to be assumed that all variables with that dimension were auxiliaries of the field before collapse. That is not necessary so; some such variables might be irrelevant.
I agree, this risks problems.
Looking at the issue that way, it now seems to me is that what we need to record is the coordinates and auxiliary coordinates of the field before collapse. This information is more complete than you can record with interval.
Here's a proposal for the new para along these lines:
The original domain of the variable before the statistical operation was applied can be described completely using "dimension: dimname [dimname ...]" and coordinates: varname [varname ...]". The dimnames are the names of the netCDF dimensions for the affected axes in the original domain. This syntax implies that any 1D variable whose name is dimname is a coordinate variable of the original domain. The varnames are the names of auxiliary coordinate variables for the affected axes in the original domain. Before the statistical operation was carried out, these variables would have been named by the coordinates attribute of the original data variable.
This seems significantly more complex than the use of ancillary variables I have proposed. I do not yet see where it adds particular extra value.
Using ancillary variables makes it a conscious choice which ex-coordinate variables and gives access to the dimensions, vis the ancillary variable.
I think that I find the ancillary approach neater than the dimname and varname you have suggested as an alternative.
I would be wary of adopting this alternative as it does not seem to me to add value, but it does add complexity.
Are there factors I am missing?
Would my example benefit from the description of dimname and varname instead?
mark
comment:5 Changed 6 years ago by markh
- Description modified (diff)
Dear Mark
I feel this wording is not clear in its intention. For the usual case of a numerical coordinate, the domain over which the statistic is calculated is indicated by the cell bounds. Your intention is to provide the original coordinates, isn't it. That's a more detailed sort of information than the typical spacing.
I continue to think that a more economical and therefore better approach would be to give the dimension name for the uncollapsed coordinate variable(s), rather than naming the variables themselves. They must all have that dimension if they are in the file, so I think this is robust for the CF convention, which is formulated for single netCDF files. I agree that programs that process files have to be careful about this kind of thing, but there are many other parts of the convention which are similarly fragile. I know you've commented on this already on the email list, but I didn't quite follow your argument there, sorry to say.
Best wishes
Jonathan