Opened 4 years ago
Last modified 7 months ago
#108 new enhancement
Defining a domain for a cell_method
Reported by: | markh | Owned by: | cf-conventions@… |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: | antonio.cofino@… |
Description (last modified by cofino)
1. Title
Defining the domain of operation as part of a cell_method definition.
2. Moderator
see comments
3. Requirement
The information contained within a data variable's coordinate may have value in defining a cell_method for a result data variable calculated from the initial data variable.
4. Initial Statement of Technical Proposal
7.3.2. Recording the spacing of the original data and other information
To indicate more precisely how the cell method was applied, extra information may be included in parentheses () after the identification of the method. This information includes standardized and non-standardized parts.
Currently the only standardized information isThe interval spacing keyword is used to provide the typical interval between the original data values to which the method was applied, in the situation where the present data values are statistically representative of original data values which had a finer spacing. The syntax is (interval: value unit), where value is a numerical value and unit is a string that can be recognized by UNIDATA's Udunits package [UDUNITS]. The unit will usually be dimensionally equivalent to the unit of the corresponding dimension, but this is not required (which allows, for example, the interval for a standard deviation calculated from points evenly spaced in distance along a parallel to be reported in units of length even if the zonal coordinate of the cells is given in degrees). Recording the original interval is particularly important for standard deviations. For example, the standard deviation of daily values could be indicated by cell_methods="time: standard_deviation (interval: 1 day)" and of annual values by cell_methods="time: standard_deviation (interval: 1 year)".
If the cell method applies to a combination of axes, they may have a common original interval e.g. cell_methods="lat: lon: standard_deviation (interval: 10 km)". Alternatively, they may have separate intervals, which are matched to the names of axes by position e.g. cell_methods="lat: lon: standard_deviation (interval: 0.1 degree_N interval: 0.2 degree_E)", in which 0.1 degree applies to latitude and 0.2 degree to longitude.
To explicitly define the domain over which the statistic was calculated, the syntax (domain: varname [varname]) may be used. In this case, each 'varname' listed is the name of an ancillary variable, explicitly referenced by the data variable. Each of the ancillary variables referenced contributes to the definition of the domain over which the cell_method operation was conducted.
5. Benefits
The community will benefit by having enhanced capabilities for defining cell_methods for data variables calculated across complex domains.
aggregateExampleMH presents a use case for this capability.
6. Status Quo
The capability to define complex domains for cell_methods will not be standardised.
Change History (12)
comment:1 follow-up: ↓ 2 Changed 4 years ago by jonathan
comment:2 in reply to: ↑ 1 Changed 4 years ago by markh
Replying to jonathan:
I feel this wording is not clear in its intention. For the usual case of a numerical coordinate, the domain over which the statistic is calculated is indicated by the cell bounds. Your intention is to provide the original coordinates, isn't it. That's a more detailed sort of information than the typical spacing.
Absolutely, the aim of the ticket is is enable the data creator to provide this level of detail in a specified manner where they deem it useful, such as in my use case. Perhaps:
For complex domains, cell bounds and interval definitions may be unable to properly define the domain the method was applied over. To explicitly define the domain over which the statistic was calculated, the syntax (domain: varname varname) may be used. Each 'varname' is the name of an ancillary variable, referenced by the data variable. Each of the ancillary variables referenced contributes to the definition of the domain over which the cell_method operation was conducted.
I continue to think that a more economical and therefore better approach would be to give the dimension name for the uncollapsed coordinate variable(s), rather than naming the variables themselves. They must all have that dimension if they are in the file, so I think this is robust for the CF convention, which is formulated for single netCDF files. I agree that programs that process files have to be careful about this kind of thing, but there are many other parts of the convention which are similarly fragile. I know you've commented on this already on the email list, but I didn't quite follow your argument there, sorry to say.
The concern I have expressed with using the dimension name for the domain specification is that the named dimension is not used by the data variable, so any connection is fragile: that dimension is not in scope for the data variable.
I wonder if there are potential problems with data integrity wrapped up with this kind of referencing. It is different from the current name:dimension cell_method reference, as currently I can just check that dimension is a data variable dimension and I know I have the relevant information in scope.
I can see that referencing the dimension name is a suitable way of delivering to the use case I have presented, and I do not pretend that referencing ancillary variables avoids all referencing problems. I think it is a plausible solution.
I prefer the slightly longer notation, where each ancillary variable is named within the domain string, and each named ancillary variable must be in scope for the data variable, I think this lends itself better to compliance checking and provides information more clearly to the data consumer.
comment:3 follow-up: ↓ 4 Changed 4 years ago by jonathan
Dear Mark
I'd like to put forward a different argument in support of your preference, namely that if we relied on the dimension name alone it would have to be assumed that all variables with that dimension were auxiliaries of the field before collapse. That is not necessary so; some such variables might be irrelevant. Looking at the issue that way, it now seems to me is that what we need to record is the coordinates and auxiliary coordinates of the field before collapse. This information is more complete than you can record with interval.
Here's a proposal for the new para along these lines:
The original domain of the variable before the statistical operation was applied can be described completely using "dimension: dimname [dimname ...]" and coordinates: varname [varname ...]". The dimnames are the names of the netCDF dimensions for the affected axes in the original domain. This syntax implies that any 1D variable whose name is dimname is a coordinate variable of the original domain. The varnames are the names of auxiliary coordinate variables for the affected axes in the original domain. Before the statistical operation was carried out, these variables would have been named by the coordinates attribute of the original data variable.
In addition, I think we would need to amend the opening para of 7.3.2
To indicate more precisely how the cell method was applied, extra information may optionally be included in parentheses () after the identification of the method. This information includes standardized and non-standardized parts. The possible standardized parts begin with interval:, dimensions: and coordinates:.
To provide the typical interval between the original data values to which the method was applied, in the situation where the present data values are statistically representative of original data values which had a finer spacing, the syntax is ...
and perhaps we might rename the section to a more general title.
New requirements in the conformance document for sect 7.3.2:
- The names listed after dimensions: must all be names of netCDF dimensions, which must not be dimensions of the data variable.
- Any 1D variable whose name equals the name of one of the original dimensions is regarded as a coordinate variable of the original domain and therefore must be strictly monotonic (increasing or decreasing) and must not have the _FillValue or missing_value attributes.
- If there is a coordinates: list for a cell_methods entry there must also be a dimensions: list for the same entry.
- The names listed after coordinates: must all be names of netCDF variables. These variables are regarded as auxiliary coordinate variables of the original domain. They must not have as dimensions any of the dimensions to which the cell_methods entry applies. Their dimensions must be a subset of the other dimensions of the data variable and the original dimensions named by this cell_methods entry. In addition, any label variable will have a trailing dimension for the maximum string length.
What do you think?
Cheers
Jonathan
comment:4 in reply to: ↑ 3 Changed 4 years ago by markh
Replying to jonathan:
Dear Mark
I'd like to put forward a different argument in support of your preference, namely that if we relied on the dimension name alone it would have to be assumed that all variables with that dimension were auxiliaries of the field before collapse. That is not necessary so; some such variables might be irrelevant.
I agree, this risks problems.
Looking at the issue that way, it now seems to me is that what we need to record is the coordinates and auxiliary coordinates of the field before collapse. This information is more complete than you can record with interval.
Here's a proposal for the new para along these lines:
The original domain of the variable before the statistical operation was applied can be described completely using "dimension: dimname [dimname ...]" and coordinates: varname [varname ...]". The dimnames are the names of the netCDF dimensions for the affected axes in the original domain. This syntax implies that any 1D variable whose name is dimname is a coordinate variable of the original domain. The varnames are the names of auxiliary coordinate variables for the affected axes in the original domain. Before the statistical operation was carried out, these variables would have been named by the coordinates attribute of the original data variable.
This seems significantly more complex than the use of ancillary variables I have proposed. I do not yet see where it adds particular extra value.
Using ancillary variables makes it a conscious choice which ex-coordinate variables and gives access to the dimensions, vis the ancillary variable.
I think that I find the ancillary approach neater than the dimname and varname you have suggested as an alternative.
I would be wary of adopting this alternative as it does not seem to me to add value, but it does add complexity.
Are there factors I am missing?
Would my example benefit from the description of dimname and varname instead?
mark
comment:5 Changed 3 years ago by markh
- Description modified (diff)
comment:6 Changed 3 years ago by markh
This ticket is still open and awaiting further discussion.
Please could interested parties reconsider this ticket in its current state and feed back thoughts.
The ticket description has been updated to reflect my latest position on this proposal
thank you mark
comment:7 Changed 3 years ago by jonathan
Dear Mark
My suggestion is not much more complex than yours. I would say that the main difference in our proposals is that I have written down more completely what would need to be changed, including in the conformance document, and that might make it appear more complicated. You propose to add domain:varname [varname ...] to the cell_methods comment. I propose to add dimension:dimname [dimname ...] and/or coordinates: varname [varname ...]. The difference between our proposals is that mine distinguishes between variables which were originally 1D (Unidata) coordinate variables of the original domain and variables which were originally auxiliary coordinate variables, whereas yours does not distinguish between these. It is not clear to me from your wording Each of the ancillary variables referenced contributes to the definition of the domain over which the cell_method operation was conducted whether these variables are coordinate or auxiliary coordinate variables of the original domain. You refer to them as ancillary variables, but I think that's not the right term, because they are not ancillary variables in the usual CF sense i.e. named by an ancillary_variables attribute, and containing per-gridbox metadata.
If you wish to record only 1D coordinate variables of the original domain, in my proposal you would do that by naming their dimensions. Their dimensions must exist in the file, and it is an obvious convention that the coordinate variables should have the same name as their dimensions, because that's the usual Unidata rule. Thus naming the dimensions is no less convenient than naming 1D coordinate variables - it's the same thing. If you wish to record only auxiliary coordinate variables, in my proposal you would name the these variables and their dimensions separately. I think it's more convenient to be told the names of the dimensions than to inspect the variables to find out what the dimensions are. Finally, I would argue that my proposal maps in an obvious way to the original definition of the domain e.g. a high-resolution grid-point data variable:
dimensions: oldx=720; oldx=960; variables: float oldx(oldx); float oldy(oldy); float temperature(oldy,oldx); temperature:cell_methods="oldx: oldy: point";
is meaned into gridboxes 10x10 times larger, recording the original domain information:
dimensions: oldx=720; newx=72; oldx=960; newx=96; variables: float oldx(oldx); float newx(newx); float oldy(oldy); float newy(oldy); float temperature(newy,newx); temperature:cell_methods="newx: newy: mean (dimension: oldx oldy)";
In your example, there are no 1D coordinate variables, but there are 1D auxiliary coordinate variables. In your proposal, you would encode domain: experiment_id source realization institution. The original dimension ensemble=21 is renamed to member=21 and you have a new dimension ensemble=1. In my proposal, you would encode dimension: member coordinates: experiment_id source realization institution.
Best wishes
Jonathan
comment:8 follow-up: ↓ 9 Changed 3 years ago by taylor13
Dear Jonathan and Mark,
I’m worried that this proposal will overburden cell_methods, making it too complicated for the casual user to understand. Moreover, I’m not sure it is wise to start including in this way information on the various datasets that might have been used in deriving the actual variable of interest. Until now most of the metadata has been about characteristics of the data itself (what it represents), not how it was derived. For example, we might note with cell_methods that some variable represents mean values, but we haven’t said much about how the means are computed (e.g., the details of weighting). At least one exception to this general rule has been made already: in the cell_methods attribute, we can specify the “interval", which doesn’t directly describe the field stored in the file, but rather the field from which it was derived. The question is do we want to extend cell_methods to describe further characteristics of the source datasets from which the stored variable was calculated?
If so, we should also ask what use will be made of this information. Clearly, there is value in providing information on data has been derived, and humans reading netCDF metadata will benefit. Will the information be made use of my computer codes too (other than perhaps reading the information and recording it or reformatting it and displaying it)?
In this ticket we have been discussing an incremental addition to our capability to provide a more complete description of how data in our files are derived. A specific way of doing that has been proposed, but I would prefer an approach that would be more general and allow for additional details concerning *how* the data appearing in a file were derived.
I think the approach should be able to allow us to describe the following procedures used to produce commonly derived data products:
mean over a dimension not representing a numerical coordinate (e.g., multi-model mean -- this is what started this ticket) difference fields (e.g., anomalies — we want to know anomalies relative to what?) sums of fields (e.g., sum of sensible and latent heat) divergences (e.g., what finite difference formula was used?) filtering (e.g., band-pass filters) transformations (e.g., indicating characteristics of the source data used to derived fourier coefficients) change of units (.e.g., in calculating geopotential height, indicating the value assumed for gravity (g)) record the definitions (now only recorded in appendix D) used to translate from non-dimensional vertical coordinates to
dimensional coordinate values.
. .
I suggest for all of these we might define a new CF variable attribute: “formula” (or “method” or “procedure” or ???)
The information in “formula” would indicate how the variable had been calculated. How prescriptive we are in specifying the structure and syntax for the formula would depend on how we think the information will be used. Do machines need to interpret it to do something useful, or do we expect that only humans will take an interest in it?
We could record the formula with pseudo computer code, or in a specific computer language, or using the LaTex? notation, or in a less prescriptive way left more-or-less up to the data writer. No matter what we agree to as far as structure and syntax, I think it would be essential to allow references to variables actually included in the netCDF file. Some of these variables might be “dummy” variables (i.e., variables defined in the file but not containing any data).
For example, if our variable is the sum of sensible and latent heat, we could record:
dimensions:
lon=144 lat=73
variables:
float shpluslh(lat,lon)
shpluslh:units=“W m-2” shpluslh:formula=“shpluslh = sh + ln”
character sh
sh:standard_name=“surface_upward_sensible_heat_flux” sh:type=“float”
character lh
lh:standard_name=“surface_upward_latent_heat_flux” lh:type=“float”
Note that sh and lh are “dummy” variables because no data are stored in them, but we can make use of all of the cf attributes to describe them. We can infer that they are “dummy” variables because of the presence of the “type” attribute, which we newly define here. The “type” attribute would be required for any dummy variable and the actual variable type of all dummy variables is “character” (of length 1, perhaps set to "").
If we want to indicate the dimensions of a dummy variable, then we could introduce a “dimensions” attribute (again new to CF), which would list the dimensions. For example, in the above we might include:
sh dimensions=“lon, lat”
As another example, consider how one would record the information in the example introduced by Mark at the beginning of this ticket.
dimensions:
longitude = 144 ; latitude = 73 ; time = UNLIMITED ; (120 currently) time_bnd = 2 ; m = 21 ; string4 = 4 ; string15 = 15 ; string60 = 60 ;
variables: ...
int realization(m) ;
realization:standard_name = "realization" ; realization:long_name = "Number of the simulation in the ensemble" ;
char experiment_id(m, string4) ;
experiment_id:long_name = "Experiment identifier" ;
char source(m, string60) ;
source:long_name = "Method of production of the data" ;
char institution(m, string15) ;
institution:long_name = "Institution responsible for the forecast system" ;
...
float hus_sd(time, latitude, longitude) ;
hus_sd:units = "1" ; hus_sd:standard_name = "specific_humidity" ; hus_sd:cell_methods = "leadtime: mean (interval 6 h)" ; hus_sd:formula = “hus_sd=mean_over_m(hus)”
character hus
hus:type = “float” hus:dimensions = “m, time, latitude, longitude” ; hus:standard_name = "specific_humidity" ; hus:coordinates = "experiment_id source realization institution" ;
As a third example, consider Example 4.3 of the current conventions. We could make explicit the definition of the coordinate by adding the “formula” attribute:
float lev(lev) ;
lev:long_name = "sigma at layer midpoints" ; lev:positive = "down" ; lev:standard_name = "atmosphere_sigma_coordinate" ; lev:formula_terms = "sigma: lev ps: PS ptop: PTOP" ; lev: formula = “[p(n,k,j,i)] = PTOP + lev[k] * (PS[n,j,i] - PTOP)”
Note that to indicate which variables are not actually defined in the file, we use square brackets. So it is clear that p(n,k,j,i), as well as the indices (i,j,k,n) cannot be found in this file. We leave the formula_terms attribute in for backward compatibility.
As a final example suppose we store the time-mean planetary albedo
dimensions:
lon = 144 ; lat = 73 ; time = 1 ; nv = 2
variables:
float albedo(lat, lon)
albedo:standard_name = “planetary_albedo” albedo:units = “1” albedo:formula = “albedo(lat,lon) = rsut(time,lat,lon)/rsdt(time,lat,lon)”
character rsut
rsut:dimensions=“time,lat,lon” rsut: type=float rsut: standard_name=“toa_outgoing_shortwave_flux” rsut: cell_methods=“time: mean”
character rsdt
rsdt:dimensions=“time,lat,lon” rsdt: type=float rudt: standard_name=“toa_incoming_shortwave_flux” rudt: cell_methods=“time: mean”
double time(time)
time: standard_name=“time” time: units = “days since 1999-01-01” time: bounds = “time_bnds”
double time_bnds(time,nv)
Note: It would have been incorrect to simply include cell_methods as an attribute of albedo, since we are not reporting the time-mean of albedo itself, but the ratio of the time-means of the two fluxes.
As a reminder, I am not yet proposing a specific structure/syntax for the “formulas”. Perhaps we could initially leave that unspecified and see what users of the conventions come up with. I am proposing to add the “formula”, “type” and “dimensions” attributes and have introduced the idea of “dummy” variables containing no actual data but enabling more precise definitions of how data in the file were derived. The "type" and "dimensions" attributes would be permitted to be used only with "dummy" variables.
Hope someone is interested enough to improve on the above.
Best regards, Karl
comment:9 in reply to: ↑ 8 Changed 3 years ago by davidhassell
Replying to taylor13:
Hell Karl,
I would suggest that the presence of a "dimensions" attribute could be enough to signify a dummy variable. The dummy variable may then be given a netCDF type in usual manner. I.e. your
character rsut rsut: dimensions="time lat lon"; rsut: type="float"; rsut: standard_name="toa_outgoing_shortwave_flux"; rsut: cell_methods="time: mean";
would become
float rsut rsut: dimensions="time lat lon"; rsut: standard_name="toa_outgoing_shortwave_flux"; rsut: cell_methods="time: mean";
Scalar variables would be dealt with by assigning an empty string to the "dimensions" attribute:
double scalar_var scalar_var: dimensions=""; scalar_var: standard_name="something_or_other"; scalar_var: cell_methods="time: mean";
This is essentially the approach that I have used in the "CFA-netCDF" conventions, which describes the single file netCDF storage of datasets aggregated across multiple files - in many ways a fully generalised version NcML storage - and uses dummy netCDF variables.
Hope that helps,
All the best,
David
comment:10 Changed 3 years ago by taylor13
o David,
[the "o" is to complete your email's truncated greeting ... together we can probably get it right]
Your modification to my suggestion is simpler and clearer and thus better. We need to check is whether you can declare a character variable without any dimensions (i.e., does netCDF assume the string length is 1? If not, then the declaration of a dummy variable that is type scalar would have to be written:
dimensions:
string1=1 ;
variables:
character cdata(string1)
cdata: dimensions "time"
I don't really think we would ever want to store a dummy variable of type character, but we should check what the syntax would be in that case. [By the way we certainly would have to have addressed this problem under my original proposal.]
best regards, Karl
comment:11 Changed 7 months ago by cofino
- Cc antonio.cofino@… added
- Description modified (diff)
comment:12 Changed 7 months ago by cofino
- Description modified (diff)
Dear Mark
I feel this wording is not clear in its intention. For the usual case of a numerical coordinate, the domain over which the statistic is calculated is indicated by the cell bounds. Your intention is to provide the original coordinates, isn't it. That's a more detailed sort of information than the typical spacing.
I continue to think that a more economical and therefore better approach would be to give the dimension name for the uncollapsed coordinate variable(s), rather than naming the variables themselves. They must all have that dimension if they are in the file, so I think this is robust for the CF convention, which is formulated for single netCDF files. I agree that programs that process files have to be careful about this kind of thing, but there are many other parts of the convention which are similarly fragile. I know you've commented on this already on the email list, but I didn't quite follow your argument there, sorry to say.
Best wishes
Jonathan