Opened 6 years ago
Closed 6 years ago
#105 closed enhancement (wontfix)
Scalar Coordinates
Reported by: | markh | Owned by: | cf-conventions@… |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: |
Description
1. Title
Scalar Coordinates
2. Moderator
unknown
3. Requirement
The ability to store scalar coordinates in CF NetCDF data sets.
Scalar Coordinates are a valuable semantic concept, allowing data variables to encode coordinates which do not vary across the domain of the data variable.
Invariance with respect to the data variable's dimensions is the key characteristic of these coordinates and this is all the meaning that should be explicitly defined for such a coordinate. Further characteristics, such as implied degrees of freedom and potential inter-relationships, may be inferred downstream, at the discretion of the data consumer.
The current conventions provides clear specification on vector coordinates, encoded as either Coordinate Variables or Auxiliary Coordinate Variables. Such variables, which may be encoded as not varying with respect to the data variable, can be used wherever it is vital that the inter-relationship and degree of freedom are explicitly encoded. The specification does not make it clear enough what is meant by a scalar coordinate variable.
The current conventions document (1.6) does not make it clear whether the scalar coordinate variable (section 5.7) is:
an encoding detail with explicitly no semantic content, merely storing vector coordinates:
where the characteristics of these vector coordinates must be inferred by context;
a semantic container enabling scalar coordinates to be stored and recognised as scalar coordinates.
This interpretation requires clarification, which this proposal aim to deliver. Explicitly, this proposal will define the scalar coordinate variable as a semantic container for coordinate information.
4. Initial Statement of Technical Proposal
1.2 Terminology
A NetCDF variable which contains coordinate information for a data variable, where the coordinate information does not vary with respect to the data variable's dimensions.
5.7 Scalar Coordinate Variables
A scalar coordinate variable defines a coordinate which applies to an entire data variable equally. The scalar coordinate does not vary with respect to the data variable's dimensions.
The scalar coordinate variable is associated with a data variable via the coordinates attribute. The scalar coordinate variable does not share any dimensions with the data variable.
The variable name of a scalar coordinate variable must not match the name of any dimension in the file.
Bounds may be defined for a scalar coordinate variable in the same way as other coordinates.
A scalar coordinate variable may be interpreted as implying a potential further dimension, of size one, for the data variable. However, scalar coordinate variables do not define explicit independent dimensions.
Note that use of scalar coordinate variables for latitude, longitude, vertical, or time coordinates will inhibit COARDS conforming applications from recognizing them.
6.1 Labels
... {Unchanged, up to the last sentence}
If a character variable referenced by a data variable's coordinates attribute has only one dimension, the maximum length of the string, it is a string-valued scalar coordinate variable (see Section 5.7 Scalar Coordinate Variables).
9.2 Collections, Instances and Elements
... {First two paragraphs unchanged}
If there is only a single feature to be stored in a data variable, the instance dimension may be omitted. In this case the mandatory space-time coordinate variables will not vary with respect to the instance dimension. These space-time coordinates, as defined in table 9.1, may need to be defined as scalar coordinate variables, to maintain the required relationships for the feature types.
5. Benefits
The community benefits from this proposal by gaining clarity over the interpretation of scalar coordinate variables as a simple semantic concept within CF with a clear encoding.
All use cases presented to the mailing list to date are supported by this proposal, including:
- encoding of size one explicit coordinate variables as scalar coordinate variables to simplify data variable shape;
- encoding of coordinates where there are multiple possible inter-dependency relationships, such as:
- multiple related time coordinates;
- model, experiment, run identifiers for multi-model analyses;
- encoding of discrete sampling geometries where there is a single feature;
- encoding of coefficients as coordinates.
The data producer has the opportunity to select a scalar coordinate as the most appropriate way of describing some aspects of their data. The encoding convenience of omitting dimensions of size one is provided for, with a clear recognition that this encoding option slightly limits the richness of expression available with CF vector coordinates.
6. Status Quo
Currently the interpretation of scalar coordinate variables is unclear. It is recognised that a clarification is required. As such the status quo is deemed undesirable (e.g.4, 15)
There is another ticket, #104, proposing an alternative change. These two tickets should not both be approved, they are mutually exclusive.
Change History (10)
comment:1 Changed 6 years ago by stevehankin
comment:2 Changed 6 years ago by biard
Steve,
I like the way you think! In particular, the question regarding the mathematical distinction got my mind going. I don't know if it will generate more heat than light, but here's what your questions led me into. (By the way, I'm an "independent" when it comes to this issue. I see merit in the arguments of both the "Originalist" and "Living CF" parties.)
If I think about the concept of scalar vs vector/matrix in linear algebra, I find that a scalar is treated differently than a matrix of dimension 1 x 1. Multiplication of a matrix by a scalar is defined as multiplication of every element of the matrix by the scalar. Multiplication of one matrix by another is only possible when they share a dimension. Following that sort of semantics, a scalar coordinate could well be considered to be different than a vector coordinate in a similar fashion. A scalar coordinate is considered to apply to every element of a variable as an ensemble. A vector coordinate is considered to apply to a variable through a relation over one or more shared dimensions.
If I continue to follow the linear algebra analogy, I can decompose a matrix multiplication by a vector into a sum of scalars (taken from the vector) multiplying column vectors (taken from the matrix). Correspondingly, I can compose a set of scalars multiplying column vectors (that all have the same length) into a matrix multiplied by a vector. This corresponds to the multi-file aggregation of variables with scalar coordinates into a single higher-dimension variable with a vector coordinate composed from scalar coordinates.
Thinking about the problem this way actually is leading me to smile more on the idea of scalar coordinates being different from vector coordinates of length 1. The semantic difference is specifically honored within linear algebra, and adopting this approach doesn't seem (to me) to constrain anyone in terms of how they may choose to compose (or not compose) higher-dimension relations using multiple files.
comment:3 Changed 6 years ago by jonathan
Dear all
I don't think anyone will be surprised to read that I disagree with this proposal insofar as it is inconsistent with the proposal of ticket 104, which was made by David and me. Like Jim, I appreciate Steve's clear questions.
Perhaps unlike Jim, I agree with Steve on the answer to his third question. I do not think that CF needs to make any semantic distinction between a scalar and a vector of size one. The proposed text says, "A scalar coordinate variable defines a coordinate which applies to an entire data variable equally," but this is also true of a coordinate variable of size one, isn't it. For instance, a scalar coordinate variable of height=1.5 m means exactly the same as a size-one coordinate variable of height=1.5 m. The alternative of a scalar is offered because it is less effort to encode it, that's all. But they mean the same: both of them indicate that all the values in the data variable apply at a height of 1.5 m.
Moreover, I think that drawing a formal distinction introduces an unnecessary conceptual complexity for aggregation. If I have one data variable with a scalar coordinate variable of height=1.5 m, and another for the same geophysical quantity with a scalar coordinate variable of height=10 m, and the horizontal grid and all other coordinates are the same in the two variables, I should be able to aggregate them into a single data variable having a size-two dimension for height. I would expect to be able to do exactly the same thing if the two data variables each had a size-one coordinate variable of height instead of a scalar, or if one was a scalar and the other had size one. I see no need or value to having conceptual differences among these cases.
Going the other way, if I extract a single height level from a data variable with a multivalued height coordinate, I expect to get a size-one height coordinate. However, I don't think this has a different meaning from a data variable which has a scalar height coordinate of the same value. Both of them are on the same single level. There is no semantic distinction, in my opinion, just a formal one, which is a matter of convenience.
Steve's question 1 asks what is the semantic difference between a coordinate variable and an auxiliary coordinate variable. According to the draft CF data model, whose development has got stuck because we can't agree on the issue being discussed in this ticket and ticket 104, a coordinate variable is 1D, monotonic and numeric (I presume because if it's not numeric it's hard to make it reliably monotonic), while an auxiliary coordinate variable can be multidimensional, string-valued or numeric, and doesn't have to be monotonic (which wouldn't make sense if it was multidimensional anyway). In addition, there can be only one coordinate variable for any given dimension, but there can be any number of auxiliary coordinate variables with any given dimension.
Steve's question 2 refers to the special case when you have both a coordinate variable and an auxiliary coordinate variable with a single size-one dimension. With size one, monotonicity is not an issue, obviously. If they are both numeric, either of them could be the coordinate variable, and the other the auxiliary coordinate variable. This freedom of choice is not limited to size one. If you have a 1D coordinate variable and several 1D auxiliary coordinate variables of the same dimension, and all of them are numeric and monotonic, any one of the auxiliary coordinate variables could equally serve as the coordinate variable. For instance, I might have a multivalued coordinate variable of height, and going along with it an auxiliary coordinate variable model_level_number(height), which is also monotonic. It would be equally valid to switch them round and have a coordinate variable of model level number and an auxiliary coordinate variable height(model_level_number). This is a choice I can freely make when encoding the dataset. It depends on whether I want to regard height or model level number as an independent spatial coordinate of the data. The other quantity is a dependent variable, a function of the independent spatial coordinate. The distinction is indicated formally in netCDF by making one of them a (Unidata) coordinate variable, whose name equals the name of its dimension, and the other a (CF) coordinate variable, named by the CF coordinates attribute of the data variable. Part of the answer to Steve's question 2 is that this distinction is made syntactically in just the same way for size-one variables as for multivalued variables.
Since a numeric scalar coordinate variable doesn't have a dimension, it's not possible to tell formally whether it's semantically the same as a coordinate variable or an auxiliary coordinate variable. Therefore in ticket 104 we propose to make it clear that it is semantically a coordinate variable. (If it's not numeric, it must be semantically an auxiliary coordinate variable.) We think that's what was meant when scalar coordinate variables were invented. We also think this is the right choice because you only really need an auxiliary coordinate variable if you also have a coordinate variable (for instance, if you have both model level number and height), and in that situation you must have a netCDF dimension to show that they are linked. That's the only way CF-netCDF offers to indicate the connection between them. Hence you cannot use scalar coordinate variables in that situation. If you do have scalars of both height and model level number, the most obvious interpretation is that they are independent, we think. If that wasn't the intention of the data-writer, the file is defective, but not illegal. It needn't do much damage in practice, because it should be easy for the user of software to indicate that these two coordinates should actually be regarded as belonging together, if that is relevant to know (for instance, in aggregation).
The bottom of line of this rather lengthy contribution, in response to Steve's nice short one, is that
- Question 1 is already clear in CF.
- Question 2 is clear for size one dimensions.
- Question 2 is not clearly enough answered for scalars, and ticket 104 proposes to make it clear (i.e. numeric scalars are coordinate variables, not auxiliaries).
- My answer to question 3 is No. I can't see anything in CF currently which makes a distinction and I don't think it would be helpful to do so.
Cheers
Jonathan
comment:4 Changed 6 years ago by markh
Replying to stevehankin:
Useful questions Steve, thank you
A point of view on these from me:
- what is the semantic difference (not the syntactic difference) between a coordinate variable and an auxiliary coordinate variable?
- an Auxiliary Coordinate variable describes dimensions of a data variable, it is not definitive;
- a Coordinate Variable provides a definitive, ordered, reversible definition of one dimension of a data variable.
- through what syntax does the semantic distinction in question 1 get represented when the coordinate variables are of length 1?
- If there is a relevant dimension of size 1 defined in the file, then the facilities exist to make the distinction.
- However, scalar coordinate variables do not vary with any data dimensions
- as such they are not Coordinate variables, they are more like Auxiliary Coordinates which describe zero data dimensions.
- Because they are scalars it is a very easy step to promote a scalar to a coordinate variable, a vector, by adding a dimension
- thus they may be treated as functionally equivalent by the data consumer
- however, this is an extra step, taken by data consumers, adding information to the data set
- This distinction recognises that Scalar Coordinates are semantically less rich than other vector coordinates.
- does CF regard the mathematical distinction between a scalar and vector of length 1 as significant?
- I think it does, in it's current use in various places in the conventions document and the community;
- a scalar is a different thing with different properties from a size one vector:
- I don't think this is hair splitting, I think it is useful and a facet that will be generally recognised by data consumers and creators.
comment:5 follow-up: ↓ 6 Changed 6 years ago by markh
I have had a few requests for more examples, so I have prepared a wiki page to try and assist in the comprehension of the issues surrounding this ticket and ticket #104
comment:6 in reply to: ↑ 5 ; follow-up: ↓ 7 Changed 6 years ago by davidhassell
Replying to markh:
I have had a few requests for more examples, so I have prepared a wiki page to try and assist in the comprehension of the issues surrounding this ticket and ticket #104
Thank your for the examples, Mark, but I'm afraid that I find some of your comments quite partisan.
"The right hand column shows a number of current uses of scalar coordinates we have encountered in software creating CF NetCDF datasets. All of these examples become invalid if #104 is implemented but remain valid if #105 is implemented."
Please could you say which software libraries these are?
"For #104 we will have extensive need to re-engineer code and revisit data writing and reading capabilities, whilst retaining backwards compatibility with pre 1.7 CF data sets."
With respect, the amount of code you may have written based on #105 prior to its possible acceptance does not, by itself, seem to be a valid reason for accepting it! That said, Jonathan has previously pointed out that in a software library, it is easy to apply the #105 view on #104 datasets, so your work will not have been in vain, whatever happens.
What do you mean by "retaining backwards compatibility with pre 1.7 CF data sets."?
"In all these cases #104 is driving a change in my behaviour as a data creator, where as #105 is enabling me to carry on as I currently work, whilst clarifying the interpretation of my data for consumers."
See above.
In your multimodel ensemble example, you say that "the data creator should not have to define a set of inter-relationships at the point of data writing, there are no valid ways of doing this to meet all needs , no unique answer. #105 recognises these characteristics as emergent properties, not defined at the time of writing each data set."
I disagree. When the experiment was set up, the dimensionality and inter-relationships were clearly defined. For example, it was known if exp_param2 was auxiliary or independent to exp_param1. One could easily create an API which throws away that information, but I don't think that the conventions should support a situation where that it is not known if exp_param2 was actually auxiliary or independent to exp_param1.
The same holds, in my view, for your forecast times example. In general, a single forecast can have many times (and therefore forecast periods) but only one forecast reference time, so I would be happy with encoding this, for example, as:
dimensions: time = 24 ; // or size 1, it makes no difference variables: time(time) ; forecast_reftime; // #104 scalar coordinate => not auxiliary to time forecast_refperiod(time) ;
All the best,
David
comment:7 in reply to: ↑ 6 Changed 6 years ago by markh
Replying to davidhassell:
Hello David
I'll try to answer your questions
"The right hand column shows a number of current uses of scalar coordinates we have encountered in software creating CF NetCDF datasets. All of these examples become invalid if #104 is implemented but remain valid if #105 is implemented."
Please could you say which software libraries these are?
The software I am most aware of is a range of in house code bases, written in a variety of programming languages and interfacing to the NetCDF API, to write datasets.
I know some of these interface to libraries, such as CDAT and Met Office IDL libraries, others make us of tools such as NCO and CDO and others are pretty self contained.
There is quite a range of architectures.
What do you mean by "retaining backwards compatibility with pre 1.7 CF data sets."?
#104 proposes a change in interpretation of scalar coordinate variables. If 104 is adopted, we will need to retain the capability to recognise one interpretation of scalars for <= 1.6 datasets and another for >= 1.7 data sets.
The same holds, in my view, for your forecast times example. In general, a single forecast can have many times (and therefore forecast periods) but only one forecast reference time
The models which output data don't define the nature of the three time coordinate inter-relationship, the only define that there are two degrees of freedom. All interpretations of this statement are equally valid, and useful, when the data is created.
The representation you have used is not one I would choose to use, even if I no longer had access to scalar coordinates, it does not represent my understanding of the use of time in our forecast models.
mark
comment:8 Changed 6 years ago by jonathan
Dear Mark
You say that ticket 104 proposes a change in the interpretation of scalar coordinate variables, and that it would make some existing datasets invalid. I have to say that I think both of these statements are incorrect. As you know, I think the interpretation proposed in ticket 104 is not only the intention of the convention when it was written, but also the most obvious and simplest interpretation of the existing text. However, this debate demonstrates that it's not the only possible interpretation, and I can't argue that yours is definitely excluded by the existing text. But I would emphasise, as I've said before, that adopting 104 does not invalidate any existing dataset. I think that the examples you have given of datasets not OK with 104 but OK with 105 are all legal CF-netCDF files, and would remain so if we clarified the convention as 104 proposes, but they are not good examples of CF-netCDF files.
For instance, why would you not make a single model level number and the corresponding single vertical coordinate value belong to the same (size-one) dimension? They are very likely to belong together. It is hard to imagine how it could be useful to have independent multivalued coordinates of model level number and vertical coordinate. If they are not independent when there are two levels, why would it be useful not to indicate their relationship when there is only one level? I would argue that a file is deficient, though not invalid, if it does not show this relationship.
In another example, you write concerning time (i.e. forecast time), forecast reference time (i.e. analysis time) and forecast period, that "The models which output data don't define the nature of the three time coordinate inter-relationship; they only define that there are two degrees of freedom." This sounds bizarre to me. I find it hard to imagine the writer of the data (or the writer of the software which created the data) thinking, "There are three variables but only two independent dimensions and I want to leave it deliberately vague which are the dimensions." I think it is more likely that this is a degenerate case of a system which has defined relationships between one or two multivalued coordinates in general. One interesting possibility is that it might be one of a collection of forecasts, with various times and forecast reference times, that don't constitute a 2D array. In that case there could be a single discrete axis (CF section 4.5), with no coordinate variable, and all three coordinates would be auxiliary. There's only one dimension in this case. How can you be sure that there are two dimensions (but not which two they are) if all three of the coordinates are single-valued? I think it is best to record the relationship which is most appropriate to the system which generated the dataset, but you may wish to reinterpret the data if a different description is more convenient.
You write that "104 is driving a change in my behaviour as a data creator, where as 105 is enabling me to carry on as I currently work." I would say that 104 does not force anyone to change how they write single-valued coordinates, but it might strongly encourage them to clarify their intentions by writing more informative CF-netCDF files.
Best wishes
Jonathan
comment:9 Changed 6 years ago by markh
comment:10 Changed 6 years ago by markh
- Resolution set to wontfix
- Status changed from new to closed
ticket closed, no further action required
After reading this, my "2 cents" (hoping I will offend no one by saying this) is that for me this text does not clarify; instead it further confuses. It seems to me that we are moving towards clarity when CF concepts can be described in fewer words; not more.
I wonder if the effort to refine the definition of a scalar variable is misplaced. Does the confusion actually come from our having fuzzy answers to these questions: