Opened 6 years ago
Closed 2 years ago
#104 closed enhancement (fixed)
Clarify the interpretation of scalar coordinate variables
Reported by: | jonathan | Owned by: | davidhassell |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: |
Description
David Hassell and I propose the following changes to the CF standard document to clarify the interpretation of scalar coordinate variables. We do not think this is a material change to the convention, which already implies this interpretation, and we believe this interpretation is what was intended when scalar coordinate variables were introduced.
Section 5.7, Scalar coordinate variables.
The convention contains the following sentences:
Under COARDS the method of providing a single valued coordinate was to add a dimension of size one to the variable, and supply the corresponding coordinate variable. The new scalar coordinate variable is a convenience feature which avoids adding size one dimensions to variables. Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable.
These sentences are OK as they stand, we think, but it would be better to describe the current situation without emphasising its history, so we would propose to replace them with the following. In addition, we propose to add a bit extra, as shown, to the second sentence below:
The use of scalar coordinate variables is a convenience feature which avoids adding size one dimensions to variables. A scalar coordinate variable has the same information content and can be used in the same contexts as a size one coordinate variable, if numeric, or a size one auxiliary coordinate variable, if a string (Section 6.1).
The next sentence would be unchanged; it mentions how this situation relates to that of the COARDS convention.
At the end of the section, after the example, we propose to append the following sentences:
If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (for instance, time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables.
We think the above interpretation and implications are consistent with the convention already, which says in this section that, "Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable". However, it would be an improvement to spell it out.
Section 6.1, Labels
Replace the last sentence
If a character variable has only one dimension (the maximum length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables).
with
If a character variable has only one dimension (the maximum length of the string), it is a string-valued scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables). As such, it has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension. This is a convenience feature which avoids adding the size one dimension to the data variable.
The last part is a repetition of what Section 5.7 says. The reason for the change is that the existing wording is careless in implying that a string could be a coordinate variable; in fact, this is not possible, since string- value coordinates must be auxiliary coordinate variables.
The interpretation of scalar coordinate variables in Section 9 may be different from the above, and this may require further clarification if the above is agreed.
Change History (71)
comment:1 Changed 6 years ago by mgschultz
comment:2 Changed 6 years ago by jonathan
Dear Martin
I think it's fine to split the sentence - thanks, good idea. I would suggest we insert some extra words to make it even clearer:
A numeric scalar coordinate variable has the same information content and can be used in the same contexts as a size one numeric coordinate variable. Similarly, a string-valued scalar coordinate variable has the same meaning and purposes as a size one string-valued auxiliary coordinate variable (Section 6.1).
Is that all right?
Cheers
Jonathan
comment:3 Changed 6 years ago by mgschultz
Perfect! Martin
comment:4 follow-up: ↓ 5 Changed 6 years ago by edward.campbell
Hello,
I strongly disagree with your proposed change. You claim that this is a clarification and not a material change. I do not agree with this. Specifically, you wish to append a section stating that if a data variable has two or more scalars coordinate variables those scalar coordinate variables are independent. This is not currently the case - they may be independent, they may not be - the conventions make no assertion either way. If the dimensions are not specified one cannot currently infer independence. Put another way, without a dimension (a direction if you like) these are true scalars as opposed to unit vectors. Your proposed change will impose a single interpretation of existing data where a single interpretation does not currently exist. That in my view constitutes a material change.
Ed
comment:5 in reply to: ↑ 4 Changed 6 years ago by davidhassell
Replying to edward.campbell:
Dear Ed,
Thank you for posting on our proposal.
This view is, of course, not unexpected. However, to move things on from the current area of discussion on this point (on this ticket and elsewhere), I suggest that it would be useful if you, or anyone, could post some examples of software libraries which by design interpret, read or write CF-netCDF datasets in the manner in which suggest. This would give us some evidence for or against your assertions about the reinterpretation of existing datasets, and how widespread this practice may be. I don't know of any, but I'm the first to admit that I'm familiar with only a small fraction of what is out there.
If the dimensions are not specified one cannot currently infer independence. Put another way, without a dimension (a direction if you like) these are true scalars as opposed to unit vectors. Your proposed change will impose a single interpretation of existing data where a single interpretation does not currently exist. That in my view constitutes a material change.
I don't agree with your 'direction' analogy. Scalar coordinates are often endowed with a direction, via some combination of their positive property, their units property and their bounds' values. So I would say that they are indeed logically equivalent to unit vectors.
All the best,
David
comment:6 Changed 6 years ago by edward.campbell
David,
Thanks for the response. I'm intrigued that you think units, bounds and/or a positive attribute can combine to turn a scalar quantity into a vector quantity. I do not believe this to be true, but putting that aside, I think your call for implementations that would be affected by this change is a good idea. However, given that the spec is not clear I suspect the interpretation we are debating may also be within user code rather than the libraries that load the data from file. Despite this I support your call for evidence so we can make an informed decision.
Ed
comment:7 Changed 6 years ago by ngalbraith
Small detail in the text, both old and new:
'If a character variable has only one dimension (the maximum length of the string), it is a string-valued scalar coordinate variable' ...
This implies that all 1-d character variables are coordinates; not true. Can this be edited while you're modifying this section of the text?
Thanks - Nan
comment:8 Changed 6 years ago by markh
I disagree with this proposal.
We do not think this is a material change to the convention, which already implies this interpretation, and we believe this interpretation is what was intended when scalar coordinate variables were introduced.
Such an implication is really not clear in the conventions as they stand.
The reinterpretation of existing CF compliant data implied by the 'defect' status of this ticket is particularly undesirable.
I view this as a regressive proposal, limiting the utility of CF for a number of scenarios without good justification. This functionality is providing benefit to the community and we should not remove that.
The interpretation of scalar coordinate variables in Section 9 may be different from the above, and this may require further clarification if the above is agreed.
This statement highlights to me that the interpretation put forward here is not complete.
I have significant reservations about having a whole range of interpretations of scalar coordinate variables depending on the context they exist in with respect to a data variable.
This is a particularly confusing approach for data creators and I fear it will lead to CF compliant data sets where the intended meaning is not correctly encoded.
I think there is a better way to approach this, which will deliver a cleaner and clearer result.
comment:9 in reply to: ↑ description Changed 6 years ago by jonathan
Dear Nan
Thank you for your comment. In the context of the whole paragraph, it is probably clearer than it appears when quoted by itself, but I agree it could be clarified further. Thanks for the suggestion. We could replace "character variable" in this sentence with "string-valued auxiliary coordinate variable". Is that OK?
Here are the text changes we are now proposing, following Martin's and Nan's comments.
Section 5.7, Scalar coordinate variables.
Replace
Under COARDS the method of providing a single valued coordinate was to add a dimension of size one to the variable, and supply the corresponding coordinate variable. The new scalar coordinate variable is a convenience feature which avoids adding size one dimensions to variables. Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable.
with
The use of scalar coordinate variables is a convenience feature which avoids adding size one dimensions to variables. A numeric scalar coordinate variable has the same information content and can be used in the same contexts as a size one numeric coordinate variable. Similarly, a string-valued scalar coordinate variable has the same meaning and purposes as a size one string-valued auxiliary coordinate variable (Section 6.1).
At the end of the section, add:
If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (for instance, time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables.
Section 6.1, Labels
Replace the last sentence
If a character variable has only one dimension (the maximum length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables).
with
If a string-valued auxiliary coordinate variable has only one dimension (the maximum length of the string), it is a string-valued scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables). As such, it has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension which has not been added to the data variable. This is a convenience feature.
Cheers
Jonathan
comment:10 Changed 6 years ago by jonathan
Dear Ed and Mark
As you know, this is a long-running discussion with Mark. However, I maintain that what David and I propose here is what the convention is intended to mean. Firstly, this is because I am confident that it's what the authors of the convention had in mind when scalar coordinate variables were invented. Of course memories are not perfect, so I cannot be certain, but that's my honest recollection. Secondly, I think it is the most obvious interpretation of the words as they stand now. If we wanted to introduce a conceptually different kind of coordinate variable, it would be misleading to write, "Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable." I would say that the evident intention of that statement is that scalar coordinate variables mean the same as size one coordinate variables (not size one auxiliary coordinate variables). Thirdly, it is sufficient to have two abstract kinds of coordinate variable, and because it is simpler it should be preferred.
David and I propose these changes because Mark suggests an alternative interpretation, which would not be what the convention was intended to mean, and which we would therefore like to exclude. Our aim is to clarify the intention of the CF document. That is why this is a defect ticket. The document is erroneous in apparently allowing an interpretation which was not intended.
Your arguments are concerned with flexibility in the interpretation of CF-compliant data. I think that's a different matter from what the convention means. It's fine to reinterpret data as part of analysing or processing it. That's a routine thing to do. Therefore our statement of what the convention means does not prevent your interpretation of CF-compliant data. I wrote about this on the email list. It would be interesting to know your response to that (on the email list). I appreciate that you would be rightly concerned if there was a loss of flexibility, but I really don't think you need to be. You can offer the user of software a flexibility in interpreting scalar coordinate variables, if you want to (changing size one Unidata coordinate variables into size-one auxiliary coordinate variables or vice-versa). It is flexibility in the treatment of data which is of benefit to users of data.
I appeal to you to consider this argument as a kind of compromise and to allow the clarification we propose to be made.
Best wishes and thanks
Jonathan
comment:11 Changed 6 years ago by stevehankin
Hi guys,
It's clear that convictions run deep on this issue. It is a lot less clear what "this issue" is (so many demonstrative pronouns!). Here is the disputed paragraph, right?
o If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (for instance, time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables.
Concrete examples that illustrate the undesirable interpretations are needed. (At least for this discussion.) The text as it stands is awfully dense to serve us well in the standard. If readers go away scratching their heads about the intended meaning, there will be divergent interpretations.
comment:12 Changed 6 years ago by mcginnis
I would like to second Steve's request for some examples that illustrate the different interpretations.
I've been following this issue both here and on the mailing list, and I have to admit that despite my best attempts, I just don't understand what folks are disagreeing about, and I suspect that I am not alone...
comment:13 follow-up: ↓ 14 Changed 6 years ago by markh
I will try to illustrate a useful example. Consider a CF compliant NetCDF data variable with a 2 dimensional data array and two coordinate variables:
- latitude
- longitude
This data variable declares auxiliary coordinates, using the coordinates attribute. All but one of these are scalar coordinate variables:
- surface_pressure(latitude, longitude)
- a
- p0
- b
- time
- forecast_period
- forecast_reference_time
- source
- experiment_id
- model_id
We have many such data sets. Often these are converted from other formats using one of a number on converters developed inside my office and by collaborating organisations over the past few years. Sometimes they are the results of analyses which have been persisted by researchers.
These have always been viewed as reasonable data sets, given the comprehension of scalar coordinate variables.
Under this proposal, this data set would be viewed as exactly equivalent to
a CF compliant NetCDF data variable with an 11 dimensional data array and 11 coordinate variables:
- latitude
- longitude
- a
- p0
- b
- time
- forecast_period
- forecast_reference_time
- source
- experiment_id
- model_id
along with one auxiliary coordinate:
- surface_pressure(latitude, longitude)
The proposal makes explicit that there should be no semantic differentiation between the two data sets, the first is merely an encoding short hand for the second.
This interpretation does not adequately reflect the data and metadata. It defines a set of independent coordinates which were not intended to be independent by the data creator. This will cause considerable confusion in the interpretation of these data sets, most of which are in storage, to be returned to at a later date as required.
If the data creators had been shown an 11 dimensional data set when they created it, they would have very likely raised questions about the output, but the scalar coordinate variable has not been interpreted in this way. The first encoding appeared to people to be a logical and sensible representation of their data.
Scalar coordinate variables have been viewed by users as semantic containers: no clear statement existed in the conventions to disagree with this perspective.
In most cases these data creators used software which had been deemed appropriate by the organisation and trusted its output would be valid. For years it has been deemed to be valid.
With cases such as this in mind, I do not think the community have the luxury of making decisions solely based on confidence in what the authors of the convention had in mind when scalar coordinate variables were invented. I think the interpretation of these features of the convention by users needs to be taken into account.
I have previously posted examples from the discrete sampling geometries section of the conventions which require different interpretation. I do not agree with the proposed approach of handling these as further special cases, requiring further clarification text not in this proposal.
comment:14 in reply to: ↑ 13 Changed 6 years ago by davidhassell
Replying to markh:
Dear Mark,
Thank you for posting an example, especially one which also describes the position that Jonathan and I are stating.
This interpretation does not adequately reflect the data and metadata. It defines a set of independent coordinates which were not intended to be independent by the data creator.
Interesting that you should say this. If the creator had intended for some of these size one coordinates to be dependent, I rather think that they should have indicated this when they created the dataset, by giving them a shared dimension. How else can we know what the creator intended? A human can try to infer (correctly?), but a computer can't.
For example, I might presume from their names that you mean a, p0 and b to be related height coordinates (but as the data variable varible is surface pressure, I'm not 100% sure ...). If so, then these should have been written as something like the size 1, 1-d coordinates a(height), p0(height), b(height) and don't need to be listed in the coordinates attribute.
If the data creators had been shown an 11 dimensional data set when they created it, they would have very likely raised questions about the output, but the scalar coordinate variable has not been interpreted in this way. The first encoding appeared to people to be a logical and sensible representation of their data.
... and perhaps wished that they had clearly encoded the relationships between dependent variables when they created the file in the first place. You seem to ackowledge that the creator meant to indicate that some coordinates are related, but has neglected to encode this in the file.
Given that we have a mechanism for linking dependent size 1 coordinates (via a shared dimension), if no link is specified then I prefer the only unambiguous interpretation: the exact opposite - i.e. they are independent.
I would rather not change the convention just to attach credibility to some datasets which were unintentionally created without all of the relevent information.
All the best,
David
comment:15 follow-up: ↓ 21 Changed 6 years ago by stevehankin
What CF calls a "scalar coordinate variable" is syntactically an "auxiliary coordinate" i.e. it lacks the axname(axname) syntax, and it is associated with a variable via the coordinates attribute. About "auxiliary coordinate variables" CF says
The use of coordinate variables is required whenever they are applicable. That is, auxiliary coordinate variables may not be used as the only way to identify latitude and longitude coordinates that could be identified using coordinate variables.
So for length >1 coordinates CF insists on using the axname(axname) syntax. but for length=1 coordinates CF allows us to borrow the auxiliary coordinate syntax. Through these 2 rules CF has orphaned the concept of a "scalar auxiliary coordinate variable". This missing concept lies at the heart of the current discussion.
This is a genuine flaw in the current CF specification. It deserves to be fixed. The text at the start of this trac ticket is wrong when it says that it is not proposing a material change. But it is a needed change.
The text, however, deserves to be expanded to much better illuminate the nature of the problem that it is fixing. Frankly, it should get explicit in saying that it is defining a rule that addresses an ambiguity created by the definitions of auxiliary coordinate variable and scalar coordinate variable.
comment:16 Changed 6 years ago by jonathan
Dear Mark
Thanks for your example. I think it is notable that you say netCDF datasets with many scalar coordinate variables have often been converted from other formats. I would suggest this means the people who wrote the converters didn't fully understand the CF convention, and if so, this is a mistake. Indeed, a previous email exchange that we had revealed that I had made such a mistake in my conversion program pp_cfwrite, which I'll correct. (Namely, even if there is only one model level, it should have a size-one dimension for the level, in order to indicate that the model level number and the vertical coordinate are linked.)
Why are a, p0 and b included as auxiliary coordinate variables? - David also asks about them. I guess they are formula terms for the atmosphere hybrid sigma pressure coordinate. Formula terms are not normally listed by the coordinates attribute; there is no prohibition on doing that, but I don't see it suggested in Section 4.3.2. Similarly perhaps surface_pressure is also a formula term (ps in the formula) and should not be in the coordinates attribute. Why isn't there a vertical coordinate variable (perhaps a scalar)? This variable ought to have a formula_terms attribute, pointing to these other quantities.
As you and Richard have pointed out before, there is a link between forecast_period, forecast_reference_time and time. This link should be indicated in the file by use of coordinate and auxiliary coordinate variables, rather than coding them all as scalars.
In the CF convention, source is an attribute, not a coordinate variable.
If experiment_id and model_id could vary separately, it is fine to have them as scalar coordinate variables. On the other hand, if this data variable is really one from a set of experiment--model combinations, it would be more informative to encode them as auxiliary coordinate variables of a shared size one dimension, because they belong together.
I think this variable is probably equivalent to one which has six independent dimensions (latitude, longitude, vertical, time, forecast time, and ensemble member). This is fine, I think, since a variable in which all six of these dimensions were multivalued is quite likely to be useful sometimes.
I'm not suggesting that the file is erroneous. It no doubt passes the CF checker and is CF-compliant, but it is not ideal. It suggests that certain quantities are independent coordinates which actually are not. The problem is that many errors cannot be formally detected, because much of CF is optional, and because we allow everything which is not explicitly probibited, in order to allow other conventions to be used in combination with CF. So I don't think the data creator should be worried by our being more precise about what CF is intended to mean. It will still be a valid file, but not a good example of CF.
Best wishes
Jonathan
comment:17 follow-up: ↓ 18 Changed 6 years ago by biard
I've been following this TRAC ticket and the discussion on the listserv, and I am wondering if I have correctly understood the point under dispute. Is the statement below a fair representation of the difference between the two "camps"?
The point under dispute is whether scalar coordinate variables should be assumed to have (implied) size-one dimensions that are independent of one another, or whether there should be no assumption at all made about dependence or independence of the implied dimensions.
Does that sum it up, or is there something more?
comment:18 in reply to: ↑ 17 Changed 6 years ago by jonathan
Dear Jim et al.
Replying to biard:
The point under dispute is whether scalar coordinate variables should be assumed to have (implied) size-one dimensions that are independent of one another, or whether there should be no assumption at all made about dependence or independence of the implied dimensions.
Does that sum it up, or is there something more?
Yes, I think that's right. That's the contentious point.
It has an implication for the CF logical data model (that is, an abstraction of the way we encode the metadata in a netCDF file). Mark and Ed think that scalar coordinate variables are a distinct concept in the logical data model from coordinate variables and auxiliary coordinate variables. David and I think that scalar coordinate variables are a convenient shorthand for coordinate variables and auxiliary coordinate variables, and not a distinct concept of the logical data model.
Like Mark, Steve thinks that defining this either way is a material change rather than a clarification or a correction of the convention. Steve writes
This is a genuine flaw in the current CF specification. It deserves to be fixed. The text at the start of this trac ticket is wrong when it says that it is not proposing a material change. But it is a needed change.
I think it isn't a change in the convention, myself, but since it is contentious I'm happy to change this ticket to a proposal for amendment rather than a defect ticket. That means it needs to be positively agreed, rather than agreed by default if not objected to. At the moment, it can't be agreed as a defect ticket anyway because there are outstanding objections. As Steve says,
[This ticket] addresses an ambiguity created by the definitions of auxiliary coordinate variable and scalar coordinate variable.
Our intention is to remove that ambiguity by clarifying that scalar coordinate variables represent size-one numeric coordinate variables and size-one string-valued auxiliary coordinate variables. We don't think that in practice this will limit the flexibility that software has in interpreting CF-netCDF files, and it doesn't make any existing file invalid. Really it's only an issue for the logical data model, we think.
Best wishes
Jonathan
comment:19 Changed 6 years ago by jonathan
- Type changed from defect to enhancement
comment:20 Changed 6 years ago by mcginnis
Thanks for providing that example! It's very helpful, and I think I actually understand the discussion now.
Jonathan's interpretation of the example file makes sense to me. Listing the source and the formula terms for the sigma height in the coordinates attribute seems strange to me.
I wouldn't go so far as to actively campaign for assuming that scalar coordinate variable implies size-one dimension -- the issue of older files NOT assuming that makes me prefer to abstain -- but I will say that it's what I always thought it meant based on the CF spec.
Cheers,
Seth
comment:21 in reply to: ↑ 15 Changed 6 years ago by markh
Replying to stevehankin:
I agree with Steve's analysis of this ticket.
I support the change of ticket type, we can now talk about this issue as a forward looking adaption to CF.
The text, however, deserves to be expanded to much better illuminate the nature of the problem that it is fixing.
I think the nature of the problem and the objective for the solution are not well defined in the ticket proposal.
It would be good to have the objective of the ticket clearly stated. I think this objective needs to be agreed before any text to deliver this objective may be finalised.
comment:22 Changed 6 years ago by jonathan
Steve and Mark have requested a statement of the purpose of this ticket. This is how I would describe it:
The current version of the CF standard does not state clearly that a scalar coordinate variable implies the notional existence of a size-one dimension which has not been included in the netCDF file as a dimension of the data variable. We believe this to have been the intention of the authors, and that is preferable because it is the most obvious and simplest interpretation of the CF standard, and we therefore propose these changes to make that intention clearer. It is possible that the writers of some existing CF-netCDF files might have had a different interpretation, but the changes we propose would not invalidate any existing file. The ambiguity relates only to the logical interpretation of the file. Resolving the ambiguity is therefore important to the CF data model, which is under discussion in other tickets.
Note that the above statement is a motivation for the ticket. It's not intended as new text for the CF standard document. The proposed changes are as shown in comment 9.
Jonathan
comment:23 Changed 6 years ago by markh
Thank you for the statement of purpose, from your point of view; it is helpful.
From my perspective the objective of this ticket is to remove any capability to encode a Scalar Coordinate in a NetCDF file.
The term 'scalar' is only ever to be interpreted as part of an encoding short cut with no semantic implications.
As is clear from my earlier comments I don't support this approach. With this in mind I have created a new ticket #105 to enable an alternative approach to be discussed on its merits.
I have noted on that ticket that it is not consistent with this ticket and the proposed changes should be seen as mutually exclusive.
comment:24 Changed 6 years ago by jonathan
Dear all
In an email, Steve has suggested further helpful text to describe the problem we are trying to resolve. Here is my version of his suggestion.
Scalar coordinate variables provide a convenient way to encode coordinate variables of size one. They do so by borrowing the syntax that is otherwise used for auxiliary coordinate variables. There is, however, a key difference between the interpretation of scalar coordinate variables and auxiliary coordinate variables. Scalar coordinates have the same status in a CF file as (conventional, Unidata, COARDS) coordinates in which the dimension name and the variable name match. These coordinates define the independent variables (spatiotemporal and others) for the data variable. Auxiliary coordinate variables provide extra information as a function of these independent variables, as alternative numeric values (which don't have to be unique or monotonic along a given dimension), or string-valued labels. To indicate that a variable is intended to be an auxiliary coordinate variable, it is necessary to give it a dimension, in order to show which coordinate variable(s) it belongs to. Numeric scalar coordinate variables are not to be interpreted as auxiliary coordinate variables.
Cheers
Jonathan
comment:25 Changed 6 years ago by mgschultz
Dear Jonathan,
I like this text and believe it gives a rather clear view. There is only one point which seems to be implicit and could perhaps be made more explicitly: You refer to the Unidata and COARDS convention to give coordinate variables the same name as the corresponding dimension. Is it therefore safe to assume that a scalar variable with the same name as a dimension is a coordinate variable? Note that this limits the flexibility for switching coordinate variables and auxiliary coordinate variables around as you suggested in the other post on height and model_level.
Best regards,
Martin
comment:26 Changed 6 years ago by biard
Martin,
I believe that the situation you describe is specifically prohibited (or should be). Having a scalar variable with the same name as a dimension would be very bad berries.
Grace and peace,
Jim
comment:27 Changed 6 years ago by jonathan
Dear Martin
I'm glad you like the motivation statement. Thanks are due to Steve.
Sect 5.7 strongly recommends against giving a scalar coordinate variable the same name as a dimension, as Jim advocates too. That's because it could be confusing: the scalar coordinate variable is a shorthand; it takes the places of a size-one coordinate variable, with the dimension omitted.
If you wanted to exchange the roles of a coordinate variable and a 1D auxiliary coordinate variable which share a size-one dimension in a netCDF file, you would have to rename the dimension. Also, auxiliary coordinate variables have to be named by the coordinates attribute; coordinate variables are not normally listed there, but it is not prohibited. When I was writing about exchanging their roles, I was thinking more of doing this in software which processes, and hence could reinterpret, the CF-netCDF file, rather than changing the file itself.
Is this OK?
Best wishes
Jonathan
comment:28 follow-up: ↓ 64 Changed 6 years ago by jonathan
Dear all
For the sake of clarity, I will restate what we now propose in this ticket. This is the motivation:
Scalar coordinate variables provide a convenient way to encode coordinate variables of size one. They do so by borrowing the syntax that is otherwise used for auxiliary coordinate variables. There is, however, a key difference between the interpretation of scalar coordinate variables and auxiliary coordinate variables. Scalar coordinates have the same status in a CF file as (conventional, Unidata, COARDS) coordinates in which the dimension name and the variable name match. These coordinates define the independent variables (spatiotemporal and others) for the data variable. Auxiliary coordinate variables provide extra information as a function of these independent variables, as alternative numeric values (which don't have to be unique or monotonic along a given dimension), or string-valued labels. To indicate that a variable is intended to be an auxiliary coordinate variable, it is necessary to give it a dimension, in order to show which coordinate variable(s) it belongs to. Numeric scalar coordinate variables are not to be interpreted as auxiliary coordinate variables.
This is the change to the convention:
Section 5.7, Scalar coordinate variables.
Replace
Under COARDS the method of providing a single valued coordinate was to add a dimension of size one to the variable, and supply the corresponding coordinate variable. The new scalar coordinate variable is a convenience feature which avoids adding size one dimensions to variables. Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable.
with
The use of scalar coordinate variables is a convenience feature which avoids adding size one dimensions to variables. A numeric scalar coordinate variable has the same information content and can be used in the same contexts as a size one numeric coordinate variable. Similarly, a string-valued scalar coordinate variable has the same meaning and purposes as a size one string-valued auxiliary coordinate variable (Section 6.1).
At the end of the section, add:
If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (for instance, time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables.
Section 6.1, Labels
Replace the last sentence
If a character variable has only one dimension (the maximum length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables).
with
If a string-valued auxiliary coordinate variable has only one dimension (the maximum length of the string), it is a string-valued scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables). As such, it has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension which has not been added to the data variable. This is a convenience feature.
Since no new objections or comments have been made for three weeks, and enough support has been expressed previously, I think this ticket could be accepted according to the rules. However, I guess that Mark, for instance, is not content with it. It would be helpful to hear views from others on the proposal as it now stands.
Cheers
Jonathan
comment:29 Changed 6 years ago by russ
Having just reviewed this ticket and ticket #105 as well as all the associated comments, I'm convinced by the arguments in favor of #104. I believe the new clarifications capture the original intent of scalar coordinate variables as a convenient and natural shortcut and convenience, avoiding the proliferation of size 1 dimensions. Dependencies among scalar values are best captured by use of explicit dimensions expressing such relationships. In their absence, scalar coordinate variables may be assumed to be independent.
--Russ
comment:30 Changed 6 years ago by caron
Im pondering the relationship of coordinate (regular or auxiliary) variables to the concepts of dependence and independence. This may be very worthwhile to figure this out, though i dont think the conventions as they stand are clear.
- Is there a clear definition of dependence and independence that can be used in this context?
- I cant think of any counterexamples to the notion that a coordinate variable represents an independent variable. But there are some auxilary coordinate variables that im not sure are dependent variables, namely:
2.1 2D coordinate lat(x,y) and lon(x,y) are auxiliary coordinates, but it seems that they could be considered independent. Especially when the x(x) and y(y) coordinates are missing.
2.2 Are discrete sampling coordinates lat(sample) and lon(sample) dependent or independent?
comment:31 follow-up: ↓ 32 Changed 6 years ago by markh
I am particularly wary of the following sentence in this proposal:
The interpretation of scalar coordinate variables in Section 9 may be different from the above, and this may require further clarification if the above is agreed.
The implications for discrete sampling geometries need to be considered for any change to the conventions for version 1.7.
In my view this approach makes the description of discrete sampling geometries more complex, but whatever my view, I think the implications should be clearly documented as part of this ticket.
comment:32 in reply to: ↑ 31 Changed 6 years ago by davidhassell
Replying to markh:
Dear Mark,
In my view this approach makes the description of discrete sampling geometries more complex, but whatever my view, I think the implications should be clearly documented as part of this ticket.
I don't think, at the moment, that the interpretation of scalar coordinate variables in Section 9 will be different under what Jonathan and I are proposing.
The examples H.4, H.5 and H.9, for example, are fine for me under our proposal. Is it a problem, in these examples, that time, lat, lon (and alt) are independent? I would expect it, I think. One reason for this is, for example, that given enough of these files for different locations, I'd want to try and stitch them together to make a data variable with indpendent lat and lon dimensions which are both of size greater than 1.
I suspect, however, that you have spent more time than me considering this issue, so it would be helpful if you could expand on any implications you think there may be.
Many thanks and all the best,
David
comment:33 Changed 6 years ago by taylor13
Dear all,
I vote against ticket 105 and I favor 104, but I think the wording is still not ideal for the addition to the end of section 5.7 which is proposed to read:
"If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (for instance, time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables."
For the second sentence do we mean: "If two or more single-valued coordinates are not independent, but have related values (for instance, station number and station name or vertical coordinate and model level number or station number and station name, Section 6.2), then one should be stored as a coordinate of size one dimension and the rest should be stored as associated auxiliary coordinate variables."
I was confused by "should be stored as coordinate or auxiliary coordinate variables". Which is it?
Note besides modifying the end of the sentence, I replaced one of the "for instances"; it seems to me that time and forecast period would often be considered to be independent coordinates, no? I'm not familiar with forecasting terminology, but I would think "period" would indicate the time since initialization whereas "time" would indicate the time at the end of the period. These would be independent only if you assume all forecasts were initiated at the same time.
Please clarify this before accepting the ticket.
thanks, Karl
comment:34 Changed 6 years ago by biard
Karl,
Regarding coordinates vs auxiliary coordinates - there is nothing that forces you to make one of these coordinate-type variables under consideration a "primary" coordinate, is there? I think it is completely valid for a group of coordinate-type variables to share a common dimension name that is not the name of any of the variables. In such a case, all of them would be auxiliary coordinates for one or more data variables.
Grace and peace,
Jim
comment:35 Changed 6 years ago by taylor13
Jim,
Thanks for clarifying. My attempt to reword is wrong, I guess, since severally auxiliary coordinates can be associated with one another simply by defining them to be functions of the common dimension. It doesn't matter if one of them is an actual coordinate or not.
Also my statement that "These would be independent only if you assume all forecasts were initiated at the same time" should have read "These would be independent except in the case when all forecasts were initiated at the same time." But I'm probably wrong about that too.
My support for accepting ticket 104 still stands, but I'm worried that it is difficult to understand all of this by reading the documentation. The whole idea is rather simple, but I'm not sure we've explained it concisely and clearly.
cheers, Karl
comment:36 Changed 6 years ago by jonathan
Dear John, Karl, Jim
I think Jim is correct in saying that you can have a number of 1D auxiliary coordinate variables of the same dimension, without there being a (Unidata, COARDS) coordinate variable of that dimension. This might well happen, for example, with a model ensemble axis. The ordering of such an axis is arbitrary, and there isn't necessarily a monotonic coordinate variable for it. The elements are distinguished by a combination of information such as ensemble member number and model identity. Therefore I don't think we can prescribe whether coordinate or auxiliary coordinate variables should be used.
I agree with John to the extent that coordinate variables and auxiliary coordinate variables are similar and sometimes exchangeable semantically, but I wouldn't say they are the same. The differences that we state in the draft data model (latest version of this text) are that coordinate variables must be one-dimensional, numeric and monotonic (implying all their values must be distinct). Auxiliary coordinate variables can be multidimensional, string-valued and don't have to be monotonic.
Note that in the data model we use the term dimension coordinate construct to correspond to a (Unidata, COARDS) coordinate variable, and auxiliary coordinate construct for a (CF) auxiliary coordinate variable.
To address Karl's other concern, I would propose
If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (this might be the case, for instance, for time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables.
That's only slightly different from before. I've inserted "this might be the case", because time and forecast period might be related, but they might not, as Karl suggests - it depends on the application. It's valuable to indicate whether there is a relationship or not (this has been discussed earlier in this ticket). Is this text OK?
Best wishes
Jonathan
comment:37 follow-up: ↓ 40 Changed 6 years ago by jonathan
Dear Mark and John
Discrete sampling geometries will require an addition to the CF data model, but we decided in the first place to agree the data model for CF 1.5, in which we do not have DSGs. I think the question John asks, about whether lat(sample) and lon(sample) are independent in a DSG, will have to be thought about in the context of the data model for CF 1.6 and later.
Cheers
Jonathan
comment:38 Changed 6 years ago by jonathan
Dear all
Arising from a comment of Nan's, I'd like to propose a further clarification to the text of 5.7, namely to change the first para from
When a variable has an associated coordinate which is single-valued, that coordinate may be represented as a scalar variable. Since there is no associated dimension these scalar coordinate variables should be attached to a data variable via the coordinates attribute.
to
When a variable has an associated coordinate which is single-valued, that coordinate may be represented as a scalar variable (i.e. a data variable which has no netCDF dimensions). Since there is no associated dimension these scalar coordinate variables should be attached to a data variable via the coordinates attribute.
Jonathan
comment:39 Changed 6 years ago by taylor13
Dear all,
Concerning Jonathan's post above (8/14/13 08:54:56), I think the additional text is very helpful and should be included.
regards, Karl
comment:40 in reply to: ↑ 37 ; follow-ups: ↓ 41 ↓ 47 Changed 6 years ago by caron
The DSG v1.6 uses dimensions in a new way, which i agree can wait to think through. But point data representations are valid in 1.5, so it might be better to think the implications of this on point data. For example from proposal:
"If two or more single-valued coordinates are not independent, but have related values..., they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables."
But you dont want to add a new dimension onto point data just to represent that two coordinates are not independent. An example:
dimensions: sample = 39238923; variables: float data(sample); data:coordinates = "lat lon time"; float lat(sample); float lon(sample); float time(sample);
suppose that you are sampling at the same point. Its intuitive to indicate this using scalar coordinates:
dimensions: sample = 39238923; variables: float data(sample); data:coordinates = "lat lon time"; float lat; float lon; float time(sample);
so here you are really doing a shorthand for lat(sample), lon(sample) indicating that these are constants. which is really useful to know.
you might also want to put height levels of the sample:
dimensions: sample = 39238923; variables: float data(sample); data:coordinates = "lat lon time heightAboveGround heightAboveMsl"; float lat; float lon; float heightAboveGround ; float heightAboveMsl; float time(sample);
you dont really want to do this:
dimensions: sample = 39238923; height = 1; variables: float data(sample, height); data:coordinates = "lat lon time heightAboveGround heightAboveMsl"; float lat; float lon; float heightAboveGround(height) ; float heightAboveMsl(height); float time(sample);
well maybe you do. i dont like it because it makes it looks like a profile.
comment:41 in reply to: ↑ 40 ; follow-up: ↓ 43 Changed 6 years ago by davidhassell
Replying to caron:
Hello John,
Thanks for this. It's definitely clarified things for me. I still bravely maintain that DSG is not affected logically by #104. I sympathize with your last example ("I don't like it because it makes it looks like a profile"), but surely the resemblance is passing, since the featureType attribute will be "point", rather than "profile", and extra dimensions are not prohibited. Is that right?
suppose that you are sampling at the same point. Its intuitive to indicate this using scalar coordinates:
dimensions: sample = 39238923; variables: float data(sample); data:coordinates = "lat lon time"; float lat; float lon; float time(sample);
so here you are really doing a shorthand for lat(sample), lon(sample) indicating that these are constants. which is really useful to know.
I like this example, but I don't understand how the scalar lat is shorthand for lat(sample) when sample = 39238923. Am I missing something?
All the best,
David
comment:42 follow-up: ↓ 44 Changed 6 years ago by jonathan
Dear John
CF 1.6 talks about scalar coordinate variables in section 9.2:
If there is only a single feature to be stored in a data variable, there is no need for an instance dimension and it is permitted to omit it. The data will then be one-dimensional, which is a special (degenerate) case of the multidimensional array representation. The instance variables will be scalar coordinate variables; the data variable and other auxiliary coordinate variables will have only an element dimension and not have an instance dimension, e.g. data(o) and t(o) for a single timeSeries.
Your first example, with data(sample), lat(sample), lon(sample) and time(sample), could be a collection of point data, with sample, which is (much!) greater than one, being the instance dimension (i.e. the number of points). No scalar coordinate variables are involved in that case.
Alternatively, it could be a single trajectory feature, in which sample is the element dimension (the number of points along the trajectory). Again, there are no scalar coordinates.
Your second example, with data(sample), lat, lon and time(sample), could be a single timeseries feature, in which sample is the element dimension (the number of times in the timeseries). This has two scalar coordinate variables. Following this ticket, they would be regarded as logically equivalent to lat(lat) and lon(lon), as though the data were dimensioned data(sample,lat,lon) with lat=1 and lon=1. This is a valid and logically equivalent way of representing a single timeseries.
If it's actually not a timeseries, but still a collection of points which happen to be coincident (?), you have to dimension lat and lon with sample, in order to agree with Table 9.1.
Thus, this ticket doesn't appear to cause a problem for DSGs, but I expect we will have to do some more thinking about the logical data model. With that caveat, is the ticket OK as it stands?
Best wishes and thanks
Jonathan
comment:43 in reply to: ↑ 41 Changed 6 years ago by caron
Hi David:
Replying to davidhassell:
Replying to caron:
Hello John,
Thanks for this. It's definitely clarified things for me. I still bravely maintain that DSG is not affected logically by #104. I sympathize with your last example ("I don't like it because it makes it looks like a profile"), but surely the resemblance is passing, since the featureType attribute will be "point", rather than "profile", and extra dimensions are not prohibited. Is that right?
no, its not really fatal, just an idiom that i like. but it does make me wonder if height is really an "independent" variable. if so, it would be nice to know more precisely what that means.
Also, note that I am not reffering to the implications of DSG representation, just the way that one would do point data pre-DSG.
suppose that you are sampling at the same point. Its intuitive to indicate this using scalar coordinates:
dimensions: sample = 39238923; variables: float data(sample); data:coordinates = "lat lon time"; float lat; float lon; float time(sample);so here you are really doing a shorthand for lat(sample), lon(sample) indicating that these are constants. which is really useful to know.
I like this example, but I don't understand how the scalar lat is shorthand for lat(sample) when sample = 39238923. Am I missing something?
I just mean to say that by making it a scalar, you indicate that all the values of lat(39238923) would be identical if you made it into a dimensioned variable. So by looking at the "structural metadata" of the file, you know something thats very useful.
Regards, John
comment:44 in reply to: ↑ 42 Changed 6 years ago by caron
Hi all:
From my POV, CF Conventions have not previously made any clear semantic distinction between coordinate variables and auxiliary coordinate variables. Now tickets 104 and 105 seem to have that as an important distinction. Yet the main difference between the two, monotonicity, has no effect on a scalar coordinate.
FYI, in the CDM data model, there is no real difference between coordinate variables and auxiliary coordinate variables, bot between scalar and non-scalar coordinates. However, maybe there should be if we can figure it out.
Im guessing that the real issue is with the data model, for which I apologize for not having followed. Perhaps thats what I need to do next in order to have anything useful to say.
In terms of the actual wording:
"The use of scalar coordinate variables is a convenience feature which avoids adding size one dimensions to variables. A numeric scalar coordinate variable has the same information content and can be used in the same contexts as a size one numeric coordinate variable. Similarly, a string-valued scalar coordinate variable has the same meaning and purposes as a size one string-valued auxiliary coordinate variable (Section 6.1)."
"If a string-valued auxiliary coordinate variable has only one dimension (the maximum length of the string), it is a string-valued scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables). As such, it has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension which has not been added to the data variable. This is a convenience feature."
both seem quite good.
This sentence:
"If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (for instance, time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables."
implies a distinction between "independent coordinate variables" and "not independent coordinate variables", and indictates how to use the dimensions of a variable to convey which is which.
My gut instinct is that dimension usage is actually already overloaded, and in the absence of a data model which would spell out the implications, we are probably making things unnecessarily complex and confusing with this addition.
What I would need to accept either proposal is a working definition of "dependent" and "independent", esp reletive to "coordinate variable" and "auxiliary coordinate variable". Is it true that a "coordinate variable" must be independent ? Is an "auxiliary coordinate variable" always dependent, or can it be either? or?
Regards, John
comment:45 follow-up: ↓ 46 Changed 6 years ago by jonathan
Dear John
Yes, we agree, this whole ticket is really about clarifying the data model. The clarifications being discussed are concerned with how CF-netCDF metadata are interpreted, and do not affect the legality of CF-netCDF files, although they do imply that some ways are better than others for encoding a given dataset. The part you quote last is the crucial issue in this debate.
I think you have made a good point, and thanks for that. In the draft data model as it stands, we distinguish dimension (Unidata, COARDS) coordinates and auxiliary coordinates purely on formal grounds (uniqueness, monotonicity, dimensionality, data type). That's because we based the data model on the CF-netCDF convention, of course. But I think you have correctly identified the conceptual distinction, and we should put that in the data model document as well, namely that the dimension coordinate variables are independent, and the auxiliary coordinate variables are dependent. It's because the dimension coordinates are independent that they must be unique and one-dimensional. Those formal properties don't help with a scalar coordinate, as you say, but the idea that it's an independent variable is still valid.
In some situations, a CF-netCDF file might have auxiliary coordinate variables of dimensions which do not have dimension coordinate variables. One situation is an axis for an unordered collection, such as an ensemble axis. In that case, I suppose the index along the ensemble dimension is the independent variable, in a sense, but that is not useful information since the ordering is arbitrary, and there's no need for explicit independent coordinates. In the case you mentioned earlier, of 2D lat and lon auxiliary coordinate variables if the 1D projection coordinates are not given, I would say that the auxiliary coordinates are still dependent on the projection coordinates. Even though the latter are absent, there are formulae which define the relationship.
Will this ticket be OK, do you think, if we add some text to insert a couple of sentences in the CF-netCDF standard, when the two kinds of coordinate variable are introduced, to point out their distinction in role of independence/dependence? If so, I'll draft some extra text for this ticket. Do you think we really need to define what independence and dependence mean in the CF-netCDF standard, or can we assume that people will understand them in their usual mathematical sense?
Cheers
Jonathan
comment:46 in reply to: ↑ 45 Changed 6 years ago by caron
Hi Jonathan and all:
Well, if we had to make a decision here, I would say that "coordinate variables" = "independent" and "auxiliary coordinates" = "dependent" would be the correct one. Obviously very simple and powerful idea.
For all the examples I can think of, this interpretation seems reasonable to me. It would be good if others look at their data files and see if there is an important exception to this rule.
For gridded data, I agree that aux coordinates lat(x,y) and lon(x,y) are best thought of as dependent on coordinate variables x(x) and y(y), even when x(x) and y(y) are missing.
For point data, I agree that aux coordinates lat(time), lon(time) are best thought of as dependent on the coordinate variable time(time).
However, under this interpretation, it seems that a scalar auxiliary coordinate should be thought of as dependent, not independent. For example if "lat(time)" is actually constant, so one uses a scalar auxiliary coordinate "lat" in its place, it seems that it is still a dependent variable, and should not be thought of as adding another degree of freedom to the data, ie with some implicit dimension lat(1).
Conversely, if you want to indicate that the domain has another independent coordinate, you should dimension your data with it, even if that dimension happens to only be of length 1, and you should add a coordinate variable of length 1. A good example I run across a lot in GRIB data are vertical coordinates. If one could have multiple vertical levels, say on pressure levels, I will add a vertical dimension even if theres only one such level in the file. But for things like "Temperature at the tropopause" I wont add a vertical dimension. Obviously one could do it differently, but I think thats reasonable.
This interpretation has the advantage that one can add as many auxiliary coordinates as you need, without increasing the dimensionality of the domain. I think that's the essence of what it means to have n independent coordinates.
What do you think?
John
Regards, John
Replying to jonathan:
Dear John
Yes, we agree, this whole ticket is really about clarifying the data model. The clarifications being discussed are concerned with how CF-netCDF metadata are interpreted, and do not affect the legality of CF-netCDF files, although they do imply that some ways are better than others for encoding a given dataset. The part you quote last is the crucial issue in this debate.
I think you have made a good point, and thanks for that. In the draft data model as it stands, we distinguish dimension (Unidata, COARDS) coordinates and auxiliary coordinates purely on formal grounds (uniqueness, monotonicity, dimensionality, data type). That's because we based the data model on the CF-netCDF convention, of course. But I think you have correctly identified the conceptual distinction, and we should put that in the data model document as well, namely that the dimension coordinate variables are independent, and the auxiliary coordinate variables are dependent. It's because the dimension coordinates are independent that they must be unique and one-dimensional. Those formal properties don't help with a scalar coordinate, as you say, but the idea that it's an independent variable is still valid.
In some situations, a CF-netCDF file might have auxiliary coordinate variables of dimensions which do not have dimension coordinate variables. One situation is an axis for an unordered collection, such as an ensemble axis. In that case, I suppose the index along the ensemble dimension is the independent variable, in a sense, but that is not useful information since the ordering is arbitrary, and there's no need for explicit independent coordinates. In the case you mentioned earlier, of 2D lat and lon auxiliary coordinate variables if the 1D projection coordinates are not given, I would say that the auxiliary coordinates are still dependent on the projection coordinates. Even though the latter are absent, there are formulae which define the relationship.
Will this ticket be OK, do you think, if we add some text to insert a couple of sentences in the CF-netCDF standard, when the two kinds of coordinate variable are introduced, to point out their distinction in role of independence/dependence? If so, I'll draft some extra text for this ticket. Do you think we really need to define what independence and dependence mean in the CF-netCDF standard, or can we assume that people will understand them in their usual mathematical sense?
Cheers
Jonathan
comment:47 in reply to: ↑ 40 Changed 6 years ago by markh
In caron, John provides us with a useful set of examples and remind us that there are many uses of non-gridded data which predate CF 1.6 and the DSG implementations. I expect many of these use cases to continue, not making use of the DSG featureType attribute, which I do not think is mandated.
I would like to pick out one of John's examples, a definition of a collection of samples at a single location:
dimensions: sample = 39238923; variables: float data(sample); data:coordinates = "lat lon time"; float lat; float lon; float heightAboveGround ; float heightAboveMsl; float time(sample);
In my naive and simple days of scalars I happily interpreted this as an observation collection at a point, a single location in space. My understanding of #104 is that, in the future, this is to be treated as exactly equivalent to:
dimensions: sample = 39238923; lat = 1; lon = 1; heightAboveGround = 1; heightAboveMsl = 1; variables: float data(sample); data:coordinates = "lat lon time"; float lat(lat); float lon(lon); float heightAboveGround(heightAboveGround) ; float heightAboveMsl(heightAboveMsl); float time(sample);
This seems an odd way of representing the data to me which doesn't fit comfortably with the neat flexibility I have seen in use in many datasets. I would avoid using scalar coordinate variables under any circumstances on this basis.
I am uncomfortable with this re-interpretation of John's example, I have numerous examples with similar characteristics.
- How, under the terms of #104, may I represent such data, such that it explicitly defines 1 degree of freedom and a collection of metadata elements, some of which are invariant with respect to the data variable?
comment:48 follow-up: ↓ 52 Changed 6 years ago by jonathan
Dear John
If you have a coordinate variable lat(time), I think it must be a trajectory - is that right? In Table 9.1, we define a trajectory as "a series of data points along a path through space with monotonically increasing times", so it is natural to record time as the independent coordinate, and latitude depends on it. In that situation, according to this ticket, you would not replace lat(time) with a scalar lat, even if the latitude were unchanging all the way along the trajectory, because doing so would incorrectly imply that latitude is independent. You have to keep lat(time) and fill it with repeated values. Actually, I think that is better anyway in this case, precisely because latitude really is a function of time. I think it would be strange to use a different way of storing a trajectory if the latitude values were [51, 51, 51, 51] on the one hand, or [51, 51.001, 51, 51] on the other.
For timeseries data, the latitude and time coordinates are independent. If it's a single timeseries, it will have only one latitude, and in that situation, according to this ticket, it would be fine to replace the size-one coordinate variable lat(lat) with the scalar lat, and omit the dimension lat=1.
If you have data on multiple vertical levels with a physical coordinate, it might have dimensions data(pressure,lat,lon). When you extract a single level, you reduce the dimension to pressure=1. The size-one pressure(pressure) is an independent coordinate. In this situation, CF permits you to replace it with a scalar pressure and omit the size-one dimension. You don't have to do this; it's a convenience feature, as the standard document says. According to this ticket, replacing pressure(pressure) with pressure does not alter the interpretation. It is still a size-one independent coordinate. This ticket says that the scalar coordinate implies the existence of a size-one dimension.
I agree that you do not add a vertical dimension for temperature at the tropopause, because tropopause doesn't have a numerical coordinate to define it. CF provides standard_names for quantities that exist on special physically defined leves, for this reason. CF does not provide standard names for levels which can be specified with coordinate values. There isn't a standard name for temperature at 2.0 m height, because that can be specified with a size-one height coordinate. It could be either height(height) with dimension height=1, or a scalar height. According to this ticket, they both logically describe an independent size-one dimension.
CF has this convenience feature because it allows you to have many size-one independent coordinates without having to create netCDF dimensions for them. This ticket says that the domain logically has these size-one dimensions anyway.
Is that all right?
Cheers
Jonathan
comment:49 follow-up: ↓ 50 Changed 6 years ago by jonathan
Dear Mark
Concerning the example
dimensions: sample = 39238923; variables: float data(sample); data:coordinates = "lat lon time"; float lat; float lon; float heightAboveGround ; float heightAboveMsl; float time(sample);
I would say this is not an ideal way of storing this data. I agree that, according to this ticket, it implies five independent dimensions. It is not illegal, but it is likely that it doesn't really describe the data's logical structure properly (of course, I am guessing at what the true situation is), in two ways:
- The sample dimension is really an independent coordinate of time. So a 1D (dimension) coordinate variable should be used for this, not an auxiliary.
- Height above ground and height above MSL are necessarily related. The above does not show their relationship.
Therefore, I think this data should probably be stored like this instead:
dimensions: sample = 39238923; heightAboveGround=1; variables: float data(time,heightAboveGround); data:coordinates = "lat lon heightAboveMsl"; float lat; float lon; float heightAboveGround(heightAboveGround) ; float heightAboveMsl(heightAboveGround); float time(time);
or the roles of heightAboveGround and heightAboveMsl could be exchanged.
The data actually has four independent coordinates, and one dependent coordinate. If we adopt this ticket, the interpretation is umambiguous. The original version has a different interpretation. It might cause a nuisance to the analyst that the dependencies had not been indicated in the file, but it shouldn't be hard to fix it up when the data is being processed.
As in previous discussions, I would argue that adopting this ticket doesn't make anything illegal which is currently legal, but in some cases it may lead to data-writers being clearer about their intentions, and that's a good thing, I would say.
Cheers
Jonathan
comment:50 in reply to: ↑ 49 Changed 6 years ago by markh
Replying to jonathan:
In the example lat, lon and both heights are invariant with respect to the data. This is not the same as stating that they are independent.
The data set has 1 degree of freedom, the sample, all other coordinate quantities are invariant with respect to the data or vary with the sample dimension. I'm afraid I don't see a good way of encoding this information with coordinates under this proposal.
comment:51 Changed 6 years ago by jonathan
Dear Mark
That's a matter of interpretation. There is a choice to be made with this dataset. One possibility is to regard it the timeseries for a single point from a 4D dataset (time,vertical,lat,lon), in which case it is natural to regard the spatial coordinates as independent and with size one. That's how I represented it above.
Alternatively, you could regard it as a single timeseries from a set of timeseries at scattered points. In that case, the spatial coordinates are not independent; there is one independent coordinate for location and another for time. For several timeseries, you would have for instance (this is like Example H.2 in the CF document)
dimensions: time = 39238923; location=10; variables: float data(time,location); data:coordinates = "lat lon heightAboveGround heightAboveMsl"; float lat(location); float lon(location); float heightAboveGround(location) ; float heightAboveMsl(location); float time(time);
If you regard the single-location timeseries as a special case of the above, with location=1, then according to this ticket you cannot drop the "location" dimension of the size-one coordinates. You have to keep it in order to show that they are related and do not vary independently of one another.
These are different points of view. Given the means of production of the data (e.g. was it selected from a GCM, or is it a single observational location), one of them is probably more natural than the other. The data-writer can choose, but it's easy for the analyst to convert one to the other if necessary, since the data are all the same, it's only a matter of size-one dimensions being included or not. The effect of this ticket is to give an unambiguous interpretation to the dataset, thus:
- numeric scalar coordinate is equivalent to size-one (Unidata, COARDS, dimension) coordinate, and is an independent variable.
- numeric size-one (CF) auxiliary coordinate is a dependent variable.
I think this is nice and simple. I am getting a sense of deja vu, though.
Best wishes
Jonathan
comment:52 in reply to: ↑ 48 Changed 6 years ago by caron
Hi Jonathan:
I think the core of the disagreement might be alternative ways to understand what the domain is of the variable. We agree that the dimensionality of the domain is the number of independent coordinates. So lets just clarify how we each interpret the number of independent coordinates in a few example cases.
First, let me use radar data to illustrate the difference between domain and range dimensionality. Im guessing you may not disagree here, but i think it may help others to follow along. Consider a sweep from a scanning radar mounted on an airplane. This might look like:
float elev(y, x); float azimuth(y, x); float radius(y, x); float data(y, x); coordinates: "azimuth elev radius";
I think of the data as 2 dimensional, embedded in 3 dimensional space. So the number of independent coordinates is 2, and the total number of coordinates is 3.
Ok, so whats the dimensionality of a trajectory? In the CDM data model, its considered one dimensional, because it can be represented as:
float lat(sample); float lon(sample); float alt(sample); float time(sample); float data(sample); coordinates: "lat lon alt time";
The idea is that it looks like a 1D line embedded in 4D space/time.
Since time is almost always monotonic, its nice to use:
float lat(time); float lon(time); float alt(time); float time(time); float data(time); coordinates: "lat lon alt time";
What about time series data?
Well, it could also look like:
float lat(time); float lon(time); float alt(time); float time(time); float data(time); coordinates: "lat lon alt time";
although
float lat; float lon; float alt; float time(time); float data(time); coordinates: "lat lon alt time";
is really nice in making it clear from the structural metadata alone that all the data is at one point.
Note that under propsal #104, if you want to preserve the idea that time series have only one degree of freedom, it would be difficult since scalar coordinates imply that they are independent. Your only choice would be to the one above that looks just like a trajectory.
When there "really is" another degree of freedom, then adding another explicit dimension seems natural, as we agree with height coordinates for gridded data.
Replying to jonathan:
Dear John
If you have a coordinate variable lat(time), I think it must be a trajectory - is that right? In Table 9.1, we define a trajectory as "a series of data points along a path through space with monotonically increasing times", so it is natural to record time as the independent coordinate, and latitude depends on it. In that situation, according to this ticket, you would not replace lat(time) with a scalar lat, even if the latitude were unchanging all the way along the trajectory, because doing so would incorrectly imply that latitude is independent. You have to keep lat(time) and fill it with repeated values. Actually, I think that is better anyway in this case, precisely because latitude really is a function of time. I think it would be strange to use a different way of storing a trajectory if the latitude values were [51, 51, 51, 51] on the one hand, or [51, 51.001, 51, 51] on the other.
For timeseries data, the latitude and time coordinates are independent. If it's a single timeseries, it will have only one latitude, and in that situation, according to this ticket, it would be fine to replace the size-one coordinate variable lat(lat) with the scalar lat, and omit the dimension lat=1.
If you have data on multiple vertical levels with a physical coordinate, it might have dimensions data(pressure,lat,lon). When you extract a single level, you reduce the dimension to pressure=1. The size-one pressure(pressure) is an independent coordinate. In this situation, CF permits you to replace it with a scalar pressure and omit the size-one dimension. You don't have to do this; it's a convenience feature, as the standard document says. According to this ticket, replacing pressure(pressure) with pressure does not alter the interpretation. It is still a size-one independent coordinate. This ticket says that the scalar coordinate implies the existence of a size-one dimension.
I agree that you do not add a vertical dimension for temperature at the tropopause, because tropopause doesn't have a numerical coordinate to define it. CF provides standard_names for quantities that exist on special physically defined leves, for this reason. CF does not provide standard names for levels which can be specified with coordinate values. There isn't a standard name for temperature at 2.0 m height, because that can be specified with a size-one height coordinate. It could be either height(height) with dimension height=1, or a scalar height. According to this ticket, they both logically describe an independent size-one dimension.
CF has this convenience feature because it allows you to have many size-one independent coordinates without having to create netCDF dimensions for them. This ticket says that the domain logically has these size-one dimensions anyway.
Is that all right?
Cheers
Jonathan
comment:53 follow-up: ↓ 54 Changed 6 years ago by jonathan
Dear John
We agree about the 2D radar data (two independent dimensions) and the trajectory (one independent dimension). Table 9.1 defines a timeseries as "a series of data points at the same spatial location with monotonically increasing times". A collection of timeseries in a discrete sampling geometry has data(i,o) and mandatory coordinates x(i) y(i) t(i,o), where i is the instance dimension and o the element dimension i.e.
float lat(stations); float lon(stations); float time(time); float temp(stations,time); temp:coordinates="lat lon";
in the orthogonal representation, like example H.2. In this representation, time is a 1D coordinate variable, because it's the same for all the stations, so it does not have the instance dimension. If there is only one station, you can keep the size-one instance dimension stations=1, or you can omit it (last para of sect 9.2):
float lat; float lon; float time(time); float temp(time); temp:coordinates="lat lon";
like example H.4. According to this ticket, the interpretations are different. If you keep the size-one dimension, you are making explicit that lat and lon are related: they are two coordinates which share a single dimension of the domain. If you drop the size-one dimension, lat and lon are independent dimensions. That means the single timeseries is equivalent to
float lat(lat); float lon(lon); float time(time); float temp(time,lat,lon);
with lat=1 and lon=1. That is, it's regarded as having been extracted from a 2D array of timeseries. Physically, this is a perfectly fine interpretation; that might indeed be how the single timeseries was obtained.
Therefore I think that the two interpretations are physically distinct, and this ticket recognises them as distinct. However, it is clearly not difficult to convert between them, as the data is completely unaffected; they only differ through the presence and absence of size-one dimensions.
However, this kind of data:
float lat(time); float lon(time); float time(time); float temp(time); temp:coordinates="lat lon";
is not a timeseries feature, according to Table 9.1. It's a trajectory. In a timeseries, the x and y coordinates do not vary with the element dimension (time in this case).
Best wishes
Jonathan
comment:54 in reply to: ↑ 53 ; follow-up: ↓ 55 Changed 6 years ago by caron
Hi Jonathan:
Suppose that the data provider would prefer to represent her time-series data as having a single independent coordinate, ie is 1 dimensional? How would she do that under ticket #104?
Regards, John
comment:55 in reply to: ↑ 54 ; follow-up: ↓ 56 Changed 6 years ago by jonathan
Dear John
Suppose that the data provider would prefer to represent her time-series data as having a single independent coordinate, ie is 1 dimensional? How would she do that under ticket #104?
If the single timeseries is recorded like this:
dimensions: station=1; time=NNN; variables: float lat(station); float lon(station); float time(time); float temp(station,time); temp:coordinates="lat lon";
it would be interpreted as having two independent dimensions, one of space and one of time. The space dimension has size one and lat and lon both depend on it. Is that what you mean by 1D? It can't be truly 1D unless you omit the spatial information altogether. A timeseries discrete sampling geometry must have at least horizontal coordinates.
If the single timeseries is recorded like this:
dimensions: time=NNN; variables: float lat; float lon; float time(time); float temp(time); temp:coordinates="lat lon";
or like this:
dimensions: lon=1; lat=1; time=NNN; variables: float lon(lon); float lat(lat); float time(time); float temp(time,lat,lon);
it would be interpreted as having three independent dimensions, of longitude, latitude and time, with the longitude and latitude both having size one. The main point of this ticket is that these last two representations are logically equivalent.
So there are two logically distinct ways of representing a single timeseries. Which one you choose depends on whether you regard it as a single feature from a discrete sampling geometry (scattered points), or as the time-dependent data from a single point on a grid.
Best wishes
Jonathan
comment:56 in reply to: ↑ 55 Changed 6 years ago by caron
Hi Jonathan:
I think that this is the root of the disagreement/misunderstanding. Just as radar data is 2 dimensional (has 2 independent coordinates) but is "embedded" in 3D space (has 3 coordinates needed to represent its position on the earth), so too can time-series data be understood as a 1 dimensional subspace ("manifold") embedded in 4D space. But I dont think that proposal #104 allows that possibility.
Regards, John
Replying to jonathan:
Dear John
Suppose that the data provider would prefer to represent her time-series data as having a single independent coordinate, ie is 1 dimensional? How would she do that under ticket #104?
If the single timeseries is recorded like this:
dimensions: station=1; time=NNN; variables: float lat(station); float lon(station); float time(time); float temp(station,time); temp:coordinates="lat lon";it would be interpreted as having two independent dimensions, one of space and one of time. The space dimension has size one and lat and lon both depend on it. Is that what you mean by 1D? It can't be truly 1D unless you omit the spatial information altogether. A timeseries discrete sampling geometry must have at least horizontal coordinates.
If the single timeseries is recorded like this:
dimensions: time=NNN; variables: float lat; float lon; float time(time); float temp(time); temp:coordinates="lat lon";or like this:
dimensions: lon=1; lat=1; time=NNN; variables: float lon(lon); float lat(lat); float time(time); float temp(time,lat,lon);it would be interpreted as having three independent dimensions, of longitude, latitude and time, with the longitude and latitude both having size one. The main point of this ticket is that these last two representations are logically equivalent.
So there are two logically distinct ways of representing a single timeseries. Which one you choose depends on whether you regard it as a single feature from a discrete sampling geometry (scattered points), or as the time-dependent data from a single point on a grid.
Best wishes
Jonathan
comment:57 Changed 6 years ago by jonathan
Dear John
I don't understand, I'm afraid. I think it is natural to regard a single timeseries as a kind of core bored, or skewered, through 4D space-time, at a particular fixed single point (with size-one x and y coordinates, required in a timeseries DSG, and perhaps a size-one z coordinate), ranging over all times. Both the logical representations above have that idea, I would say; they differ according to whether you regard this single timeseries as one from a number of scattered points in a DSG, or one from a point selected in 2D or 3D continuous space. This ticket does not change what is allowed or disallowed wrt the present convention.
Best wishes
Jonathan
comment:58 follow-up: ↓ 59 Changed 6 years ago by caron
Hi Jonathan:
This representation:
variables: float lat; float lon; float time(time); float temp(time); temp:coordinates="lat lon";
would be interpreted as having three independent dimensions. But thats incorrect, or at least should be optional. It should be possible to interpret this as having only one independent coordinate, ie "1D". But theres no way to represent that in proposal #104.
The more general issue is that the number of dimensions and the number of coordinates have different meanings. Interpreting a scalar coordinate as a dimension 1 coordinate (meaning independent) complicates things without any gain that i can see. Better is to let it be understood as a dependent variable, and if you want to indicate that its an independent coordinate, then you have to make it a coordinate variable, ie give it a dimension of length 1, and add that dimension to the data.
OTOH, theres nothing special about a scalar coordinate, and should not be handled in a special way in the data model. Its just an auxiliary coordinate, period. By the current definition of coordinate and auxiliary coordinate, its clearly an auxiliary coordinate.
What this discussion has helped clarify for me is that coordinate variables are independent, and auxiliary coordinates are dependent variables. I think thats a really valuable advance in the data model.
Things are complicated a bit by the DSG representations that introduce, eg a station dimension, that allows one to factor out the station info from the observation. Since all we have in the classic model are multidimensional arrays, one has to use dimensions for lots of things, not just to indicate the domain dimensionality. I guess we should look hard at the DSG representation to see what it says about the assertion that "coordinate variables are independent, and auxiliary coordinates are dependent variables".
regards, John
comment:59 in reply to: ↑ 58 ; follow-up: ↓ 60 Changed 6 years ago by jonathan
Dear John
What this discussion has helped clarify for me is that coordinate variables are independent, and auxiliary coordinates are dependent variables. I think thats a really valuable advance in the data model.
I agree. We will put that in the data model.
OTOH, theres nothing special about a scalar coordinate, and should not be handled in a special way in the data model. Its just an auxiliary coordinate, period. By the current definition of coordinate and auxiliary coordinate, its clearly an auxiliary coordinate.
It is formally an auxiliary coordinate variable, but not logically so. That's a point that Steve made and is included in the motivation for this ticket - now a long way back in comment 28:
Scalar coordinate variables provide a convenient way to encode coordinate variables of size one. They do so by borrowing the syntax that is otherwise used for auxiliary coordinate variables. There is, however, a key difference between the interpretation of scalar coordinate variables and auxiliary coordinate variables. Scalar coordinates have the same status in a CF file as (conventional, Unidata, COARDS) coordinates in which the dimension name and the variable name match. These coordinates define the independent variables (spatiotemporal and others) for the data variable. Auxiliary coordinate variables provide extra information as a function of these independent variables, as alternative numeric values (which don't have to be unique or monotonic along a given dimension), or string-valued labels. To indicate that a variable is intended to be an auxiliary coordinate variable, it is necessary to give it a dimension, in order to show which coordinate variable(s) it belongs to. Numeric scalar coordinate variables are not to be interpreted as auxiliary coordinate variables.
Yes, the single timeseries representation with scalar lat and lon coordinates would, according to this ticket, be regarded as having independent dimensions of lat and lon. That is a possible interpretation, and I would say it's the preferred one in the case that the timeseries was extracted from a gridded dataset.
I would argue that it is possible, according to this ticket, to store the data with the interpretation you prefer, that it's one timeseries from a DSG. That is what it means if we use auxiliary coordinates i.e.
dimensions: station=1; time=NNN; variables: float lat(station); float lon(station); float time(time); float temp(station,time); temp:coordinates="lat lon";
In this case, the station dimension is a discrete axis. CF section 4.5 says a discrete axis "indicates either an ordered list or an unordered collection, and does not correspond to any continuous coordinate variable." So it is a netCDF dimension, but it doesn't correspond to any independent physical dimension. It's just an index; only time is an independent physical dimension. Isn't that what you want?
Consider the case with station=2 but otherwise the same. Now you have two 1D timeseries, each with a single independent physical dimension of time. The other netCDF dimension is just an index to bundle them up. You could put exactly the same data in a one of the chapter 9 ragged representations. Then the data array would have only one netCDF dimension, but the station dimension would still exist, and you would still have lat(station) and lon(station) just as above. Again, the station dimension is simply an index. Would you agree that this does not add any physical dimensions to the individual timeseries?
Best wishes
Jonathan
comment:60 in reply to: ↑ 59 Changed 6 years ago by caron
Hi all:
I would argue that it is possible, according to this ticket, to store the data with the interpretation you prefer, that it's one timeseries from a DSG. That is what it means if we use auxiliary coordinates i.e.
dimensions: station=1; time=NNN; variables: float lat(station); float lon(station); float time(time); float temp(station,time); temp:coordinates="lat lon";In this case, the station dimension is a discrete axis. CF section 4.5 says a discrete axis "indicates either an ordered list or an unordered collection, and does not correspond to any continuous coordinate variable." So it is a netCDF dimension, but it doesn't correspond to any independent physical dimension. It's just an index; only time is an independent physical dimension. Isn't that what you want?
Consider the case with station=2 but otherwise the same. Now you have two 1D timeseries, each with a single independent physical dimension of time. The other netCDF dimension is just an index to bundle them up. You could put exactly the same data in a one of the chapter 9 ragged representations. Then the data array would have only one netCDF dimension, but the station dimension would still exist, and you would still have lat(station) and lon(station) just as above. Again, the station dimension is simply an index. Would you agree that this does not add any physical dimensions to the individual timeseries?
I agree with your interpretation and that this is a possible way to represent 1D data.
It is formally an auxiliary coordinate variable, but not logically so. That's a point that Steve made and is included in the motivation for this ticket - now a long way back in comment 28:
Scalar coordinate variables provide a convenient way to encode coordinate variables of size one. They do so by borrowing the syntax that is otherwise used for auxiliary coordinate variables. There is, however, a key difference between the interpretation of scalar coordinate variables and auxiliary coordinate variables. Scalar coordinates have the same status in a CF file as (conventional, Unidata, COARDS) coordinates in which the dimension name and the variable name match. These coordinates define the independent variables (spatiotemporal and others) for the data variable. Auxiliary coordinate variables provide extra information as a function of these independent variables, as alternative numeric values (which don't have to be unique or monotonic along a given dimension), or string-valued labels. To indicate that a variable is intended to be an auxiliary coordinate variable, it is necessary to give it a dimension, in order to show which coordinate variable(s) it belongs to. Numeric scalar coordinate variables are not to be interpreted as auxiliary coordinate variables.
I dont see the motivation in this comment, it appears to be a statement of the proposal. Also this statement of the purpose of auxiliary coordinates misses the critical point of that they can represent the embedding of the data domain into a higher dimensional space (think of 2D radar in 3D cartesion space).
Here's why I think scalar coordinates need to be understood as auxiliary coordinates. Suppose you want to add a length=1 coordinate, and you now get to decide if its dependent or independent. If independent, you must add a new dimension to the data, and if you want it to be a dependent variable, you must not add a new dimension to the data. In the independent case, you can use a standard coordinate variable lat (lat=1). But in the dependent case, you cant use anything other than the scalar form. So we need to leave that form as indicating an auxiliary coordinate.
The example that ive been using to illustrate this is:
1D data: dimensions: time=23236727; float data(time); data:coordinates = "lat lon time" float time(time); float lat; float lon 3D data: dimensions: lat=1; lon=1; time=23236727; float data(time, lat, lon); data:coordinates = "lat lon time" float time(time); float lat(lat); float lon(lon)
The dimensionality of the data, indicates the dimensionality of the domain. The number of coordinates represents the dimensionality of the range. If you lose track of that then none of this really matters.
Regards John
comment:61 Changed 6 years ago by stevehankin
I think this is our question: When a scalar coordinate is associated with a data variable through a coordinates attribute, is the proper interpretation that the scalar coordinate is an independent axis of length 1? Or simply a dependent variable?
The current text is internally contradictory (*). But I like the free wheeling approach that is found in this definition:
scalar coordinate variable
A scalar variable that contains coordinate data. Functionally equivalent to either a size one coordinate variable or a size one auxiliary coordinate variable.
I suspect that we should continue to live with this ambiguity. To emphasize this perspective consider example H.5 (Single time series, including deviations from a nominal fixed spatial location http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.6/cf-conventions.html#idp8320208 ) in which scalar variables 'lat' and 'lon' may be regarded as independent or dependent depending on the context in which the file is being used.
This ambiguity is (I think) a challenge for the data model rather than for CF, itself. Can it be addressed by redefining some concepts inside of the data model, without altering CF? (So easy to lose track of the question. Sorry if I have.)
====
(*) The text also says: "Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable."
comment:62 Changed 6 years ago by caron
Hi Jonathan, Steve and all:
Whatever the outcome of this discussion, i feel like it has really helped me understand the meaning of dependent/independent coordinate variables.
Most data providers are not thinking about this when they design their files, im sure, so it remains to be seen if existing CF files are consistent in the use of coordinate/auxiliary coordinate as indicating dependence/independence.
Would love to hear other's perspectives, and ill be out of touch next week.
Kind regards, John
comment:63 Changed 6 years ago by jonathan
Dear Steve and John
Thanks for your additions.
Yes, as Steve says, this discussion is entirely about interpretation and the data model. The current text can be seen as contradictory. This ticket proposes to remove the ambiguity by following the text, "Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable." That follows what Steve said earlier, and I quoted: "Scalar coordinate variables provide a convenient way to encode coordinate variables of size one. They do so by borrowing the syntax that is otherwise used for auxiliary coordinate variables." I think that the definition appears to be ambiguous because there are two kinds of scalar coordinate variable mentioned in the CF document. There are numeric scalar coordinate variables, which were introduced to mean the same thing as size-one numeric coordinate variables, and there are string-valued scalar coordinate variables, and they have to be logically regarded auxiliary coordinate variables because string-valued coordinate variables are not possible in netCDF. This ticket proposes to clarify that as well.
We do not think that ambiguity is desirable or necessary and was not intended. We could retain the ambiguity by creating a third class of coordinate (not dimension coordinate, not auxiliary coordinate) in the data model, as Mark argues in ticket 105. David and I think that's an unnecessary complexity which has not been envisaged in the design of CF up to now, and that it's better for data-writers to indicate what their data mean.
I'd like to understand John's point but it looks like I'll have to wait for a week. :-) The problem I'm having with it is shown by the example implied in my last posting:
dimensions: station=2; time=NNN; variables: float lat(station); float lon(station); float time(time); float temp(station,time); temp:coordinates="lat lon";
This is a discrete sampling geometry with two timeseries. The data variable has two dimensions. John writes, "The dimensionality of the data indicates the dimensionality of the domain. The number of coordinates represents the dimensionality of the range." So what is the dimensionality of the data here? I think, by expressing it as a DSG, we are saying the data is physically one-dimensional (time). The other netCDF dimension is just an index, a discrete axis. If you like, you can regard lat and lon as a function of this index. It's like an ensemble axis, which is also not a physical coordinate. If this is not true, I do not understand how chapter 9 can possibly meet John's requirement that timeseries should be regarded as physically one-dimensional. In my opinion, if station=1 in the above example, we have the same interpretation of it. The data is still physically one-dimensional, though formally two-dimensional.
In order to decide about this ticket, the CF committee has been asked to vote on it. So far, Nan, Roy, Russ, Martin, Rich, Karl and I have given support. It would be good to hear from Balaji, Philip, Bryan and Alison, as well as more from John and Steve, when possible.
Best wishes
Jonathan
comment:64 in reply to: ↑ 28 Changed 6 years ago by jonathan
Since there have been a couple of changes since the proposal was last stated, here it is again in its current form:
Overview, or motivation
Scalar coordinate variables provide a convenient way to encode coordinate variables of size one. They do so by borrowing the syntax that is otherwise used for auxiliary coordinate variables. There is, however, a key difference between the interpretation of scalar coordinate variables and auxiliary coordinate variables. Scalar coordinates have the same status in a CF file as (conventional, Unidata, COARDS) coordinates in which the dimension name and the variable name match. These coordinates define the independent variables (spatiotemporal and others) for the data variable. Auxiliary coordinate variables provide extra information as a function of these independent variables, as alternative numeric values (which don't have to be unique or monotonic along a given dimension), or string-valued labels. To indicate that a variable is intended to be an auxiliary coordinate variable, it is necessary to give it a dimension, in order to show which coordinate variable(s) it belongs to. Numeric scalar coordinate variables are not to be interpreted as auxiliary coordinate variables.
This is the change to the convention:
Section 5.7, Scalar coordinate variables.
Replace
When a variable has an associated coordinate which is single-valued, that coordinate may be represented as a scalar variable. Since there is no associated dimension these scalar coordinate variables should be attached to a data variable via the coordinates attribute.
with
When a variable has an associated coordinate which is single-valued, that coordinate may be represented as a scalar variable (i.e. a data variable which has no netCDF dimensions). Since there is no associated dimension these scalar coordinate variables should be attached to a data variable via the coordinates attribute.
Replace
Under COARDS the method of providing a single valued coordinate was to add a dimension of size one to the variable, and supply the corresponding coordinate variable. The new scalar coordinate variable is a convenience feature which avoids adding size one dimensions to variables. Scalar coordinate variables have the same information content and can be used in the same contexts as a size one coordinate variable.
with
The use of scalar coordinate variables is a convenience feature which avoids adding size one dimensions to variables. A numeric scalar coordinate variable has the same information content and can be used in the same contexts as a size one numeric coordinate variable. Similarly, a string-valued scalar coordinate variable has the same meaning and purposes as a size one string-valued auxiliary coordinate variable (Section 6.1).
At the end of the section, add:
If a data variable has two or more scalar coordinate variables, they are regarded as though they were all independent coordinate variables with dimensions of size one. If two or more single-valued coordinates are not independent, but have related values (this might be the case, for instance, for time and forecast period, or vertical coordinate and model level number, Section 6.2), they should be stored as coordinate or auxiliary coordinate variables of the same size one dimension, not as scalar coordinate variables.
Section 6.1, Labels
Replace the last sentence
If a character variable has only one dimension (the maximum length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables).
with
If a string-valued auxiliary coordinate variable has only one dimension (the maximum length of the string), it is a string-valued scalar coordinate variable (see Section 5.7, Scalar Coordinate Variables). As such, it has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension which has not been added to the data variable. This is a convenience feature.
In addition, when/if we return to the data model discussion, we should note that (dimension) coordinate constructs are for independent coordinates, and auxiliary coordinate constructs for dependent coordinates, as John's comments have suggested.
Cheers
Jonathan
comment:65 Changed 6 years ago by apamment
After careful reading of both tickets #104 and #105 and the relevant sections of CF 1.6, I have come to the conclusion that I support the approach taken in ticket 104. Therefore I vote to accept #104 and reject #105.
I agree that the original intention in section 5.7 seems to have been that scalar coordinate variables represent independent axes. Scalar coordinates are stated to be an alternative to adding a size one dimension, which implies that each one does indeed represent an independent dimension of the data variable. I can't help feeling that if CF had stuck to the COARDS way of doing things, i.e., always requiring the addition of size 1 dimensions to data variables, then this ambiguity about whether scalar coordinates are independent or auxiliary would not have arisen. Perhaps this is something to bear in mind if we are tempted to add any more "convenience features" in the future.
I can see how the kind of practices highlighted by Mark may have arisen, but I think it is highly undesirable for data files to be written in which the relationship between the coordinates and auxiliary coordinates is ambiguous. I fully agree with Jonathan and David's position that users of data are free to interchange coordinates and auxiliary coordinates as appropriate for their application of the data. Furthermore, I think the ability to do this in a meaningful way (i.e. correctly interpreting the data) is enhanced by making the relationships between the possible alternative coordinates clear at the time of writing. Not to document these relationships is to supply incomplete and possibly misleading metadata which diminishes the opportunities to reuse the data and interpret it appropriately. I certainly do not think we should change the convention to encourage ambiguous use of either coordinate or auxiliary coordinate variables, hence my rejection of the proposal in ticket 105.
I confess to having become rather bogged down when reading the discussion of dimensionality versus degrees of freedom, but John's examples helped me to understand (eventually!) the issue. My feeling is to go with Jonathan's way of doing things, i.e., that scalar coordinates are independent and data with only one real degree of freedom, like John's time series, should be described by adding auxiliary coordinates of size one. I think John's idea of making scalar coordinates dependent is in some ways simpler, but it is a more radical change than the one Jonathan is proposing. I had a look at the CF 1.6 conformance document to see what it says about scalar coordinates. They are referred to in section 7.3 (and also, of course, in the corresponding section of the conventions). The text on cell methods clearly treats scalar coordinates as separate dimensions along which statistical processing may have taken place. For this reason, I think that changing scalar coordinates to being dependent rather than independent is in fact a break with the way things have been done in the past and it risks making CF1.7 incompatible with earlier versions.
Regardless of whether we ultimately choose to declare scalar coordinates as dependent or independent, I would advocate the inclusion in the conventions document of as many examples as are necessary to fully illustrate the intended use of both scalar coordinates and auxiliary coordinates. In this way, we can hope to avoid future ambiguities and misunderstandings. Maybe we could even include clearly signposted "Do" and "Don't" examples, showing first how scalar coordinate variables are meant to be used and also how they are not meant to be used.
Best wishes, Alison
comment:66 Changed 6 years ago by jonathan
Dear all
Thank you for your thoughtful comments, Alison.
Following the above useful discussions with John and Steve, and further email discussions, I'd like to add one more change to this proposal:
In Section 1.2, change the definition
scalar coordinate variable: A scalar variable that contains coordinate data. Functionally equivalent to either a size one coordinate variable or a size one auxiliary coordinate variable.
to
scalar coordinate variable. A scalar variable (i.e. one with no dimensions) that contains coordinate data. Depending on context, it may be functionally equivalent either to a size-one coordinate variable (Section 5.7) or to a size-one auxiliary coordinate variable (Sections 6.1 and 9.2).
This is in order to define what "scalar" means, and to provide pointers to the relevant parts of the document where the role of scalar coordinate variables is clarified.
It's become clear from these discussions that it would be better to indicate the role of scalar coordinate variables explicitly. I will make a proposal in another ticket to define new cf_role values for that purpose. However, if agreed, that wouldn't be implemented until CF 1.7 or 1.8. At the moment we are discussing the interpretation and the data model of CF 1.5. None of the changes proposed in this ticket intend to change anything about CF-netCDF; they are concerned only with clarifying what CF 1.5 means. I don't want to make a proposal to modify CF metadata in this ticket, because I fear that it would delay the conclusion of the data model discussion, which has already been in progress for a couple of years (beginning in ticket 68)!
If this ticket is agreed, we can go back to continuing the data model discussion in ticket 98. Following other comments by John, we will there propose inserting some text about the correspondence between dimension/auxiliary and independent/dependent coordinates in the data model.
Cheers
Jonathan
comment:67 Changed 6 years ago by markh
I appreciate the clarifications and thoughts from all parties involved in this discussion, and the patience shown with me.
The community has concluded on its view, I do not have further objections to the principle of the approach stated in this ticket. I would like to see the discussions on this ticket conclude with suitable clarification text in time for for the 1.7 conventions document.
I will consider the text as posted above and add any comments I have on wording as part of concluding this ticket.
mark
comment:68 Changed 6 years ago by caron
ok with me
comment:69 Changed 6 years ago by jonathan
A month has passed with no new comments being made. There are no outstanding objections and the ticket has enough support to be accepted according to the rules, so I presume that it is concluded and this change should be included in CF version 1.7, in the form shown in comment 64 and comment 66. Thanks to all the participants in the discussion.
comment:70 Changed 3 years ago by davidhassell
- Owner changed from cf-conventions@… to davidhassell
- Status changed from new to accepted
comment:71 Changed 2 years ago by painter1
- Resolution set to fixed
- Status changed from accepted to closed
Dear Jonathan and David,
For the most part I like your changes and vote for them. Only one minor suggestion on the first alteration: I think it would be even clearer if we split the second sentence:
Best regards,
Martin