Opened 4 years ago

Last modified 3 years ago

#95 new task

Development of CF 1.5 Data Model

Reported by: markh Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

This ticket follows on from the agreement of terms of reference #88 for an implementation neutral model for CF. The agreed terms of reference are restated here.

Scope, Terms and Conditions

  1. The CF community will adopt a data model as part of the CF Metadata Project.
  2. The data model will be a complementary resource to the:
    • CF Conventions Document
    • CF Standard Name Table
    • CF Conformance Requirements & Recommendations
    • Guidelines for Construction of CF Standard Names
  3. The data model will be maintained by the community, using the same mechanisms as are used for the conventions, conformance and standard_name documents.
  4. A version of the data model will be published at the same time as or as soon as possible after each version of the CF conventions, consistent with that version and having the same version number, beginning from version 1.5.
  5. Discussions of proposed changes to the CF conventions should consider consistency with the data model. If inconsistencies exist, these should be addressed, either by altering the proposal or by proposing a change to the data model.
  6. Equally, consideration of the data model may motivate changes to the CF conventions. In this case proposed changes to the conventions will be discussed and agreed using the current mechanisms.
  7. The responsibility for maintaining the data model and for its consistency with the CF conventions will belong to a new committee, but anyone may propose changes to the data model in the same way as changes to the CF conventions.
  8. The scope of the data model is to define the concepts of CF and the relationships that exist between these concepts.
  9. The data model provides a logical, implementation neutral, abstraction of the concepts defined by CF.
  10. The data model does not define the interface to CF.

Benefits

The data model is believed to offer the following benefits:

  • Providing an orientation guide to the CF Conventions Document
  • Guiding the development of software compatible with CF
  • Facilitating the creation of an Application Programming Interface which 'behaves/feels like CF' and is intuitive to use.
  • Providing a reference point for gap analysis and conflict analysis of the CF specification
  • Providing a communication tool for discussing CF concepts and proposals for changes to the CF specification
  • Setting the ground work to expand CF beyond netCDF files.

Activity

The discussions in #68 have developed many of the ideas for this work. The objective of this ticket is to build on the work of #68 to agree the details of the CF data model consistent with version 1.5 of the CF Conventions for NetCDF.

The draft data model is being worked on by the community on the CF Trac wiki:

The draft CF Data Model v1.5

The final data model details will be placed in:

The CF Data Model v1.5

upon agreement to this ticket pending publication by the CF community.

Change History (107)

comment:1 follow-up: Changed 4 years ago by jonathan

Dear Mark

David and I opened ticket 68 with a proposed data model. You offered to be the moderator of that ticket. It seems to me that we have been having a productive discussion and we are almost in agreement, with the only outstanding issues being whether axes have an independent existence (which you and I have been discussing by email) and whether our proposal for transforms is adequate (on which you have some ideas that you're going to present).

I think that the correct procedure at this point on ticket 68 would be to request that the discussion about time representation by continued elsewhere. Apart from that, that ticket is on course to agree a data model. The aim has always been to be implementation-neutral.

If, as moderator of ticket 68, you think it would be clarify the discussion to move it to a new ticket, surely it should carry on from its current position. David has an up-to-date version of the HTML diagram, which he has not yet published on ticket 68 because we have been waiting to see if we could reach agreement with you by email containing the point about domain axes. If it would facilitate the discussion, the HTML document of ticket 68 could be converted into a wiki for further discussion. We also have a UML diagram on that ticket, which David intends to keep up to date as the discussion proceeds. I don't think we should start all over again!

Jonathan

comment:2 in reply to: ↑ 1 Changed 4 years ago by markh

Replying to jonathan:

Dear Mark

David and I opened ticket 68 with a proposed data model. You offered to be the moderator of that ticket. It seems to me that we have been having a productive discussion and we are almost in agreement, with the only outstanding issues being whether axes have an independent existence (which you and I have been discussing by email) and whether our proposal for transforms is adequate (on which you have some ideas that you're going to present).

I think that the correct procedure at this point on ticket 68 would be to request that the discussion about time representation by continued elsewhere. Apart from that, that ticket is on course to agree a data model. The aim has always been to be implementation-neutral.

If, as moderator of ticket 68, you think it would be clarify the discussion to move it to a new ticket, surely it should carry on from its current position. David has an up-to-date version of the HTML diagram, which he has not yet published on ticket 68 because we have been waiting to see if we could reach agreement with you by email containing the point about domain axes. If it would facilitate the discussion, the HTML document of ticket 68 could be converted into a wiki for further discussion. We also have a UML diagram on that ticket, which David intends to keep up to date as the discussion proceeds. I don't think we should start all over again!

Jonathan

Hello Jonathan

I agree, I have no desire to start over again and I don't believe I have implied doing this. I see this ticket as merely a continuation of #68, with the agreed objectives clearly stated in the ticket definition.

The notes I put up on the wiki are my attempt to summarise the discussions we have had to date. I hope that in general these statements may be seen as useful summaries of our discussions, but clearly I will have written a number of things which need adjusting or completely rewriting/replacing.

I feel that a wiki is a useful place to maintain the latest information as we discuss it.

I have not copied information from pages you and David have published, instead taking the section headings and writing a short summary of what I thought the title meant. I think that this will enable us to take the sections one by one and agree on a final wording describing the construct.

I have tried to be as brief as possible following various comments on #68 about complexity and brevity.

My proposal as moderator of #68 / #95 is to continue discussions here, taking each type in turn and attempting to agree text describing it and the nature of it's relationships in the UML model.

If this simply involves taking a block of text from one of your or David's documents and pasting it into the wiki then that seems a fine approach to me. If we want to adjust wording or simplify statements we can address those in turn.

Similarly we will update or replace the diagram and keep the latest version visible on the wiki.

Does this feel like a workable approach to you?

mark

comment:3 follow-up: Changed 4 years ago by davidhassell

Hello,

I think that it's a bit confusing for two open tickets to both claim to have a description and diagram of proposed CF data models! I would say that to date, the draft data model has been developed in the discussion on ticket 68.

I have just posted, on ticket 68, the latest draft (text and diagram) of the data model, which incorporates many of the points raised in that discussion, which should be a good point of reference.

All the best,

David

comment:4 in reply to: ↑ 3 Changed 4 years ago by markh

Replying to davidhassell:

I think that it's a bit confusing for two open tickets to both claim to have a description and diagram of proposed CF data models! I would say that to date, the draft data model has been developed in the discussion on ticket 68.

My proposal is to freeze #68 with respect to the data model, version 1.5, and continue discussions on the model to their conclusion on this ticket. There has been too much confusion on the scope of #68 which I think continues to detract from discussions.

I have just posted, on ticket 68, the latest draft (text and diagram) of the data model, which incorporates many of the points raised in that discussion, which should be a good point of reference.

That is useful, thank you.

I feel that a workable approach is to take the contents of the document and diagram you have posted, section by section, conduct a final review of wording and intent.

Each agreed section can then be used to populate the Draft Wiki, replacing my placeholder text on this page until it represents the conclusion of this ticket.

comment:5 Changed 4 years ago by markh

I propose that we start with one of the constructs not yet agreed: CoordinateReferenceSystem

In the early stages of CF grid_mapping variables were mainly used to define the coordinate transformation from a pair of horizontal spatial coordinates to their latitude, longitude equivalents, generally 2D auxiliary_coordinates. These 2D auxiliary_coordinates are also supplied according to the convention: in particular to support interoperability with Coords convention NetCDF aware software.

I feel that the use of coordinate reference systems in CF is far more extensive than that these days, with the importance of spatial referencing for data stretching beyond the singular requirement to provide latitude and longitude as geographic coordinates.

There is a change scheduled for CF 1.7 extending the referencing syntax to support some of these use cases (#70) and discussions around specification which are likely to continue as the scope of CF continues to grow.

As the scope for this ticket is to deliver a CF data model which represents CF for NetCDF 1.5 I think it is reasonable to limit the expression of grid_mappings to that version. As such I think it is reasonable to shelve CoordinateReferenceSystems for this iteration of the data model and just have a Transform construct, to represent the 'transform only' grid_mappings of CF 1.5 and parametrised vertical coordinates.

We can revisit this type as we evolve the data model to version 1.7.

Is this a reasonable approach?

mark

comment:6 follow-up: Changed 4 years ago by jonathan

Dear all

In comment 134 of on 23 Nov I summarised the discussion of ticket 68 up that point. Various comments were made on the summary, as a result of which David and I made some changes to the model we had proposed, which are included in the version he has posted. Here is a summary of my previous summary, with comments on the changes made. In my previous posting, I invited people to say if they disagreed with the points made. I've recorded the disagreements which were noted below. I hope it is safe to assume that silence meant agreement. In that case, the only outstanding point in this list is Mark's concern about domain axes.

  • What to call the space in which the field exists. David and I called it a space, but we are OK with calling it a domain instead, following Jon's suggestion, which various people support. As far as I can see, no-one disagrees with domain. But note that it is not purely spatiotemporal (unlike Jon's table), since CF allows non-spatiotemporal coordinates. Also, it is not defined only by coordinates, but also cell measures, cell methods and transforms (i.e. grid mapping and formula terms), as has already been discussed.

I think everyone agrees with using the word domain - thanks for that, Jon. Bryan agreed that the domain could include spatiotemporal coordinates. Mark supported this point. No-one has disagreed.

  • Whether the data model should specifically mention the possibility of a domain existing without a data in it i.e. a field with no data. David and I included this by saying that the data array is optional in the field. There has been agreement that this could be useful, but Mark has argued that since CF-netCDF doesn't currently have a convention for a field with no data, we shouldn't mention it in the data model. I can't argue against this, since it's the usual view we take about CF, that we don't add to it until we need to. So we should remove it for the moment from the proposed data model (but keep it in mind as an appealing and probably useful generalisation). OK?

Bryan mentioned description of irregular grids and gridspec definitions as example use cases for a dataless domain. Bob Oehmke also supported this need, since ESMF regridding needs it. Bryan thinks we should look further ahead than we normally do when modifying CF. Mark earlier argued against this concept, but in his comment 138 on 28 Nov he said he supported all the points in my summary except the next one, which would therefore appear to support this one. Is that right, Mark? We are content to omit this part for the moment, expecting that it will soon be added when a convention for it is proposed for CF-netCDF, and dataless fields are not currently in our document. However, we would be happy to put them back in if you are content that we do so. No-one else has objected to the idea.

  • Whether dimensions have an independent identity in the field. Could we call a 1D axis a domain axis? Our CF data model would then distinguish domain axis construct (corresponding to netCDF dimension and the 1D coordinate variable of the same name, if there is one) and auxiliary coordinate construct (corresponding to a CF-netCDF auxiliary coordinate variable).

Bryan supported the concept of domain axis construct. Mark expressed concerns about this point. Following subsequent useful discussions with Mark, we have changed our document. We now have a domain axis construct, whose only function is to indicate the existence of a dimension of the domain, and provide the size of the dimension. That means the dimension coordinate construct (corresponding to the Unidata 1D netCDF coordinate variable) and the auxiliary coordinate construct (corresponding to the CF-netCDF 1D or multi-dimensional auxiliary coordinate variable) are more similar than we had before. This feels like an improvement. Both types of coordinate construct refer to the domain axes. The field data array also refers to the domain axes for its dimensions, but it does not have to include dimensions of size one. This is because CF-netCDF allows data variables to omit dimensions of size one, by using scalar coordinate variables. In previous discussons, I don't think anyone else has been concerned about this treatment. I wonder what you think now about our current version, Mark?

  • Whether fields are independent as far as the data model is concerned. I think that everyone agrees with this, provided that we also note that it is possible to test, using the information described by the data model, whether two fields are defined with the same domain, and that if the fields are actually contained in a single CF-netCDF file, identity of dimensions and coordinates may make it easier to verify that two fields are defined with the same domain. Is that OK with everyone?

Mark supported this point. Bryan agreed with it too, pointing out that it might be desirable to put the domain definition in a different file from the data variables, which would be possible if we allowed dataless domains. We agree with Steve that the duplication of a domain definition is not an issue for the data model. The inclusion of the note proposed above (which is in the version of our document that David posted) was to meet a concern arising from earlier comments of John Caron's. It's an aspect of implementation in CF-netCDF files, not an aspect of the data model itself. No other objections have been made.

  • I think we all agree that there is a need for the CF data model to envisage fields which are distributed over several CF-netCDF files or which reside in non-netCDF files or in memory. Steve argues that we are already doing it, and he's right. Nonetheless the CF-netCDF convention does not deal with this. It is for individual netCDF files and says nothing about aggregation rules or non-netCDF files. The question is whether, despite this limitation, we should write the CF data model description to permit the idea of fields spread over several datasets or in non-netCDF files. Do we all agree that we should do that?

Bryan and Mark agreed with this. Martin commented that a field distributed over several files should be regarded as a single field by the data model. We agree; our document says "a field construct may span data variables in more than one file." No-one has disagreed with this point, but Steve commented that the data model should not talk about files at all. We agree that files are not required by the data model itself; our data model document does, however, remark on how things are implemented in CF-netCDF, which is a convention for netCDF files. We thought that making these references to CF-netCDF would make the document easier to understand than if we described the data model in a purely abstract way, but the intention is that the data model is independent of the file format, as others have argued for.

  • I suspect we all agree that we do not want the CF data model to contain rules for aggregation. John has just made such a remark. We acknowledge the need to be able to do this, but we are not ready to standardise it. The CDM and NCML support aggregation according to certain rules. David and I have proposed a different set of aggregation rules in ticket 78, which David has implemented in cf-python. Aggregation rules are a step beyond the CF data model at the moment. Is that OK?

Bryan and Mark agreed with this. No-one has disagreed.

In comment 5 on this ticket, Mark has raised the treatment of grid_mapping etc. I'd like to discuss this, but I'd prefer not to deal with too many things at once, so I wonder if we can first be sure we agree on the above points? If we do, that goes a long way towards agreeing the data model.

To be clear, I am proposing that the our proposed data model, except for the part on transforms, should be adopted as the CF data model. Are there other substantial aspects that have not yet been discussed about which others have comments?

I suggest that we postpone agreeing the explicit text until we have agreed the content. If there are lots of detailed concerns about text, we could try doing editing the document on a wiki, or by circulating the document among those who wish to participate, whichever is more efficient.

Best wishes

Jonathan

comment:7 Changed 4 years ago by mgschultz

John,

this is a very good summary, and I agree with pretty much all you have written. However, I do have a general concern about the scope of the data model as it is described at http://www.met.reading.ac.uk/~david/cfdm_recast.html. In the "preamble" it states "The present document proposes a data model corresponding to the CF metadata standard (version 1.5). The data model avoids prescribing more than is needed for interpreting CF as it stands, in order to avoid inconsistency with future developments of CF." I believe that these two sentences are the major cause for some of the confusion or discussions you refer to, and I would propose to change them. Specifically, explicitly allowing the data model already now to look beyond the present CF-1.5 convention will facilitate its further development as well as the developments of specific APIs and the convention itself.

The two points I read as "open" from your summary of the summary could in my view be agreed upon more easily if we remove the need for CF-1.5 consistency, or rather: relax this a little.

  1. empty fields: I don't think these are in conflict with the current CF convention. Although they don't seem to matter in current real-life single netcdf-CF files (else there would have been stronger voicing of concerns or support), it seems like everyone agrees that they are a logical aspect of an abstract data model, in particular with respect to "setting the groundwork...". My suggestion would therefore be to leave this in the data model description, but state explicitly that this is an extension beyond the current CF-1.5 standard, because there has been no need for this in actual applications so far. [side remark: froma software engineering perspective, I would regard this option as the possibility to have a pointer to a data field set to NULL, which is in any event a good condition to check for]
  1. fields straddling across netcdf files: as you say, the current netcdf-CF convention doesn't deal with this, even though it is common practice, and a concept supported by various analysis tools. I think it would be silly to let the data model fall behind reality, "just because" the convention is not up to this yet. In fact, the cfdm web page notes that "Providing a communication tool for discussing CF concepts and proposals for changes to the CF specification" is one of the primary objectives of the data model, and indeed, I see a case for this here.

As you also point out implicitly, one reason for confusion may be the relation of convention - cf-netcdf file(s) - data model. As it has evolved from the need to standardize metadata in netcdf files, the convention specifically deals with (individual) netcdf files. The data models comes from a rather different angle, and - as was discussed in this ticket - would be a lot stronger without tying it (too closely) to the specifics of (single) netcdf data files. This is indeed a fundamental difference, which in my eyes contradicts the two introductory sentences I cited above. I hope that if we agree on this introduction, thereby better defining the scope of the data model, we might reach consensus easier and also make it easier to expand the data model (and possibly the convention) later.

So, here is my suggestion for this paragraph: "The present document proposes a data model corresponding to the CF metadata standard (version 1.5). The data model strives to maintain consistency with CF as it stands, but extends beyond the version 1.5 standard in a few places where its more general scope demanded this and where the authors do not anticpate inconsistency with future developments of CF. This document is illustrated by the accompanying UML diagram of the data model."

Best regards,

Martin

comment:8 follow-up: Changed 4 years ago by jonathan

Dear Martin

Thanks for your comments. Bryan's earlier comment is similar, that in the data model we might allow ourselves to look a bit further into the future than we do when modifying the CF-netCDF standard. We should say that the data model must be sufficient for CF 1.5, but need not be restricted to CF 1.5. I would be happy to use your words for the preamble. I wonder what David and others think.

Best wishes

Jonathan

comment:9 in reply to: ↑ 8 Changed 4 years ago by Oehmke

Replying to jonathan:

Dear Martin

Thanks for your comments. Bryan's earlier comment is similar, that in the data model we might allow ourselves to look a bit further into the future than we do when modifying the CF-netCDF standard. We should say that the data model must be sufficient for CF 1.5, but need not be restricted to CF 1.5. I would be happy to use your words for the preamble. I wonder what David and others think.

Best wishes

Jonathan

I think that it would be a good idea. Allowing the data model to be "sufficient" would give it the freedom to be a bit more forward looking. It would also remove the necessity of putting explicit restrictions in for things which are likely to be allowed later (e.g. the empty fields issue).

  • Bob

comment:10 in reply to: ↑ 6 Changed 4 years ago by markh

Replying to jonathan:

  • Whether dimensions have an independent identity in the field. Could we call a 1D axis a domain axis? Our CF data model would then distinguish domain axis construct (corresponding to netCDF dimension and the 1D coordinate variable of the same name, if there is one) and auxiliary coordinate construct (corresponding to a CF-netCDF auxiliary coordinate variable).

Bryan supported the concept of domain axis construct. Mark expressed concerns about this point. Following subsequent useful discussions with Mark, we have changed our document. We now have a domain axis construct, whose only function is to indicate the existence of a dimension of the domain, and provide the size of the dimension. That means the dimension coordinate construct (corresponding to the Unidata 1D netCDF coordinate variable) and the auxiliary coordinate construct (corresponding to the CF-netCDF 1D or multi-dimensional auxiliary coordinate variable) are more similar than we had before. This feels like an improvement. Both types of coordinate construct refer to the domain axes. The field data array also refers to the domain axes for its dimensions, but it does not have to include dimensions of size one. This is because CF-netCDF allows data variables to omit dimensions of size one, by using scalar coordinate variables. In previous discussons, I don't think anyone else has been concerned about this treatment. I wonder what you think now about our current version, Mark?

I think it is a useful step to qualify the relationships between the Fields and the Coordinate/AuxiliaryCoordinate with domain_axes, this feels like a sensible abstraction.

If we do this it is essential that the relationship between the domain_axis collection, the Fields data array and the points and bounds of the Coordinate/AuxiliaryCoordinate are well defined.

The shape (dimensionality, size, order) of the domain_axis collection of a Field must, be the shape of the Field's data array or degenerate unambiguously to this shape.

The shape of all coords referenced via a domain axis must conform to the domain axis shape.

I think if these factors are clearly defined in the data model then we have a workable solution. Extra behaviour must be truly additive and not affect this behaviour.

As yet I do not see the benefit delivered by having the domain axis as a separate construct, referenced by the Field. However, I think the risk of this causing serious problem can be mitigated by the constraints I have detailed, binding domain_axis collections to the Field data properties and mediating the shape constraints between Fields and other constructs.

On this basis I think we have a model which we can work with.

I think it is important that we are wary of introducing new features without sufficient consideration at this stage, so I would avoid providing DomainAxis Constructs with any other behaviour or properties. The discussion of how these types develop should be continued once CF1.5 is completed, aligned with CFforNetCDF1.5.

So, some further thought on precise wording is required, but in principle I think we can work with this.

comment:11 follow-up: Changed 4 years ago by markh

It appears we are getting close to agreement in principle to the various components and interactions in the model.

I think we should now attempt to review and finalise the text and diagram to provide the definitive perspective on the model.

I find it easier to write introductions to completed texts, so I suggest we start we the type definitions text and diagram section by section, and returning to introductory text near the end.

Jonathan and David have proposed a text for the Field Construct:

Field construct

The central concept of the data model is a field construct. In a dataset contained in a single netCDF file, each data variable usually corresponds to a field construct, but a field construct might be a combination of several data variables. In a dataset comprising several netCDF files, a field construct may span data variables in more than one file, for instance from different ranges of a time coordinate (to be introduced by Gridspec in CF version 1.7). Rules for aggregating data variables from one or several files into a single field construct are needed but are not defined by CF version 1.5; such rules are regarded as the concern of data processing software.

This data model makes a central assumption that each field construct is independent. Data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, we view this solely as a means of saving disk space, and we assume that software will be able to alter any field construct in memory without affecting other field constructs. For instance, if the coordinates of one field construct are modified, it will not affect any other field construct. Explicit tests of equality will be required to establish whether two data variables have the same coordinates. Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables. In a netCDF file, tests for the equality of coordinates between different data variables may be simplified if the data variables refer to the same coordinate variable.

Each field construct may have

  • An ordered list of zero or more domain axis constructs.
  • A data array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. If there are no domain axes of greater size than one, the data array may be a scalar. If there are no domain axes then data array must be a scalar. Domain axes of size one can be omitted because their position in the order of domain axes makes no difference to the order of data elements in the array. The elements of the data array must all be of the same data type, which may be numeric, character or string.
  • An unordered collection of dimension coordinate constructs.
  • An unordered collection of auxiliary coordinate constructs.
  • An unordered collection of cell measure constructs.
  • A cell methods construct, which refers to the domain axes (but not their sizes).
  • An unordered collection of transform constructs.
  • Other properties, which are metadata that do not refer to the domain axes, and serve to describe the data the field contains. Properties may be of any data type (numeric, character or string) and can be scalars or arrays. They are attributes in the netCDF file, but we use the term "property" instead because not all CF-netCDF attributes are properties in this sense.
  • A list of ancillary fields. This corresponds to the CF-netCDF ancillary_variables attribute, which identifies other fields that provide metadata.

All the components of the field construct bar the data array are optional.

Collectively, the domain axis, dimension coordinate, auxiliary coordinate, cell measure and cell method constructs describe the domain in which the data resides. Thus a field construct can be regarded as a domain with data in that domain.

The CF-netCDF formula_terms (see also Transform constructs) and ancillary_variables attributes make links between field constructs. These links are fragile. If a field construct is written to a file, it is not required that any other field constructs to which it is linked are also written to the file. If an operation alters one field construct in a way which could invalidate a relationship with another field construct, the link should be broken. The user of software will have to be aware of these relationships and remake them if applicable and useful.

I invite comments on this text and suggestions for additions, removals and edits.

thank you mark

comment:12 in reply to: ↑ 11 ; follow-up: Changed 4 years ago by jonblower

Replying to markh:

Quick question about this part just for clarification:

"a field construct might be a combination of several data variables"

Are you thinking here of things like velocities that are combinations of components (as discussed in a another ticket)?

comment:13 Changed 4 years ago by mgschultz

Dear Mark et al.,

by and large I think this text is pretty good already. I am trying to take the viewpoint of an outsider who doesn't know much about CF. From this perspective a bit more clarification may be useful (but perhaps some of this would be found in the introduction then). In a few instances, I would prefer the text to be more "structured" in the sense that it first describes things on the most abstract level before going into specifics or examples. Also, I had the same question as Jon concerning the exact meaning of "several data variables". This may indeed become quite complicated. Therefore, I would suggest to limit the meaning of one field construct to exactly one scalar field, and rather introduce another construct "field collection" or similar, which deals with the possible relation between fields, such as vector components. This separation should also make it easier to actually implement the data model. Below please find my additions/comments or revisions in italics.

Field construct

The central concept of the data model is a field construct. A field construct corresponds to exactly one data array of arbitrary dimension together with associated information on the spatio-temporal domain where the data resides and other metadata information about the data. In a dataset contained in a single netCDF file, each data variable corresponds to one field construct, but in a dataset comprising several netCDF files, a field construct may span data variables in more than one file, for instance from different ranges of a time coordinate (to be introduced by Gridspec in CF version 1.7). Rules for aggregating data variables from several files into a single field construct are needed but are not defined by CF version 1.5; such rules are regarded as the concern of data processing software.

This data model makes a central assumption that each field construct is independent. Technically, data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, we view this solely as a means of saving disk space, and we assume that software will be able to alter any field construct in memory without affecting other field constructs. For instance, if the coordinates of one field construct are modified for example by averaging the field values over one dimension, it will not affect any other field construct.

Explicit tests of domain consistency will be required to establish whether two data variables have the same coordinates or at least share a subset of these coordinates [Note: I am thinking of surface pressure versus a 3D temperature field here, for example].

[I suggest to delete the following, because there is no need to refer to netcdf files here]Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables. In a netCDF file, tests for the equality of coordinates between different data variables may be simplified if the data variables refer to the same coordinate variable.

Each field construct consists of

  • An ordered list of zero or more domain axis constructs.
  • A data array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. If there are no domain axes of greater size than one, the data array may be a scalar. If there are no domain axes then data array must be a scalar. [I suggest to remove the following: isn't this an implementation detail?] Domain axes of size one can be omitted because their position in the order of domain axes makes no difference to the order of data elements in the array. The elements of the data array must all be of the same data type, which may be numeric, character or string. In an actual implementation of this data model, the data array may not always physically reside in memory, but could also consist of a set of pointers and/or methods which allow access to arbitrary portions of the data array.
  • An optional unordered collection of dimension coordinate constructs.
  • An optional unordered collection of auxiliary coordinate constructs.
  • An optional unordered collection of cell measure constructs.
  • An optional cell methods construct, which refers to the domain axes (but not their sizes).
  • An optional unordered collection of transform constructs (e.g. as specified by the CF formula_terms).
  • Other properties, which are metadata that do not refer to the domain axes, and serve to describe the data the field contains. Properties may be of any data type (numeric, character or string) and can be scalars or arrays. These properties correspond to attributes in the netCDF file, but we use the term "property" instead because not all CF-netCDF attributes are properties in this sense. The properties list may be empty.
  • An optional list of ancillary fields. This corresponds to the CF-netCDF ancillary_variables attribute, which identifies other fields that provide metadata.

Collectively, the domain axis, dimension coordinate, auxiliary coordinate, cell measure and cell method constructs describe the domain in which the data resides. Thus a field construct can be regarded as a domain with data in that domain.

The CF-netCDF formula_terms (see also Transform constructs) and ancillary_variables attributes make links between field constructs. These links are fragile and it might not always be possible for a data processing software to maintain a consistent set of such links. [I suggest to remove the following, because I think this is again an implementation detail - the general statement above should suffice in terms of the data model]If a field construct is written to a file, it is not required that any other field constructs to which it is linked are also written to the file. If an operation such as writing a field construct to a file alters one field construct in a way which could invalidate a relationship with another field construct, the link could [I think that some cases may be tractable by 'intelligent' software] be broken. [is the following needed?]The user of software will have to be aware of these relationships and remake them if applicable and useful.

[Perhaps a general remark concerning the relationship between CF-1.5 and the data model at the end?]The CF data model field construct extends the scope of a CF-1.5 netcdf variable in that it is defined independent of the actual representation of the data in one physical file. This has consequences concerning the exact relation between the data model and the CF standard and opens up several areas of discussion with respect to actual implementations of the data model. Some of these - in particular the consistency and aggregation rules - should be defined as implementation standard in order to provide a similar "look and feel" to different software packages that rely on the CF data model.

Best regards,

Martin

comment:14 in reply to: ↑ 12 Changed 4 years ago by markh

Replying to jonblower:

Replying to markh:

Quick question about this part just for clarification:

"a field construct might be a combination of several data variables"

Are you thinking here of things like velocities that are combinations of components (as discussed in a another ticket)?

Hello Jon

I believe this is explicitly not what is being suggested. The inference I draw from this statement is that a single Field may be represented by multiple data variables, separated across files.

This allows a Field to represent a dataset with one definition (e.g. standard_name, unit, cell_methods etc) where the data is split up into multiple variables for convenience.

I think the other ticket you refer to is looking at a construct which would allow multiple Fields, with different definitions, to be treated as an entity; a very different thing.

The limitation on scope for this ticket is to concepts present in CF for NetCDF 1.5 so this interesting topic will have to wait until single Field definition is complete: the conclusion of this ticket. (I am raring to go on this one)

mark

comment:15 Changed 4 years ago by markh

My view on the proposed text is that it is conflating two objectives: the definition of the construct and the relationship of this construct to a CF NetCDF file. I am concerned that has a significant impact on the clarity of the definition.

I think the key features are not clear enough in the proposed text.

I would prefer to separate these two aspects. I wonder whether two documents would help, one defining the model and one relating it to NetCDF; however, I would be content to have the NetCDF relationship as a subsection within the data model construct sections.

I suggest an alternative text which tries to implement this approach. I have tried to capture the required information for the model from the previously suggested text, in doing so I have made a couple of changes.

Field Construct

The central concept of the data model is a Field construct. A Field represents a single phenomenon with metadata to define that phenomenon and to define the domain which the phenomenon is sampled from.

The domain of the Field defines the Field's location in time, space and all other degrees of freedom it may have; it also may provide further contextual metadata. A field construct may be regarded as a domain definition with data in that domain.

The Field contains one multi-dimensional array of data values, which may include missing data. Elements of the data array must all be of the same data type.

The data array has shape, an ordered set of dimensions with extents defined by the Field's explicit domain_axes.

The data model makes a central assumption that each Field construct is independent.

Attributes are interpreted consistently with Appendix A. Attributes of the CF Conventions for NetCDF files (1.5) (a relevant subset of this table could be included in the definition perhaps), apart from coordinates and grid_mapping which are not to be used within the scope of the data model.

New attribute names are introduced by the data model: domain_axes, dim_coordinates, aux_coordinates, transforms. These attributes, along with cell_measures, mediate the relationships between a Field and the constructs which define its domain, using containment associations qualified by the DomainAxis instances.

The cell_methods attribute qualifies the phenomenon referencing a CellMethod collection.

All the Fields attributes are optional except for data.

The Field Construct in a NetCDF File

In a dataset contained in a single netCDF file, each data variable usually corresponds to a field construct, but a field construct might be a combination of several data variables, as long as they represent the same phenomenon, over comparable but not overlapping domains. In a dataset comprising several netCDF files, a field construct may span data variables in more than one file, for instance from different ranges of a time coordinate (to be introduced by Gridspec in CF version 1.7). Rules for aggregating data variables from one or several files into a single field construct are needed but are not defined by CF version 1.5; such rules are regarded as the concern of data processing software.

Data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, this is viewed solely as a means of saving disk space, and it is assumed that implementations will be able to alter any field construct without affecting other field constructs. For instance, if the coordinates of one field construct are modified, it will not affect any other field construct. Explicit tests of equality will be required to establish whether two data variables have the same coordinates. Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables. In a netCDF file, tests for the equality of coordinates between different data variables may be simplified if the data variables refer to the same coordinate variable.

comment:16 Changed 4 years ago by markh

The discussion on Field is still open for comment, but I would like to continue discussions. With that In mind I request comments on the proposed text for a DimensionCoordinate:

Dimension coordinate construct A dimension coordinate construct indicates the physical meaning and locations of the cells for a unique domain axis of the field.

A dimension coordinate construct may contain

  • A scalar or one-dimensional numerical coordinate array of the size specified for the domain axis. The elements of the coordinate array must all be of the same numeric data type, they must all have different non-missing values, and they must be monotonically increasing or decreasing. Dimension coordinate constructs cannot have string-valued coordinates. In this data model, a CF-netCDF string-valued coordinate variable or string-valued scalar coordinate variable corresponds to an auxiliary coordinate construct (not a dimension coordinate construct), with a domain axis which is not associated with a dimension coordinate construct.
  • A two-dimensional boundary coordinate array, whose slow-varying (second in Fortran) dimension equals the size specified by the domain axis construct, and whose fast-varying dimension is two, indicating the extent of the cell. For climatological time dimensions, the bounds are interpreted in a special way indicated by the cell methods.
  • Properties (in the same sense as for the field construct) serving to describe the coordinates.

In this data model we permit a domain axis not to have a coordinate array if there is no appropriate numeric monotonic coordinate. That is the case for a dimension that runs over ocean basins or area types, for example, or for a domain axis that indexes timeseries at scattered points. Such domain axes do not correspond to a continuous physical quantity. (They will be called index dimensions in CF version 1.6.)

comment:17 Changed 4 years ago by markh

As The auxiliary coordinate is a similar construct I also invite comments on the proposed text.

This may help if people would liek to comment on them as a pair.

Auxiliary coordinate construct

An auxiliary coordinate construct provides auxiliary information for interpreting the cells of an ordered list of one or more domain axes of the field.

An auxiliary coordinate construct must contain

  • A coordinate array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. The elements of the coordinate array must all be of the same data type (numeric, character or string), but they do not have to be distinct or monotonic. Missing values are not allowed (in CF version 1.5).

and may also contain

  • A boundary coordinate array with all the dimensions, in the same order, as the coordinate array, and a fastest-varying dimension (first dimension in Fortran) equal to the number of vertices of each cell.
  • Properties serving to describe the coordinates.

Auxiliary coordinate constructs correspond to auxiliary coordinate variables named by the coordinates attribute of a data variable in a CF-netCDF file. CF recommends there to be auxiliary coordinate constructs of latitude and longitude if there is two-dimensional horizontal variation but the horizontal coordinates are not latitude and longitude. As for dimension constructs, auxiliary coordinate constructs of different field constructs are independent in the data model.

comment:18 Changed 4 years ago by mgschultz

Hello Mark,

excellent! I only have a few minor remarks:

  1. "A dimension coordinate construct indicates the physical meaning and locations of the cells for a unique domain axis of the field." -- do we need the term "cells" here? I believe this implies some sort of gridded set-up (i.e. a numerical model), while in principle, "locations" should be all we need to know (?).
  1. I agree that string coordinates should be disallowed in the data model. Yet, it may be good to clarify that a dimension coordinate can consist of ordinal values, i.e. index values - for example indices of an auxiliary coordinate construct (if I understand the latter correctly).
  1. Concerning the auxiliary construct, this is hard to read (but I wouldn't know how to do it better right now). Perhaps it may hep to also refer to the example of hybrid pressure vertical coordinates here as another example?

Best regards,

Martin

comment:19 follow-up: Changed 4 years ago by ngalbraith

Also just a couple of comments/questions here.

Can we assume that when you use the term 'array' you include singletons? Or does that need to be stated explicitly?

Is the definition a little too specific?

'An auxiliary coordinate construct provides auxiliary information for interpreting the cells of an ordered list of one or more domain axes of the field.'

The information in the aux coordinate isn't necessarily 'auxiliary' - only the construct is (i.e. as an alternative to a dimension coordinate). The aux coordinate doesn't need to be monotonic, and in some cases may not be ordered.

Would a short version could be less ... limiting: 'An auxiliary coordinate construct provides information for interpreting one or more domain axes.

Side note: We use aux coordinates to provide heights of meteorological sensors in files that may have subsurface data with a depth dimension. We may also use them for moorings that have a singleton lat/long dimension coordinate but also have GPS time series data.

comment:20 follow-up: Changed 4 years ago by jonathan

Dear all

I apologise for silence on this ticket; I haven't had time in the last couple of weeks, but I do intend to contribute. On the Field construct, I agree we could gather the netCDF-specific comments together. In doing this Mark has produced a somewhat different text from the one we had before, on which Martin provided some suggestions. I'd like to compare these three versions.

The CF-netCDF convention does not have a clear statement of the purpose of coordinate or auxiliary coordinate variables, as far as I can see. The nearest I have found is in section 4, where it speaks of both: "The values of a coordinate variable or auxiliary coordinate variable indicate the locations of the gridpoints." Following that, we could say "gridpoint", as Martin suggests, because the bounds are optional, but sometimes the bounds are the important information and the gridpoints are notional (for extensive quantities). "Cells" covers both cases.

Perhaps we could say "A dimension coordinate construct provides physical coordinates to locate the cells along a unique domain axis, such that all cells which have the same index along that axis share the same coordinate value, and have a different coordinate value from cells with any other index along the axis" and "An auxiliary coordinate construct provides physical coordinates to locate the cells as a function of their indices along an ordered list of one or more domain axes". That's a bit more formal, but I think it shows the distinctions between the two kinds.

I agree with Martin that we could add text such as "In CF-netCDF, a string-valued auxiliary coordinate construct could be represented by a numerical auxiliary coordinate construct with a flag_meanings attribute to supply the translation to strings."

I think the text already allows for singleton auxiliary coordinate values, because it says that an aux coord construct might refer to only one domain axis (so it could be 1D array with a single element).

I think we may have made a mistake in the last bit. I think it should say "CF requires there to be auxiliary coordinate constructs of latitude and longitude if there is two-dimensional horizontal variation but the horizontal coordinates are not latitude and longitude", not "recommends". I believe that's what the start of sect 5 means, but it is obscurely stated. Steve Hankin and I have separately been discussing ways to rephrase it which could be proposed as a defect ticket.

Best wishes

Jonathan

comment:21 follow-up: Changed 4 years ago by mgschultz

Dear Jonathan,

actually I was trying to argue against the use of "cell". A dictionary definition gives you:

  1. a small room, as in a convent or prison.
  2. any of various small compartments or bounded areas forming part of a whole.

Thus, a "cell" would refer only to a bounded entity and disregard the case where you only care about the location itself. Using Heisenberg's principle ;-) one could rather argue that "location" is a term that is imprecise enough to allow for both interpretations. On the other hand, I like the clarification you provide in "the bounds are optional, but sometimes the bounds are the important information and the gridpoints are notional (for extensive quantities)." - If we can insert this bit into the definition, the meaning should become pretty clear.

Martin

comment:22 in reply to: ↑ 19 Changed 3 years ago by markh

Replying to ngalbraith:

Can we assume that when you use the term 'array' you include singletons? Or does that need to be stated explicitly?

Do you differentiate between an array of size one and a singleton?

I think that an array may always be of size one and that this ecplicitly allows singletons.

We can state this explicitly if that helps.

Is the definition a little too specific? 'An auxiliary coordinate construct provides auxiliary information for interpreting the cells of an ordered list of one or more domain axes of the field.'

The information in the aux coordinate isn't necessarily 'auxiliary' - only the construct is (i.e. as an alternative to a dimension coordinate). The aux coordinate doesn't need to be monotonic, and in some cases may not be ordered.

Would a short version could be less ... limiting: 'An auxiliary coordinate construct provides information for interpreting one or more domain axes.

I like this, and I agree that auxiliary part is how the Field interprets it, rather than a characteristic of the dcoordinate data.

Side note: We use aux coordinates to provide heights of meteorological sensors in files that may have subsurface data with a depth dimension. We may also use them for moorings that have a singleton lat/long dimension coordinate but also have GPS time series data.

comment:23 in reply to: ↑ 21 Changed 3 years ago by markh

Replying to mgschultz:

actually I was trying to argue against the use of "cell". A dictionary definition gives you:

  1. a small room, as in a convent or prison.
  2. any of various small compartments or bounded areas forming part of a whole.

Thus, a "cell" would refer only to a bounded entity and disregard the case where you only care about the location itself. Using Heisenberg's principle ;-) one could rather argue that "location" is a term that is imprecise enough to allow for both interpretations. On the other hand, I like the clarification you provide in "the bounds are optional, but sometimes the bounds are the important information and the gridpoints are notional (for extensive quantities)." - If we can insert this bit into the definition, the meaning should become pretty clear.

I am not sure about this, I find the term 'cell' useful to describe the region, in my domain's phase space, which may be extensive, or infinitessimally small, but always has one and only one data value.

A cell may be defined as 'a small part of something' or ' the smallest basic unit of a plant or animal' both of which capture pretty well the nature of 1 element of my Field to me.

comment:24 follow-up: Changed 3 years ago by markh

In trying to describe the two coordinate constructs, I have been considering roles and characteristics, and pondering the comments made to date.

It appear to me we have two characteristics here and two roles. The management of the role seems to me to sit firmly with the Field, whilst the characteristic is in the nature of the coordinate the Field references. The language used to describe this is not delivering this message for me.

I wonder whether we would improve the model by recognising that the 'dimension' or 'auxiliary' nature of the coordinate is defined by the Field while the coordinate characteristics are the responsibility of the coordinate to define. I think this division of responsibility could bring clarity to the model.

This leads me to suggest an alteration to the descriptions of the assocations.

  • A Field will declare:
    • dim_coordinates: a set of containment associations, each mediated by one and only one explicit domain axis.
    • aux_coordinates: a set of containment associations, each mediated by a collection of domain axes (recognising the constraints on shape matching).

This, in turn, leads me to think that we may have our names wrong for our constructs. The coordinate types are not dimension and auxiliary, they are ordered and unordered.

An OrderedCoord is a coordinate instance which asserts that it has an explicitly defined order: it is sortable, monatonic and 1-dimensional.

An UnorderedCoord is a coordinate instance which does not assert that it has an expliucitly defined order.

The only constraint required on the Field is that the dim_coordinates association may only contain OrderedCoord references.

comment:25 in reply to: ↑ 20 Changed 3 years ago by markh

Replying to jonathan:

I think we may have made a mistake in the last bit. I think it should say "CF requires there to be auxiliary coordinate constructs of latitude and longitude if there is two-dimensional horizontal variation but the horizontal coordinates are not latitude and longitude", not "recommends". I believe that's what the start of sect 5 means, but it is obscurely stated. Steve Hankin and I have separately been discussing ways to rephrase it which could be proposed as a defect ticket.

I do not think this should feature in the data model, I do not think it is required.

While I can see the NetCDF interoperability benefits which come from software which looks for latitude and longitude coordiantes, I think this is a NetCDF implementation benefit which clutters the data model unnecessarily.

CF model datasets may be fully defined by coordinates and transforms (coordinate reference systems) such that extra latitude and longitude coordinates are not required, they are inferrable from the defined properties.

comment:26 in reply to: ↑ 24 Changed 3 years ago by markh

Replying to markh:

I have written up how the data model may look, if the suggestion from my previous post was taken up, to assist the consideration of this.

I do not see this as a significant change of capability or approach, just clarity of presentation.

Is the naming approach I have adopted for this example helpful?

Does this provide a good base for us to work from?

thank you

comment:27 follow-up: Changed 3 years ago by jonathan

Dear all

Here's a new proposal for the text about the field construct. This is based on the version Mark posted from David's document, incorporating suggestions from Martin. Following Mark's suggestion to make a clearer separation between the logical data model and the netCDF format, I have moved most of the netCDF-specific comments to the end.

Regarding the point about omitting domain axes of size one, I wouldn't say it's an implementation detail. It's more of an explanatory note, so I've put it in brackets. Regarding the remarks about domain consistency, I think we should keep these because they arose from an earlier quite lengthy debate on the issue, in which the agreement was to spell it out like this.

I'm going to post separately about the coordinate constructs. Note that we still have to deal with cell measures, cell methods, properties (which I hope should all be easy to agree) and transforms (which need most thought). Please see http://www.met.reading.ac.uk/~david/cfdm_recast.html for the proposals David and I have made for these.

Field construct

The central concept of the data model is a field construct. A field construct corresponds to exactly one data array together with associated information about the domain in which the data resides (defined by spatio-temporal and other coordinates) and other metadata. This data model makes a central assumption that each field construct is independent.

Each field construct may contain the following, all of which are optional.

  • An ordered list of domain axis constructs.
  • A data array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. (It is possible to omit domain axes of size one because their position in the order of domain axes makes no difference to the order of data elements in the array.) If there are no domain axes of greater size than one, the single datum may be a scalar instead of an array. If the data array has more than one element, they must all be of the same data type, which may be numeric, character or string.
  • An unordered collection of dimension coordinate constructs.
  • An unordered collection of auxiliary coordinate constructs.
  • An unordered collection of cell measure constructs.
  • A cell methods construct, which refers to the domain axes (but not their sizes).
  • An optional unordered collection of transform constructs (corresponding to CF-netCDF formula_terms and grid_mapping).
  • Other properties, which are metadata that do not refer to the domain axes, and serve to describe the data the field contains. Properties may be of any data type (numeric, character or string) and can be scalars or arrays. These properties correspond to attributes in the netCDF file, but we use the term "property" instead because not all CF-netCDF attributes are properties in this sense.
  • A list of ancillary fields (corresponding to the CF-netCDF ancillary_variables attribute, which identifies other data variables that provide metadata).

Collectively, the domain axis, dimension coordinate, auxiliary coordinate, cell measure and cell method constructs describe the domain in which the data resides. Thus a field construct can be regarded as a domain with data in that domain.

The CF-netCDF formula_terms (see also Transform constructs) and ancillary_variables attributes make links between field constructs. These links are fragile and it might not always be possible for data processing software to maintain a consistent set of such links when writing fields to files or manipulating them in memory.

CF-netCDF considers fields which are contained in single netCDF files. In a dataset contained in a single netCDF file, each data variable corresponds to one field construct. This data model has a broader scope. It applies also to data contained in memory and to datasets comprising several netCDF files. A field construct may span data variables in more than one file, for instance from different ranges of a time coordinate. Rules for aggregating data variables from several files into a single field construct are needed but are not defined by CF version 1.5; such rules are regarded as the concern of data processing software. Technically, data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, we view this solely as a means of saving disk space, and we assume that software will be able to alter any field construct in memory without affecting other field constructs. For instance, if the coordinates of one field construct are modified by averaging the field values over one dimension, it will not affect any other field construct.

Explicit tests of domain consistency will be required to establish whether two data variables have the same coordinates or share a subset of these coordinates. Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables. Within a netCDF file, tests for the equality of coordinates between different data variables may be simplified if the data variables refer to the same coordinate variable.

Best wishes

Jonathan

comment:28 follow-ups: Changed 3 years ago by jonathan

Dear all

Here are some comments and a new proposal for the text about the coordinate constructs.

Regarding the definition of "cell", I have the same instinct as Mark about what this means, but if Martin finds additional description helpful as clarification, then I imagine it's likely to be useful to others as well if we include it.

Regarding the statement of the purpose of an auxiliary coord construct, I think that "An auxiliary coordinate construct provides information for interpreting one or more domain axes", while correct, is not informative enough. I would now propose "An auxiliary coordinate construct provides physical coordinates to locate the cells along one or more domain axes." That is, the information it provides is physical coordinates which locate the cells. The fact that it's an ordered list of domain axes can be postponed to the description of the data array.

I have to say that I don't find Mark's restatement of purposes in comment 24 makes things clearer for me. I agree that one has to read the description of both the field and the coordinates to get the ideas, but since in total it's not very long, that shouldn't be an obstacle. I also do not agree that ordered versus unordered is the principal distinction between dimension and auxiliary coordinate constructs. Dimension coordinates are ordered in CF, I think, because monotonicity makes uniqueness easier to enforce, and because it's useful for doing calculations with the coordinate variable. In my view the principal distinction is that dimension coordinate uniquely locate the cells (or points) along the domain axes, while auxiliary coordinate supply information to locate the cells in optional alternative ways. Auxiliary coordinates can be and often are also monotonic.

Regarding the requirement for latitude and longitude auxiliary coordinates, I would argue that since it's in CF-netCDF and not something which is a feature specifically of the netCDF file format (like, for instance, the Coordinates attribute is), it must be part of the CF data model. This is the complement of the argument whereby we are omitting features from the CF data model if they are not currently in CF-netCDF, even though they may be logically obvious. This requirement for auxiliary coordinates could subsequently be dropped at a later version of CF-netCDF, and then we would also remove it from the CF data model.

Here's a new proposed version of the text for dimension and auxiliary coordinate constructs, bearing in mind all the above. I have changed the terminology for bounds to call them "boundary arrays" instead of "boundary coordinate arrays", because that allows us to make a convenient distinction between "coordinate arrays" and "boundary arrays". We also need text for domain axis constructs, not previously debated, so I've included that too.

Domain axis construct

A domain axis construct declares a dimension of the field. It must contain

  • A size, which is an integer that must be greater than zero, but could be equal to one.

Dimension coordinate construct

A dimension coordinate construct provides physical coordinates to locate the cells at unique positions along a single domain axis.

A dimension coordinate construct may contain:

  • A one-dimensional numerical coordinate array of the size specified for the domain axis. If the size is one, the single coordinate value may be a scalar instead of an array. If the size is greater than one, the elements of the coordinate array must all be of the same numeric data type, they must all have different non-missing values, and they must be monotonically increasing or decreasing. Dimension coordinate constructs cannot have string-valued coordinates.
  • A two-dimensional numerical boundary array, whose slow-varying dimension (first in CDL, second in Fortran) equals the size specified by the domain axis construct, and whose fast-varying dimension is two, indicating the extent of the cell. For climatological time dimensions, the bounds are interpreted in a special way indicated by the cell methods. Sometimes the bounds are the important information for locating the cell, and the coordinates are notional, especially for extensive quantities.
  • Properties (in the same sense as for the field construct) serving to describe the coordinates.

In this data model, a CF-netCDF string-valued coordinate variable or string-valued scalar coordinate variable corresponds to an auxiliary coordinate construct (not a dimension coordinate construct), with a domain axis that is not associated with any dimension coordinate construct.

In this data model we permit a domain axis construct not to have a dimension coordinate construct if there is no appropriate numeric monotonic coordinate. That is the case for a dimension that runs over ocean basins or area types, for example, or for a domain axis that indexes timeseries at scattered points. Such domain axes do not correspond to a continuous physical quantity. (They will be called index dimensions in CF version 1.6.)

Auxiliary coordinate construct

An auxiliary coordinate construct provides physical coordinates to locate the cells along one or more domain axes. An auxiliary coordinate construct must contain:

  • A coordinate array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. If all domain axes are of size one, the single coordinate value may be a scalar instead of an array. If the array has more than one element, they must all be of the same data type (numeric, character or string), but they do not have to be distinct or monotonic. Missing values are not allowed (in CF version 1.5). In CF-netCDF, a string-valued auxiliary coordinate construct can be stored either as a character array with an additional dimension (last dimension in CDL) for maximum string length, or represented by a numerical auxiliary coordinate variable with a flag_meanings attribute to supply the translation to strings.

and may also contain

  • A boundary array with all the dimensions, in the same order, as the coordinate array, and an additional dimension (following the coordinate array dimensions in CDL, preceding them in Fortran) equal to the number of vertices of each cell.
  • Properties (in the same sense as for the field construct) serving to describe the coordinates.

Auxiliary coordinate constructs correspond to auxiliary coordinate variables named by the coordinates attribute of a data variable in a CF-netCDF file. CF requires there to be auxiliary coordinate constructs of latitude and longitude if there is two-dimensional horizontal variation but the horizontal coordinates are not latitude and longitude.

Cheers

Jonathan

comment:29 Changed 3 years ago by jonathan

Sorry, I've made a mistake in the Field construct. I forgot that we have agreed the data array is mandatory (in this version of the data model) since that is the case in CF 1.5. It should therefore begin:

Each field construct must contain:

  • An ordered list of zero or more domain axis constructs.
  • A data array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. (It is possible to omit domain axes of size one because their position in the order of domain axes makes no difference to the order of data elements in the array.) If there are no domain axes of greater size than one, the single datum may be a scalar instead of an array. If the data array has more than one element, they must all be of the same data type, which may be numeric, character or string.

and may optionally contain:

  • An unordered collection of dimension coordinate constructs.

etc.

Jonathan

comment:30 in reply to: ↑ 28 Changed 3 years ago by markh

Replying to jonathan:

I have to say that I don't find Mark's restatement of purposes in comment 24 makes things clearer for me. I agree that one has to read the description of both the field and the coordinates to get the ideas, but since in total it's not very long, that shouldn't be an obstacle. I also do not agree that ordered versus unordered is the principal distinction between dimension and auxiliary coordinate constructs. Dimension coordinates are ordered in CF, I think, because monotonicity makes uniqueness easier to enforce, and because it's useful for doing calculations with the coordinate variable. In my view the principal distinction is that dimension coordinate uniquely locate the cells (or points) along the domain axes, while auxiliary coordinate supply information to locate the cells in optional alternative ways. Auxiliary coordinates can be and often are also monotonic.

Hello Jonathan

On this point, the perspective I am trying to investigate is whether the typing of the construct is the important factor; I am currently of the view that it is not.

As such, I have suggested that the description of the role does not belong with the construct, rather with the Field which references it. A dimcoord/orderedcoord does not know what it does, it only knows what it is ( a constrained (1d, strictly monotonic) coord/auxcoord/unorderedcoord).

The Field is responsible for the role of each of its coords. I think that, in the current text, the Field is too much of an unaware container of clever constructs; I think this perspective should be reversed, with the Field being a smart container of simple constructs. This makes it easier for more emergent propeties to be developed over time.

So, using your terminology, I advocate that

Field construct

...

  • An unordered collection of dimension coordinate constructs.

is changed to say

Field construct ...

  • attributes:
    • dimension_coordinates
      • this attribute defines the domain axes of the Field
      • each domain_axis may reference 1 dimension coordiante construct, defining the meaning of this domain_axis
      • Each dimension coordinate construct provides physical coordinates to locate the cells at unique positions along a single domain axis of the Field.

Dimension coordinate construct

A dimension coordinate construct may contain:

  • A one-dimensional numerical coordinate array ....

This lead me to suggest the change of name, to make it clear where the responsibility lies. The role is the important factor for me, I'm not wedded to the name change or the 'ordered' naming.

When applying this to a NetCDF file, the NetCDF file creates the Field's domain_axes (the NetCDF dimensions) and the dimension_coordiantes attribute (the collection of coordinate variables) for a data variable. The dimcoord/orderedcoord construct is only the contents of the coordiante variable (points, bounds, standard_name) and nothing else: no contextual awareness.

This perspective is really valuable when working with CF fields dynamically (in models, post processing etc). I think it is consistent with CF for NetCDF and more powerful and flexible than the proposed description of roles. e.g.:

A monotonic coordinate may be explicitly placed in the Field's auxiliary_coordinates collection, then later be used to defined a newly explicit domain axis as the Field grows in size and shape.

comment:31 follow-up: Changed 3 years ago by mgschultz

Dear all,

just a quick question as my eye caught the word "monotonic" for domain axis (if this has been discussed elsewhere, please forgive me): how about "circular coordinates"? If you cut out Europe from a global map of 0..359 degrees, many tools will put for example 350, 355, 0, 5, 10 degrees in the lon variable. Should the data model enforce -10, -5, 0, 5, 10 instead?

Cheers,

Martin

comment:32 in reply to: ↑ 31 Changed 3 years ago by markh

Replying to mgschultz:

just a quick question as my eye caught the word "monotonic" for domain axis (if this has been discussed elsewhere, please forgive me): how about "circular coordinates"? If you cut out Europe from a global map of 0..359 degrees, many tools will put for example 350, 355, 0, 5, 10 degrees in the lon variable. Should the data model enforce -10, -5, 0, 5, 10 instead?

Martin

Hi Martin, interestung question you raise.

My view, on first look at the spec, is that a strictly monotonic coordinate would have to be

-10, -5, 0, 5, 10

or

0, 5, 10, 350, 355

both of which seem sub-optimal to me.

I think there is a potential solution, which we could propose (separate from this ticket) which would be to enable the explicit definition of circular coordinates.

This could be done, for example, by providing two new attributes, circular (a boolean flag) and modulus the number where the values 'wrap around', which could be used by any coordinate.

Perhaps I have missed that this capability already exists, maybe it is implicit for the longitude coordinate; I hope someone will correct me, if I have.

If not, and this functionality would be new, I suggest that we break the discussion out into a separate ticket and propose a change for CF 1.7.

mark

comment:33 Changed 3 years ago by stevehankin

The idea of an attribute "modulo=[length]" has been proposed in trac email discussions on a number of times and is badly needed. The fact is that users have been utilizing such attributes for years (pre-dating CF) ... just have never found the energy to converge on a CF encoding. Here's an example in the documentation for the Ferret program: http://ferret.pmel.noaa.gov/Ferret/documentation/users-guide/Grids-Regions/REGIONS#_VPINDEXENTRY_731

However, the use of the modulo attribute does not negate the requirement that the numerical encoding of the axis be monotonic. It merely informs the application reading the coordinates that (for modulo=360) a coordinate of -180 is the same position as a coordinate of 540.

comment:34 in reply to: ↑ 27 ; follow-up: Changed 3 years ago by markh

Replying to jonathan:

Regarding the point about omitting domain axes of size one, I wouldn't say it's an implementation detail. It's more of an explanatory note, so I've put it in brackets. Regarding the remarks about domain consistency, I think we should keep these because they arose from an earlier quite lengthy debate on the issue, in which the agreement was to spell it out like this.

  • An ordered list of domain axis constructs.
  • A data array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. (It is possible to omit domain axes of size one because their position in the order of domain axes makes no difference to the order of data elements in the array.) If there are no domain axes of greater size than one, the single datum may be a scalar instead of an array. If the data array has more than one element, they must all be of the same data type, which may be numeric, character or string.

Hello Jonathan

I think there are issues with this definition of domain axes and data for the Field.

The domain axes mediate the qualified association of field and coordinate. To deliver this effectively they must provide unambiguous mappings between coordinate arrays and the data array.

If the domain_axes are only identified by order and some may be omitted, then this is no longer secure.

The relationship between a domain axis and a data dimension must be explicit and unambiguous to enable the coordinate relationships to be defined correctly.

I think that we have 2 types of domain axis, the ones explicit in the Field's data array, and the ones which are not explicit in the data, which must be of length 1.

I do not think that the second type, domain axes not explicit in the Field's data array, should be ordered and I do not think that their number needs to be fixed for a given Field. The Field should be flexible here.

For this reason I propose that only the domain axes which are explicit in the Field's data array are declared by the field.

Additional to this the Field defines one potential domain axis of size 1. This provides the idea that the Field's conceptual dimensionality may be more than the dimensionality of the data array, whilst constraining the relationships to be consitent with the data array as defined.

I believe that this better represents how CF NetCDF files handle single valued coordinates not bound to NetCDF dimensions than the approach you have defined and I feel that this provides a more flexible approach for other implementations.

what do you think?

mark

comment:35 in reply to: ↑ 34 Changed 3 years ago by markh

Replying to markh:

For example, consider two instances of Fields which declare 6 domain axes

[1, 23, 1, 1, 1, 712]

and define a data array of shape

[23, 1, 712]

The dimension_coordinates each contain 6 coord instances, in the order of the domain axes these are:

forecast reference time (1) ensemble member (23), time (1), height (1), northing (1) easting (712)

The two fields differ in their data and the values of time and forecast reference time only. So, I should be able to combine the two fields into a single field, by inspecting their domains and sampling.

But, I do not know whether the time coordinate is referencing the domain axis defining the Field's second dimension, of size one, or if it defines a domain axis not explicit in the data array. As such I cannot decide whether the resultant Field's data array should have a shape of

[23, 2, 712]

or

[23, 2, 1, 712]

I feel this situation can arise too often if we declare the domain axes as 1 ordered list of sizes, then omit an arbitrary number of size 1 instances from the Field's data array.

I think my proposal addresses this issue.

comment:36 follow-ups: Changed 3 years ago by davidhassell

Hello Mark,

I think I see your point about the ordering, but perhaps for different reasons?

The domain axes don't need to be ordered, because they have identities other than their size (independently of any associtated coordinates). The field then needs, just like a coordinate or cell measure, to have an ordered list of domain axes which match up with its array's dimensions.

Perhaps we could say that a field contains:

  • An (unordered) set of zero or more domain axis constructs.
  • A data array whose shape is determined by an ordered subset of the domain axes. Only domain axes of size 1 may be omitted. If no domain axes are given then data array is a scalar. Domain axes of size one can be omitted because their position in the order of domain axes makes no difference to the order of data elements inthe array. The elements of the data array must all be of the same data type, which may be numeric, character or string.
  • etc.

All the best,

David

comment:37 in reply to: ↑ 36 ; follow-up: Changed 3 years ago by markh

Replying to davidhassell:

The domain axes don't need to be ordered, because they have identities other than their size (independently of any associtated coordinates).

I thought this was explicitly not the case. I do not see anything in the discussions about the domain axis that provides an identity. Please may you describe further what this identity is?

thank you mark

comment:38 in reply to: ↑ 37 ; follow-up: Changed 3 years ago by davidhassell

Hello Mark,

The domain axes don't need to be ordered, because they have identities other than their size (independently of any associtated coordinates).

I thought this was explicitly not the case. I do not see anything in the discussions about the domain axis that provides an identity. Please may you describe further what this identity is?

Domain axis identity was discussed a bit in ticket #68, but looking back at that, perhaps the type of domain axis identity I meant was not clear. I'll try to explain.

A domain axis has an identity in the model by virtue of the fact that it is a construct independent of other similar constructs. It doesn't have an identity in the same way that a coordinate has a standard name.

By way of simple analogy, in a computer program you might say something like:

x=2
y=3

Here, both variables x and y are integers, but they've got different names and you could use those variable names to declare an array, for example:

a=Array_of_Floats(y, x)

I hope that makes sense,

All the best,

David

comment:39 in reply to: ↑ 38 Changed 3 years ago by markh

Replying to davidhassell:

A domain axis has an identity in the model by virtue of the fact that it is a construct independent of other similar constructs. It doesn't have an identity in the same way that a coordinate has a standard name.

Hi David

I understand your point of view now, thank you, but I'm afraid I don't agree with your implications.

I think our scope for this ticket is to define how the semantics of the data model provide clear relations between components to enable an implementation to provide such functionality however it sees fit.

Whilst a programming language's internal referencing system may exist, I do not think any aspect of the data model can be dependent on such implmentation details. The model is an abstract, self consistent entity; the relationships need to be based on semantics (imho).

As such I don't think we can use 'domain axis identity' as you describe it.

comment:40 follow-up: Changed 3 years ago by jonathan

Dear Mark

This appears to be a repetition of a debate we have already had, and I thought we had settled. Back in my summary in comment 6 I wrote

Whether dimensions have an independent identity in the field. Could we call a 1D axis a domain axis? Our CF data model would then distinguish domain axis construct (corresponding to netCDF dimension and the 1D coordinate variable of the same name, if there is one) and auxiliary coordinate construct (corresponding to a CF-netCDF auxiliary coordinate variable).

Bryan supported the concept of domain axis construct. Mark expressed concerns about this point. Following subsequent useful discussions with Mark, we have changed our document. We now have a domain axis construct, whose only function is to indicate the existence of a dimension of the domain, and provide the size of the dimension. That means the dimension coordinate construct (corresponding to the Unidata 1D netCDF coordinate variable) and the auxiliary coordinate construct (corresponding to the CF-netCDF 1D or multi-dimensional auxiliary coordinate variable) are more similar than we had before. This feels like an improvement. Both types of coordinate construct refer to the domain axes. The field data array also refers to the domain axes for its dimensions, but it does not have to include dimensions of size one. This is because CF-netCDF allows data variables to omit dimensions of size one, by using scalar coordinate variables. In previous discussons, I don't think anyone else has been concerned about this treatment. I wonder what you think now about our current version, Mark?

In comment 10, you wrote

As yet I do not see the benefit delivered by having the domain axis as a separate construct, referenced by the Field. However, I think the risk of this causing serious problem can be mitigated by the constraints I have detailed, binding domain_axis collections to the Field data properties and mediating the shape constraints between Fields and other constructs.

I took that as acquiescence, although without enthusiasm, by you. :-) But I have to confess that I do not understand the meaning of "binding domain_axis collections to the Field data properties and mediating the shape constraints between Fields and other constructs." Perhaps that's what I've got to understand now!

Anyway, in the most recent postings I agree with David. I think that the domain axis construct in the CF data model corresponds to the dimension in a CF-netCDF file. It indicates that a certain dimension exists and has a particular size. Just as the dimension has an identity in the CF-netCDF file, so the domain axis construct has an identity in the abstract representation of the data model. As you say, the particular way an identity is implemented depends on the language or file format (a dimension is identified by a name in CDL, but more fundamentally by an integer in a netCDF file, for instance), but that is not essential to the data model.

Because the axis constructs each have a unique identity, the field construct knows which of the axes the data array is dimensioned with. They are not identified by their order or size, but by who they are. In the same way, Mark Hedley, David Hassell and Jonathan Gregory each have unique identities. Two of them are employed by the Met Office, two of them are employed by the University of Reading, and two of them are professional software designers, but despite these ambiguities they are distinguishable entities. In the same way, a netCDF data variable might have two dimensions which are the same size, but they are distinguishable because they have different identities. float data(x) is not the same as float data(y) even if x and y both have size 10.

The axis constructs which are used to dimension a field data array or an auxiliary coordinate array are in a particular order because this order determines the order of the elements of the array. It is safe to omit axis constructs of size one because they make no difference to the order of the elements of the array. Omitting them doesn't mean the field or the coordinate construct is not aware of them.

You made another point earlier, that "the Field is too much of an unaware container of clever constructs; I think this perspective should be reversed, with the Field being a smart container of simple constructs." That was nicely put! I don't feel strongly about this, myself. I think you have to read the whole document (both the field and its components) to comprehend the data model, so in a sense it doesn't make any difference where the logical function of each component is stated. The formal description is unaffected by this. However it might be more intelligible to describe the logical functions where the components are first introduced i.e. in the field description, as you say.

Best wishes

Jonathan

comment:41 Changed 3 years ago by mgschultz

Dear all,

can't this issue be resolved simply by stating explicitly that the relation between the field and the domain axes depends on the software that is used to implement the data model? It seems that we all agree that there is and must be a "photo id". How this id is established (via a person at passport control or some automated scanner) is up to the implementation.

Cheers,

Martin

comment:42 follow-up: Changed 3 years ago by markh

Replying to mgschultz:

Dear all,

can't this issue be resolved simply by stating explicitly that the relation between the field and the domain axes depends on the software that is used to implement the data model? It seems that we all agree that there is and must be a "photo id". How this id is established (via a person at passport control or some automated scanner) is up to the implementation.

Cheers,

Martin

Hello Martin

I think you may well be right, that we need to agree at such a level of detail.

I seem to be struggling to illustrate my concerns effectively.

I think the domain_axis and coordinate model works very nicely for points arrays > length 1.

All my concern is around coordiantes of length one and preserving the flexibility of interpretation of these.

I work with a lot of data stored as 2D datasets which I aggregate based on metadata. How these datasets aggregate is interpreted by the metadata, the 'shape', 'ordering' etc emerges from the collection of 2D Fields.

I want the ability to add a single valued coordiante to my Field without defining a domain_axis at any time.

Additionally I want the ability to evaluate two or more Fields with single valued coordinates, establish that they share coordinate definitions and that the values are different. At this point I will create a new domain_axis on one Field, a new data dimension, and bind my new two-valued coordinate to this data dimension, aggregating my two Fields into one.

By stating that the domain_axes are all defined by the Field, even if some of them are not evident as data dimensions, I feel I have lost this ability to see domain_axes emerge from my data as I aggregate.

This has driven my suggestions to treat domain_axes not bound to data dimensions differently from bound domain_axes.

May I have some assurance that the principal of what I am doing is consistent with interpretations of the design intent and that we are not mandating that 'all domain axes must be declared up front'?

I do not think we should giving the impression of limiting the degrees of freedom of our Fields in this way.

The flexibilty that CF NetCDF provides for size one coordiantes is a really useful feature that I am worried we are desiging away in the model.

Perhaps Martin's suggestion of a simpler, less prescriptive wording allows us to preserve this behaviour.

mark

comment:43 in reply to: ↑ 42 Changed 3 years ago by jonathan

Dear Mark

Actually I think may already have the same perspective. You write

I want the ability to add a single valued coordiante to my Field without defining a domain_axis at any time. ... The flexibility that CF NetCDF provides for size one coordinates is a really useful feature that I am worried we are designing away in the model.

My understanding is that size-one domain axis constructs preserve this flexibility. This is precisely because, in the CF-netCDF standard, scalar coordinate variables and size-one coordinate variables are regarded as equivalent, which is the flexibility you value. Described in terms of the CF data model, adding a new size-one coordinate to a CF-netCDF data variable means (i) creating a domain axis construct of size one, (ii) creating a dimension coordinate construct having this domain axis (and therefore also of size one), (iii) adding the new size-one domain axis construct to the list of axes which dimension to the data array of the field. Adding a new scalar coordinate to a CF-netCDF data variable is described in the data model by exactly the same steps (i) and (ii), but step (iii) is not taken: the dimensions of the data array are unchanged in the data model representation, just as in the CF-netCDF representation.

Additionally I want the ability to evaluate two or more Fields with single valued coordinates, establish that they share coordinate definitions and that the values are different. At this point I will create a new domain_axis on one Field, a new data dimension, and bind my new two-valued coordinate to this data dimension, aggregating my two Fields into one.

In order to aggregate the two fields, they must first both have domain axis constructs in the appropriate quantity e.g. indicated by standard name. For example, one of them might at 1.5 m height, the other at 10 m height. However, in CF-netCDF terms, one of them could be a scalar coordinate variable and the other a size-one dimension coordinate variable, since these are logically equivalent. The process of aggregation combines the size-one domain axis constructs of the separate fields into a domain axis of size two of the aggregated field.

By stating that the domain_axes are all defined by the Field, even if some of them are not evident as data dimensions, I feel I have lost this ability to see domain_axes emerge from my data as I aggregate.

In my understanding, the existence of the size-one axis constructs of the two foields is what enables the aggregation to happen.

Is this the same process as you describe, but in different words?

Best wishes

Jonathan

comment:44 in reply to: ↑ 40 Changed 3 years ago by markh

Replying to jonathan:

You made another point earlier, that "the Field is too much of an unaware container of clever constructs; I think this perspective should be reversed, with the Field being a smart container of simple constructs." That was nicely put! I don't feel strongly about this, myself. I think you have to read the whole document (both the field and its components) to comprehend the data model, so in a sense it doesn't make any difference where the logical function of each component is stated. The formal description is unaffected by this. However it might be more intelligible to describe the logical functions where the components are first introduced i.e. in the field description, as you say.

I think this will be a helpful approach. It makes it clear that a Field defines the role of it's coordinate instances.

I have updated the draft to try and reflect the comments and text to date, while maintianing cosistency. I have put in italics all known (to me) areas of uncertainty.

I hope this reflects the views and textual suggestions to date. Please pick out all issues and raise them here (as usual)

comment:45 follow-up: Changed 3 years ago by markh

Replying to jonathan:

...

Is this the same process as you describe, but in different words?

Hello Jonathan

I am hoping so, but I'm still not sure. Perhaps I can try to take these terms and cast a practical example, to see whether this fits with your perspective?

Consider defining a Field with a 2D array of data, with these data dimensions defined by DomainAxes and DimCoords latitude and longitude respectively. My intention is to take a large number of these and run them through an aggregator; the important factor here is defining what I have; the aggregator is out of scope, imo.

I have a list of single valued metadata elements which describe my data:

height_level, time, forecast_reference_time, forecast_period, ensemble_member, source, institute

I need to define appropriate domain_axes and coordinates for these.

  • source and institute
    • textual data coordinates
    • must be AuxCoords (data type)
    • no bound domain_axis required
      • but these may form an aggregating factor at a later date
  • height_level and time
    • numerical data coordinates
    • I want to define their order within the data array
      • I expect these to be aggregated and I want the order correct
    • create as DimCoords
    • domain_axes are created, at dimensions 2 and 3 respectively
    • these do not exist in the Field's data array (which is 2D)
  • forecast_reference_time, forecast_period, ensemble_member
    • numerical data coordinates
    • I do not want to define their order within the data array
      • these may or may not be aggregated over
      • if they are aggregated, the ordering of dimensions is not important
      • any one may never be aggregated, and may never define a domain_axis
    • no domain_axes are created
    • DimCoord instances are created and placed in the Field's auxiliary_coords container

The times are a crucial factor here, as I have 3 time coordiantes but only 2 degrees of freedom. I never want 3 time related domain_axes, an aggregation operation should return 2 time related domain_axes and 1 2D time coordinate (otherwise my data is faulty); which of the forecast_period or forecast_reference_time becomes a DomainAxis and DimCoord is evaluated on inspection, it is data dependent.

So, I have a Field with

a 2D array of data values (Iy, Ix)

4 domain_axes, ordered (1, 1, Iy, Ix)

a dimensions_coords container, with:

DimCoords: time, height_level, latitude, longitude

an auxiliary_coords container, with:

DimCoords: forecast_reference_time, forecast_period, ensemble_member

AuxCoords: source, institute

Does such a Field instance conform to your view of the CF data model specification?

comment:46 in reply to: ↑ 45 ; follow-up: Changed 3 years ago by davidhassell

Replying to markh:

Dear Mark,

Does such a Field instance conform to your view of the CF data model specification?

I'm afraid not! I think the field you describe has:

5 unordered domain axes ({size}): h{1}, t{1}, x{96}, e{1}, y{73}

a 2D array of data(y, x)

4 dimension coordinates:

time(t), height_level(h), latitude(y), longitude(x)

5 auxiliary coordinates:

forecast_reference_time(t), forecast_period(t), ensemble_member(e), source(e), institute(e)

In my field, domain axis e has no dimension coordinate and ensemble_member is an auxiliary coordinate. This is because I am presuming that, even though it's numeric, the coordinate's value seems arbitrary to me. However, I could just as easily have made ensemble_member one of 5 dimension coordinates and had only 4 auxiliary coordinates.

All the best,

David

comment:47 in reply to: ↑ 46 ; follow-up: Changed 3 years ago by markh

Replying to davidhassell: Hello David

In your illustration do you also have some sort of construct which maps your DomainAxis instances to the data array?

If so, what is the nature of this construct?

Additionally you have named each DomainAxis (h, t, x, e, y). Is this just for illustration?; an attribute of each DomainAxis instance?; an identifier defined by the field to manage it's domain_axes collection?

thank you

mark

comment:48 in reply to: ↑ 47 ; follow-up: Changed 3 years ago by davidhassell

Replying to markh:

Replying to davidhassell:

Hello Mark,

In your illustration do you also have some sort of construct which maps your DomainAxis instances to the data array?

If so, what is the nature of this construct?

No construct. The field just has an ordered subset of the domain axes which correspond to the shape of its data array. Just like a dimension coordinate, auxiliary coordinate and cell measure construct (see also comment 37)

Additionally you have named each DomainAxis (h, t, x, e, y). Is this just for illustration?; an attribute of each DomainAxis instance?; an identifier defined by the field to manage it's domain_axes collection?

They should be seen as identifiers defined by the field to manage its domain_axes collection.

All the best,

David

comment:49 in reply to: ↑ 48 ; follow-up: Changed 3 years ago by markh

Replying to davidhassell:

Hello David

I am struggling more with the proposal of an unordered collection of DomainAxis instances than i have with an ordered one. I thought one of the objectives was to define the Field's data array, without domain axis ordering this seems not to be delivered.

I think if we are to define the DomainAxis construct the Field must define an ordered list of them as its domain_axes.

The instance of a Field from my example which you have described does not deliver on some key aspects I require, I will try to explain how:

  • the ordering of the dimensions defined by the height_level and time coordinates is not defined;
    • my Fields will become extensive in height_level and time, during aggregation; how would my ggregator know which way round to order these dimensions in the data array?
  • you have defined forecast_period and forecast_reference_time as functions of t, suggesting a link to the time coordinate.
    • one of these two coordiantes is independent of time, while the other is a function of both time and the previously mentioned coordinate independent of time.
      • which is which is either decided by inspection of the data during aggregation or by user preference stated to the aggregator
  • you have defined source and institute as functions of the domain_axis defined by an ensemble_member coordinate:
    • this is explicitly not the case, these coordinates are not stated to vary with any other coordinates

Does your interpretation of the model allow for a coordinate which has only 1 data value and is not a function of any DomainAxis?

Also, when comparing Fields:

  • do the domain axis identifiers form part of the comparison criteria?
  • do the domain axes a coordinate is a function of form part of the comparison criteria?

thank you mark

comment:50 in reply to: ↑ 49 Changed 3 years ago by davidhassell

Replying to markh:

Dear Mark,

I am struggling more with the proposal of an unordered collection of DomainAxis instances than i have with an ordered one. I thought one of the objectives was to define the Field's data array, without domain axis ordering this seems not to be delivered.

I think if we are to define the DomainAxis construct the Field must define an ordered list of them as its domain_axes.

I'm sorry to hear that! The field domain's domain axes are unordered. There can be no order to a domain's domain axes (what would it mean if there were?). Those that relate to each dimension of the field's data array do, however, comprise an ordered list which the field knows about. This is the same treatment that auxiliary coordinate and cell measure constructs get (i.e. they know which domain axes relate to the shape of their data arrays), which is nice.

The instance of a Field from my example which you have described does not deliver on some key aspects I require, I will try to explain how:

  • the ordering of the dimensions defined by the height_level and time coordinates is not defined;
  • my Fields will become extensive in height_level and time, during aggregation; how would my ggregator know which way round to order these dimensions in the data array?

This is beyond the scope of the data model. The order of dimensions in your aggregated field's data array is arbitrary and entirely at the whim of your aggregation software.

  • you have defined forecast_period and forecast_reference_time as functions of t, suggesting a link to the time coordinate.
  • one of these two coordiantes is independent of time, while the other is a function of both time and the previously mentioned coordinate independent of time.

I'm afraid I don't understand. In your view of this example, you have only two size 1 domain axes (for height and time) and have only two auxiliary coordinates (source and institute). Where is this 2-d auxiliary coordinate, and what is the "previously mentioned coordinate independent of time"? I don't find your dimension/auxiliary containers useful, particularly when the auxiliary container contains dimension coordinates.

  • which is which is either decided by inspection of the data during aggregation or by user preference stated to the aggregator

How can a field's coordinates' domain axes not be defined within the field?

  • you have defined source and institute as functions of the domain_axis defined by an ensemble_member coordinate:
  • this is explicitly not the case, these coordinates are not stated to vary with any other coordinates

OK. In this case we need two further, size 1 domain axes for source and institute:

7 unordered domain axes ({size}): h{1}, t{1}, x{96}, e{1}, y{73} s{1}, i{1}

and the source and institute auxiliary coordinates span the s and i dimensions respectively (as opposed to both spanning e, as I had before).

Does your interpretation of the model allow for a coordinate which has only 1 data value and is not a function of any DomainAxis?

Definitely not. Note that a CF-netCDF scalar coordinate has an implied domain axis, just not one that is encoded in a netCDF file.

Also, when comparing Fields:

  • do the domain axis identifiers form part of the comparison criteria?

No.

  • do the domain axes a coordinate is a function of form part of the comparison criteria?

Yes. How else can you ascertain if two constructs are equal or equivalent? The domain axes' identifiers are not used, but their physical meanings (ascertained from coordinates which span them) are used.

As an empirical aside, the proposed fully general aggregation rules (#78) are entirely defined on the the data model as I see it, and they work just fine in the cf-python library (as do field combination with broadcasting and field comparison). I'll shortly be posting more about this on the general list.

All the best,

David

comment:51 Changed 3 years ago by jonathan

Dear Mark and David

This debate seems to be rather confused. I am not quite sure what we are arguing about or whether there needs to be an argument!

One particular point of confusion is about the time coordinates in Mark's example, I suspect. I think Mark means that forecast_period = time - forecast_reference_time, so only two of these are independent. This means there are not three domain axes of time, but it's not obvious from the description of the example whether there is one or two.

The lack of clarity perhaps arises because the description doesn't say what the fields look like in netCDF. I think Mark's intention could be represented in CF-netCDF like this:

  • source and institution are string-valued scalar coordinate variables.
  • height and time are size-one (Unidata) coordinate variables.
  • ensemble is a numeric scalar coordinate variable.
  • either forecast_reference_time or forecast_period is a size-one numeric (Unidata) coordinate variable, and the other is an auxiliary coordinate variable with the same dimension.
  • The data variable is dimensioned (time,height,latitude,longitude).

In the CF data model, this becomes a Field with

  • eight domain axis constructs: s, i, height, time, longitude, latitude, ensemble and either forecast_reference_time or forecast_period.
  • a data array dimensioned by the ordered list of domain axes [time, height, latitude, longitude] (or perhaps the other way round! - I'm not sure which convention we are using). It's 4D, not 2D, if the position of time and height are significant.
  • six numeric dimension coordinate constructs: height, time, longitude, latitude, ensemble and either forecast_reference_time or forecast_period, each dimensioned by the corresponding domain axis construct.
  • one numeric auxiliary coordinate construct: either forecast_period or forecast_reference_time, dimensioned by the same domain axis construct as whichever is the dimension coordinate construct.
  • two string-valued auxiliary coordinate constructs: source(s) and institution(i). These are not dimension coordinate constructs (source(source) and institute(institute)) because CF does not permit string-valued coordinate variables.

There are alternatives in the netCDF, which would imply corresponding differences in the CF data model description e.g.

  • source and institution could be string-valued auxiliary coordinate variables with the ensemble dimension, as David assumed. Then the domain axis constructs s and i would not exist, and the ensemble domain axis construct would dimension these auxiliary coordinate constructs.
  • forecast_reference_time and forecast_period could both be auxiliary coordinate variables with the time dimension, as David assumed. Then they would not have domain axis constructs; they would be dimensioned with the time axis construct.
  • The data variable could be dimensioned (latitude,longitude). Then the field data array would be 2D in the data model. There would be no other difference.

These differences would affect the aggregation, because aggregation is a process of concatenating dimensions, and therefore depends on which dimensions there are. After aggregation, the data array will have more than two dimensions of size greater than one. If the input data arrays had only latitude and longitude as dimensions, the position of the new dimensions in the order is not defined by the input, and hence would have to be decided by the aggregating software, or by instructions given to the aggregating software by the user. As has been said, the aggregation rules are not part of the CF data model.

I hope that helps a bit.

Best wishes

Jonathan

comment:52 in reply to: ↑ 28 ; follow-up: Changed 3 years ago by jonathan

Dear all

The text on Mark's wiki page isn't exactly the same as the text we have discussed in the ticket.

Here's the text for the field construct from comment 27, modified by comment 29 and comment 36, and the text for the domain axis and coordinate constructs from comment 28. Following Mark's suggestion about the field being a smart container of simple constructs, I have moved the statements of the logical functions of the components into the description of the field construct. The data array and ancillary fields did not have statements of purpose, so for the data array I have inserted "contains the data of the field", and for the ancillary fields I have inserted "contain metadata about the elements of the field's data array". In addition, for the dimension coordinate construct, I have inserted, "A dimension coordinate construct corresponds to a netCDF coordinate variable, whose name is the same as the name of its single dimension, or a CF-netCDF numeric scalar coordinate variable." Those are the only bits of text which haven't previously been exhibited in postings to on this ticket.

I hope we can settle the ongoing discussion about axes soon, so that we can proceed to work on the text for cell methods, cell measures and transforms.

Cheers

Jonathan

Field construct

The central concept of the data model is a field construct. A field construct corresponds to exactly one data array together with associated information about the domain in which the data resides (defined by spatio-temporal and other coordinates) and other metadata. This data model makes a central assumption that each field construct is independent.

Each field construct must contain:

  • An unordered list of zero or more domain axis constructs. Each domain axis construct declares a dimension of the field.
  • A data array, which contains the data of the field. The shape of the data array is determined by an ordered subset of the domain axes. All domain axes of size greater than one must be included in the subset, but domain axes of size one may optionally be omitted, because their position in the order of domain axes makes no difference to the order of data elements in the array. If there are no domain axes of greater size than one, the single datum may be a scalar instead of an array. If the data array has more than one element, they must all be of the same data type, which may be numeric, character or string.

and may optionally contain:

  • An unordered collection of dimension coordinate constructs. Each dimension coordinate construct provides physical coordinates to locate the cells at unique positions along a single domain axis.
  • An unordered collection of auxiliary coordinate constructs. Each auxiliary coordinate construct provides physical coordinates to locate the cells along one or more domain axes.
  • Yet to be worked upon: An unordered collection of cell measure constructs.
  • Yet to be worked upon: A cell methods construct, which refers to the domain axes (but not their sizes).
  • Yet to be worked upon: An optional unordered collection of transform constructs (corresponding to CF-netCDF formula_terms and grid_mapping).
  • Other properties, which are metadata that do not refer to the domain axes, and serve to describe the data the field contains. Properties may be of any data type (numeric, character or string) and can be scalars or arrays. These properties correspond to attributes in the netCDF file, but we use the term "property" instead because not all CF-netCDF attributes are properties in this sense.
  • A list of ancillary fields, which contain metadata about the elements of the field's data array.

Collectively, the domain axis, dimension coordinate, auxiliary coordinate, cell measure and cell method constructs describe the domain in which the data resides. Thus a field construct can be regarded as a domain with data in that domain.

The CF-netCDF formula_terms (see also Transform constructs) and ancillary_variables attributes make links between field constructs. These links are fragile and it might not always be possible for data processing software to maintain a consistent set of such links when writing fields to files or manipulating them in memory.

CF-netCDF considers fields which are contained in single netCDF files. In a dataset contained in a single netCDF file, each data variable corresponds to one field construct. This data model has a broader scope. It applies also to data contained in memory and to datasets comprising several netCDF files. A field construct may span data variables in more than one file, for instance from different ranges of a time coordinate. Rules for aggregating data variables from several files into a single field construct are needed but are not defined by CF version 1.5; such rules are regarded as the concern of data processing software. Technically, data variables stored in CF-netCDF files are often not independent, because they share coordinate variables. However, we view this solely as a means of saving disk space, and we assume that software will be able to alter any field construct in memory without affecting other field constructs. For instance, if the coordinates of one field construct are modified by averaging the field values over one dimension, it will not affect any other field construct.

Explicit tests of domain consistency will be required to establish whether two data variables have the same coordinates or share a subset of these coordinates. Such tests are necessary in general if CF is applied to a dataset comprising more than one file, because different variables may then reside in different files, with their own coordinate variables. Within a netCDF file, tests for the equality of coordinates between different data variables may be simplified if the data variables refer to the same coordinate variable.

Domain axis construct

A domain axis construct must contain

  • A size, which is an integer that must be greater than zero, but could be equal to one.

Dimension coordinate construct

A dimension coordinate construct may contain:

  • A one-dimensional numerical coordinate array of the size specified for the domain axis. If the size is one, the single coordinate value may be a scalar instead of an array. If the size is greater than one, the elements of the coordinate array must all be of the same numeric data type, they must all have different non-missing values, and they must be monotonically increasing or decreasing. Dimension coordinate constructs cannot have string-valued coordinates.
  • A two-dimensional numerical boundary array, whose slow-varying dimension (first in CDL, second in Fortran) equals the size specified by the domain axis construct, and whose fast-varying dimension is two, indicating the extent of the cell. For climatological time dimensions, the bounds are interpreted in a special way indicated by the cell methods. Sometimes the bounds are the important information for locating the cell, and the coordinates are notional, especially for extensive quantities.
  • Properties (in the same sense as for the field construct) serving to describe the coordinates.

A dimension coordinate construct corresponds to a netCDF coordinate variable, whose name is the same as the name of its single dimension, or a CF-netCDF numeric scalar coordinate variable. A CF-netCDF string-valued coordinate variable or string-valued scalar coordinate variable corresponds to an auxiliary coordinate construct (not a dimension coordinate construct), with a domain axis that is not associated with any dimension coordinate construct.

In this data model we permit a domain axis construct not to have a dimension coordinate construct if there is no appropriate numeric monotonic coordinate. That is the case for a dimension that runs over ocean basins or area types, for example, or for a domain axis that indexes timeseries at scattered points. Such domain axes do not correspond to a continuous physical quantity. (They will be called index dimensions in CF version 1.6.)

Auxiliary coordinate construct

An auxiliary coordinate construct must contain:

  • A coordinate array whose shape is determined by the domain axes in the order listed, optionally omitting any domain axes of size one. If all domain axes are of size one, the single coordinate value may be a scalar instead of an array. If the array has more than one element, they must all be of the same data type (numeric, character or string), but they do not have to be distinct or monotonic. Missing values are not allowed (in CF version 1.5). In CF-netCDF, a string-valued auxiliary coordinate construct can be stored either as a character array with an additional dimension (last dimension in CDL) for maximum string length, or represented by a numerical auxiliary coordinate variable with a flag_meanings attribute to supply the translation to strings.

and may also contain

  • A boundary array with all the dimensions, in the same order, as the coordinate array, and an additional dimension (following the coordinate array dimensions in CDL, preceding them in Fortran) equal to the number of vertices of each cell.
  • Properties (in the same sense as for the field construct) serving to describe the coordinates.

Auxiliary coordinate constructs correspond to auxiliary coordinate variables named by the coordinates attribute of a data variable in a CF-netCDF file. CF requires there to be auxiliary coordinate constructs of latitude and longitude if there is two-dimensional horizontal variation but the horizontal coordinates are not latitude and longitude.

comment:53 follow-up: Changed 3 years ago by mgschultz

Dear Jonathan,

this text looks good to me -- except that the relation between the data array dimensions and the domain axes is not stated at all. If I understood the discussion correctly, then this is considered an aspect of the implementation. That's fine, but the data model description should say this! I expect that many people who wish to use the data model will wonder about this relation.

Here is my suggestion: Field construct [...] Collectively, the domain axis, dimension coordinate, auxiliary coordinate, cell measure and cell method constructs describe the domain in which the data resides. Thus a field construct can be regarded as a domain with data in that domain. The relation between domain axes and the shape of the data array must be defined by the implementing software.

Best regards,

Martin

comment:54 in reply to: ↑ 36 ; follow-up: Changed 3 years ago by markh

The discussions around the DomainAxis type has consistently referred to a Field having an ordered list of these types until recently.

In 36 davidhassell first brought forward the idea that this collection should be unordered. I do not think this is a good approach, I think it causes issues of confusion and added complexity without adding utility, as far as I can see.

I would like to see this approach to be dropped, unless a strong justification for it can be clearly stated. I think it is making the process of reaching consensus more difficult.

mark

comment:55 in reply to: ↑ 52 Changed 3 years ago by markh

Replying to jonathan:

The text on Mark's wiki page isn't exactly the same as the text we have discussed in the ticket.

Thank you for the further text Jonathan.

In an attempt to keep discussions structured and posts to a manageable size, I have put the text you posted into the wiki page. I hope this better represents the status of the discussions than my previous attempt.

I have also defined in the wiki a markup approach to facilitate multiple editors. I would like to try this approach to enable comments and thoughts to be captured, then discussed in turn.

I have also added a little formatting to aid readability, but hopefully the text is faithful to your post, aside from the italic markup which is the record of my suggested additions.

I would like to try and handle issues in constrained posts, so I will start with a single theme from me:

I think it is helpful if the Field names the containers which it uses to reference the types it contains, for example:

Each Field Construct may optionally contain:

  • dimension_coords: A collection of dimension coordinate constructs.

I think this approach will aid communication. It allows us to talk in (hopefully) intuative language, such as The Field's dimension_coordinates.

This label does not have to be used in an implentation, rather an implementation should define how this label has been interpreted. So, CF-NetCDF would define that this label is implemented using NetCDF's Coordinate Variable identification mechanism.

Do contributors also think this is helpful?

mark

comment:56 in reply to: ↑ 53 Changed 3 years ago by jonathan

Dear Martin

Thanks for your comment,

the relation between the data array dimensions and the domain axes is not stated at all.

Actually there is text in the description of the field construct which is intended to address this, namely

The shape of the data array is determined by an ordered subset of the domain axes. All domain axes of size greater than one must be included in the subset, but domain axes of size one may optionally be omitted, because their position in the order of domain axes makes no difference to the order of data elements in the array.

The domain axes store the size of the dimensions. Does this meet the need? I don't think it's implementation-dependent; the dimensionality of the data array in the CF logical data model is the same as the dimensionality of the data variable in the CF-netCDF file.

Best wishes

Jonathan

comment:57 in reply to: ↑ 54 Changed 3 years ago by jonathan

Dear Mark

Thanks for your comment.

In 36 davidhassell first brought forward the idea that this collection should be unordered. I do not think this is a good approach, I think it causes issues of confusion and added complexity without adding utility, as far as I can see.

In making this suggestion, I think David was intending to address a concern of yours by clarifying the situation. The domain axes which are used to dimension the data array must be in a certain order i.e. the order of the netCDF dimensions. But that is only a subset of the domain axes. There can also be domain axes of size one, not used to dimension the data array, whose logical existence is implied by CF-netCDF scalar coordinate variables. Since these are not used as dimensions, they do not have a particular order. Therefore it makes sense to David and me to say (as I did in my text) that the complete set of domain axes is an unordered collection, but the subset of them used to dimension the data array is an ordered list.

Does that help?

Cheers

Jonathan

comment:58 Changed 3 years ago by jonathan

Dear Mark

In comment 52 I wrote

Each field construct ... may optionally contain:

  • An unordered collection of dimension coordinate constructs.

You suggest

Each Field Construct may optionally contain:

  • dimension_coords: A collection of dimension coordinate constructs.

To be honest, I don't think this adds clarity for me. I think that if I say, "the field's dimension coordinate constructs", then it must be referring to the "unordered collection of dimension coordinate constructs" that the field contains. That seems intuitive and obvious to me.

I would not call them "the field's dimension coordinates" (without "constructs") because that would be confusing. A dimension coordinate construct contains an array of coordinate values. Those values are "the coordinates". The construct is more than "coordinates"; it also contains properties, bounds, and so on.

Best wishes

Jonathan

comment:59 follow-up: Changed 3 years ago by markh

An issue I don't think we have addressed yet is the interpretation of attribute types (variable and global) within the data model.

Each variable in a file defines a dictionary of attributes, which live with that variable, whatever type that variable is.

The file also defines a global attributes dictionary.

As the CF model states:

This data model makes a central assumption that each field construct is independent.

should the global attributes dictionary information be recognised and referenced by each data variable independently?

should these global attributes be handled separately from the variable attributes?

comment:60 in reply to: ↑ 59 Changed 3 years ago by davidhassell

Replying to markh:

Dear Mark,

In our original model, Jonathan and I proposed:

  • it is assumed that any relevant global attribute is also an attribute of every data variable [in the file], although it is superseded if the data variable has its own attribute.

This is not inconsistent with 2.6.2 in the conventions: "When an attribute appears both globally and as a variable attribute, the variable's version has precedence"

All the best,

David

comment:61 follow-up: Changed 3 years ago by markh

Hello David

any relevant global attribute

could be a bit problematic, how do we establish relevance? It feels safer to include all global attributes on each data variable to me.

The key aim of my question is to investigate whether we recognise the difference between a global attribute and a data variable attribute?

This is an important distinction in some cases; for example, if we want to recreate a file exactly as it was read.

Should a CF Field know which of it's attributes were encoded in the file as globals and which were encoded as data attributes?

comment:62 in reply to: ↑ 61 ; follow-up: Changed 3 years ago by bnl

Replying to markh:

Should a CF Field know which of it's attributes were encoded in the file as globals and which were encoded as data attributes?

Absolutely not. It's irrelevant. How can it matter? The key is in the word "encoded", the data model should be independent of the encoding.

comment:63 in reply to: ↑ 62 ; follow-up: Changed 3 years ago by markh

Replying to bnl:

Replying to markh:

Should a CF Field know which of it's attributes were encoded in the file as globals and which were encoded as data attributes?

Absolutely not. It's irrelevant. How can it matter? The key is in the word "encoded", the data model should be independent of the encoding.

Hello Brian

point taken, my language was not clear.

I'll rephrase my question:

Does the semantics of an attribute include it's typing, variable or global?

i.e: Is a global attribute of 'institute' able to be differentiated from a data variable attribute of 'institute' within the scope of a Field in the data model?

mark

comment:64 in reply to: ↑ 63 ; follow-up: Changed 3 years ago by bnl

Replying to markh:

Replying to bnl:

Replying to markh:

Should a CF Field know which of it's attributes were encoded in the file as globals and which were encoded as data attributes?

Absolutely not. It's irrelevant. How can it matter? The key is in the word "encoded", the data model should be independent of the encoding.

I'll rephrase my question:

Does the semantics of an attribute include it's typing, variable or global?

i.e: Is a global attribute of 'institute' able to be differentiated from a data variable attribute of 'institute' within the scope of a Field in the data model?

One of the nice things about not having read the rest of the discussion, is that I can just dive in with a confident: No!

Global attributes are an artefact of file based encoding, a convenience artefact at that ...

... so one only ends up with an issue when there is a global attribute and a field attribute, and they differ, and we have a convention that in this case, one effectively ignores the global attribute while creating the field. Thereafter it matters not (or should not matter).

This becomes obvious when one encodes the same content alongside different fields in a different file ... the original global attributes are just field attributes ... one might, I suppose, in the new file grab out common attributes and encode them as global file attributes, but that's just the encoding.

mark

comment:65 in reply to: ↑ 64 Changed 3 years ago by bnl

(Mark, I should say sorry about brevity of reply. I'm doing something else, but saw something coming by that I thought I understood ... so made an effort to contribute. B)

comment:66 follow-up: Changed 3 years ago by spascoe

I'm going to try to rephrase Bryan's point because I agree with it.

The Data Model needs to differentiate between the encoding of attributes and attributes of model constructs. The encoding states how a field construct's attributes can be encoded and decoded into global and/or variable attributes. These rules would include how variable attributes override global attributes.

Then the rest of the data model can ignore the distinction between global and variable attributes. Everything can be expressed in attributes of constructs.

That would seem to clarify the situation. It would also help disentangle the CF model from the format, which is what we are trying to do.

comment:67 in reply to: ↑ 66 Changed 3 years ago by bnl

Replying to spascoe:

The Data Model needs to differentiate between the encoding of attributes and attributes of model constructs. The encoding states how a field construct's attributes can be encoded and decoded into global and/or variable attributes. These rules would include how variable attributes override global attributes.

Then the rest of the data model can ignore the distinction between global and variable attributes. Everything can be expressed in attributes of constructs.

I don't think I agree with restatement in detail.

The first paragraph more or less contradicts the second. Of course we have to have a "set of rules for the netcdf encoding" of the data model, and that includes describing the convenience method of using global attributes. That isn't the CF data model, it's the CF-netcdf encoding convention.

The CF-data model itself ought to be agnostic about the encoding rules - which is what the second para says, I'm just trying to clarify the relationship between the first and second paragraph.

That would seem to clarify the situation. It would also help disentangle the CF model from the format, which is what we are trying to do.

Which it would with the understanding that we're dealing with two different sets of logic: the CF-netcdf convention and the CF data model.

But there should be no expectation that a round trip from CF-netcdf via CF to CF-netcdf should necessarily preserve global attributes per se ...

... if providers wish to build software that do that, then that's fine, then they're building implementations of "site-specific-software" that are overloading CF with their own conventions ... which is perfectly fine!

comment:68 follow-ups: Changed 3 years ago by mgschultz

Dear all,

so, if I understand correctly, then the issue is essentially whether or not to have a concept like "global attributes" in the CF data model. I would argue in favor of this, and generally in favor of supporting some sort of hierarchy in the data model. There are reasons why HDF5 or netCDF4 allow for groups, and why global attributes were allowed in the first place. Also, any kind of XML metadata are hierarchical. Don't namespaces offer a solution here? If we allow for a hierarchy of field constructs and attributes, then the rule "local overwrites global" can be implemented quite naturally, one can move attributes up or down the hierarchy level (up will require some rules), and thus the "encoding" of the data in a specific file format should create little problems. Without recognition of hierarchy levels there will be many more problems, I believe.

Perhaps it is also useful to discuss this with an example: assume you have N models generating M data sets each from X experiments. Clearly there are metadata which are specific to a variable from the output of one experiment, metadata describing an experiment, others which are specific for a model, and finally some generic metadata describing the project, data center, etc. All of this can be reflected in a hierarchical model, but I fear that things (in particular changes) would easily get lost if all attributes are maintained at the variable level only.

An alternative way would be to allow for "independent" attributes which can be defined outside any field construct, and to allow field constriuct attributes to be links to such independent attributes. Note that this still doesn't mean a 1:1 reflection of the netcdf global attributes model, because it may be that some field constructs share one independent attribute and another set of field constructs shares another independent attribute with the same name.

Best regards,

Martin

comment:69 in reply to: ↑ 68 Changed 3 years ago by spascoe

Replying to mgschultz:

Dear all,

so, if I understand correctly, then the issue is essentially whether or not to have a concept like "global attributes" in the CF data model. I would argue in favor of this, and generally in favor of supporting some sort of hierarchy in the data model.

IMO Martin's argument is about encoding of the model. We need to define this encoding but it needs to be separated from the model.

There are reasons why HDF5 or netCDF4 allow for groups, and why global attributes were allowed in the first place. Also, any kind of XML metadata are hierarchical. Don't namespaces offer a solution here? If we allow for a hierarchy of field constructs and attributes, then the rule "local overwrites global" can be implemented quite naturally, one can move attributes up or down the hierarchy level (up will require some rules), and thus the "encoding" of the data in a specific file format should create little problems. Without recognition of hierarchy levels there will be many more problems, I believe.

Naturally if our model is very close to the encoding it will make writing the data easier. However, it doesn't make reading it easier and it doesn't help describing the underlying concepts independent of the format. See below for my explanation.

Perhaps it is also useful to discuss this with an example: assume you have N models generating M data sets each from X experiments. Clearly there are metadata which are specific to a variable from the output of one experiment, metadata describing an experiment, others which are specific for a model, and finally some generic metadata describing the project, data center, etc. All of this can be reflected in a hierarchical model, but I fear that things (in particular changes) would easily get lost if all attributes are maintained at the variable level only.

I think you are worrying too much about duplication of information when stored. This is an encoding issue; the model doesn't need to worry about the difference between 2 references to a piece of data and 2 identical copies of that data.

From a data writer perspective your statement makes sense but from a data consumer less so. A user could access 2 CF field constructs from 2 separate files (or 2 separate services). Since they are encoded separately they cannot be physically part of the same hierarchy. How do you tell whether they were created by the same model? -- you have to compare the relevant field metadata for equality. That metadata may have been encoded in global attributes, variable attributes or generated dynamically from a service. The point is that the user doesn't need to know how it's encoded to interact with the model.

Also, if we model a hierarchy we end up with multiple ways of representing a field construct in the model because many attributes could be defined at the global or variable level. You would then need a meta-model to define when 2 constructs are formally different but semantically the same. IMO the point of the data model is that it shouldn't need further abstraction.

comment:70 in reply to: ↑ 68 ; follow-ups: Changed 3 years ago by davidhassell

Hello,

(I might not have written this if had seen Stephen's post first. But I didn't, so I did, so you might as well see it!)

Global attributes apply to each data variable in the dataset, do they not? unless superceded locally by a data variable's attribute. Therefore CF-netCDF global attributes are an encoding convenience - they allow us to avoid adding them to every data variable in the dataset, thus reducing redundancy of information and dataset size (good things).

So I think that there should be no logical distinction between global and data attributes in the data model. They're all just 'field attributes'.

I also think that it's not the preserve of the data model to ensure that a CF-netCDF dataset written out is identical to one read in. So long as the logical content of the input and output datasets is the same then the data model has done its job, regardless, in fact, of the formats of the two datasets. As Bryan says, software built on the CF data model will make decisions on how to format its output within the confines of creating a CF compliant file.

All the best,

David

comment:71 follow-up: Changed 3 years ago by mgschultz

Dear Stephen and David,

point taken and accepted! However, I think that when writing this up we should include some "implementation remark" here to explain that there is an issue here, which however needs to be dealt with on the implementation level.

At some point I am only beginning to worry that the gap between the data model and any specific implementation will be so wide that the data model as such may become useless. In fact, I would tend to disagree with David's statement "So long as the logical content of the input and output datasets is the same then the data model has done its job, regardless, in fact, of the formats of the two datasets." -- as soon as you deal with a dataset format, you are bound to certain implementation details, and it may therefore not be possible to exactly preserve the "logical content" of input and output. So, I think it is an illusion to create a fully abstract data model, and the value of the data model would be greater if it at least points to "preferred" ways of implementing it.

Best regards,

Martin

comment:72 in reply to: ↑ 70 ; follow-ups: Changed 3 years ago by pbentley

Hi David,

Global attributes apply to each data variable in the dataset, do they not?

My current understanding is that this is not the case.

AFAIK, the netcdf library does not implement this behaviour. And I'm not aware of any explicit statement within the CF specifation that says that global attributes routinely act as providers of default values in the case where a variable-scope attribute is absent. But I'm happy to be corrected if I've missed the salient bit of text.

I'm fairly sure the netcdf tools I use (the usual suspects) don't rely on this behaviour. Could be wrong.

Phil

comment:73 in reply to: ↑ 72 Changed 3 years ago by davidhassell

Replying to pbentley:

Hi Phil,

I think you're right that the conventions don't state this behaviour explicitly, but I think that the conventions' statement "When an attribute appears both globally and as a variable attribute, the variable's version has precedence" does directly imply that a global attribute is applicable to all data variables, unless a particular data variable overrides it with a new value that is true for that data variable only.

Since global attributes are "intended to provide information about where the data came from and what has been done to it", if they don't apply to the data variables in the dataset, what are they for, I wonder ... (something tells me that I'm overlooking something glaringly obvious - if so, do let me know!).

All the best,

David

comment:74 in reply to: ↑ 72 Changed 3 years ago by bnl

Replying to pbentley:

Hi David,

Global attributes apply to each data variable in the dataset, do they not?

My current understanding is that this is not the case.

AFAIK, the netcdf library does not implement this behaviour. And I'm not aware of any explicit statement within the CF specifation that says that global attributes routinely act as providers of default values in the case where a variable-scope attribute is absent. But I'm happy to be corrected if I've missed the salient bit of text.

I'm fairly sure the netcdf tools I use (the usual suspects) don't rely on this behaviour. Could be wrong.

The netcdf data model makes no assumption about what happens with file attributes as opposed to variable attributes, and so the netcdf libraries don't either. What we are dealing with here is the CF data model ... which I think we are clear on does imply that global attributes are inherited by variables ... the CF convention seems clear on that. Where there is any contention at all seems to be whether there is any semantic difference between such attributes on a variable ... does a variable need to know whether it inherited an atttribute from the file or in it's own right. My answer: No.

comment:75 in reply to: ↑ 71 Changed 3 years ago by davidhassell

Replying to mgschultz:

Hi Martin,

point taken and accepted! However, I think that when writing this up we should include some "implementation remark" here to explain that there is an issue here, which however needs to be dealt with on the implementation level.

OK. I entirely agree about the usefulness of implementation details - the original data model proposal document already has many descriptions of how the various constructs are represented in CF-netCDF.

At some point I am only beginning to worry that the gap between the data model and any specific implementation will be so wide that the data model as such may become useless. In fact, I would tend to disagree with David's statement "So long as the logical content of the input and output datasets is the same then the data model has done its job, regardless, in fact, of the formats of the two datasets." -- as soon as you deal with a dataset format, you are bound to certain implementation details, and it may therefore not be possible to exactly preserve the "logical content" of input and output. So, I think it is an illusion to create a fully abstract data model, and the value of the data model would be greater if it at least points to "preferred" ways of implementing it.

However, we must remember that we are only considering file formats which are capable of storing any fully CF-compliant dataset. So I would say, rather boldly, perhaps, that the preservation of "logical content" is a given.

All the best,

David

comment:76 in reply to: ↑ 70 ; follow-ups: Changed 3 years ago by bjlittle

Replying to davidhassell:

Hi David,

Okay I'm jumping into this conversation somewhat cold, so apologies if I'm covering a previously discussed point.

I'm curious about your statement with regards to global/data attributes:

I also think that it's not the preserve of the data model to ensure that a CF-netCDF dataset written out is identical to one read in. So long as the logical content of the input and output datasets is the same then the data model has done its job, regardless, in fact, of the formats of the two datasets. As Bryan says, software built on the CF data model will make decisions on how to format its output within the confines of creating a CF compliant file.

I know that we're discussing the data model here and not implementation specifics, but I believe that proposing a CF data model that can only promise similar logical content will cause big issues for many data users. I've seen many, many real world CF NetCDF files where (misguided or not) a user expects the global attributes to be preserved through the load-process-save cycle.

My concern is that if the community agree on a data model that only guarentees similar logical content, then there will be a real world impact as a concequence. Such a decision should be made explicit, at the very least, as it will be contrary to the current expectation some data users may have.

I hope that this point isn't too inappropriate given the current discussion.

Best regards, Bill

comment:77 in reply to: ↑ 76 Changed 3 years ago by bnl

Replying to bjlittle:

Replying to davidhassell:

Hi David,

Okay I'm jumping into this conversation somewhat cold, so apologies if I'm covering a previously discussed point.

Hi Bill. I jumped in a few comments back. It's the done thing :-)

I'm curious about your statement with regards to global/data attributes:

I also think that it's not the preserve of the data model to ensure that a CF-netCDF dataset written out is identical to one read in. So long as the logical content of the input and output datasets is the same then the data model has done its job, regardless, in fact, of the formats of the two datasets. As Bryan says, software built on the CF data model will make decisions on how to format its output within the confines of creating a CF compliant file.

I know that we're discussing the data model here and not implementation specifics, but I believe that proposing a CF data model that can only promise similar logical content will cause big issues for many data users. I've seen many, many real world CF NetCDF files where (misguided or not) a user expects the global attributes to be preserved through the load-process-save cycle.

My concern is that if the community agree on a data model that only guarentees similar logical content, then there will be a real world impact as a concequence. Such a decision should be made explicit, at the very least, as it will be contrary to the current expectation some data users may have.

The problem is that other data users have different expectations, and some of us think that the "file as a bucket" metaphor is holding us back. In particular, what do we do with (original) global file attributes by the time the variable is some way down any kind of reasonable workflow where we are accumulating provenance. They're just (logically) variable attributes because we have different assemblages of variables by then ...

... which is to say that I have never expected global file attributes to propagate *as global file attributes* through my workflow, as attributes, yes.

So, I agree, let's make it explicit. The lack of explicitness is what has caused this thread :-)

I hope that this point isn't too inappropriate given the current discussion.

Best regards, Bill

comment:78 in reply to: ↑ 76 ; follow-up: Changed 3 years ago by davidhassell

Replying to bjlittle:

Hi Bill,

No apologies required!

Remember that one of the aims of the data model is to set the groundwork to expand CF beyond netCDF files. I don't think that guaranteeing the same logical content will be a problem for users, as a user interacts with a software library, and not directly with the data model. I would expect a software library to add inconsequential user-friendliness. For example, it might be nice to remember netCDF variables names when reading a file so that they may be reused, if possible, when writing the data out again. But netCDF variable names are not part of the CF data model.

All the best,

David

comment:79 in reply to: ↑ 78 ; follow-up: Changed 3 years ago by rhattersley

Consider a CF-netCDF compliant file containing a single data variable. It is perfectly compliant to define both a global "comment" attribute and a data variable "comment" attribute containing different information. It's a worrying consequence of the all-attributes-are-really-data-attributes approach that it renders the global attribute meaningless.

Replying to davidhassell:

Remember that one of the aims of the data model is to set the groundwork to expand CF beyond netCDF files.

Whilst this is true, it's also important to remember that the terms of reference for the CF 1.5 data model constrain us to be consistent with version 1.5 of the CF-netCDF conventions.

In particular, the statement "When an attribute appears both globally and as a variable attribute, the variable's version has precedence" has been shown to be subject to two different interpretations. Since we cannot go back in time and re-write version 1.5 of the conventions (but we should clarify the next version!) we must embrace this ambiguity in the data model, and the data model needs to contain both the global and variable attribute concepts.

FWIW, the title of section 2.6.2, "Description of file contents" suggests that the original authors saw these attributes as file-based as opposed to variable-based.

Regards, Richard

comment:80 in reply to: ↑ 79 ; follow-up: Changed 3 years ago by bnl

Replying to rhattersley:

Consider a CF-netCDF compliant file containing a single data variable. It is perfectly compliant to define both a global "comment" attribute and a data variable "comment" attribute containing different information.

Good point.

It's a worrying consequence of the all-attributes-are-really-data-attributes approach that it renders the global attribute meaningless.

I think that's stretching the point a bit far. It's certainly true that one needs to think a bit harder about what it means to "take precedence". This example clearly indicates that it should not mean destroy.

There is room to argue that this means one should allow variables to inherit two types of attributes (with semantic distinction between them). I personally don't think it needs semantic distinction, but concede that this example may have pushed me out of my comfort zone.

comment:81 in reply to: ↑ 80 ; follow-up: Changed 3 years ago by bnl

While I'm out of my comfort zone, I'm still not up for two semantic distinctions. What happens if I take variable "A" from file "afile" and variable B from file "bfile" and put them both in file "cfile". What then do I do with the global attributes from file "afile" and "bfile"? Do I leave them as semantically distinct global attributes of A and B? What if I do it again and put them in "dfile". What now has happened to all these (different) global attributes?

"file" attributes really only have useful meaning when variables all stay in the same containers ... and never get mixed and matched into differing output files.

comment:82 in reply to: ↑ 81 ; follow-up: Changed 3 years ago by rhattersley

Replying to bnl:

While I'm out of my comfort zone, I'm still not up for two semantic distinctions. What happens if I take variable "A" from file "afile" and variable B from file "bfile" and put them both in file "cfile". What then do I do with the global attributes from file "afile" and "bfile"? Do I leave them as semantically distinct global attributes of A and B? What if I do it again and put them in "dfile". What now has happened to all these (different) global attributes?

Indeed. These are very good questions, and ones that have been nagging at me for quite some time. Having seen what happens when one takes the all-attributes-are-really-data-attributes approach (which is what we did with Iris), I'm now of the opinion that only the end-user knows. All we should do is present the information (all of it!) and let them decide.

"file" attributes really only have useful meaning when variables all stay in the same containers ... and never get mixed and matched into differing output files.

Just as CF doesn't attempt to describe what happens to data variable metadata when one processes a data variable, it shouldn't describe what happens to file metadata when one processess files.

comment:83 in reply to: ↑ 82 ; follow-up: Changed 3 years ago by bnl

Replying to rhattersley:

Replying to bnl:

While I'm out of my comfort zone, I'm still not up for two semantic distinctions. What happens if I take variable "A" from file "afile" and variable B from file "bfile" and put them both in file "cfile". What then do I do with the global attributes from file "afile" and "bfile"? Do I leave them as semantically distinct global attributes of A and B? What if I do it again and put them in "dfile". What now has happened to all these (different) global attributes?

Indeed. These are very good questions, and ones that have been nagging at me for quite some time. Having seen what happens when one takes the all-attributes-are-really-data-attributes approach (which is what we did with Iris), I'm now of the opinion that only the end-user knows. All we should do is present the information (all of it!) and let them decide.

"file" attributes really only have useful meaning when variables all stay in the same containers ... and never get mixed and matched into differing output files.

Just as CF doesn't attempt to describe what happens to data variable metadata when one processes a data variable, it shouldn't describe what happens to file metadata when one processess files.

We agree on the problem :-). Not the solution :-).

For me, there is no real concept of file metadata not withstanding the heading you referred to earlier. What there is is a concept of a convenience method of bundling variable metadata together for all variables in a file.

The simplest solution would be to have the following BEHAVIOUR become conventional: when processing variables, take "file" metadata, and add tag it with the filename while assigning it to variables ... then when the variables travel, they at least have the provenance go with them ... but all such provenance are just variable attributes. One could then interpret something as taking precedence however one wanted.

But, from a data model perspective, we just have variable attributes. What one does ones own workflow is your own convention ... sometimes we try and solve too many problems in CF itself. I think the act of trying to handle "file metadata" from a previous file in a future file as "file metadata" would be having us taking the wrong coloured pill ...

comment:84 Changed 3 years ago by taylor13

From a user's perspective, I've always interpreted global attributes as covering all the variables in a file, and a variable attribute as only applying to a particular variable. So in effect, I would use a global attribute simply as a shorthand way of providing information, so that I wouldn't have to repeat it for each and every variable.

Having not read much of the discussion about this, I would think that software could unambiguously interpret the metadata simply by converting all global attributes to variable attributes. If a global and variable attribute had the same name (e.g., "comment"), you would have a problem if they were inconsistent, but I would say whoever wrote the data messed up.

best regards, Karl

p.s. I just read Bryan's posting, which I think might say about the same thing.

comment:85 Changed 3 years ago by jonblower

I'm late to this conversation, but this seems to be a special case of a general problem of aggregation and inheritance of metadata in a hierarchy. A file is an aggregation of variables, whose attributes can be specified at the file or variable level. Also, a set of files can be aggregated into a dataset. Then we have ensembles etc etc.

I'm not sure what the solution is, but other systems (e.g. OGC services) have the concept of metadata inheritance, with the behaviour of the inheritance dependent on the metadata item in question - metadata can be inherited, overridden or aggregated by children in the hierarchy. Other items of metadata are not considered inheritable at all and only exist on the parent object.

For example, if a file has a comment field, and the variable has a comment field, these might be aggregated in the data model to produce a resulting variable with two comment fields (it doesn't seem unreasonable to me that there would be valid use cases for this). However, in an aggregated dataset, each variable can only have one scale/offset pairing, otherwise it doesn't make sense (this precise case has just bitten someone in THREDDS-land by the way), so the semantics of how these items are derived are important (what about valid_range for example)?

comment:86 in reply to: ↑ 83 Changed 3 years ago by davidhassell

Replying to bnl:

But, from a data model perspective, we just have variable attributes. What one does ones own workflow is your own convention ...

I agree.

David

comment:87 Changed 3 years ago by graybeal

When an attribute appears both globally and as a variable attribute, the variable's version has precedence

It's a worrying consequence of the all-attributes-are-really-data-attributes approach that it renders the global attribute meaningless.

I think that's stretching the point a bit far. It's certainly true that one needs to think a bit harder about what it means to "take precedence". This example clearly indicates that it should not mean destroy.

Though ambiguous, I've always thought it meant the global attribute is lost. Which I thought was bad design, because I'm convinced global attributes should be managed as Jon Blower describes. If global and variable attributes exist and differ, maybe "whoever wrote the data messed up", or maybe they followed the description to provide a default value and override values.

I see "how do we handle global attributes during processing?" as a red herring. If collections get remerged into new combinations, then global attributes can be handled via provenance, describing where the newly merged variable attributes came from (and if desired, the attributes of those sources). Whereas, if the task is to merge a group of uniform collections into a single collection, the global attributes should already be identical or vary in known and manageable ways; if not, one is back to the provenance solution.

So if the CF data model is only describing past practices, you can forget about that -- the original was too ambiguous to present a single correct answer. (In which case provide the new answer, and people can satisfy the ambiguity however they feel is right.)

If the data model provides the best way forward, it should definitely support Jon's list.

comment:88 Changed 3 years ago by jonathan

Dear all

I agree with Bryan, Karl and David. Helpful software may combine the contents of attributes, but as far as the data model is concerned, I have always understood the convention of "variable attribute takes precedence" to mean that the global att can be ignored if the variable has its own att. The global att is a default for all variables in the file and is therefore irrelevant when a default is not required. In Jon's terms, I think that means the att can be inherited and overridden but is not aggregated. Of course this is not true of most CF atts - only the ones which are permitted both as global and variable atts.

Best wishes

Jonathan

comment:89 follow-up: Changed 3 years ago by taylor13

Hi all,

Just to note one use case (CMIP5 specifications) where, perhaps mistakenly, it was assumed that attributes appearing both as global and (with different text) as variable would be concatenated to provide more complete information. The assumption wasn't that if the variable attribute was present it would supersede/override the global one. Note that the description below distinguishes between types of information that might appear in comment and history attributes which would be common among variables vs. different from one variable to the next.

Here's the description of the relevant CMIP5 attributes taken from http://cmip-pcmdi.llnl.gov/cmip5/docs/CMIP5_output_metadata_requirements.pdf :

CMIP5 optional GLOBAL attributes:

comment = A character string containing additional information about the data or methods used to produce it. The user might, for example, want to provide a description of how the initial conditions for a simulation were specified and how the model was spun-up (including the length of the spin-up period).

history = A character string containing an audit trail for modifications to the original data. Each modification is typically preceded by a "timestamp". The "history" attribute provided here will be a global one that should not depend on which variable is contained in the file. A variable-specific "history" can also be included as an attribute attached to the output variable.

CMIP5 optional VARIABLE attributes:

comment = a character string providing further information concerning the variable (e.g., if the variable is mrso (soil_moisture_content), the comment might read "includes subsurface water, both frozen and liquid, but not surface water, snow or ice").

history = a character string containing an audit trail for modifications to the original data (e.g., indicate what calculations were performed in post-processing, such as interpolating to standard pressure levels or changing the units). Each modification is typically preceded by a "timestamp". Note that this history attribute is variable-specific, whereas the global history attribute defined above (see "Optional attributes" under "Requirements for global attributes") provides information concerning the model simulation itself or refers to processing procedures common to all variables.

Don't know if this is helpful, but thought I'd pass it along.

Best regards, Karl

comment:90 in reply to: ↑ 89 Changed 3 years ago by markh

Replying to taylor13:

Hi all,

Just to note one use case (CMIP5 specifications) where, perhaps mistakenly, it was assumed that attributes appearing both as global and (with different text) as variable would be concatenated to provide more complete information. The assumption wasn't that if the variable attribute was present it would supersede/override the global one. Note that the description below distinguishes between types of information that might appear in comment and history attributes which would be common among variables vs. different from one variable to the next.

For what it is worth, I took a similar interpretation to the CMIP5 guidance.

I interpreted

'takes precedence'

to mean that the data variable attribute was to be treated as having primacy, having more import than the global variable attribute. This is different from saying that the global attribute in this case: has no meaning / is overridden / is ignored, if a data variable has an variable attribute with the same name.

I think it is useful to work with CF data variables as independent entities, so I think we should look to apply file global attributes to data variables.

I think there is some semantic content is the information encoding approach in numerous cases so I think there is value in having two attribute containers, perhaps 'data_attrs' and 'inherited_attrs' so that we can preserve the source of these attributes and provide a mechanism for controlling how NetCDF files are written from Field instances.

They are all attributes on the Field, in this view, there just happen to be two types This doesn't have to impact on most uses, at the top level the Field has attributes. I would add that this applies to all attributes, ones recognised by CF and ones outside of the CF conventions.

This meets the use case of reproducing files, which I think is a valuable capability that many user expect and enables precedence to be applied without deleting potentially informative metadata.

comment:91 Changed 3 years ago by bnl

... but this distinction would be difficult to interpret once you mix and match variables which have previously inherited attributes. In practice you couldn't distinguish between the importance and meaning of these attributes down the line, so if you can't distinguish them, they have no *absolute* semantic meaning. I've suggested previously a compromise which would allow software in a specific workflow to read such extra semantic meaning, but one couldn't rely on all software understanding such meaning, since in effect CF would then have to support the semantics of all possible workflows.

So, I have no problem if you give variable A two attributes, one of which is marked in such a way as to indicate it is inherited when you read the file, but the CF data model should be agnostic of that, even if *your* software is not.

comment:92 Changed 3 years ago by davidhassell

Hello,

I interpreted 'takes precedence' to mean that the data variable attribute was to be treated as having primacy, having more import than the global variable attribute. This is different from saying that the global attribute in this case: has no meaning / is overridden / is ignored, if a data variable has an variable attribute with the same name.

However, how can we discern the case where the data variable's attribute really is meant to entirely replace the global value? An example might be a global source attribute having the value 'Unified model' and the file contains a data variable (containing data from another model) which has a source attribute of 'ARPEGE'.

This example suggests to me that the "precedence" we are talking should mean "overrides".

If I understand it correctly, the very useful (thanks, Karl) CMIP5 strategy neatly sidesteps this issue by i) insisting that common global and data variable attributes should contain information of pre-defined types and are mutually exclusive and ii) disallowing certain global attributes as data variable attributes (such as source). But I don't think that this is an option available to us at the moment for CF-netCDF in general.

It'd be nice to somehow support both views, I'm not sure how we could do that.

I don't think that this issue has any bearing on the question of whether or not global attributes should be stored separately to data variable attributes in the data model. Whether you "combine" or "override" in the conventions - the end result is still just some metadata on the field. The data model is solely about describing the logical information content of a field, but that doesn't, and shouldn't, stop a software library from storing extra information for, for example, writing out fields to CF-netCDF in a particular, arbitrary way.

All the best,

David

comment:93 Changed 3 years ago by jonathan

Dear all

For what it's worth, my memory is that when we wrote "has precedence" we meant "is used rather than". Surely that is the usual meaning of this phrase? The free online dictionary says "take precedence over" means "take priority over, outweigh, come before, supersede, prevail over". That's how I understand it. "Precede" means "go before". If A precedes B, it does not mean they go side by side.

Cheers

Jonathan

comment:94 Changed 3 years ago by ngalbraith

For example, if a file has a comment field, and the variable has a comment field, these might be aggregated in the data model to produce a resulting variable with two comment fields (it doesn't seem unreasonable to me that there would be valid use cases for this).

I also thought of the history attribute; if globals are really meant to be replaced by variable attributes during subsetting or aggregation, then I have a big problem, as does almost anyone who stores observational data in CF.

The early processing steps of each variable in an observational file may be different, and so are recorded as variable attributes, while processes performed later - after all the variables are put into a single file - are recorded as globals. If the global history is discarded, we've lost a significant chunk of provenance.

comment:95 Changed 3 years ago by taylor13

Dear Jonathan and all,

I agree that "precedence" has the meaning you ascribe to it, but I think this was a mistake. I suggest that "When an attribute appears both globally and as a variable attribute, the variable's version has precedence." should be replaced with a statement that global attributes should contain only information that applies to all variables in the file, whereas variable attributes with the same name should contain additional information applying to the specific variable. If this were done, files constructed correctly would never have conflicting information stored in global and variable attributes of the same name. In this case the user could safely make use of both.

I think saying that the variable attribute supersedes the global attribute information is asking for trouble, as the casual user may not be aware that the global attribute should be ignored in this case. Most users would, I think, incorrectly assume that both the global attribute information and the variable attribute information were valid. [Like me they might have missed that "precedence" statement in the conventions document.]

Does anyone know of files where a global attribute does not apply to a variable that has an attribute of the same name?

best regards, Karl

comment:96 Changed 3 years ago by bnl

Hi Folks

We now have two threads running on the global/variable attribute dilemma: (1) do such things have any semantic difference? and (2) what to do with data where both global and variable attributes with the same name exist (the precedence argument).

Taking the second first: I think the consensus of the argument thus far is that we *may* have data where Jonathan's interpretation of precedence (aka replace) might give the wrong results, and that in any case, we need to build up provenance as we manipulate data, which means we want to keep adding information to attributes as we process data.

What then should be the best practice to do the latter? Multiple attributes with the same name, or add/aggregate information into the same attribute? Effectively we can't have multiple attributes with the same name, so "best practice" is to add provenance to "the" history comment as software processes data.

Note that in the previous paragraph I didn't mention whether the attributes were "file" or "variable". The same argument applies to both.

So now we can think of file/variable attribute conflict as just being a special case of having to deal with aggregating information as software processes data (and that will now go past history, we might be adding authors sources etc). There are three possible conflict cases to deal with:

  1. Global attribute and variable attribute of the same name exists, and they are complementary, that is, the right answer is to join them together in some sense.
  2. Global attribute and variable attribute of the same name exists, but the intent was that the global attribute has been overridden by the variable attribute.
  3. Global attribute and variable attribute of the same name exists, but the intent was that the global attribute overides the variable attribute.

The third we can discard. If such data exists, the file is not CF compliant. The first two are clearly at least consistent with some interpretations of CF, but frankly, we can't know what was intended. We can fix that in a future version, but we are where we are.

Returning to my very first enumeration, then we have the semantic difference issue. I believe that my previous interventions, plus the argument above about workflow, suggest that there is no semantic difference between global and variable attributes, we simply have to deal with the problem above: two possible interpretations of how to process the attributes.

I suggest for existing CF data (up to 1.7 I suppose), we recommend software simply aggregate them, but tag the global one as "inherited global %s %s"%(filename,timestamp). A human will have to disambiguate any problems that arise from the correct behaviour having been replace not aggregate. This does not affect the data model!

For future version of CF, we choose which of the above interpretations we want, and make it unambiguous. Either way, we simply treat global and variable attributes as having no different semantics.

comment:97 follow-up: Changed 3 years ago by biard

As I read this discussion, I can't help feeling that we are over-thinking this. A physical analogy to the file/variable attribute question is a box with widgets inside it, all of which have various labels on them. The label on the box named "handling" tells me about the handling of the box (or the collection of widgets in the box, taken as a whole). A label on a widget in the box that is also named "handling" tells me about the handling of the widget. The only way to know whether or not to treat the information on a widget label as replacing or adding to the information on the box label is to read and understand them both.

Without coming up with some sort of formalized grammar and vocabulary for this, I don't see any good way to resolve the "person-in-the-loop" issue. I am also not convinced that there is a need to resolve it. If the attribute is named "comment", I think it is almost assuredly OK to just treat them as independent. If the attribute is named source, it's going to depend on details of what the attribute values are. And so on, and so forth.

comment:98 in reply to: ↑ 97 ; follow-up: Changed 3 years ago by bnl

Replying to biard:

As I read this discussion, I can't help feeling that we are over-thinking this. A physical analogy to the file/variable attribute question is a box with widgets inside it, all of which have various labels on them. The label on the box named "handling" tells me about the handling of the box (or the collection of widgets in the box, taken as a whole). A label on a widget in the box that is also named "handling" tells me about the handling of the widget. The only way to know whether or not to treat the information on a widget label as replacing or adding to the information on the box label is to read and understand them both.

Without coming up with some sort of formalized grammar and vocabulary for this, I don't see any good way to resolve the "person-in-the-loop" issue. I am also not convinced that there is a need to resolve it. If the attribute is named "comment", I think it is almost assuredly OK to just treat them as independent. If the attribute is named source, it's going to depend on details of what the attribute values are. And so on, and so forth.

That's exactly what I was saying in my penultimate paragraph, but you've put it better by not suggesting any best practice *in processing*. I think the global/variable attribute as currently formulated does not allow an unambiguous resolution, and so the data model cannot either.

comment:99 in reply to: ↑ 98 ; follow-up: Changed 3 years ago by jonathan

Dear all

Replying to bnl:

I think the global/variable attribute as currently formulated does not allow an unambiguous resolution, and so the data model cannot either.

The data model is about data variables (fields), not about files. A field has only one property of a given name (source, etc.). So if it were decided that the current convention is ambiguous about how global and variable attributes in a netCDF file are to be treated, the data model could not say how the contents of the relevant property should be derived from the netCDF attributes. Is that right?

That seems rather unsatisfactory to me. Karl proposes we change the convention so it doesn't mention "precedence". I agree, that would be a change. I think the current convention is actually clear. If there is a variable attribute, the global attribute should be ignored. That's what "precedence" means. However, Karl and Nan have given examples where the information in the global and variable attributes need to be combined, so clearly this situation has to be taken seriously. I would say these datasets were not written correctly according to any existing version of the convention, so we don't have to allow for it in the present data model.

There are two things we can do, however.

  • Acknowledge that this situation exists, and recommend that software should have options to combine the attributes upon reading them in. This behaviour should be optional, because there may also exist datasets written on the assumption that variable attributes override global attributes, as was intended (I believe). The user of the data will have to decide which treatment is appropriate.
  • Clarify the convention in the next version to avoid this problem for future data. That should be the subject of a different ticket, because this ticket is about the data model for CF 1.5.

Best wishes

Jonathan

comment:100 in reply to: ↑ 99 ; follow-up: Changed 3 years ago by davidhassell

Replying to jonathan:

Dear all

Replying to bnl:

I think the global/variable attribute as currently formulated does not allow an unambiguous resolution, and so the data model cannot either.

The data model is about data variables (fields), not about files. A field has only one property of a given name (source, etc.). So if it were decided that the current convention is ambiguous about how global and variable attributes in a netCDF file are to be treated, the data model could not say how the contents of the relevant property should be derived from the netCDF attributes. Is that right?

My interpretation was similar, but not quite the same? The lack of an unambiguouus resolution doesn't affect the data model because, whichever resolution is chosen (precedence or concatenation) , the field (in a data model sense) still has just one comment or one source, etc.

I agree with rest of Jonathan's analysis. I like the idea for dealing with the current situation of simply giving software an option to combine the attributes upon reading them in.

All the best,

David

comment:101 in reply to: ↑ 100 ; follow-up: Changed 3 years ago by rhattersley

Replying to davidhassell:

The lack of an unambiguouus resolution doesn't affect the data model ...

As things stand I disagree with that interpretation. But perhaps discussing a new aspect will help shed light on the issue, so I'd like to bring up the subject of "non-standard attributes"...

When one describes the phenomenon in a data variable, one is not confined to using the "standard_name" attribute and can utilise the free-form value of the "long_name" attribute instead. Similarly, one may also use "non-standard attributes" to record additional facets of the metadata. This is explicitly recognised in the conventions document and much use is made of this extensibility in practice.

In the light of which, the data model needs to reflect this open-ended capability and include an explicit definition of non-standard attributes.

NB. The conventions also state that, "Application programs should ignore attributes that they do not recognise or which are irrelevant for their purposes." But this statement does not apply here as the data model is deliberately intended to be generic and cannot be categorised as an application program.

comment:102 in reply to: ↑ 101 ; follow-up: Changed 3 years ago by davidhassell

Replying to rhattersley:

Replying to davidhassell:

The lack of an unambiguouus resolution doesn't affect the data model ...

As things stand I disagree with that interpretation. But perhaps discussing a new aspect will help shed light on the issue, so I'd like to bring up the subject of "non-standard attributes"...

Hello Richard,

Good point about non-standard attributes (although I'm not sure how they relate to the global/data attribute debate). Our proposal suggests, in the Other Properties section:

The CF data model allows field, dimension and auxiliary coordinate constructs to have other properties not defined by CF, provided they do not conflict with CF, but since they are not part of the CF standard, the data model does not provide any interpretation of them.

All the best,

David

comment:103 in reply to: ↑ 102 ; follow-up: Changed 3 years ago by markh

Replying to davidhassell:

Good point about non-standard attributes (although I'm not sure how they relate to the global/data attribute debate).

CF recognises that people use non-CF attributes and impart meaning into these which should be carried along with a dataset once it is extracted from a NetCDF file into the data model.

So, consider a file which has a global attribute of 'responsible_authority' and two data variables, one with a data variable attribute of 'responsible_authority'.

This key is not CF, the data model objective is just to preserve the information, not provide any CF insight into the interpretation.

If two name spaces are used, one for global attributes and one for data variable attributes, this preservation becomes easy.

Some parties in this discussion have favoured one attribute set, I presume with unique keys. In this case I am concerned that the information from the file related to this key cannot be stored in a Field instance.

comment:104 in reply to: ↑ 103 Changed 3 years ago by davidhassell

Replying to markh:

Dear Mark,

We are trying to build a logical data model which describes the irreducible building blocks of the CF conventions. As has been mentioned a few times, whichever of the two interpretations of "precedence" is used (concatenation or replacement), a single field attribute results. Therefore, that is all that is required by the data model. There may be value in a software library giving a user the choice to have distinct global and data variable attributes, but that is not the issue here.

All the best,

David

comment:105 Changed 3 years ago by biard

I think this way of thinking about the issue of global vs variable attributes within the CF data model is taking us down a wrong path. The idea that there should, in essence, be no "container" construct within the CF data model seems to me to be an attempt to force a particular viewpoint, rather than model what exists.

Given that it's not likely that people will change this course, here's a suggestion for a compromise. If you allow an attribute to itself have a "parent" attribute associated with it, you could capture both the variable attribute value and the global attribute value without forcing a particular way of resolving the precedence question. You could give the association a name (that escapes me at the moment) to indicate the precedence relationship.

comment:106 Changed 3 years ago by jonathan

Dear Jim

I am not quite sure what you mean by "container". Do you mean a file, perhaps, which contains fields? I believe that we have so far agreed in this ticket that the CF data model is concerned with fields. That's indicated by this text, which I think we have all been happy with e.g. in comment 27: "The central concept of the data model is a field construct. A field construct corresponds to exactly one data array together with associated information about the domain in which the data resides (defined by spatio-temporal and other coordinates) and other metadata. This data model makes a central assumption that each field construct is independent." The reason for taking this approach is that we want the CF data model to apply to data which isn't stored in physical netCDF files. It might be served from some other kind of database or file format, even if presented as though it were netCDF to the application.

Because the fields are independent, David and I think that each field has only one comment property, for instance. The CF standard says that variable attributes have precedence over global attributes, which I think means that the value of the field's comment property should come from the data variable's comment attribute, if it has one, disregarding the global comment attribute. If the data variable has no comment attribute, the property comes from the global comment attribute, if there is one. However, it's been pointed out by Karl and others that some files have been written on the assumption that both global and variable attributes simultaneously apply to the data variable. That is, in my view, an erroneous interpretation of the CF-netCDF standard, but it doesn't cause a problem for the data model, if we allow software implementations to combine the global and variable comment attributes in some way, if both exist, to obtain the comment attribute for the field. Thus, I think we can just have a single comment property for the field in the data model, and we can regard it as an issue for implementation how the value of that property is got from the netCDF file.

Is your suggestion something like Mark's in comment 103, which distinguishes information from global and variable attributes? Although that would be possible, it's not consistent with CF version 1.5.

Mark's example mentions an attribute responsible_authority. This is not a CF attribute anyway, so the CF data model doesn't need to say anything about how it should be handled, except that it is legal to have such attributes, as David said in comment 102.

Best wishes

Jonathan

comment:107 Changed 3 years ago by davidhassell

Hello,

I'd to open up a new topic for discussion - namely transform constructs, but this in no way means stopping the others!

I think that the current proposal's definition of a transform construct is a bit confused by trying to be too general, going beyond what CF-1.5 states, so Jonathan and I have tightened it up a bit.

It is still the case that transform constructs are contained by a parent field construct, but we have altered their definition to be more precise:

A transform construct defines a mapping from one set of coordinates which can not geo-locate the field construct's data to another set of coordinates that can geo-locate the field construct's data.

The proposed full text for this section is:


Transform Constructs

A transform construct defines a mapping from one set of coordinates which can not geo-locate the field construct's data to another set of coordinates that can geo-locate the field construct's data.

A transform construct contains

  • A transform name which indicates the nature of the transformation and implies the formula to be used. A CF-netCDF file does not explicitly record the formula; it depends on the application software knowing what to do.
  • An unordered collection of variables which correspond to the terms of the transformation formula. The variables may be scalar parameters, pointers to dimension or auxiliary coordinate constructs of the field construct, or pointers to other field constructs. The collection includes the input coordinates being mapped, but not the output coordinates.

A transform construct provides geo-locating metadata to all of the dimension and auxiliary coordinate constructs referenced by it.

Transform constructs correspond to the functions of the CF-netCDF attributes formula_terms, which describes how to compute a vertical auxiliary coordinate variable from components (CF Appendix D), and grid_mapping, which describes how to compute true latitude and longitude auxiliary coordinate variables from horizontal projection dimension coordinates, or describes the figure of earth for true latitude and longitude coordinate variables (CF Appendix F).

The transform name is the standard_name of a vertical coordinate variable with formula_terms, and the grid_mapping_name of a grid_mapping variable. The scalar parameters are scalar data variables (which should have units if dimensional) named by formula_terms, and attributes of grid_mapping variables (for which the units are specified by the transform construct). The role of each term in the formula of the transform construct is identified by its keyword in a formula_terms attribute, or its attribute name in a grid_mapping variable.

Note that the transform construct for a CF-netCDF grid mapping of latitude_longitude is a special case in which the outputs are the same as the input coordinates.


All the best,

David

Note: See TracTickets for help on using tickets.