Opened 2 years ago

Closed 7 months ago

#145 closed enhancement (fixed)

Subconvention for associated files, proposed for use in CMIP6

Reported by: jonathan Owned by: davidhassell
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

1 Title

Subconvention for associated files

2 Moderator

Balaji

3 Purposes

CMIP6, like CMIP5, will store cell_measures variables in a different file from the data variables to which they belong. This is to save storage space, but it is not legal in CF, which is a convention that so far applies only to individual self-contained netCDF files. To relax that restriction requires regarding two or more files as though they were a single dataset. Rules for aggregating files are needed. In this ticket a simple mechanism is proposed which is sufficient for CMIP6 to allow one file to refer to another file.

Note that the file referred to is not necessarily identified by name, because this is fragile and caused some difficulties in CMIP5. This proposal does not say exactly how the file should be found. A further convention specifically for CMIP6, and not part of CF, will be needed for that, and other users of this subconvention would similarly have to adopt their own rule.

4 Status quo and benefits

CMIP6 files will not be CF-conforming without this change. Legalising them is a benefit, and the mechanism will probably be useful in other situations. This proposal arose from email discussions involving Balaji, Jonathan, Steve Griffies and others in September 2014 during discussions about the Ocean Model Development Panel recommendations of diagnostics for CMIP6.

5 Detailed proposal

The proposal is to introduce a subconvention of CF i.e. conventions which are not part of CF, but intended to be used in combination with CF. It is proposed to insert the following text as a named (but not numbered) section of the CF standard document, before Appendix A. The title of the section will be Subconvention for associated files and the text is below. In addition to the following new section, in Table A.1 of Appendix A insert an entry for associated_files, type S, use G, link "Associated files subconvention", description "Indicates where files containing metadata variables can be found".

CF is a convention for individual netCDF files, which implies that if a data variable refers to another variable containing metadata, the variables must be in the same file. This subconvention provides a mechanism to allow data variables in one file to refer to metadata variables in another file or files. When this subconvention is used, the netCDF file containing the data variable should contain CF-n/associated-files in its global Conventions attribute, where n is the CF version number to which it conforms.

The optional global attribute associated_files of the file containing the data variable indicates where the files containing metadata variables can be found. This attribute is a string whose syntax is not standardised. For instance, it could the path to a file, a URL of a file, or a URL of a website where the required file could be found (thus requiring human intervention). Applications which use this subconvention may define their own rules for the syntax and the interpretation of the associated_files attribute.

The metadata variables to which this subconvention applies are those identified by the coordinates, formula_terms, grid_mapping and cell_measures attributes. These metadata variables are identified by name. The named variables may be stored in either the same file as the data variable which refers to them, as usual, or in other files, provided that

  • There is only one variable of that name in the data in any of the files concerned (the file containing the data variable and any of the associated files), so that the identification of the metadata variable is unambiguous.
  • If the metadata variable is in a different file from the data variable, its dimensions must have names which are also names of dimensions in the file containing the data variable, and these dimensions must have the same sizes as they do in that file. These rules are usual CF conventions when the metadata variable is in the same file as the data variable.

Example

A file containing a data variable:

dimensions:
  lat=73;
  lon=96;
  level=20;
variables:
  float temperature(level,lat,lon);
    temperature:cell_measures="area: areacell";
    temperature:standard_name="air_temperature";
    temperature:standard_name="degC";
// global attributes:
  :Conventions="CF-1.7/associated-files" ;
  :associated_files="http://some.web.site/areacell.nc";

In this example, the associated_files attribute gives the URL of this file, which contains a metadata variable:

dimensions:
  lat=73;
  lon=96;
variables:
  float areacell(lat,lon);
    areacell:units="m2";
// global attributes:
  :Conventions="CF-1.7" ;

The variable areacell would need to be in the same file as temperature according to standard CF. This subconvention allows it to be stored in a different file. It would be an error if there was a variable called areacell in both files, since it would be ambiguous which should be used. It would be an error if the latitude and longitude dimensions had names other than lat and lon, or different sizes e.g. lat=180, in the second file, because they must correspond to dimensions of the data variable in the first file.

Change History (63)

comment:1 Changed 2 years ago by mgschultz

I like this proposal, but I have one issue with the statement "The optional global attribute associated_files of the file containing the data variable indicates where the files containing metadata variables can be found. This attribute is a string whose syntax is not standardised." - once again, this will make it difficult to use autoamted systems to even identify whether a file can be found under the given string or not. I would suggest to add two attributes like associated_files with the meaning you described and associated_file_url as a more specific term which would require that something machine-readable is found under the referenced URL. Either one is optional, but one of them should (must?) be used. Another term could be associated_file_identifier which would have to go together with a description of a catalogue (web service) where a file with this identifier can be found.

Again, all of this would be optional, but if there is at least a chance to generate files which are both CF-compliant and machine-interpretable, then I think we should go for it.

Best regards, Martin

comment:2 Changed 2 years ago by mgschultz

I like this proposal, but I have one issue with the statement "The optional global attribute associated_files of the file containing the data variable indicates where the files containing metadata variables can be found. This attribute is a string whose syntax is not standardised." - once again, this will make it difficult to use autoamted systems to even identify whether a file can be found under the given string or not. I would suggest to add two attributes like associated_files with the meaning you described and associated_file_url as a more specific term which would require that something machine-readable is found under the referenced URL. Either one is optional, but one of them should (must?) be used. Another term could be associated_file_identifier which would have to go together with a description of a catalogue (web service) where a file with this identifier can be found.

Again, all of this would be optional, but if there is at least a chance to generate files which are both CF-compliant and machine-interpretable, then I think we should go for it.

Best regards, Martin

comment:3 Changed 2 years ago by mcginnis

I think adding a standardized mechanism for associating different files is a reasonable idea.

However, I also think that relaxing the single-file restriction is a very, very bad idea.

Metadata, coordinates, and other ancillary data must be kept tightly coupled to data variables because otherwise it's far too easy for them to get lost or out of sync and to hinder or corrupt data processing and analysis in subtle and insidious ways. One of the biggest strengths of the netcdf file format is that it packages everything up in a way that makes it easy to keep it all together and in sync. Separating the ancillary data into a different file would make it nearly impossible to write software that can sensibly handle basic tasks like subsetting the data.

Now, if the problem is that there are 3-D variables with time-varying height/depth (e.g., masscello for Boussinesq ocean models with time-dependent grid cell volumes), the solution I have seen, and which is fully CF legal as best I can tell, is to recognize that in this case the height variable is not an ancillary variable at all, but simply another data variable, and treat it as such. That's how geopotential height is generally handled for the atmosphere: it's a variable like temperature or velocity, not an ancillary coordinate like 2-D lat or lon. You don't list it in "coordinates" or "cell_measures"; you just have to know what it is so you can use it when you're making calculations that involve volume, the same way that you have to know what the velocity variable is to use when you're making calculations that involve momentum.

I think you could make an argument that, unlike "coordinates", "formula_terms", and "grid_mapping", "cell_measures" identifies (useful) supplemental data rather than (essential) ancillary data and that it should therefore be allowed to refer to a variable in a separate file, and propose a mechanism for making external references like that. But I think it would be a really bad idea to break the single-file requirements for "coordinates", "formula_terms", and "grid_mapping", so I have to oppose this ticket as it is currently written.

Cheers, Seth

comment:4 Changed 2 years ago by stevehankin

http://cf-trac.llnl.gov/trac/ticket/145

============

I second the concern that Martin has raised. CF needs to ensure that a reliable API always exists with which to access the datasets that it defines -- including aggregated dataset. As long as the syntax of the "associated_file" reference remains unstandardized, the reliability of the API is compromised. Plus, external references are brittle/unreliable. Exploring here whether a way around these limitations may be found ....

The problem of CMIP6 cell_measures is not unique in calling for a generalization of the single file CF dataset into an aggregation of files. Other related aggregation needs include (the venerable) aggregations of files into time series; aggregations of models into ensembles; aggregations of forecasts into FMRCs (forecast model run collections); and aggregations of tiles into "gridspec" datasets. (Also aggregations of DSG feature files ... but that may be another discussion.) All of these aggregation types share the need that CMIP6 has raised: that there may be time-independent/ensemble-independent/forecast-independent fields that we also want to include in the aggregation (e.g. cell_measures, a land-ocean mask, etc.).

Here is a simple (simplistic? ) proposal that attempts to address the need to associate CF files into aggregations in a manner that does not bring brittle/unreliable external references into CF. The idea is to re-define the problem such that the external references become implicit rather than explicit; that the CF files contain the metadata needed for an external piece of software to create valid file linkages on-the-fly. Here's a straw man:

define new CF attribute "external_aggregation_keys"

a white-space separated list of strings that indicates memberships in higher level aggregations

Examples of usage

CMIP6 example:

model history output files and the file containing the cell_measures field all contain:

 external_aggregation_keys = "GFDL_CM2.5_CMIP6_global_lowres_cellMeasures";

time series example

all model outputs that belong to the same model run contain:

 external_aggregation_keys = "GFDL_CM2.5_CMIP6_global_lowres_runstart_21oct2015";

combined aggregation of external cell measures and time series files

model output history files all contain 2 keys:

external_aggregation_keys = "GFDL_CM2.5_CMIP6_global_lowres_cellMeasures GFDL_CM2.5_CMIP6_global_lowres_runstart_21oct2015";

the cell_measures file contains only 1:

 external_aggregation_keys = "GFDL_CM2.5_CMIP6_global_lowres_cellMeasures";

There is a problem, of course. In the CMIP6 example the model history output files would fail in a CF conformance tester because the "cell_measures" attributes they contain point to a field that is missing from the file. However, the creators of these model history files know of this situation as they are writing the files. They just need a mechanism to indicate it. So we propose a second new CF attribute, "external_references". In the CMIP6 example above the model history files would need to contain

external_references = "this_cell_measures_field";

(Arguably, the same concern over breaking CF conformance applies to the original proposal: the CF file needs to confess to the variable(s) it is missing.)

To fully flesh out this proposal we need to consider how client applications should behave when they encounter datasets that contain "external_reference" fields. I am leaving that as a future discussion, after seeing whether others CFers consider this entire approach loopy.

Last edited 2 years ago by stevehankin (previous) (diff)

comment:5 Changed 2 years ago by caron

External forces dictate how a dataset is divided into physical files. These are at odds with the desire for complete metadata in a single file. My opinion is to relax the notion of completeness as a requirement for "CF-compliance". It seems that is the gist of this proposal.

If a file doesnt have a "cell_measures" variable, it is still a valid CF file. So why make it not valid CF if it indicates that the "cell_measures" variable lives in another file? Perhaps, though, the attribute should be called something else like "external_cell_measure" ?

I have to agree with the concerns of trying to specify the URL of external files. Seems likely that link will eventually fail, although Martin's additions give it more traction. But im more in sympathy to specify how an external program can assemble files into metadata-complete datasets, and what rules to follow in generating the component files. Steve's ideas seem to me to be in a good direction.

In short, it seems that we could

1) clarify the meaning of "CF-compliant", and allow a way to relax the need for strict "metadata-completeness". A subconvention as proposed, or some other global attribute would work.

2) come up with a standard way to specify, in an external config file, which files are meant to be collected together into a logical dataset, and how component files should be written to enable that.

In the end, systems like CMIP6 need flexibility in how to divide very large datasets into physical files. We should accomodate that.

comment:6 Changed 2 years ago by jonathan

Dear all

Thanks for your thoughtful comments.

Yes, as John says, the gist of the proposal is to relax the requirement for files to be self-contained. I think this is necessary, because CMIP and no doubt other projects already treat datasets as distributed across files, and we need to recognise and accommodate that situation.

Seth, you disagree with this on principle, and I'm not sure I can think of any other arguments to persuade you! You are right of course that you can make the file legal by dropping the external references to metadata, but that's not so helpful. I can only point out that this proposal is a subconvention (we haven't had one before). That is, it's something you have to opt into explicitly, and the present CF conventions requiring self-contained files will still apply by default. Can you live with that?

By the way, I should note that I didn't include bounds in the proposal, because data variables don't have bounds.

The proposed method specifies rather vaguely how the file(s) are to be found, requiring user intervention, because that's what CMIP6 needs, for the sake of flexibility and robustness. The attribute name associated_files is already in use (though not CF standard) in earlier CMIP datasets, so we kept the name for backward compatibility. However, I'm fine with including an attribute associated_files_url in this proposal as well, providing a blank-separated list of URLs to files, following Martin's suggestion, which has John's support. I think that these should be alternatives i.e. it would be an error to include both. The choice of which to use would be made by the project concerned. Would that be OK, Martin and John?

I think that the addition of this attribute should be sufficient to make the dataset legal, without having to change any of the other attribute names. This means that you could simply copy all the variables unchanged into one file (omitting any duplicates) to make a legal CF-netCDF file with no external references. I like that because it emphasises that the collection of files is to be treated as one file.

Steve's right to point out there are other varieties of aggregation. This proposal is for the a very simple mechanism, but it's not intended to exclude other, more powerful, approaches. For example, in ticket 78, David and I proposed a mechanism for aggregating data variables using existing CF metadata only, like the time aggregation which is needed with CMIP datasets. David has coded those rules in cf-python. They are closely related, we think, to the CF data model, which is still under discussion - but that's another subject, not needed for the present proposal.

Best wishes

Jonathan

comment:7 Changed 2 years ago by stevehankin

Jonathan et. al.,

There's a concern that I believe is in all of our minds, but has not been made explicit enough: pathnames and URLs embedded in CF files are "brittle"

If associated_files names a file path, it will likely become invalid as soon as the files are downloaded onto some other file system.

If associated_files names a URL, it will be valid only for some finite period of time ... until the URL changes or is removed.

Conclusion: the proposed sub-convention will lead to corrupted CF datasets.

Have we side-stepped this concern by making the proposal a "sub-convention"? Myself, I'd say no to that. The CF document should not, itself, knowingly lead to corrupted datasets.

Are there alternative approaches that may serve the needs of CMIP6 without the brittle linkage problem just outlined? I believe the answer may be yes and I think John agreed. A straw man is proposed, above; as an entry point see the "external_references" attribute, which is a suggestion of how (borrowing John's words) to "relax the notion of completeness". What are your thoughts regarding the substance of that straw man? Could the approach be made satisfactory with modifications?

Last edited 2 years ago by stevehankin (previous) (diff)

comment:8 Changed 2 years ago by mcginnis

Hi Jonathan, et al.

I agree with Steve that labeling this a "sub-convention" is no way addresses the problem. A sub-convention is only opt-in for data providers whom it conveniences, not for data consumers and software developers whom it causes pain. So no, that doesn't allay my concerns.

I do like Steve's general approach of providing a mechanism to identify a given file as part of a larger dataset that you should go looking in if there's something missing that you need. That's useful not just for metadata, but also for data in general: if I have a humidity field and want the temperature that goes along with it, it would be very handy to have a unique ID to check, rather than piecing it together from a bunch of different global attributes. So generally speaking, I'm with John Caron in thinking that's a good direction to explore.

However, I remain strongly opposed to putting paths and filenames into the metadata. They are so brittle and so easily invalidated that I regard them as already broken before you start. URLs are nearly as bad. If you're going to have an external reference in a file, I really think it needs to be something that will persist and that there's a commitment to keep updated, like a DOI.

I could get behind a proposal for adding an attribute that uniquely identifies the dataset that a file belongs to and associating the dataset with a DOI. (Or multiple datasets / DOIs, since the membership could be hierarchical.) Steve's strawman lists other types of aggregation that would also benefit, and I think there's a lot of potential value that such a proposal could add.

But it's still not clear to me whether there's a case where it makes sense to use this mechanism to violate the principle of metadata completeness by putting ancillary data in a separate file. The example in the original proposal is not at all compelling. It pulls a static 2-D field out of a static 3-D file, saving 28k on a file that's only barely over half a meg. That's absolutely trivial, and there's no reason to do it.

This is even more true for formula_terms and grid_mapping. These variables take up at most a few dozen bytes. There is never, ever going to be a case where the storage savings of putting them in a separate file will be worth the associated inconvenience.

Nor does it ever make sense to put the variables referenced by the coordinates attribute into a separate file. Because these variables are used to geolocate the data, they are essential, not optional. In normal use, they are not tiny, but they are much smaller than the data variable(s) in a file, being associated with some subset of the coordinate variables. In all normal uses of netcdf, the storage savings of separating out this essential ancillary data is never going to be worth it.

This also holds for a static cell_measures variable, so the only case that's left is a time-varying cell_measures variable. I think that it may well make sense for CMIP to separate out cell_measures for time-varying ocean volumes and for CF to support that. I also think that this proposal is not the way to do it, and that (as per my previous comment), the real issue is that cell_measures isn't ancillary data at all, but associated data, and that solutions to that problem already exist. But it's hard to be sure and harder to discuss, because we don't have an example that showcases that problem directly.

To sum up, I remain strongly opposed to creating a subconvention that relaxes the requirement for files to be self-contained, especially with respect to coordinates, formula_terms, and grid_mapping. But I don't think we need to do that to accommodate CMIP's need to put cell_measures data in a separate file. Jonathan, could you provide some more details on this problem, perhaps with a realistic example? With that to use as a reference, I think we can make CMIP6 CF-compliant with only minimal changes to either the spec or the CMIP data. But this could also be an opportunity to get a big usability boost by adding a mechanism for DOI-based external references.

Thanks, Seth

comment:9 Changed 2 years ago by martin.juckes

Hello Jonathan et al,

I agree with Steve's concern about URLs in files, many of which will expire within the useful life of the archive. One approach, picking up on Seth's comment about DOIs, might be to restrict this to use with DOIs. This would require the modelling centres to put some work into getting the externally referenced files cleaned up before writing the files which make the references, but would that be a bad thing? The attribute is not much use if the external data is not reliably available, and a DOI is the only way we can guarantee that.

Steve's straw man is an interesting alternative, but I think it can only work is there is a formally defined namespace which can be used to identify the externally referenced resource. E.g.

external_namespaces = 'dr:http://w3id.org/cmip6dr/ns'
associated_resource = 'dr:GFDL.CM2-5.areacell'
resource_id = 'dr:GFDL.CM2-5.CMIP6.historical.r1i2p3f1.tasmax'

The last line declares the identity of the current file in the declared namespace, the 2nd line is the identity of the external file.

Apart from the addition of the namespace declaration, this would work exactly as suggested by Steve: the two attributes would provide a robust mechanism for confirming that a given resource matches the external reference by requiring that "associated_resource" in the referring file matches "resource_id" in the external file. The declaration of a namespace gives the user somewhere to go to find out how to interpret these strings it they need help in discovering the external resource.

This 2nd option may be easier to implement than a DOI based system.

regards, Martin J.

comment:10 Changed 2 years ago by lowry

This is a discussion I have encountered several times over the past 10 years. When we designed SeaDataNet? in 2005, the counsel we received was that URIs embedded in data should be URNs because the longevity of URLs cannot be trused. Consequently, our data and metadata are liberally populated with references like SDN:P01::PRESPR01. Ifremer maintain a URN to URL resolver, which if you know where to find it will translate these into URLs like http://vocab.nerc.ac.uk/collection/P01/PRESPR01/.

Since then issues caused by dynamic resolution from URN to URIs (especially if the URNs come from multiple namespaces) has caused best practice recommendations to switch from URNs to URLs. I don't see a problem with this PROVIDING those in charge of the URLs can be trusted to ensure The URLs do not expire within the useful life of anything in which they have been embedded.

So, it all comes down to trust and in my view the responsibility level is the same whether maintaining URLs or a URN namespace. URLs are much easier for any client so providing CF can find one or more organisations prepared to deliver URLs with a long-term guarantee I would go with them.

DOIs are a special kind of URN, where URN to URL resolution comes with a longevity guarantee, but these is a cost. Somebody needs to maintain URL landing pages and guarantee that any changes made to the digital object are done in a strictly versioned environment, which may require minting and daisy-chaining additional DOIs. This is worthwhile in support of published science transparency and reproducibility, but may be over the top for linkages in NetCDF data files.

comment:11 Changed 2 years ago by jonathan

Dear all

There are various issues in this discussion.

The need for the proposal

The motivation is to save space. Seth finds that the case I gave, of cell_measures, is not compelling. However, that is actually the use-case of CMIP6. The files will not contain the cell_measures variables that the cell_measures attribute names, which would have to be included in every data file. There are many data files, because each variable is in a different one, and they are usually split up by time as well, so that's lots of duplication. Even so, I agree, the cell measures would not be a large fraction of a data file, which usually contains decades of monthly fields. Nonetheless, that's the reason for it. The information will not be there, and the same was true in CMIP5, whose files are therefore formally defective wrt CF. We'd like to avoid that for CMIP6.

It has quite often been proposed on the email list that coordinates should not have to be stored in the data file. This is not an issue for 1D coordinates; the problem is the 2D latitude-longitude coordinates for non-latitude-longitude grids, and their bounds, which are four times larger; thus, for every field, you have coordinate information which is five times larger. I think it's reasonable to ask for a mechanism which allows this to be stored in a different file from the data variables. However, we could explicitly restrict its use to multidimensional (auxiliary) coordinate variables.

The issue applies to formula terms which are fields, such as the surface pressure field for an atmosphere sigma coordinate. This field is probably time-dependent, and therefore might be as large as the data variable. Again, we could restrict the use of the mechanism to multidimensional formula terms, because it's not a problem for 1D (level-dependent) formula terms or scalars.

I agree that there aren't problems of space for grid_mapping, and it's my mistake to have included them. I propose to leave that out.

The use-case is sufficiently convincing for CMIP6. We could decide anyway not to support it in CF, and to correct the illegality in CMIP6 by omitting the cell_measures attribute, but I think that's less helpful. It's OK for the data-writer to put cell measures in a different file with no link to the data variable, because they are optional anyway. If the data-producer decides to do that, it will make the files more useful if there is a means to say where the metadata is stored. That's the motivation for this proposal. The cell_measures attribute is useful too because it gives the data-user the name of the variable in the metadata file, once they have obtained it.

Robust information about file location

Robustness is the reason our original proposal doesn't require the URL of the metadata file to be stored in the data file. Martin suggested that as an alternative, and I think it's fine to include it. It would also be fine to have associated_files_doi as a further alternative. This choice can be left to the data-producer. However, CMIP6 won't use these alternatives, I believe, and if they are making this proposal unpopular I would rather omit then. CMIP6 will use associated_files to supply a URL to a central CMIP6 web page which explains where the files are stored. It's a kind of documentation, recorded in the file.

I think that using this subconvention implies that the data-producer (or curator, in the case of CMIP6) has accepted the responsibility of maintaining online information so that associated_files, associated_files_url or associated_files_doi will continue to be valid, for as long as the data itself is useful. We could state that in the text of the subconvention.

External references

If I understand correctly, part of Steve's suggestion is to indicate which variables identified in the data file are not stored in the file. In my proposal, any variable which was missing could be looked for in the metadata file(s), but I appreciate that listing which ones they are would permit more checking to be done. I'd be OK with including another global attribute external_variables (I would suggest - to be more specific), which is a blank-separated list of names of variables which are referred to by attributes but contained in metadata file(s) rather than in the file with the data. Requiring this would make the subconvention more error-proof.

Aggregation keys

The other part of Steve's proposal, if I've got it right, is to include keys that indicate which files belong together in a dataset. I don't think we need to do that for CMIP6. I appreciate that it would allow more checking that the user had got the right file, but CMIP users are accustomed to recognising files as being grouped by their names and directories, which pretty thoroughly identify them. Obviously this is a CMIP convention, not something we would include in CF, in which filenames have no significance.

Personally I would tend to argue against adding a convention for aggregation keys, because I think that aggregation of files into datasets should be fluid. For different purposes, I might choose different groups of files, sometimes aggregating a set of variables for a given model, sometimes a set of models for a given variable, for example. I use the CF metadata to construct aggregated variables, as in the rules David and I described in ticket 78. I would not find a prerecorded grouping to be helpful. However, if there's a use-case for it, let's discuss it in a separate ticket.

Best wishes

Jonathan

comment:12 Changed 2 years ago by jonathan

I'd like to add that I think it would be better if CMIP6 files always contained their cell measures, and that in previous discussions about omitting multidimensional auxiliary lat and lon coordinates I've supported the view that they must be stored in the file with the data variable (as required by CF), despite the space implications. That is, I would be happy if we did not have to consider a convention, such as the current proposal, for storing metadata in a different file from data. The present proposal is prompted by a practical requirement to do that, although it's less convenient for users of the data.

It will be interesting to hear what Karl and Balaji think about it for CMIP6, in the light of this discussion.

Jonathan

comment:13 Changed 2 years ago by stevehankin

[oops. I overlooked your most recent post. below responds to your lengthy post 25 hours ago. if the CMIP6 use case for this ticket is withdrawn that changes things ...]

=====

Hi Jonathan,

Here's a little 'yahoo', cuz it sounds like (imho) we may all be quite close to a consensus. A quick summary of the 3 relatively small issues I see remaining open:

  1. Can we all agree on the need to introduce an 'external_variables' attribute? By requiring that 'external_variables' be present, and to name any variables that have been removed to another file, we enable clients to process these incomplete CF files in a much more robust manner.
  1. What is the best way to formulate the attribute that links the associated files together? There are semantics embedded into each of the candidate attribute names: 'associated_files', 'associated_files_url', and 'external_keys' (the last, possibly augmented by indicators of the namespace at which the external key can be resolved). This question needs to be wrestled to the ground. In the interest of focusing on consensus I'm leaving this as a tbd issue in this post.
  1. I do not see the merits of defining a "sub-convention". If a solution is robust enough to include in the CF document, then it should it not be robust enough to regard as a proper part of the CF convention?

OK. But here's a final, 'meta' issue. I hope it will not shatter the sense of consensus. Namely, I agree that ensemble/timeSeries/forecast aggregation might best be taken up in a separate ticket. But, the form of aggregation spelled out in this ticket is the association of variables found in multiple files into a single conceptual dataset. Associating a 'cell_measures' field is a special case of what NCML calls 'union aggregation'. Would you not agree that a solution that can address the general case of union aggregation is preferable to a narrower CMIP-tailored solution? Our discussions suggest that we can achieve the broader goal of union aggregation without additional complexity -- thereby sweeping away the urge to define a "sub-convention".

  • Steve
Last edited 2 years ago by stevehankin (previous) (diff)

comment:14 Changed 2 years ago by jonathan

Dear Steve

I don't expect that the CMIP6 use-case will be withdrawn. All I meant was that the world would be simpler if we didn't have to consider file aggregation. However, I think we probably do.

I agree with external_variables. As you say, it enables applications to detect errors.

For the present use case, associated_files with no specific format (it has to be decided upon by the data-producer) is sufficient, and preferable to associated_files_url and associated_files_doi. I do not mind whether these are included in the proposal as well, however, if there is a need for them apart from CMIP. Others' views would be useful on this point.

I agree that the proposal can be made quite solid and safe, but it may still be better as a subconvention because some people (Seth may not be alone) might disagree with modifying the principle of CF defining self-contained files. We can make this departure optional by its being a subconvention. Others' views would be useful on this point as well.

Yes, I agree, this method of aggregation is the union of files. It does not amalgamate variables; it simply treats them all as if they were in one virtual file. I expect there might be different ways of doing this in detail, though I haven't thought about what they might be! But an advantage of its being a subconvention is that it allows for different ways to do it.

I've also realised this proposal may not be sufficient, because it doesn't say what happens to coordinate variables. In the CMIP case, the cell_measures file contains coordinate variables as well as dimensions. I think we should say something about that, because by the usual Unidata convention the (1D) coordinate variables are associated implicitly with the dimensions. Here are three alternatives:

  1. Coordinate variables for the dimensions of the metadata variables are those which are found in the data file; if there are coordinate variables also in the metadata file they should be ignored.
  1. If there is a coordinate variable in the metadata file for a dimension of the metadata variable, it must have the same coordinate values as the coordinate variable of the same name in the data file.
  1. Coordinate variables for the dimensions of the metadata variable are not allowed in the metadata file. (That would be inconsistent with what CMIP does.)

Another way in which this CMIP6 use-case differs from a general union of files is that it's specifically for putting metadata variables in a different file from the data variable. A general union would allow files containing data variables to be treated together. That needs more general rules for how to deal with coordinate variables, as above, but also for all the other things. If you unite data files, you might have multiple copies of auxiliary coordinate variables, formula terms, etc. Should that be forbidden, should duplicates be checked, which copy should be preferred?

Best wishes

Jonathan

comment:15 Changed 2 years ago by martin.juckes

Hello Jonathan et al.,

I'd like to expand the CMIP6 use-case, but I think the associated requirements are met by the form of the proposal favoured by Jonathan.

The use case is for making an explicit link to masking fields, such as a time varying sea-ice mask. Some variables, such as sea-ice temperature, are only defined on a portion of the numerical mesh. In the CMIP5 data the way to find the associated mask is to scan through the standard_output.xls sheets looking for the appropriate variable. It would be a great step forward to have the masking variable defined in the file along with a clue to help finding it.

I think this use case favours a specification with a flexible associated_files link because it will in general be difficult to make a robust link to a specific masking file at the time the data is written. As the mask is time varying this use case also reinforces the argument for keeping it in a separate file rather than having it copied into every data file.

Concerning the coordinates attribute, I agree that this needs to be handled carefully. In the CMIP context there will be coordinate variables associated with latitude and longitude which really need to be in the external file in a way which is consistent with the data file, and also coordinate variables associated with additional dimensions or properties which are irrelevant to the specification of the field in the external file. Jonathan's first suggestion appears to have the metadata file referring back to the data file, which sounds dangerous to me. I think it would be safer to insist that the appropriate auxiliary coordinates in the external file must be declared in a way which is consistent with their declarations in the data file:

Relevant auxiliary coordinates, interpreted as those which are:
    (a) listed in the data variables *coordinates* attribute and 
    (b) associated with the dimensions of the external variable, 
must be declared in the external file consistently with their declaration in 
the data file. That is the auxiliary variable should be present and declared 
in the *coordinates* attribute of the external variables.

This will only work if we can find a robust definition of Relevant auxiliary coordinates. Does the above work?

Martin

comment:16 Changed 2 years ago by stevehankin

Hi Martin,

The entirely valid point that you have raised about proper handling of coordinates illustrates that the CMIP6 use case is the same as the general "union aggregation" case. Within CMIP data management union aggregation has been a neglected challenge that has simmered under the surface from the beginning -- e.g. temperature and salinity fields from an ocean model are distributed as independent CF files with their (essential) shared coordinate relationships rendered invisible to the CF data model. The rules that you are outlining are the rules that implicitly connect these files.

The task of union aggregation has been addressed successfully by a number of systems -- NcML as a familiar example with years of successful practice behind it. Why not properly document the rules of union aggregation in the CF document and make that part of the standard? What is a "sub-convention" anyway? The proposed text says "subconvention[s] of CF [are] conventions which are not part of CF, but intended to be used in combination with CF". hmmm ... Evidently sub-convention-conformant datasets will be corrupted under CF-proper yet acceptable under the sub-convention.... So are they CF-conformant? Only a lawyer will be able to make sense of the next CF version if we take this route.

comment:17 Changed 2 years ago by martin.juckes

Hi Steve,

I agree that the definition of "subconvention[s]" leaves something to be desired. We could follow the bridge example (i.e. the card game) where conventions come with "variations" (e.g. the Absy convention is a variation of the Stayman convention). It should be clear that a file conforming to variations of the convention will comply with all the rules except those explicitly changed in the variation. This would only require a slight change of wording to rename it a "Variation for associated files" and would, I think, capture the intent of the proposal better than "sub-convention". It should be clear that datasets conforming to a variation do not, in general, conform to the un-varied convention and software developers would be free to choose whether they support variations.

The aggregation issue is a broader question. The specific CMIP6 requirement to go from a data file to an associated file with relevant metadata is not met by the NcML style aggregation in which a new document referring to both is created: this could be called explicit aggregation. ESGF makes use of what might be called implicit aggregation, where files which have a common values to a set of attributes are considered to be an aggregate entity which should have certain properties. I raised the idea of a formal namespace declaration above, which could be used to make the ESGF approach more rigorous: this approach does not, on reflection, appear to be appropriate for use here, but I hope to develop the namespace idea in the CMIP6 data request with the idea that it might become a useful basis for formal aggregation rules later.

Martin

comment:18 Changed 2 years ago by martin.juckes

Hi Steve,

I agree that the definition of "subconvention[s]" leaves something to be desired. We could follow the bridge example (i.e. the card game) where conventions come with "variations" (e.g. the Absy convention is a variation of the Stayman convention). It should be clear that a file conforming to variations of the convention will comply with all the rules except those explicitly changed in the variation. This would only require a slight change of wording to rename it a "Variation for associated files" and would, I think, capture the intent of the proposal better than "sub-convention". It should be clear that datasets conforming to a variation do not, in general, conform to the un-varied convention and software developers would be free to choose whether they support variations.

The aggregation issue is a broader question. The specific CMIP6 requirement to go from a data file to an associated file with relevant metadata is not met by the NcML style aggregation in which a new document referring to both is created: this could be called explicit aggregation. ESGF makes use of what might be called implicit aggregation, where files which have a common values to a set of attributes are considered to be an aggregate entity which should have certain properties. I raised the idea of a formal namespace declaration above, which could be used to make the ESGF approach more rigorous: this approach does not, on reflection, appear to be appropriate for use here, but I hope to develop the namespace idea in the CMIP6 data request with the idea that it might become a useful basis for formal aggregation rules later.

Martin

comment:19 Changed 2 years ago by stevehankin

Here is what appears to be the underlying contradiction motivating #145:

  1. CMIP6 has a compelling need to call its files "CF conformant"
  2. CMIP6 has a compelling need to ensure that cell_measures are included in output datasets
  3. CMIP6 has a compelling need to distribute files that lack the cell_measures field
  4. CF conformance insists (currently) that all the elements of a dataset be contained in a single file

==> therefore CMIP6 files will not be CF conformant ... contradicting no. 1.

To resolve this inherent contradiction #145 proposes to introduce a deliberate ambiguity: CMIP6 can claim their datasets are CF conformant; others can claim the same datasets are not CF conformant. This is effectively the same statement as yours, "It should be clear that datasets conforming to a variation [a.k.a. subconvention] do not, in general, conform to the un-varied convention and software developers would be free to choose whether they support variations."

Resolving a technical contradiction by knowingly introducing ambiguity will degrade CF in my opinion. If a self-consistent technical solution exists, it is much to be preferred -- even if it requires a bit more work. That said, I'm a retired guy now. If the current #145 seems like a good path forward to the current CF participants, that would seem to reflect a new set of benchmarks by which the quality of CF is being assessed. I will step aside and not voice any further objections.

comment:20 Changed 2 years ago by martin.juckes

Hi Steve,

My suggestion was not intended as a knowing introduction of ambiguity. From a technical perspective I can't see any inconsistency but your comments raise the issue of confusion caused by non-technical references to datasets, that is by people in mixing up the variant and the unvaried convention when referring to CMIP6. This is an area which I had not considered and I concede that you have a point. As I see it, ambiguity has been created through the creation of CMIP5 files which are not CF compliant and the need that some feel, highlighted in your first point, to refer to them as CF compliant. The ticket provides a means of reducing the level of ambiguity which might be seen as a good thing, whether or not you believe the creation of the ambiguity was justified.

I do not fully understand where the need to use the "cell_measures" attribute for references to external fields comes from. There is much in the CMIP6 file format specifications which is external to CF without creating conflict and I haven't seen a strong justification for not dealing with the required reference to areacello in that way.

regards, Martin

comment:21 follow-ups: Changed 2 years ago by mcginnis

My stance: it's okay to put cell_measures in another file, but not coordinates or formula_terms.


I'm sympathetic to the problems of storage space bloat due to duplications, but I think the idea of putting essential metadata in a separate file goes against the core ideals of the CF convention.

The purpose of CF is to define the necessary elements of a netcdf file such that conforming files "contain sufficient metadata that they are self-describing in the sense that ... each value can be located in space and time." That quote comes from the very first paragraph of the standard, and that says to me that "self-describing" is the first and number-one priority of the standard.

Consequently, I think that means that any information that is needed to geolocate the data must go in the file itself. Even if we come up with a good mechanism for referring to related information in other files, it shouldn't be used on the essential auxiliary data. To do so violates the primary purpose of CF.

I would say that the essential auxiliary data elements in question are any variables referenced by coordinates, grid_mapping, and formula_terms attributes. These are the variables that you need to figure out where the data is located in space and time. That's why Chapter 5 requires any other spatiotemporal dimension be associated with auxiliary time/lat/lon/height variables via the coordinates attribute.

Note, however, that cell_measures, which is motivating use case for this proposal, is not on that list. Those other 3 attributes are required if they are applicable, while cell_measures is strictly optional. You need it for certain calculations, but you don't need it to geolocate the data. It's extra. I would say that ancillary_variables falls into the same category: it's useful but not required.

So if we can agree to amend the proposal to apply only to optional (ancillary) variables referenced by attributes (currently, cell_measures and ancillary_variables), and not to essential (auxiliary) variables referenced by attributes (currently, grid_mapping, coordinates, and formula_terms), I think this is worth further discussion.

Otherwise, I have to maintain my opposition. I think the issue of CF files being self-describing with regard to geolocation of the data is non-negotiable. That's one of the most central points of CF, and we should not relax that restriction.

(The idea of putting actual coordinate variables in a separate file is, in my opinion, completely beyond the pale. We should be moving away from GRIB, not towards it.)

Cheers, Seth

comment:22 Changed 2 years ago by caron

Consequently, I think that means that any information that is needed to geolocate the data must go in the file itself.

Im sympathetic to the thrust of this, in fact, id like to have an option for the CF conformance-checker to FAIL when the data is not geo-located. Currently, it only does a syntactic check, and doesnt try to figure out metadata completeness. Thus one can claim CF compliance, when it just aint so.

In terms of essential metadata, "grid_mapping", "coordinates", and "formula_terms" seem correct (have to review what else in this category), and id like to make a friendly amendment that 2D lat/lon are optional if the grid_mapping is present and correct.

So propose that

  • CF checker continue to support the current syntactic checking, and add an option to check "essential metadata is present".
  • define essential metadata and rules for checking.
  • clarify how to reference "non-essential metadata" that may be stored in external files.
Last edited 2 years ago by caron (previous) (diff)

comment:23 in reply to: ↑ 21 Changed 2 years ago by caron

Last edited 2 years ago by caron (previous) (diff)

comment:24 in reply to: ↑ 21 Changed 2 years ago by caron

FYI, GRIB (almost) always has the coordinate info in the GRIB record, albeit as (the equivalent of) "grid-mapping" info. There is an option to specify it externally, but it is not used in current practice, at least for GRIB2, AFAIK.

comment:25 Changed 2 years ago by jonathan

Dear all

Some comments on recent points.

Subconvention. Steve doesn't like the principle of having a subconvention which permits files that wouldn't be legal without it. The reason I proposed it as a subconvention was anticipating objections like Seth's, that CF is a standard for single self-contained files. Maybe the word "subconvention" is misleading. You could imagine instead writing it as a separate document, for a new netCDF convention that applies to groups of files. The document would say, "This standard is exactly the same as CF, with this one exception" ... which is the content of this proposal. That makes it a "superconvention" instead, maybe, but the effect is the same. Does it feel any more acceptable like that?

Union. The other reason for not proposing this as part of regular CF is that there's more than one way to aggregate files, with different purposes. Union of files, which Steve and Martin discussed, is a different method. You can treat several files as though they were one file provided all the variables in them have different names, or if there is more than one variable of a given name (such as a coordinate variable), every instance of it must be exactly the same. Checking that would be time-consuming. This sort of file union would not be appropriate for CMIP6, because there are always many data variables of the same name, since the data is split up into ranges of time in different files, but the variable for a given quantity always has the same name. However in some situations that sort of file union, which could be done implicitly as Martin says (e.g. regarding all the netCDF files in a directory as constituting a single dataset), might be what you want. A different subconvention or superconvention could be defined for it.

Cell measures and auxiliary coordinate variables. Cell measures are the use-case for CMIP6. It's fine with me to restrict this proposal to cell measures. I included auxiliary coordinate variables too because it has been requested more than once on the email list that it should be allowed to put them in another file, to save space. But we can leave that to another ticket. Personally, I think we should keep the requirement to supply auxiliary 2D latitude and longitude coordinate variables when applicable. Even if grid_mapping is supplied, it's asking a lot for a user or an application to be able to compute longitude and latitude for every possible mapping.

Checking for non-geolocated data. The CF-checker can check for errors which are capable of being detected, as written down in the conformance document. I don't see how to check for data that should be geolocated but apparently isn't, because CF is not a convention only for geolocated data. It's not required for data variables to have all or any of latitude, longitude and vertical dimensions. What do you have in mind, John? One way would be to add options to the CF-checker whereby the user would assert that the data must have two horizontal coordinates, for instance; given that extra hint, the checker could tell if they were missing. "CF compliance" means not breaking rules. It doesn't mean doing everything which is optional. By email we have discussed defining a "strict" variant of CF in which more things would be mandatory, but it still could not be required to have two horizontal coordinate variables in every case.

Here are some ways we could proceed with this proposal.

  • Decide that CMIP6 should put the cell measures variables in the file, after all. I'd be interested to hear what Karl and Balaji think of this. Martin has pointed out that, while this is a small fractional increase in space requirements, it does amount to a lot of Gbyte.
  • Drop the cell_measures attribute in CMIP6 datasets, so that a user of the dataset would have to know that the variable does exist (in some other file) and what it's called.
  • Use the external_variables attribute simply as a way to make the file legal (because the cell measures is named but missing) and say nothing more in the CF document about where such missing variables might be found. We can leave that to CMIP6-specific conventions. In that case this proposal would be just to define the external_variables attribute and I don't think we'd need a subconvention for that.
  • Stick to the proposal for cell_measures only, with external_variables. In that case I'd suggest that nothing in the metadata file is relevant apart from the named cell measures variable and its dimensions. If there are coordinates, etc. in there which apply as well, they should be ignored. This depends on the cell measures being geolocated consistently with the data file, but even when cell measures are in the data file we depend on the data-writer having got it right!

Do you have views on these alternatives, or others to suggest?

Best wishes

Jonathan

comment:26 Changed 2 years ago by painter1

just testing...

comment:27 Changed 2 years ago by balaji

testing again, pretending to be balaji...

comment:28 Changed 2 years ago by balaji

Testing, please ignore

comment:29 Changed 2 years ago by taylor13

Dear all,

This ticket was originally inspired by CMIP, where we wanted users to be able to discover what areas and volumes were associated with grid cells used to store certain variables in the ocean and the atmosphere. Jonathan has nicely summarized our options. After thinking about CMIP, it appears to me that there is really very little added value to including the cell_measures attribute when the "areas" and "volumes" are stored in another file.

I'm pretty sure for the CMIP6 time-frame there won't be any software capable of automatically retrieving areacella, areacello, or volcello fields if they are stored in a separate file. I also don't think that it is necessary from a user's perspective for these fields to be stored with the data of interest. The argument for doing this is that then we could define the cell_measures attribute and the the files would conform to the current CF conventions. The arguments against are: 1) the filesize is (sometimes substantially) increased, and 2) CMOR would have to be modified substantially to accomodate the cell_measure fields (along with the data itself)

The main value for CMIP for defining the cell_measures attribute is that it explicitly indicates whether areacella or areacello is the correct measure to use with a particular variable. My feeling is that users of CMIP data will not have any trouble deciding which of the measures is appropriate. (Note that the grid coordinates values will unambiguously tell you whether areacella or areacello contains the cell areas of interest.)

I don't think we should complicate the CF conventions (especially given the arguments articulated by Steve Hankin and others) without some stronger motivation.

In summary I think we should omit the cell_measures attribute in CMIP (Jonathan's 2nd option). Then there is no immediate need to alter the convention, and we might consider the more general issue of external files and concatenation of files at a less hurried pace.

If someone can provide a CMIP6 use case where omitting cell_measures would cause problems, please describe it.

thanks and best regards, Karl

comment:30 Changed 2 years ago by balaji

I had a longer response a month or so earlier, but owing to some issue in Trac (which still hasn't been explained, but seems to be resolved now, thanks to Jeff...) it was lost, leaving no trace.

Then I started a second reply, which is also moot now, given Karl's mail from an hour ago:-). (Trac warns you now when this happens, which is pretty clever.)

Karl, I'm afraid I must dissent from your proposal to drop the cell_measures requirement. Without this there will be grievous errors in any computation of integrals from these files, in the presence of masks (land-sea boundary), area fractions (vegetation tiles on the land surface), coordinate systems where the bounds information is not sufficient to define cell measures (tripolar grid data).

The original proposal from Jonathan and me did not intend for associated_files to be machine-usable. We were fully aware that encoding pathnames inside files is brittle and fragile under copy, and so on.

The idea was to allow a human reading data and needing the variable pointed to by cell_measures to be able to go find it in another file if needed. It is certainly unwarranted to duplication to repeat-store 3D (and possibly 4D) variables in every file.

To this end, my proposal would be:

  • include external_variables in CF, a comma-separated list of variable names that may be pointed to by some entity or attribute in the file. CF will not specify in any way where these are to be found. A CF checker should throw a warning when one of these is present.
  • CMIP can include associated_files as its convention, nothing to do with CF. CF permits that. I wouldn't change the name of that attribute, since it's its legacy. CMIP can also, independently of CF, specify how to use this: for instance, to point to a URL maintained by the data owner that explains where these fields are found.
  • CMIP could also stipulate for instance that, for CMIP, the only legitimate use of external_variables is in cell_measures.

comment:31 Changed 2 years ago by balaji

I had a longer response a month or so earlier, but owing to some issue in Trac (which still hasn't been explained, but seems to be resolved now, thanks to Jeff...) it was lost, leaving no trace.

Then I started a second reply, which is also moot now, given Karl's mail from an hour ago:-). (Trac warns you now when this happens, which is pretty clever.)

Karl, I'm afraid I must dissent from your proposal to drop the cell_measures requirement. Without this there will be grievous errors in any computation of integrals from these files, in the presence of masks (land-sea boundary), area fractions (vegetation tiles on the land surface), coordinate systems where the bounds information is not sufficient to define cell measures (tripolar grid data).

The original proposal from Jonathan and me did not intend for associated_files to be machine-usable. We were fully aware that encoding pathnames inside files is brittle and fragile under copy, and so on.

The idea was to allow a human reading data and needing the variable pointed to by cell_measures to be able to go find it in another file if needed. It is certainly unwarranted to duplication to repeat-store 3D (and possibly 4D) variables in every file.

To this end, my proposal would be:

  • include external_variables in CF, a comma-separated list of variable names that may be pointed to by some entity or attribute in the file. CF will not specify in any way where these are to be found. A CF checker should throw a warning when one of these is present.
  • CMIP can include associated_files as its convention, nothing to do with CF. CF permits that. I wouldn't change the name of that attribute, since it's its legacy. CMIP can also, independently of CF, specify how to use this: for instance, to point to a URL maintained by the data owner that explains where these fields are found.
  • CMIP could also stipulate for instance that, for CMIP, the only legitimate use of external_variables is in cell_measures.

comment:32 follow-up: Changed 2 years ago by taylor13

Dear Balaji and all,

I think it is essential to have areacella, areacello, and volcello fields archived and made available for CMIP analysis. As you say, "Without [them] there will be grievous errors in any computation of integrals from these files." My suggestion was just to drop the cell_measures *attribute* (not relax the requirement that the fields be stored). This would bring CMIP in conformance with the CF-conventions.

Anyone wanting to perform an integral analysis will be aware that they need cell weights (areas or volumes) to do this, and will be able to find the names of those variables and download them just as with any other variable.

Would this be o.k.?

cheers, Karl

comment:33 in reply to: ↑ 32 Changed 2 years ago by davidhassell

Hi Karl, et al.,

I favour including both the cell_measures attribute and the external_variables attribute defined by Balaji in comment:31, assuming that the latter were included in the CF conventions, i.e. making it explicit, in a CF compliant manner, that cell measures exist but not in this file.

I think that this could help stop software gamely creating incorrect weights from coordinates because it thinks that there are no cell measures.

I hope I've understood the arguments to date ...

Thanks,

David

comment:34 Changed 2 years ago by balaji

I agree with David, without cell_measures there is much greater chance of error. Even if people deduce these may be in an external file, there could be more than one: for example, a different areacello variable for quantities on a staggered grid. cell_measures will indicate which areacello corresponds to which variable.

To summarize the proposal is:

  • external_variables to be included in CF, containing a comma-separated list of variables names possibly referenced in the file, but not present in the file.
  • cell_measures to be a CMIP requirement where cell areas and volumes cannot be computed from coordinate variables alone.

comment:35 Changed 2 years ago by balaji

I agree with David, without cell_measures there is much greater chance of error. Even if people deduce these may be in an external file, there could be more than one: for example, a different areacello variable for quantities on a staggered grid. cell_measures will indicate which areacello corresponds to which variable.

To summarize the proposal is:

  • external_variables to be included in CF, containing a comma-separated list of variables names possibly referenced in the file, but not present in the file.
  • cell_measures to be a CMIP requirement where cell areas and volumes cannot be computed from coordinate variables alone.
  • associated_files as a CMIP convention, to be used as the data writer wishes.

comment:36 Changed 2 years ago by martin.juckes

Concerning the cell_measures attribute, I'm not sure that this is the best way to resolve the problems we have in CMIP6. The way it was implemented in CMIP5, as a property defined in the CMOR tables, resulted in it being set incorrectly for some model/variable combinations. But perhaps we should follow that discussion in another venue.

For the external_variables attribute, it will be important to be specific about where admissible references might be, i.e. what categories of variables would be allowed to be absent from the file. Clearly coordinate variables are out of the question, and also auxiliary coordinate variables. What about bounds and climatology variables? The safest approach is probably to give a list of allowed usages. e.g. "... list of variables possibly referenced in cell_measures, ancillary_variable, or formula_terms".

comment:37 Changed 2 years ago by jonathan

Dear all

Referring to Martin's point, I think this proposal should allow external_variables for cell_measures only. Originally I included other sorts of metadata in this proposal, but objections have been raised, and it's better to follow the usual principle of proposing only what we currently need. However this is clearly an extensible mechanism.

If the majority is in favour, I am content to follow Balaji's suggestion, which is also approximately my third option, with a couple of small changes:

  • external_variables to be included in CF, containing a blank-separated list of names of cell_measures variables which may be referenced in the file, but might not be present in the file. (Lists in CF attributes are always blank-separated, and I suggest that it should not be an error to include in the file a variable which is listed in external_variables, nor to name a variable in external_variables which is not referenced in the file.)
  • cell_measures as a CMIP requirement where cell areas and volumes cannot be computed from coordinate variables alone.
  • associated_files as a CMIP convention, to be used as the data writer wishes.

If we proceed like this, only the first point above needs to be agreed in this CF ticket, and we don't need a subconvention for it; this can be a change to CF in general. We are not addressing concerns like Seth's and Steve's about incomplete files and I wonder if they still object to this reduced proposal, which meets the needs of CMIP6.

Best wishes

Jonathan

comment:38 Changed 2 years ago by mcginnis

Hi Jonathan,

I'm comfortable with this plan and willing to support the proposal in its reduced form.

Because cell_measures is optional information, I think it's appropriate to allow it to be split off into a separate file. The external_variables attribute as described here seems like a functional mechanism for doing that while still enabling users to figure out how to do their area integrals.

It would be good to someday have a robust and machine-usable way to reference the external files, but I think it's entirely reasonable to defer that for later, and I don't think this sets up any roadblocks for future development in that direction. And maybe by the next time the issue comes up, it'll be more obvious how to do it...

Cheers,

Seth

(P.S.: If and when it comes up in the future, I would also support extending the use of external_variables to variables listed in ancillary_variables as well.)

comment:39 Changed 2 years ago by stevehankin

Jonathan et. al,

Similar to Seth. The variants of this ticket that have been described in the past few posts do address my primary concern. Moving the formalization of "associate_files" outside of CF has eliminated the need for an ambiguous "sub-convention". Introducing "external_variables" means that CF files self-document the areas where they may have external dependencies.

External dependencies -- a.k.a support for multi-file aggregations -- has been an important unresolved need within CF for over a decade. In model management multi-file CF datasets are the norm, rather than the exception. If the introduction of an "external_variables" attribute can serve as a stepping stone to addressing that need, it is all the more desirable.

comment:40 Changed 2 years ago by jonathan

Dear Seth and Steve

Thanks for your useful comments, and thanks to Steve for suggesting external_variables. If Karl agrees with this approach too, I will start a new ticket with the reduced proposal.

I agree too that there is a need for aggregation in general. As I mentioned before, David Hassell and I formulated an algorithm, proposed by ticket 78, for aggregating variables, which can be in groups of files, based only on CF metadata, with no reference to variable names. Our aggregation rules don't say anything about filenames either. David has implemented these rules in cf-python. There is a utility for processing one or more CF-netCDF files to work out the aggregation, and write an aggregated file, or a file which records the structure of the aggregation without the data. We think this approach is suitable for CMIP datasets, for instance, but it's only one possible approach. Aggregation rules go beyond CF, and I'm sure there are other approaches would be useful in different situations.

Best wishes

Jonathan

comment:41 Changed 2 years ago by taylor13

Dear Jonathan and all,

Forgive the 10-day delay responding. Leaving aside CMIP requirements (which don't need to be decided here), the proposal above is:

"external_variables to be included in CF, containing a blank-separated list of names of cell_measures variables which may be referenced in the file, but might not be present in the file. (Lists in CF attributes are always blank-separated, and I suggest that it should not be an error to include in the file a variable which is listed in external_variables, nor to name a variable in external_variables which is not referenced in the file.)"

Did you mean "should be an error" (you wrote "should not be an error")? Assuming it wasn't a typo, why do we want to allow variable names to be listed in external_variables if the variable actually exists *in* the file? This seems both unnecessary and confusing to me. Even more troubling to me is that you say it's o.k. to name a variable in external_varaibles which isn't even referenced in the file. I'm against that.

Here is wording I would use: "the external_variable is a global attribute containing a blank-separated list of names of cell_measures variables that are referenced in the file, but are not present in the file."

Perhaps I've missed something.

Best regards, Karl

comment:42 Changed 2 years ago by jonathan

Dear Karl

I was inclined to be tolerant, but you prefer to be intolerant. :-) I did mean "should not be an error" because I thought those errors would be harmless. I assume that you would consult external_variables when you had looked for a named variable in the file and found it wasn't there, to check if it was allowed to be missing. If that is how the attribute is used, it doesn't matter if it lists variables which are actually in the file or which are not named in the file, since it won't be consulted about variable names in those two categories. However, if you and others think we should be strict about this, I won't disagree.

Best wishes

Jonathan

comment:43 Changed 23 months ago by taylor13

Dear Jonathan,

I agree that being lenient won't cause too many problems, and software could be programmed not to get tripped up, but in a file containing lots of variables, it may be difficult for a human to quickly determine whether some cell_measures referenced in the file actually must be obtained outside the file (i.e., this would require a very careful reading of the ncdump header info., looking at all the cell_measures attributes, etc., which could take some time). So, for that reason, I would like the declaration in the "external_variables" attribute to be definitively interpretable as "the listed variable is referenced in the file, but not *in* the file." [But I could live with being tolerant, since I try to be a nice guy.] best, Karl

comment:44 Changed 23 months ago by taylor13

Dear all,

To counter the comment I just made, there is a use case where there could be an advantage in allowing a variable to appear in external_variables that wasn't actually referenced by a cell_measures attribute.

In CMIP6 we are considering storing mesh information under the "UGRID" conventions. UGRID includes a "mesh" attribute attached to a variable stored on a mesh described by the variable named by the mesh attribute (e.g., mesh = fine_mesh tells us that the mesh description is found in a variable named "fine_mesh"). Since in CMIP6 we would like to include the "mesh" attribute, but place the mesh information in an external file, it would be useful to allow "external_variables" to include the name of the mesh variable. This would not be possible unless we allowed external_variables to name a variable not referenced (by the CF convention attributes) in the file.

I should note that currently I think UGRID requires the mesh information to be included in the file, so the UGRID conventions would have to be modified to allow for an "external_files" global attribute to name the variable used to convey mesh information.

Are there arguments against being tolerant?

best regards,

Karl

comment:45 Changed 23 months ago by jonathan

Dear Karl

I don't have a strong view about this and would be happy to go whichever way you think suits CMIP6.

Best wishes

Jonathan

comment:46 Changed 22 months ago by taylor13

Dear all,

We need to reach a final decision on this. The proposal is to insert the following paragraph at the end of the description of the "cell_measures" attribute (just before Example 7.3).


Although it is recommended that any variable referenced by cell_measures be included in the file, it is not required to be. If the variable is located in an external file (rather than in the file where it is referenced), then this must be indicated by defining a global attribute named "external_variables". The value of external_variables is a blank-separated list of the names of all cell_measures variables located in external files. Note that defining an external_variables attribute breaks the normal CF convention rule that variables referenced by an attribute must be located in the file where they are referenced. To allow other conventions (governing metadata in a file conforming to the CF conventions) to make use of the external_variables attribute, it is permissible for this attribute to include the names of variables besides those referenced by CF convention attributes.


Also add a row to Appendix A. There the description would read: "name(s) of variables not found in the file but referenced by a cell_measures attribute"

best regards, Karl

comment:47 Changed 22 months ago by jonathan

Dear Karl

I think that this proposal is not quite internally consistent. In 7.2 you say that external_variables may be used other than by cell_measures but in Appendix A you don't permit this. I suggest we could be a bit clearer and allow room for flexibility by separating the definition of external_variables from its use by cell_measures. To do this, I suggest:

  • Insert the following paragraph at the end of the description of the cell_measures attribute in section 7.2 (just before Example 7.3).

A variable referenced by cell_measures is not required to be present in the file containing the data variable. If the cell_measures variable is located in another file (an "external file"), rather than in the file where it is referenced, it must be listed in the external_variables attribute of the referencing file (Section 2.6.3).

  • Introduce a new subsection 2.6.3, title "External variables".

The global external_variables attribute is a blank-separated list of the names of variables which are named by attributes in the file but which are not present in the file. These variables are to be found in other files (called "external files") but CF does not provide conventions for identifying the files concerned. The only CF standard attribute which is allowed to refer to external variables is cell_measures.

  • Add an entry for external_variables to Appendix A, type S, use G, links (new) section 2.6.3 and section 7.2, description "Identifies variables which are named by attributes in the file but which are not present in the file."

Also in the conformance document, add

2.6.3 External variables

Requirements:

  • The external_variables attribute is of string type and contains a blank-separated list of variable names.
  • No variable named by external_variables is allowed in the file.

In its section 7.2, change "that must exist in the file" to "that must either exist in the file or be named by the external_variables attribute".

Note that I haven't recommended that cell measures variables be included in the data file, because if it was a recommendation, the CF checker would have to give a warning every time an external variable was used. Since we are requiring external variables in CMIP6, this would be rather impolite and inconsistent.

What do you think?

Jonathan

comment:48 Changed 22 months ago by davidhassell

Dear Jonathan,

Thanks for this complete description, which I'm very happy with.

David

comment:49 Changed 22 months ago by balaji

I too concur! This is very clear and I can see what to do in a different use-case.

If noone else raises anything in the next two weeks, why don't we go ahead, update the conventions, and close this ticket?

comment:50 Changed 22 months ago by taylor13

Hi Jonathan and all,

Your revision is an improvement and I support it in full. I hope that if there are objections by anyone to the propose new attribute, these will be raised immediately, so that we can adopt some final version as soon as possible.

thanks, Karl

comment:51 Changed 22 months ago by stevehankin

Well crafted. Thumbs up from me too.

comment:52 Changed 21 months ago by davidhassell

Four weeks have passed with no further comment. Let's accept the ticket. Thank you, Jonathan, for proposing and to everyone for the debate.

David

comment:53 Changed 11 months ago by davidhassell

  • Owner changed from cf-conventions@… to davidhassell
  • Status changed from new to accepted

comment:54 Changed 9 months ago by painter1

The changes to the CF Conventions document were implemented today. The conformance document has not been updated yet.

comment:55 Changed 8 months ago by martin.juckes

The phrase "The only CF standard attribute which is allowed to refer to external variables is cell_measures" proposed for the new section 2.6.3 is too restrictive. As it stands it implies unintended restrictions on the contents of CF attributes which are not constrained by the convention. The history and comment attributes, for example, are CF standard attributes which are allowed to refer to external variables. I suggest instead "The only attribute which is allowed to make a CF standard reference to external variables is cell_measures".

It would be helpful to deine the idea of a "self-contained", but that owuld be going beyond the scope of this ticket. It is clearly, I think, not intended to restrict the use of references to external documents in the history attribute.

comment:56 Changed 8 months ago by davidhassell

Martin - I support the new phrasing for that sentence.

Thanks, David

comment:57 Changed 8 months ago by jonathan

Dear Martin

I see the difficulty and thanks for pointing out. I think that your wording might raise the question, "What does a 'CF standard reference' mean?" To avoid that, I would suggest instead, "The only attribute for which CF standardises the use of external variables is cell_measures." What do you think?

Best wishes

Jonathan

comment:58 Changed 8 months ago by martin.juckes

Dear Jonathan, Yes, that would be better. It would follow that the sentence "Identifies variables which are named by attributes in the file but which are not present in the file" for the external_variable entry in Appendix A should be modified to "Identifies variables which are named by cell_measures attributes in the file but which are not present in the file".

regards, Martin

comment:59 Changed 8 months ago by jonathan

Dear Martin

Yes, I agree, it would be consistent to be more restrictive in Appendix A as well, as you suggest. Thanks

Jonathan

comment:60 Changed 7 months ago by painter1

The conformance document has been updated for this ticket, Ticket 145. Now I see a couple of recent changes to this ticket, for the CF Conventions document. I would like confirmation that they represent the consensus before I implement them for version 1.7:

  1. In section 2.6.3, replace the final sentence, "The only CF standard attribute which is allowed to refer to external variables is cell_measures.", with "The only attribute for which CF standardises the use of external variables is cell_measures."
  2. In Appendix A, table A.1, row for "external_variables", change the description "Identifies variables which are named by attributes in the file but which are not present in the file." to "Identifies variables which are named by cell_measures attributes in the file but which are not present in the file."

comment:61 Changed 7 months ago by davidhassell

Hi Jeff,

Looks good to me,

David

comment:62 Changed 7 months ago by balaji

To me too.

comment:63 Changed 7 months ago by painter1

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.