Opened 5 years ago

Last modified 9 months ago

#78 new enhancement

CF aggregation rules

Reported by: davidhassell Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

In this ticket we propose a set rules, based on CF metadata, for deciding whether or not two arbitrary CF field constructs may be aggregated into one, larger field construct. A field construct (hereafter a field) is as defined in the proposed CF data model (see ticket #68), as are all other terms written in bold. In terms of CF-netCDF files, a field corresponds to a data variable, with all its attributes, coordinate variables, auxiliary coordinate variables, etc.

Aggregation may be thought of as the combination of one field with another to create a new field that occupies a larger space. In practice, this means combining two fields so that their data arrays are concatenated along exactly one dimension, as are their coordinate arrays which span that dimension, in such a way that the aggregated field conforms to the CF data model (and is therefore CF-netCDF compliant).

The CF-netCDF convention at present applies only to individual files, but there is a common and increasing need to be able to treat a collection files as a single dataset, and the CF standard does not define how this should be done. Like the ticket for the CF data model, this ticket does not propose any change to the CF standard. Our purpose is to write down general abstract rules for CF field aggregation which are consistent with the abstract CF data model.

These proposed CF aggregation rules make no reference to netCDF file format. They are built solely on the abstract CF data model. As such, they may be applied equally to fields stored in CF-netCDF files or to fields contained in a memory representation of the CF data model. To support the CF data model, we produced the cf-python software, and the latest version of that software includes an aggregation function based on these aggregation rules. This function can be used to combine CF-netCDF files by aggregating the fields they contain.

Our proposed rules are more flexible than the existing schemes that we are aware of. They are similar to the NcML aggregation types JoinExisting and JoinNew, but are more general in various ways, such as that the aggregating dimension need not be the outer dimension, nor be in the same position in different fields. Also, if combining fields from various netCDF files, the netCDF variable names need not match, because the variables are identified by their metadata instead of by their names. Any number of fields may ultimately be aggregated along more one or more dimensions by repeated aggregations between pairs of fields. Our software can handle this general approach, but it needs optimisation.

This proposal is closely related to the CF data model (ticket #68), and we would welcome comments on that ticket as well as the present one.

David Hassell (d.c.hassell at reading.ac.uk)
Jonathan Gregory (j.m.gregory at reading.ac.uk)

Change History (4)

comment:1 in reply to: ↑ description ; follow-up: Changed 4 years ago by mgschultz

Replying to davidhassell:

The CF data model, cf-python, and these aggregation rules are all very useful. Yet, I have a hard time understanding these rules. They look like the result of a hard thinking process condensed into semi-formal language, and this makes them difficult to understand. It would be great if they were "documented" from a user perspective. Here, I suggest the following structure (and maybe this could lead to identifying aggregations that are not possible yet or not possible in principle). I adopt the ncml terminology where possible:

  1. SimpleUnion?:

Add variables from different sources into one dataset Requirements: where coordinates are the same (based on what? axis attribute, standard_name?), they must have the same values

1b. SupersetUnion?: A variant from 1. where overlapping data would be possible. Example: merging of several time series from individual stations into one file with ragged time series format, allowing for missing values in the individual time series. In this case, the "superset" of all time values needs to be created and missing_value has to be inserted where needed. [perhaps this example should be considered a different operation altogether, because one also needs to reformat individual longitude and latitude values into longitude and latitude arrays keyed on the station id]

  1. Join (new or existing):

ncml states "along their [existing|new], outer dimension". Is this the same as "unlimited" dimension? Or does it mean the "slowest varying dimension"? In an abstract data model, this shoudln't matter, and a Join should be possible along any dimension. However, certain rules must apply, and these may be specific for certain dimensions (in particular of axis type "T"). If I understand correctly, then most of the aggregation rules you spell out deal with this type of Join operation. Life would be much easier if certain attributes for coordinate variables at least would be mandatory rather than optional. Requirements: both files must have the same variables [again, one could think of automatic filling with missing_value if one of the files contains only a subset of the variables from the other one -- this is a real use case]. Coordinate requirements are described separately for different types of coordinates below.

2a) Join along time axis (probably most common): Requirements (probably incomplete: you thought about this more than I): time should not overlap [you actually allow for this in the case of running averages, but I think that even then the actual time values shouldn't overlap, whereas bounds may], the calendar attribute must be mappable [this is less strict then "the same": either one could use a synonym, or even a 1:1 translation rule which of course must be specified somehow], the time resolution must be the same [really? perhaps accept for numerical rounding errors? I had such a case when I worked with station data from different sources: one gave 4 decimals, the other 6. I knew that both were hourly resolution, so I had to provide a tolerance criterion to sort the data correctly. Should this be left to the user to decide? In any case a warning should appear in any application if the delta-time is not the same; then again: if there are irregular time intervals, this can become quite tricky]

2b) join along lon or lat axis (more generally "X" and "Y"): Requirements: no overlap. Simple case: the other dimension must be identical in size and values (then one can "simply" paste the two chnks together). More complicated: if two regions are defined, one can define the outer bounding_box and fill everything outside regions A and B with missing_values. [again: a real use case, for example when pasting data "tiles" from continental land surface models into one global grid] -- in an extended view, one can even think about a "Join operator" for overlapping regions (e.g. "useA", "useB", "add", "subtract", "average").

2c) join along vertical coordinate: (Ouch!) rather than listing requirements, one can write down what can go wrong. For example: one is positive="up", the other is positive="down"; One is in units of pressure, the other in units of length (= height); One is irregular (e.g. hybrid sigma), the other regular (e.g. fixed pressure grid). In all of the above cases, one may wish to be able to combine records...

3) collection (in ncml "forecastmodelruncollection"): Haven't thought about this one much. Could call for adoption of a "group" concept, similar to netcdf4 data model?

4) numerical aggregation (not in ncml): perform arithmetic operations on two variables from one file or two variables from separate files. I guess, in this case one can require all dimensions to be the same. Although it would be nice if one can, for example, multiply a 2D field with all levels of a 3D field, ...

Best regards,

Martin

comment:2 in reply to: ↑ 1 Changed 4 years ago by davidhassell

Replying to mgschultz:

The CF data model, cf-python, and these aggregation rules are all very useful. Yet, I have a hard time understanding these rules. They look like the result of a hard thinking process condensed into semi-formal language, and this makes them difficult to understand. It would be great if they were "documented" from a user perspective. Here, I suggest the following structure (and maybe this could lead to identifying aggregations that are not possible yet or not possible in principle). I adopt the ncml terminology where possible:

Thank you for looking at this and for your support. Jonathan and I do appreciate that the rules we have presented require some thought to get to grips with. (Just to tie in with the data model discussions, I would like to stress that, like the data model, these rules are file format and file structure independent, and so are equally applicable to data stored in memory and on disk.)

I think that we ended up with a description that is perhaps more useful to someone who wants to code up the process rather than an end-user who wants to understand if, how or why their data have been aggregated. This is deliberate, as the idea of the rules is that they merely formalize your existing, intuitive idea of whether or not two fields are aggregatable. That is you (a human) know already whether or not two fields logically form a single larger field, but a computer needs more formal and entirely general rules to do the same job and to save you inspecting each and every one of your fields. As a consequence, I'm not sure that I agree that a more user orientated description would be useful, or even possible.

From your suggestions, you raise some good use cases which are not catered for by the aggregation rules ("merging of several time series from individual stations into one file with ragged time series format" and "pasting data tiles from continental land surface models into one global grid"). However, whether or not such collections of fields do form a single, larger field is a user choice, and one which may required extra preprocessing. It is not possible for that choice to be informed by the metadata - hence their exclusion from the general rules. That said, any software implementation of these rules could quite happily include extensions to cover situations such as these.

I'm not in favour of singling out certain named axes for special treatment, purely because I don't think it necessary. Again, your software may, as an option, restrict the rules to on a particular axis (for example, to disallow box average time filters with strictly monotonic coordinate values but overlapping bounds), but if the operation is 'safe' (i.e. unambiguous), then it should be allowed in the rules.

2c) join along vertical coordinate: (Ouch!) rather than listing requirements, one can write down what can go wrong. For example: one is positive="up", the other is positive="down"; One is in units of pressure, the other in units of length (= height); One is irregular (e.g. hybrid sigma), the other regular (e.g. fixed pressure grid). In all of the above cases, one may wish to be able to combine records...

The 'direction' of an dimension, and the order of a variable's dimension and the variable's units, should be purely the worry of the implementing software. The rules assume that any such mismatches may be reconciled, where possible. In the example you give of 'hybrid sigma' and 'fixed pressure', these fields are not aggregatable, in just the same way as 'hybrid sigma' and 'time' axes are not aggregatable.

4) numerical aggregation (not in ncml): perform arithmetic operations on two variables from one file or two variables from separate files. I guess, in this case one can require all dimensions to be the same. Although it would be nice if one can, for example, multiply a 2D field with all levels of a 3D field, ...

I was very interested to see that you mentioned this. We have separately developed similar (combination) rules for when this should be allowed, which include the use case you give of broadcasting (for example, multiply a 2D field with all levels of a 3D field), and are currently implementing these in cf-python. Our original intention was to document these rules in this software, but if there is interest we'd be happy to discuss them openly. Do they belong with the aggregation rules? My initial thought is not, since the aggregation rules are all about determining if two CF data model spaces differ only in one dimension and are merely part of a larger one, whereas the combination rules are all about determining if two spaces are comparible where they overlap, and they may not even overlap at all. I'll write up the combination rules and post a link on the e-mail forum if anyone wants to see more.

Many thanks again and all the best,

David

comment:3 Changed 4 years ago by jonathan

Dear Martin

Like David, I appreciate that our document is not an easy read. Like you, I think that it would be useful for us to provide more of a user guide as well. This would be quite hard work to write, but hopefully easier to read! I can imagine providing a discursive description, with diagrams, of some of our use cases, showing why they can or cannot be aggregated. This would help the user understand why the software implementing these rules behaved as it did.

Another thing which it might help to say, by way of explanation, is this: We are thinking about a situation in which one or more multidimensional data variables, each containing a single quantity (identified by a standard name and other CF metadata), have been chopped/sliced up in index space (each part of each original data variable having a range of the indices along each original axis). Then dimensions which are reduced to a size of one by the slicing might, or might not, be discarded, the order of dimensions in the new smaller data variables might be changed, and any of the dimensions might be reversed. Finally these new data variables are distributed arbitrarily among various files. Our aggregation rules are designed to reassemble the original variables from the disordered fragments, using CF metadata only, like a data archaeologist. Like a very intolerant archaeologist, the rules will refuse to allow fragments to be stuck together if there appears to be a gap between them. However they do support the possibility of the fragments of the whole having been made to somewhat different standards in the first place (such as order of dimensions being different) so long as this doesn't make a logical difference to the whole.

To describe it as above might seem contrived, but this is the situation you are in if you download CMIP5 data, for instance. Only the time dimension has been chopped up in that case, so it's quite easy, although duplicate times need to be detected, as these rules would do. The rules would also allow 3D xyz fields to be reassembled from 2D xy fields. This is always necessary with data from the Met Office Unified Model, so of particular interest to us. The rules would not aggregate your "complicated 2b" or 2c cases, which could not have arisen from chopping up one variable and keeping all the bits. The rules are more tolerant than some of your requirements and more stringent than others. There are some criteria which are debatable, obviously, and might be implementation choices, such as what "equality" means.

The CF convention applies to individual netCDF files. It may equally well be regarded as applying to any single dataset, in any file format, especially if we agree a data model. However, in practice, some datasets are distributed over files, and for a long time there has been a need to deal with such collections of files. There are several ways to do this. Our proposed aggregation rules are one way. Another way is to write a metafile which contains instructions on how to do the aggregation, like NcML, or in the Gridspec-F proposal. This file needs to be written by hand or to make some assumptions if it's going to be generated automatically. Our rules are more general than the aggregations that can be described by NcML or Gridspec-F, I think. It is useful to be able to do it automatically, using rules like ours, because a dataset might be assembled from any arbitrary collection of files; new files might be added at any time, or files removed. At the moment, I tend to think that aggregation rules are something which should be considered as accessories to the CF convention, rather than part of it. In that case, various alternatives could all be advertised, and supported by different software.

Best wishes

Jonathan

comment:4 Changed 9 months ago by stevehankin

Jonathan and David,

At your suggestion I'm picking up a thread from trac #145 here.

The rules contained in this ticket (#78) provide guidance on when it is ok for a client application to merge fields. This is fine as far as this goes, but it does not provide the right level of information to coach a client on which files should be considered as candidates for aggregation.

Lets consider modelers' needs as our example. Say the model outputs 1 year of time steps per file, with each variable (temp, u, v, ...) in separate files. For a single model run it creates a matrix of NtimePeriods? x Mvariables files. Lets say that at the client end we are confronted with a directory containing the outputs of 5 such model runs. CF does not currently provide a simple mechanism that would permit an application to scan this directory and infer what the file creator knew -- that these files represent 5 model outputs, when suitably aggregated.

CMIP has formalized a set of CF attribute conventions that address this problem in the full glory of CMIP models: multiple institutions, model names, scenarios, time periods, etc. CF needs analogous machinery -- simpler and more general. Here's a straw man:

  1. define a new CF global attribute
        aggregation_key = "some string";
    
  1. Offer guidance on the creation of the aggregation key string such as: "It is important that the string that is generated be unique to this dataset. We suggest generating an MD5 hash from a list of metadata items -- to be decided on by the file creator -- that guarantees uniqueness. For example the list may contain following pieces of information"
  • institution
  • project/scenario
  • name of code generating files
  • creation date

The end result will be a global attribute such as

    aggregation_key = "d131dd02c5e6eec4";

that will be found in common among all of the files needed in this aggregation.

===

What needs to be added to this straw man, is a strategy to handle files that may be shared in common by multiple model runs. For example, the cell_measures field that was the topic of #145, might be shared among all model runs that use a particular gridded coordinate system. A solution to this problem might be simply to allow multiple aggregation keys.

In all model outputs from a single run we find

    aggregation_key = "d131dd02c5e6eec4 c69821bcb6a88393";

where the first key identifies the model run and the second identifies the grid geometry. The grid-geometry files would contain only the single identifying key

    aggregation_key = "c69821bcb6a88393";

From this information a client application can quickly scan a directory and infer which files the data creator intended to be aggregated.

Note: See TracTickets for help on using tickets.