Opened 10 years ago

Last modified 10 years ago

#29 new enhancement

Common Concept subtopic: CDL as constraint language

Reported by: caron Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

Taking matters precariously into my own hands, I am splitting out a particular part of the Common Concept proposal into a separate discussion. (If this turns out to be a bad idea, we can just delete it). I will try to copy the relevant sections of the original proposal here, so the readers can focus on whats already been said.

The original proposal:

The common_concept should bundle together properties of a CF variable (e.g. the standard_name and specific cell_methods or scalar coordinate variable values) with a scoped name and a CF registered universal resource name (URN).

(comment) This, along with the use cases below, imply that some modified version of CDL would be used essentially as a constraint language to express the semantics that the data variable should have. This discussion will try to focus on just the use of CDL as a way of expressing these semantics.

Use Cases

  1. Communities wish to define a common short form name (see ticket 11). For example, a GFDL modelling group might want to define high cloud. They would propose

Common Concept: {gfdl.noaa.gov}high_cloud Defined as:

dimensions:
    hgt=1;
variables:
    float x(unconstrained);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:coordinates="height + unconstrained";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float height (unconstrained);
        height:units="m";
        height:valid_min=7000.;
        height:valid_max=14000.;

Note that in this case, the height variable may be geospatially varying or not, the common concept does not limit this aspect.

  1. A very similar example is Near Surface Temperature. Some parts of the world use 2m air temperature, some 6 foot air temperature and some work with 1.5 m. Near surface temperature is commonly understood to be a temperature from less than 10m.

Here we want simply to associate a coordinate variable, but one with some restricted properties.

Common Concept: {badc.nerc.ac.uk}near_surface_air_temperature Defined as:

variables:
    float tsurf(unconstrained) ;
        tsurf:standard_name = "air_temperature" ;
        tsurf:units = "K" ;
        tsurf:coordinates = "height + unconstrained" ;
        tsurf:common_concept = 
          "{badc.nerc.ac.uk}near_surface_air_temperature; urn:cf-cc:blah123" ;
    float height;
        height:units="m";
        height:valid_min=0.0;
        height:valid_max=10.0;

Valid data files simply indicate which height they actually use, but are limited to one!

  1. The IPCC AR5 wishes to declare a bunch of new common variables, including climate statistics such as frost days, tropical days. These are not easily amenable to introduction as standard names, yet they are frequently used concepts.

While they could be introduced as standard names, with the addition of common concepts, one new standard name, and one new cell method, a wide range of these types of data can be included without difficulty.

The AMS glossary says: frost day: An observational day on which frost occurs; one of a family of climatic indicators (e.g., thunderstorm day, rain day). The definition is somewhat arbitrary, depending upon the accepted criteria for a frost observation. Thus, it may be 1) a day on which the minimum air temperature in the thermometer shelter falls below 0degC (32degF); 2) a day on which a deposit of white frost is observed on the ground; 3) in British usage, a day on which the minimum temperature at the level of the ground or on the tops of low, close-growing vegetation falls to -0.9degC (30.4degF) or below (also called a "day with ground frost"); and perhaps others. The present trend is to drop such terms in favor of something less ambiguous, such as "day with minimum temperature below 0degC (32degF)".

Potentially these could be all introduced as standard names, but they all depend on two underlying factors: something we could characterise with a standard name of "number_of_occurrences", and something we can characterise as a cell_method_modifier of threshold crossing. It would seem simpler and more scalable to introduce these new features, and use the common concepts, rather than proliferate standard names.

Change History (22)

comment:1 Changed 10 years ago by caron

Jonathan http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:8

...the technical proposal needs a syntax definition for the translation of a common concept into other CF metadata. This would be part of the CF standard document and have corresponding entries in the conformance document.

...It could be argued that with the scope and URN specified then software will be able to recognise a given common concept, regardless of which alternative name is used for it. However, I expect that the majority of users of the data will not have software available to do that. They will be analysing the data by inspecting the attributes using the netCDF library, or with some existing package which doesn't have automatic translation of common concepts.

comment:2 Changed 10 years ago by caron

John Graybeal http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:14

  • Goal 1: Identifier for standard names/concepts + metadata. I'm having a real problem with this one. Unfortunately I can only put it in terms of questions, which relate to My Assumptions below: A) Given that CF will not subsume all terms that anyone ever has a need to create and exchange, should the design solution require a connection to CF variables -- or to the CF processes -- for any of these additional concepts? B) Re the declaration "It would seem simpler and more scalable to introduce these new features, and use the common concepts, rather than proliferate standard names" -- what is the essential difference between proliferating common concepts and proliferating standard names? C) If there is a common concept called air_temperature_at_2m, will the definition specify the precise variances in elevation which are acceptable, so everyone can tell exactly whether their data fits? What about precisely specifying acceptable variances in all the other attributes, and in properties that aren't even in CF attributes yet? (Benno's comment about precisely characterizing properties may be relevant here; I'm not entirely sure.) D) What will happen when you want an essentially equivalent concept, but with less precise or more precise variances in one or more attributes? Semantic naming is likely to run aground here, at least to a first order. E) What is the tradeoff between manual review of each contribution of such an identifier, and ability to keep up with an ever-increasing number of variables? Especially when you realize that many of these variables could be generated automatically, in response to individual observation results or model runs?

In short, embedding the mechanism for a very narrowly defined identifier deep within the CF processes and services can only be supported in an automated, computable way, which raises the question of what CF adds to that architecture. I suspect CF can play a key role, but the role needs to be well understood programatically and technically.

comment:3 Changed 10 years ago by caron

John Caron http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:17

My first thought about using CDL to define "concepts" is that CDL is inappropriate because its not semantically rich enough. Here is the example using unconstrained to try to represent which coordinates are constrained and which not:

dimensions:
    hgt=1;
variables:
    float x(unconstrained);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:coordinates="height + unconstrained";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float height (unconstrained);
        height:units="m";
        height:valid_min=7000.;
        height:valid_max=14000.;

The is certainly a good way to guide file writers by example, and it would be great if it really worked, but i fear its not very precise or machine readable. For example, probably the units dont have to match, they only have to be udunits convertible.

The use of unconstrained means that its not parseable CDL. If we are augmenting cdl, I might prefer:

dimensions:
    hgt=1;
variables:
    float x(hgt,...);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:coordinates="height";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float height(hgt);
        height:units="m";
        height:valid_min=7000.;
        height:valid_max=14000.;

but again the semantics are not precise. I also guess that there will be semantics that it cant express (I will try to think of some). But Im willing to be persuaded if I cant come up with some deal-breakers.

It would be useful to gather a dozen or more different examples to work with before deciding CDL can really do an adequate job.

The alternative is obviously RDF and its variants. I personally havent been impressed by the useability of RDF, its seems to me to be an expert-only system. Where are the killer apps and the Web 3.0 sites? However, in order to do automatic reasoning, I assume that whatever we do use will have to be expressed in RDF. So is it possible to write an automatic translater of the proposed CDF notation into RDF?

comment:4 Changed 10 years ago by benno

As you say, it is not parsable CDL. The approach I have been pursuing is to write CDL in RDF (bi-directional map), i.e. one can express the datasets, then one can use OWL/RDF (or any of a number of other rule/semantic systems) to write the rules that map one set of attributes to another set. OWL will get us pretty far I think, though it is one class at a time.

comment:5 Changed 10 years ago by caron

russ http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:18

There is a "Linked Data" community (Stephen Pascoe mentioned this in the "what are standard names for" CF thread in referring to the http://linkeddata.org/ site), who have proposed simpler alternatives than RDF/XML for representing triples in the RDF data model. The alternatives being used currently include Turtle and TriX.

From Linked Data Tutorial

... There are various ways to serialize RDF descriptions. Your data source should at least provide RDF descriptions as RDF/XML which is the only official syntax for RDF. As RDF/XML is not very human-readable, your data source could additionally provide Turtle descriptions when asked for MIME-type application/x-turtle. In situations where your think people might want to use your data together with XML technologies such as XSLT or XQuery, you might additionally also serve a TriX serialization, as TriX works better with these technologies than RDF/XML.

The RDF data model might be expressable in an augmented CDL or NcML using ideas or notation from Turtle (or N3 or TriX) ...

comment:6 Changed 10 years ago by caron

benno http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:19

There are a number of rdf formats. The help from rapper, a commonly used RDF utility, currently shows

 -i FORMAT, --input FORMAT   Set the input format to one of:
    rdfxml                  RDF/XML (default)
    ntriples                N-Triples
    turtle                  Turtle Terse RDF Triple Language
    rss-tag-soup            RSS Tag Soup
    grddl                   GRDDL over XHTML/XML using XSLT
    guess                   Pick the parser to use using content type and URI
  -o FORMAT, --output FORMAT  Set the output format to one of:
    ntriples                N-Triples (default)
    rdfxml-xmp              RDF/XML (XMP Profile)
    rdfxml-abbrev           RDF/XML (Abbreviated)
    rdfxml                  RDF/XML
    rss-1.0                 RSS 1.0

GRDDL is particularly interesting -- it is a way of instrumented an XML file (or its schema file), so that RDF applications can find a file of XSLT tranformations to translate the XML to RDF. So one file can serve both the original XML clients as well as RDF clients.

comment:7 Changed 10 years ago by caron

jonathan http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:20

Common concepts may translate to a range of CF metadata, not just standard names e.g. they may require an entry in cell_methods (like daily mean temperature does) or a particular coordinate variable (like surface air temperature does). If preliminary registration of concepts were possible, it might lead to files being written with only the common concept identified, lacking other important CF metadata, not just the standard name. This would make the metadata in the file incomplete and hence less useful, the opposite of what CF has aimed to do.

comment:8 Changed 10 years ago by caron

bnl (Bryan) http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:24

As far as syntax description goes, I had hoped we could use CDL (possibly with a minor extension or two), since the great majority of Netcdf users find it easy to understand. I don't know whether RDF per se provides us with a constraint language as such, and that appears to be the problem here. Can Benno or anyone else give us an example of how we can use RDF to express constraints? (The formal alternative might be OCL, but I think that would be an anathema to most of us, and we'd end up wanting to limit the syntax ... mind you if we have to invent something to add to CDL it might be a good place to start).

... whatever constraint mechanism we want to add will have to go into CF since CDL wont include it. In practise that will limit the concepts we can support to only those we can define constraints for. We shouldn't try and think of every possible use case up front. Once we have the principle established, then we can add functionality as necessary. (This also means the CF checker can implement it with confidence since it will be well defined).

comment:9 Changed 10 years ago by caron

jonathan http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:26

The common concept, indicated by the URN, translates into other CF metadata (isn't that the proposal?) such as standard name, coordinates, cell_methods. This metadata will recorded be in the file. That means if you change the definition of the concept in terms of these metadata, the metadata in the file will become incorrect, and inconsistent with the URN also recorded in the file. That is my objection to provisional registration.

For example, suppose someone provisionally registers the concept of daily-maximum surface air temperature. They suggest that this concept translates to standard_name="daily_maximum_surface_air_temperature" and it is assigned a an opaque URN with that translation. They write files containing this standard name and URN. Then we debate the proposed new standard name and it is pointed out that in CF metadata we would describe this concept with standard_name="air_temperature", a size-one coordinate variable of height with a value in the range 1.5-2.0 m, and a cell_methods entry for time of maximum within days. But it is too late. The files have been written with an invalid standard name and lacking the other required metadata. The problem is that files cannot usually be provisionally written. Once written, they last forever. Hence I think provisional registration of a concept is only acceptable on the condition that no data will be written until the standard name has been agreed.

comment:10 Changed 10 years ago by caron

benno http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:28

Well I started to write down the mapping in RDF/OWL (see http://iridl.ldeo.columbia.edu/ontologies/sampleconceptclass.owl, subject to revision), but I realized that the example I was trying to reproduce (gfdl:high_cloud) was not what I thought it was.

So my basic strategy is to define an OWL class that has two different necessary-and-sufficient conditions, i.e. either one is sufficient to put a variable in the class. On the one hand, it can have a class characterized as common_concept has gfdl:high_cloud. On the other hand, there is the equivalent class standard_name has "cloud_area_fraction", units has "1", and cfobj:hasCoordinate is some a_high_height. (cfobj:hasCoordinate is an explicit property from my attempt to write down the concepts behind CF, i.e. what about data CF is trying to express, and the rules necessary to deduce the abstract relationships from what is found in a netcdf/CF file). The plan is that once a reasoner sees multiple necessary-and_sufficient conditions for a class, having one implies the other(s), and one gets reasoning, i.e. it can associate all the different ways of labeling the variable as being high_cloud with the particular variable at hand.

However, defining the class "a_high_height" does not seem quite right in the example. First of all, it has to have the standard_name "height" -- variable names do not have meaning in CF. But secondly, this fragment says there is some specified height, it just has to be between the two limits. That is more specific that common_concept has gfdl:high_cloud, which does not specify the height. So that would lead to a number of nested classes, with standard_name has "cloud_area_fraction" and units has "1" containing common_concept has "gfdl:high_cloud" containing the specific class standard_name has "cloud_area_fraction" and units has "1" and ncobj:hasCoordinate some a_high_height.

To express the concept of an ambiguous height between two limits in CF would be more of a challenge. I would think you could specify the two limits as the bounds of the height variable, with a cell_method of 'point' -- ideally you would leave the height value as missing, but I suppose you won't do too much damage by picking a value between the limits. I guess this brings up the general question of what the coordinate values mean if bounds are specified, i.e. if they are just the center of the interval, with no commitment as to where the "point" measurement was taken.

If someone could tell me how to do this (vaguely specify height), I could make a set of CF statements precisely equivalent to common_concept has gfdl:high_cloud. I am also not totally happy with the surface example, for similar reasons (particularly because it names a coordinate variable that does not share a dimension -- seems to me you want to specify a height dimension of length 1).

comment:11 Changed 10 years ago by caron

steve hankin http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:30

The semantics of the standard_names are not rich enough to capture all that needs to be captured. We see this in the case of high cloud, which needs to combine the semantics of "cloud_area_fraction" with information known through the Z axis of the variable. We also should see very similar connections in the relationship between (say) the standard variable sea_skin_temperature and the standard variable sea_surface_temperature, but we currently fail to capture the relationship between these variables in a machine-accessible way. Conclusion: We need to add richness (ontological information) to our standards name framework.

comment:12 Changed 10 years ago by caron

benno http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:36

I'll take a shot at clarifying my question. As an oceanographer guessing as to what gfdl:high_cloud means, I offer the following alternate CF representation

dimensions:
    hgt=1;
variables:
    float x(hgt + unconstrained);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:cell_method="hgt: maximum";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float hgt (hgt);
        hgt:units="m";
        hgt:standard_name="height";
        hgt:bounds="hgt_bounds";
   float hgt_bounds(hgt,2);
  data:
       hgt_bounds= 7000.,14000.;

i.e. that gfdl:high_cloud is the maximum cloud_area_fraction between 7000 and 14000m. This CF representation differs from the example presented in the original proposal in that a realization of this template does not additionally specify the height of the cloud_area_fraction within the 7000-14000m layer.

I think in this case the set of CF elements other than common_concept and common_concept=gfdl:high_cloud would be considered equivalent.

This assumes. of course, that I have the meaning of gfdl:high_cloud correct.

comment:13 Changed 10 years ago by caron

bnl http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:37

Strictly I can't comment on gfdl:high_cloud, but I could on a possible badc:high_cloud which might be cloud_area_fraction above 7km (strictly I might have no upper boundary, but let's put it at an arbitrary tropopause for the purpose of this example, so say 11 km to make the maths easier)

... Now as an aside, let me say that for this thought experiment that it is a model product and what is actually done is use a specific assumption - let's say random_overlap - and integrate the cloud_area_fraction on each model level between these bounds (this might be done to compare with a satellite derived product that can only give broad height bands) ...

Now I somehow want to get random_overlap into my cell methods ... which may be a topic for another day, so let's revert to the satellite case ... but both are the same from the following point of view: it's not from an arbitrary height per se ... each instance has a specific height ... we could write this as data from 9km +/- 2km ... (which I can get into CF without too much hassle) ...

... but I might have another model with data from 7-9km called high_cloud marked at 8 km, but I might be happy to have that marked as badc:high_cloud too ...

So the common concept would have to allow me to write the data from both instances in the normal way, but to mark both as high_cloud, and the cf_checker would have to parse the "constraints" on high_cloud and check that both are valid high_cloud things. I don't think any data processing software would need to know anything other than it could use the high_cloud as a label (for visualisation rather than cloud_area_fraction, or for selection a la Balaji's example which cannot otherwise be done automagically with CF data files) ...

comment:14 Changed 10 years ago by caron

Steve Hankin http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:40

I have great reservations about Section 5. (Maintenance) in the original proposal ("We would expect communities to propose common concepts to the CF mailing list, and for the standard-names secretary to provide a URN") IMHO the contents of the common_context tags should be the private responsibility of the organizations that inserts them. There should be no "registration" of the URNs in the CF standard. I understand that this would open a potential for namespace collisions. But it is a remote potential.

The motivation for registration is given as "reduce the necessity for the proliferation of some classes of standard names". There is an alternative, and I think better, way to address this. Namely, create a formal ontology for the standard names list. (a.k.a. "add semantic richness"). Then there is no longer a problem to having the standard names list grow to include many terms that are only subtly different concepts from one another.

comment:15 Changed 10 years ago by caron

jonathan http://cf-pcmdi.llnl.gov/trac/ticket/24#comment:41

My understanding of the proposal is that 2 is essential. Just as standard names are centrally registered, common concepts are to be registered as well, which represent the combination of standard names with other CF metadata. You are right that we can add more richness to standard names, and we certainly do that by creating new distinctions among them, but by design there are many aspects of CF metadata that are not in standard names, such as cell_methods, and some commonly discussed quantities involve these other metadata in their definition. So I support 2.

comment:16 Changed 10 years ago by caron

Ok, those are the comments that are relevant to the subtopic CDL as constraint language according to my reading. Sorry if I munged or left out anyone's previous posts - feel free to clarify.

Id like to analyze these examples to clarify what the CDL is supposed to mean. So just starting with the first one:

dimensions:
   hgt=1;
variables:
   float x(unconstrained);
       x:standard_name:"cloud_area_fraction" ;
       x:units="1";
       x:coordinates="height + unconstrained";
       x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
   float height (unconstrained);
       height:units="m";
       height:valid_min=7000.;
       height:valid_max=14000.;

Means:

  1. variable x is a kind of "cloud_area_fraction" quantity
  2. variable x has units that are udunit compatible with "1"
  3. the vertical height of variable x lies between 7 and 14 km

Doesnt mean:

  1. the variable has name "x"
  2. is a float
  3. has unit string = "1"

The requirements on the vertical coordinate are difficult. In satellite data, my understanding is that the actual height is unknown (can someone more knowlegeable comment if thats true)? In the above, you are forced to put in an actual value into the height coordinate. This is one of Benno's points, and he offers the alternative (leaving off the cell_method attribute for now):

dimensions:
    hgt=1;
variables:
    float x(hgt + unconstrained);
        x:standard_name:"cloud_area_fraction" ;
        x:units="1";
        x:common_concept:"{gfdl.noaa.gov}high_cloud;tbd";
    float hgt (hgt);
        hgt:units="m";
        hgt:standard_name="height";
        hgt:bounds="hgt_bounds";
   float hgt_bounds(hgt,2);
  data:
       hgt_bounds= 7000.,14000.;

Which is probably more correct, but you are still forced to put in an actual value into the height coordinate (putting in missing data would be even worse, I think).

This is related to the question of whether one should/must put in a vertical dimension for a 2D field such as "cloud area fraction" variables. Its the only way to associate a vertical coordinate value, but if you dont know the vertical coordinate value, then is it required?

So then would CF require conformant "{gfdl.noaa.gov}high_cloud" variables to have a length 1 vertical dimension, and a corresponding coordinate variable, whose values shall be ignored, but whose valid min/max or bounds correctly expresses the allowable range?

I would say that its difficult for CDL to concisely and unambiguously express the statement "the vertical height of variable x lies between 7 and 14 km".

comment:17 Changed 10 years ago by caron

Example 2:

variables:
   float tsurf(unconstrained) ;
       tsurf:standard_name = "air_temperature" ;
       tsurf:units = "K" ;
       tsurf:coordinates = "height + unconstrained" ;
       tsurf:common_concept ="{badc.nerc.ac.uk}near_surface_air_temperature; urn:cf-cc:blah123" ;
   float height;
       height:units="m";
       height:valid_min=0.0;
       height:valid_max=10.0;

Means:

  1. variable x is a kind of "air_temperature"
  2. variable x has units udunit compatible with "K"
  3. the vertical height of variable x lies between 0 and 10 m above the surface of the earth.

If the actual height is unknown, then we have the same situation as with Example 1. Assuming that the height is known (probably the common case), then the main question is whether a vertical dimension and coordinate is required in the file.

If it is required, some of the metadata as shown in the example is not necessarily required, eg tsurf:coordinates is not needed when using coordinate variables. So a conformance checker cannot be too naive about the syntax of the common-concept CDL "definition".

If I was writing such a conformance checker, I would likely translate the CDL definition into semantic statements (like I have above), and then implement code for each type of semantic statement. Even after accumulating lots of common_concepts, I would guess the number of statement types would be rather small, maybe less than ten (?). If that's the case, I would personally prefer to just allow the common_concept semantics to be defined by those statements, with the CDL becoming an example for file writers to follow.

(Its possible that those statements could be made into RDF statements, but Im not really sure. If so, then we can see if the reasoning engines can provide anything new.)

The problem with this approach, where CDL is example rather than definition, is that the common_concept semantics are not completely captured in the file, rather they are captured in the external definition and approximated in the file.

comment:18 follow-ups: Changed 10 years ago by bnl

(No, it's not that the actual height is necessarily unknown, it's that the definition of "high cloud" is cloud that lies between x and y km, whether that's a cloud deck at some height in between or an integrated amount over all the clouds at all heights in between.)

But your fundamental point is that the common concepts are best defined "outside" of CDL. I've no argument with that, but I'm not sure that free text semantic statements are the right way to define common concepts. I think it should be possible to add a minimal constraint language to CDL or to use RDF. Either way, we still need definitions of the *types* of constraints, and you're all right that we didn't do that properly in the proposal and it needs doing.

comment:19 in reply to: ↑ 18 Changed 10 years ago by benno

Replying to bnl:

(No, it's not that the actual height is necessarily unknown, it's that the definition of "high cloud" is cloud that lies between x and y km, whether that's a cloud deck at some height in between or an integrated amount over all the clouds at all heights in between.)

I think I can clarify my point with Bryan's additional information.

If "high cloud" can be applied to data where the height is unknown, then the part of the CF representation that is equivalent to the concept of "high cloud" cannot require specifying the height.

On the other hand, any particular dataset may have that additional information, i.e. some netcdf files will in fact specify the height for "high cloud". That would be additional information, along with longitude, latitude, source...

comment:20 follow-up: Changed 10 years ago by bnl

Crikey, isn't free text ambiguous?

I don't anticipate not having a height to go in the file. But both this and the 2m air temp height share the same characteristic, we don't much care about the value of the height, provided it lies within the bounds of the common concept definition. So what we need is to a) put it in, but b) more importantly get the cell bounds height right, and c) label it with the common concept (which a human can interpret to tell you that the actual height is less important than the bounds).

So, the question is how to put the constraint on the height that the user puts in the file ... and to ensure that the cell bounds also lie within the constraints (let's call them x and y)?

I imagine the actual height range of the data being *b*, with bounds *a* and *c*, and the constraint needing to ensure that *a,c>x* and *a,c<=y* (and hence x<b<y, but in this case *b* is more like metadata than something one does calculations with).

So how can we write that constraint?

comment:21 in reply to: ↑ 20 Changed 10 years ago by benno

Replying to bnl:

I don't anticipate not having a height to go in the file.

OK, I'll have to believe you. My experience with other data sources is different, so let's look at a few of those.

ECMWF ERA-40 supplies high-cloud with no corresponding height variable (http://data.ecmwf.int/data/d/era40_daily/) (near as I can tell -- my current browser cannot get past their disclaimer page, but when we downloaded it in grib it did not have height information). ECMWF ERA-15 did the same.

NOAA NCEP EMC CFS (retrospective is all I have dealt with) supplies cloud cover at grib level type 234 "high-cloud". there is also grib level type 232 {high-cloud bottom) and 233 (high-cloud top) where one could supply pressures, but they are not in the datasets I have access to.

NOAA NCEP-NCAR CDAS-1 DAILY grib files supply high_cloud cover, convective cover, rh-type cover, cloud bottom pressure, cloud top pressure, cloud top temperature. At least if you download the daily-update files -- none of them are kept in the long-term files. So here at least I would have the outer high cloud layer limits -- but not the "center" that the CF representation seems to require. The MONTHLY files omit the cloud top and bottom pressure, but at least they are available for the whole record.

NOAA NCEP-DOE Reanalysis-2 supplies high cloud base pressure, high cloud top pressure, and high cloud cloud cover, and high cloud top temp. So here I would have a chance, again if I could not specify the center value or be sure that it was unimportant.

NASA ISCCP D2 (satellite cloud) supplies High Level cloud amount, top pressure, and top temperature (joys of looking down, I guess). So here we have a single height, albeit at the edge of an unknown interval.

The point is

1) frequently we get high-cloud cloudiness without height, and

2) when we get height, it is can be top and bottom, no center. Or top, no width.

So even if we are to believe that gfdl/boac will always supply height with high cloud data, and so that the above CDL does in fact say what you want it to say in that regard, the question still stands w.r.t. an unspecified interval center, i.e.

a) is there a way to not specify an interval center, or

b) is there a way to be sure the center will not be misconstrued.

And, of course,

c) would you want to consider any of this data as gfdl:high_cloud?

comment:22 in reply to: ↑ 18 Changed 10 years ago by caron

Replying to bnl:

But your fundamental point is that the common concepts are best defined "outside" of CDL. I've no argument with that, but I'm not sure that free text semantic statements are the right way to define common concepts. I think it should be possible to add a minimal constraint language to CDL or to use RDF. Either way, we still need definitions of the *types* of constraints, and you're all right that we didn't do that properly in the proposal and it needs doing.

Examples 1 and 2 seem to have these statement types:

  1. x is-a standard_name
  2. x has-compatible-units-with unit
  3. x has-vertical-coordinate-range min, max

Example 3 (frost-days) im not sure what else can be done other than

  1. number of days with minimum temperature below 0 degC

It would be helpful to have more examples.

In looking at a few random existing standard_names, I notice that some also can be seen as statements about the vertical level, eg the last 6 here:

  1. air_pressure
  2. air_pressure_anomaly
  3. air_pressure_at_cloud_base
  4. air_pressure_at_cloud_top
  5. air_pressure_at_convective_cloud_base
  6. air_pressure_at_convective_cloud_top
  7. air_pressure_at_freezing_level
  8. air_pressure_at_sea_level
Note: See TracTickets for help on using tickets.