Opened 5 years ago

Last modified 3 years ago

#99 new enhancement

Taxon Names and Identifiers

Reported by: lowry Owned by: cf-conventions@…
Priority: high Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

New section to be added to the Convention

6.1.2 Taxon Names and Identifiers A taxon is a named level within a biological classification, such as a class, genus and species. Within the marine environment there are at least half a million taxa. However, CF isn't confined to the marine environment and so the number runs into millions, even billions. When a variable in CF describes a property of a taxon, such as its numeric concentration or abundance one approach would be to incorporate the taxon name into the Standard Name. However, experience with other parameter vocabularies has shown that this can quickly become unsustainable. Consequently, taxonomic names are handled in a similar manner to geographic names using a generic Standard Name for the data variable plus co-ordinate variables to carry the label text. The data variable is labelled using Standard Names of the form 'property_of_taxon_in_medium'. For example, taxon abundance in a water body would be described by the Standard Name 'number_concentration_of_taxon_in_sea_water'. The labelling co-ordinate variables have the Standard Names 'taxon_name' and 'taxon_identifier'. The taxon name included in the data must be taken from a recognised source. Currently, these are the World Register of Marine Species or WoRMS (http://www.marinespecies.org/), which is the preferred resource for the marine environment or the International Taxonomic Information System or ITIS (http://www.itis.gov/) for terrestrial flora and fauna. Note that the only requirement for CF is that the name used is registered in at least one of the named resources. It does not have to be designated as 'valid'. The taxon_identifier from either WoRMS (the aphia ID) or ITIS (the taxonomic serial number or TSN) need to include namespace strings, which are 'aphia:' or 'tsn:'. For example, Calanus finmarchicus is encoded as either 'aphia:104464' or 'tsn:85272'. For the marine domain WoRMS has more complete coverage and so aphia Ids are preferred. Example 6.3 This example shows how the taxonomic information would be encoded for a simple time series of abundance for two taxa. For clarity, a lot of information - such as the time variable has been omitted. dimensions;

time=1000; string80=80; taxon=2;

variables:

float abundance(time,taxon);

abundance:standard_name="number_concentration_of_taxon_in_sea_water"; abundance:coordinates="taxon_identifier taxon_name";

char taxon_name(taxon,string80);

taxon_name:standard_name="taxon_name";

char taxon_identifier(taxon,string80);

taxon_name:standard_name="taxon_identifier";

data; taxon_name = "Calanus finmarchicus", "Calanus helgolandicus" taxon_label = "aphia:104464", "aphia:104466"

Consequences for Standard Names

The following new Standard Names are required to describe the label variables and to support the bacterial data request that inspired the creation of this ticket. One more has been included in support of the above example. taxon_name The human-readable label for the taxon such as Calanus finmarchicus. The label should be registered in either WoRMS or ITIS and spelled exactly as registered. taxon_identifier The machine-readable identifier for the taxon registration in either WoRMS (the aphia ID) or ITIS (the taxonomic serial number or TSN), including namespace. The namespace strings are 'aphia:' or 'tsn:'. For example, Calanus finmarchicus is encoded as either 'aphia:104464' or 'tsn:85272'. For the marine domain WoRMS has more complete coverage and so aphia Ids are preferred. colony_forming_unit_number_concentration_of_taxon_in_sea_water "Colony Forming Unit" means an estimate of the viable bacterial or fungal numbers determined by counting colonies grown from a sample. "Number concentration" means the number of particles or other specified objects per unit volume. "Taxon" means an organism named in the taxon_name and taxon_identifier variables. number_concentration_of_taxon_in_sea_water "Number concentration" means the number of particles or other specified objects per unit volume. "Taxon" means an organism named in the taxon_name and taxon_identifier variables.

Change History (12)

comment:1 Changed 5 years ago by jonathan

Dear Roy

Thanks for making this proposal. You provide good arguments in support of doing this way. I think that quite a lot of the above text is actually the arguments in support of making the change, and we will not need to include it all in the CF standard document. I would suggest that the text might be

Taxon names and identifiers

A taxon is a named level within a biological classification, such as a class, genus and species. Quantities dependent on taxa have generic standard_names containing the word taxon, and the taxa are identified by auxiliary coordinate variables.

Then go on to describe the conventions for names and IDs, and give the example(s).

You compare this proposal to geographic regions, and I agree with that, but it's a bit more complicated because of the alternative sets of labels. I propose that we should tidy the CF document by promoting 6.1.1 on "Geographic regions" to 6.3 (i.e. remove it from 6.1), and adding yours as 6.4. Then 6.1 and 6.2 will describe mechanisms in CF, and 6.3 and 6.4 applications of these mechanisms.

I am still concerned about the possibility for confusion in identification of taxa, but I accept that we have to work with the best there is! I'd like to suggest something a bit more demanding, however:

  • We require taxon_name from WoRMS on ITIS, as you say.
  • We require that either taxon_identifier (the alphaID) or taxonomic_serial_number (from ITIS) should be provided, and we recommend that both should be provided, particularly the aphiaID. Including both will make the dataset more thoroughly self-describing and hence more useful for data exchange.
  • It is an error if the taxon_name does not agree with the taxon_identifier or the taxonomic_serial_number (this includes the case where the taxon_identifier and the taxonomic_serial_number are inconsistent). However, this error isn't one the CF checker could detect, because it would require there to exist a cross-reference table between WoRMS and ITIS, and I assume there isn't one of those available.

Thus, I would put the two sorts of identifier into separate auxiliary coordinate variables. Since they have different standard names, they wouldn't need the namespace identifier, and the values can be integers, rather than strings. Missing data can be given for any taxon which doesn't have an identifier.

What do you think?

Best wishes

Jonathan

comment:2 Changed 5 years ago by graybeal

A few suggestions on this, which I love to see proposed.

The use of the term 'label' with respect to standard names is a bit confusing, since 'label' is something I'd expect the long name to do. An example: "using a generic Standard Name for the data variable plus co-ordinate variables to carry the label text. The data variable is labelled using Standard Names of the form 'property_of_taxon_in_medium'." The terms used in the body of the standard when speaking of the standard name are always either 'identify' or 'describe', which is more appropriate for a standard name.

I had to reread Jonathan's suggestion, and I think it amounts to this: "Both taxon name and taxon identifier are required, to maximize understanding and interoperability. They may be obtained, as a corresponding pair, from either WORMS or ITIS."

First, are we sure we want to specify these are the only two acceptable sources, just because they are the two most prominent/recognizable/acceptable sources at this time? Since the identifier effectively identifies the source, I'm inclined to accept any source the user deems acceptable, perhaps strongly recommending these two. But your judgment works for me here.

I'm OK with requiring both name and ID. I'm a little confused by Jonathan's last paragraph, particularly "missing data can be given for any taxon which doesn't have an identifier". If they are in ITIS or WORMS, they have an identifier. If they aren't -- and this is partly why I suggested not constraining to ITIS and WORMS -- they need to have an identifier from _somewhere_, or they aren't really a taxon. I was rather hoping the identifier could be a URL, or at least a URN, but it appears ITIS doesn't provide such a thing. (!?!)

In any case, the suggested handling of a missing ID made me wonder if we are talking about two possible usage scenarios. (1) The user variable being described is simply a number (number_concentration, for example), with many measurements being taken, and they all have the same taxon info. (2) There are 3 variables being described; the measurement number, and one or two variables that describe the taxon for that particular measurement number. In this case the additional variables' values define the meaning of the primary variable, which could be different for each 'row'.

Are we proposing this solution for case (1), case (2), or both?

Regarding the conformance of the name to the ID, I think we should stipulate that one of these two values is authoritative, and the other is informative. Since the ID is truly what uniquely specifies the taxon in the database, I think it should be considered authoritative; the other is explanatory text. The name for that ID may even change over time (if those DBs work as I believe them to), wacky as that seems; but I don't think this represents an intolerable conflict.

Finally, in the original proposal ' "Taxon" means an organism named in the taxon_name and taxon_identifier variables.' appears twice.

comment:3 Changed 5 years ago by jonathan

Dear John

I am sure you and Roy know more about the available taxonomic databases. If CF isn't going to provide its own, I think we should be explicit about which ones should be used, and it should be as few as possible. That is because, in the limiting case that every data provider used a different taxonomic database, the datasets would no longer be comparable. You wouldn't know whether Graybeal species number 94308 was the same as Lowry species number 612095, even if they did have the same species name, since the names are not regarded as reliable. So I don't think we ought to leave it open to the data writer to use any database they deem to be acceptable.

Ideally we would have only one external authority, but Roy says that is not sufficient, and suggests there are two. To maximise portability of data, I therefore suggested that it be recommended for both to be used. However, in some cases the species concerned will be in one but not the other. That is when there will be missing data in one of the auxiliary coordinates. For instance:

variables:
  int aphiaID(taxa);
    aphiaID:_FillValue=-1;
    aphiaID:standard_name="taxon_identifier";
  int tsn(taxa);
    tsn:_FillValue=0;
    tsn:standard_name="taxonomic_serial_number";
data:
  taxon_name="Homo sapiens", "Fraxinus excelsior", "Struthio camelus";
  aphiaID=1,32768,-1;
  tsn=42,0,7776;

In this entirely made-up example, F. excelsior appears in WoRMS but not ITIS, while S. camelus is in ITIS but not WoRMS, so there are missing data elements in the auxiliary coordinate variables.

I think if both are provided, as recommended, they should be consistent and it is an error if they are not. For example, TSN 42 might actually be Pan troglodytes rather than H. sapiens. This would be an error. If we just said, "let WoRMS take precedence", the purpose of providing TSN as well would be undermined. If we provide both and they are consistent, software with a preference for one of them can use that one. If they are not guaranteed to be consistent, you would get different results depending on which identifier you use.

Best wishes

Jonathan

comment:4 Changed 5 years ago by graybeal

Thank you for the clarification. I think this is a critical detail worth additional attention.

The premise of semantic interoperability, based on extensive real-world experience, is that interoperability can not be achieved through constraining everyone to use one vocabulary (or two). There are any number of good reasons that the chosen vocabulary(ies) may not be sufficient. With semantics, this weakness is easily overcome by creating relations between vocabularies. In some cases those relations are precise and homomorphic; in other cases they are descriptive instead, but still powerful, and extensible with time.

So if you give me *one* taxonomic ID, I don't need a second one, or a name, or a verification of their relationship; I will have, external to CF, the tools and relations that tell me how those are related. (WORMS is a 'best practices' case of this; you can look up a matching entry in ITIS directly from the WORMS entry.) Trying to replicate this functionality makes CF more complex, at no value to CF, because the linked open data and semantic communities will take care of it much more robustly, and much less expensively for everyone.

Roy has said these are the two vocabularies, yes, and I am sure he knows more about them than I do. But when I asked a practicing biologist about ITIS, this was the answer I got: "Species 2000 was an umbrella group that combined ITIS and other sources. SP2000 provides LSIDs (including for ITIS names)." When I looked at Species 2000, it indicated records are harvested from 3 ITIS databases, a large number of WORMS databases, and about 100 others. From this I conclude that ITIS and WORMS are indeed very valuable and credible, and there are many other sources of taxonomic data that are valuable and credible.

This illustrates a clear choice. We can attempt to collect and summarize the combined wisdom of the biological-technical community to keep abreast over time about which of these taxonomic databases are necessary and sufficient for CF. Or we can defer that judgment to the ongoing efforts in the community, which seem likely to continue to be ongoing. Certainly if we want to encourage interoperability, recommending the most prominent (WORMS, ITIS, SP2000, ?) as a good source of taxa and their unique identifiers seems sensible.

comment:5 Changed 5 years ago by lowry

Thanks Jonathan for your constructive feedback, with which I have no issues. I did initially consider a variable for each id, but felt my suggestion in the example would be more acceptable.
As to John's point I strongly feel that a name given should be from an authoritative source and that source should provide an identifier that does something useful. I also strongly feel that acceptable sources should be named and that WoRMS and ITIS should be included as both of these offer a governance that provide a mechanism for verifying and adding proposals for new names. However, I don't believe that the list of two should be prescriptive. Should the community require it then obviously the list could be extended.
On the specific point of whether S2000 should be added depends on whether LSIDs are in active use in the community likely to use CF. In my community (SeaDataNet?) they aren't - we use aphiaIDs, but common sense has prevailed and the aphiaID has been incorporated into the LSID. For example, my favourite copepod C finmarchicus as an LSID of urn:lsid:marinespecies.org:taxname:104464 and an aphiaID of 104464. So, the list could be extended as and when required, but at the moment I don't feel anything is gained by adding S2000.

comment:6 Changed 5 years ago by painter1

I'm sorry to bother you - this is just a test. Google seems to be rejecting messages as spam when the originate in the CF Trac system.

  • Jeff Painter

comment:7 Changed 5 years ago by painter1

Once again I apologize for having to send a test message.

  • Jeff Painter

comment:8 Changed 5 years ago by painter1

This is another test message, hopefully the last one.

  • Jeff Painter

comment:9 Changed 4 years ago by graybeal

I see this ticket, on Taxon Names and Identifiers, has not been addressed since the original discussion over a year ago.

I think it is most important that the ticket move forward. Though Roy's team may have moved on, this problem will need to be addressed in CF sooner or later. While only Roy, Jonathan, and I have discussed it, I suspect many CF lurkers have need for this capability.

The following issues seem acceptably resolved:

  • promoting 6.1.1 on "Geographic regions" to 6.3 (i.e. remove it from 6.1), and adding Roy's as 6.4. Then 6.1 and 6.2 will describe mechanisms in CF, and 6.3 and 6.4 applications of these mechanisms.
  • Initial text rewording by Jonathan: "A taxon is a named level within a biological classification, such as a class, genus and species. Quantities dependent on taxa have generic standard_names containing the word taxon, and the taxa are identified by auxiliary coordinate variables."
  • Requiring name and identifier is reasonable (to make the description self-contained).

The following questions are open:

  • How many identifier/sources if multiple are available? Roy suggested 1, Jonathan recommends 2, John suggests user's choice.
  • How many sources? Roy suggested 2 (extensible), John says CF should not limit (and if it does, the 2 suggested are not the best 2).
  • What kind of identifier? Roy suggested namespace + ':' + local text ID; Jonathan proposed (agreeable to Roy) separate int variables for WORMS aphia ID vs ITIS taxon species name; and John prefers globally unique identifiers, LSIDs being the common practice (not offered directly by ITIS, only indirectly through Catalog of Life). In Jonathan's scheme each ID type would have a separate int variable, dimensioned to the number of taxa being defined.

(Incidentally, http://www.jbiomedsem.com/content/2/1/7 provides a detailed analysis of the Catalog of Life identifier approach, which integrates the data from ITIS, WORMS, and Species 2000, among many others, and includes thoughts of why the CoL approach wasn't more widely adopted (at that time anyway). Another extended discussion at http://soyouthinkyoucandigitize.wordpress.com/2013/01/28/what-gets-linked-to-global-unique-identifiers-guids-in-natural-history-collection-digitization/. The point is that while going round and round is definitely possible, I want to cleanly account for more than what a specific part of the CF community does today, if we can.)

Looking for a common path, the following seems pretty close:

  • Support multiple identifier sources; specifying those to be provided _if available_
    • if it isn't available in ITIS or WORMS, it should still be citable
    • if the user always uses WORMS, we should not force them to translate to ITIS, and vice versa
    • While I happen to think Catalog of Life is more suitable than ITIS, I'll forego the argument as long as we aren't exclusive
  • Use Jonathan's proposed approach for WORMS and ITIS, but allow the extension for others (e.g., CoL) for other globally unique identifiers; with any globally unique identifier to be given the standard name taxon_global_identifier, and can be text (which most will be) or int (for UUIDs, for example)
    • The comparability of identifiers A to B to C etc. will inevitably be done at a domain-specific application level, well beyond the concern of CF (but readily achievable by domain experts)
    • It won't be necessary to define unique identifier types for each source, since globally unique identifiers are by their nature distinguishable and uniquely relatable to their source
    • If we accept this adjustment, we don't have to argue on the merits whether Catalog of Life is better than ITIS (not so much because of LSIDs, but because it includes many more sources than just ITIS).

So this might give us the following example:

variables:
  int aphiaID(taxa);
    aphiaID:_FillValue=-1;
    aphiaID:standard_name="taxon_identifier";
  int tsn(taxa);
    tsn:_FillValue=0;
    tsn:standard_name="taxonomic_serial_number";
  string col(taxa);
    col:_FillValue="null";
    col:standard_name="taxon_global_identifier";
    col:comment="LSID from Catalog of Life";
data:
  taxon_name="Homo sapiens", "Fraxinus excelsior", "Struthio camelus";
  aphiaID=1,32768,-1;
  tsn=42,0,7776;
  col="urn:lsid:catalogueoflife.org:taxon:f33e0fe1-ac8e-11e3-805d-020044200006:col20140401",
         "urn:lsid:catalogueoflife.org:taxon:0ad7462a-ac8f-11e3-805d-020044200006:col20140401",
         "urn:lsid:catalogueoflife.org:taxon:ebff2886-ac8e-11e3-805d-020044200006:col20140401";
Last edited 4 years ago by graybeal (previous) (diff)

comment:10 Changed 4 years ago by jonathan

Dear John

Thank you for moving this forward. I am not an expert and defer to you and Roy, but what you propose seems fine to me. I think the text for the standard would need to be more specific about the content of a variable containing taxon global identifiers. Wouldn't it have to be string data, in order to be able to say (as a URN) what sort of identifier it is, as in your example? Can it be required to be a URN? If the col variable is an array of strings, following the netCDF classic model (as we do so far in CF), it should be a 2D char array.

I note with sadness that F. excelsior is currently threatened by a nasty disease in England.

Jonathan

comment:11 follow-up: Changed 3 years ago by graybeal

This from Roy via the CF list 2014.04.23:

Dear John and Jonathan,

Resurrecting Trac ticket 99 has been on my Todo list for some time - thanks John. I'm not in the office until Tuesday and my Trac login credentials are on my work PC. Hence this reply via the normal list.

Since the last correspondence on this ticket, SeaDataNet? have adopted a LSID URI syntax incorporating the AphiaID driven by standards from the OBIS community for identifying taxa. It would make a lot of sense to adopt this in CF. I don't have the details to hand, but if anyone from VLIZ or ICES is watching this thread maybe they could provide the precise syntax of this URI. If not, I'll dig it out next week. As Jonathan says, this will need to be a string array.

John - are you able to write draft text for inclusion in the CF documentation?

Cheers, Roy.

comment:12 in reply to: ↑ 11 Changed 3 years ago by graybeal

Replying to graybeal:

This from Roy via the CF list 2014.04.23: John - are you able to write draft text for inclusion in the CF documentation?

In theory, yes. I am having trouble making the time right now but will try to get to it.

Meanwhile, we got another request on the list that involved chemical names and organism parts. So far CF tends to include chemical names in the standard name, but if amount_of_<substance>_in_<body part> combinations is high I wonder if that is not another candidate for the same treatment. Something I'll keep in the back of my mind if I start writing a draft.

Note: See TracTickets for help on using tickets.