Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate if ChecklistBank could expose the readonly GBIF v1 Species API #1320

Open
mdoering opened this issue May 15, 2024 · 7 comments
Open

Comments

@mdoering
Copy link
Member

mdoering commented May 15, 2024

What are the differences between the v1 GBIF Species API and ChecklistBanks data model. Are there any true blockers?

@mdoering
Copy link
Member Author

Name parser

  • this is straight forward, no problems here

@mdoering
Copy link
Member Author

mdoering commented May 15, 2024

Searching names

Species suggest and search both have similar parameters and return types. The exact behavior of the search (scoring/ranking) is likely to be different:

  • datasetKey UUID needs to be mapped to CLBs int datasetKey
  • constituentKey UUID needs to be mapped to CLBs int datasetKey -> sourceDatasetKey
  • rank OK
  • higherTaxonKey OK, but will be a string
  • status OK
  • extinct OK
  • habitat OK = environment
  • threat: MISSING ! but could be added
  • nameType OK
  • nomenclaturalStatus: very different vocabulary is being used in CLB. I would think this is a very niche parameter that would not be a blocker
  • origin: OK, but slightly different vocab values. Not all can be mapped
  • issue OK, but rather different vocab values. Not all can be mapped
  • hl: highlighting is not yet supported in CLB (and troublesome to implement)
  • limit/offset: OK
  • facet: OK (might be some other facet names we could map - and available facets also differ)
  • facetMincount: NOT SUPPORTED
  • facetMultiselect: NOT SUPPORTED
  • facetLimit: NOT SUPPORTED
  • facetOffset: NOT SUPPORTED

Return type

  • no Linnean ranks, but could be added and is desireable as users have already requested it: Store flat classification for immutable datasets #1122
  • numDescendants: NOT SUPPORTED, but could be for immutable datasets
  • numOccurrences: NOT SUPPORTED, I wonder if that is even still in use in GBIF? We could add this by calling the GBIF API to retrieve counts
  • descriptions: NOT SUPPORTED, but there is a generic TaxonProperty extension that maybe could be used instead. Or a new extension being added which isn't such a big thing.
  • vernacularNames: all OK, but some properties are missing and would need to be added:
    • lifeStage
    • plural

@mdoering
Copy link
Member Author

mdoering commented May 15, 2024

Species response Type:
see above. Additionally:

  • deleted: CLB releases are immutable and the way deleted identifiers work is different. We can resolve older, now deleted IDs, but to search & work across them all is difficult and maybe not possible
  • lastCrawled: OK (but can also be uploads)
  • lastInterpreted: OK, but really always the same as crawled

@mdoering
Copy link
Member Author

mdoering commented May 15, 2024

v1 methods which do not exist at all:

  • /species/{usageKey}/toc
  • /species/{usageKey}/speciesProfiles we only keep a few infos directly on the taxon as most these infos are 1:1 and make no sense in an extension. DwC forced us that way. E.g. extinct, environment, livingPeriod exist, but lifeForm, habitat, ageInDays, sizeInMillimeter, massInGram do not exist and would have to be TaxonProperty records. Doable, but quite some mapping effort going on
  • /species/{usageKey}/metrics: not existing at all. Would need to be precalculated and stored similar to the flat classification

@mdoering
Copy link
Member Author

Identifiers are the biggest problem. ChecklistBank has compound keys with datasetKey (int) and a dataset scoped id (String) which is the original identifier from the source. While v1 has a single int key which is unique across all datasets.

COL stable identifiers are short string, but can be converted bidirectionally into an int. That won't work for other dataset identifiers

@MattBlissett
Copy link
Collaborator

MattBlissett commented May 15, 2024

Backbone taxon keys are used in other GBIF APIs:

  • Occurrence search
  • Map tile filter
  • Quarterly analytics (in the CSV data, just kingdom keys) which is used by the country reports.
  • Existing downloads which were created with a taxon filter

I can't think of an exposure of non-backbone keys, things like the IUCN Red List resolution during interpretation don't store the keys.

@mdoering
Copy link
Member Author

Does that mean we cannot change the keys to not break the other APIs or is it a matter of (not) changing the data type from int to string?
If the APIs would accept both an old backbone integer and a new string one we might be able to offer a smooth transition. Old integers would be mapped internally to the new ids which could also be submitted directly then.

Note also that there are 17 accepted kingdoms in COL these days, mostly viruses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants