You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Frog now assigns provenance data to FoLiA, which a.o. allows us to detect a rerun of (parts of) Frog on a FoLiA documents. BUT:
Handling this is quite dangerous and needs a lot of thinking.
assigning useful ID's to all provenance information of the several tools
Does a rerun of one or more parts (like MBLEM or NER) mean an extra sub-processor under the old frog processor OR do we add a new Frog-processor?
etc
As I don't want to postpone the FoLiA 2.0 Release, I suggest for the time being to just FORBID running Frog again on FoLiA with frog provenance data. That will not break existing cases, and will for sure NOT introduce artifacts that would bother us in the future.
We do have to consider the realistic case where somebody only runs certain modules of Frog (say PoS-tagging and lemmatisation) and someone else at a later stage wants to add something else, like NER or parsing.
assigning useful ID's to all provenance information of the several tools
The random component strategy I use is easy and works well for that.
Does a rerun of one or more parts (like MBLEM or NER) mean an extra sub-processor under the old frog processor OR do we add a new Frog-processor?
Add a new one, since it's completely new Frog run, which can be done at a very different time and by a very different user on a very different machine than the older one.
As I don't want to postpone the FoLiA 2.0 Release, I suggest for the time being to just FORBID running Frog again on FoLiA with frog provenance data. That will not break existing cases, and will for sure NOT introduce artifacts that would bother us in the future.
Yes, as a temporary solution that seems quite acceptable, it will take some time for users to run into this issue anyway.
Ok, so running a new (part-of) Frog requites new provenance record.
Regarding the ID's
The random component strategy I use is easy and works well for that.
I already started implementing along these lines, but:
A big disadvantage is the reproducibility. Every run on a certain input will generate different ID's which makes it quite impossible to check for REAL differences in (integration-) tests.
So my solution will probably be to register the used ID's while processing the document.
Frog will force these ID's into a certain format. Probably 'tool.n' so 'ucto.2' or 'MBLEM.4'
When new provenance is added, we use the highest id for a tool and increment by 1.
In this way processing will be deterministic.
Within the Frog pipeline this won't be a problem. Other tool might use a different scheme for ID's but that will be opaque to Frog.
Although this seems relatively easy to implement, I will keep this on the wish-list for after the 2.0 release
That will give us also time to think about a nice way to have Frog neatly (re)do just a module.
Frog now assigns provenance data to FoLiA, which a.o. allows us to detect a rerun of (parts of) Frog on a FoLiA documents. BUT:
Handling this is quite dangerous and needs a lot of thinking.
As I don't want to postpone the FoLiA 2.0 Release, I suggest for the time being to just FORBID running Frog again on FoLiA with frog provenance data. That will not break existing cases, and will for sure NOT introduce artifacts that would bother us in the future.
@proycon Any comments?
The text was updated successfully, but these errors were encountered: