You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, all of the nodes that the quality engine is running on are listed in metadig.properties. When we had 2 or 3 nodes this wasn't a big deal but now we have 8, and the number is growing rapidly. Similarly, the tasks for each of these nodes are listed in taskList.csv. Adding a new node requires a configuration change which means that a helm upgrade has to be run to install the new config files. This is not ideal, in addition to the fact that the config files will quickly become unwieldy.
There are a few options here I think, which I'll outline below:
1. Dynamic lookup for nodes and dynamic assignment of tasks
Using bookkeeper we can look up the hosted repositories that metadig-engine should be running on. I think we would need to combine that list with a list of DataONE nodes funded through other sources(ADC, KNB, ESS-DIVE) that are listed in the metadig.properties file. This does keep some hard coded values in metadig.properties but the number would (maybe?) be much smaller and not change as frequently as the list generated by bookkeeper.
From this list of nodes, we would assume that the tasks are all fairly similar. Each node will get a quality task with the Dataone FAIR suite and a member node assessment task. We can also do a regex search and assign additional quality tasks to nodes based on the node/suite name (eg: urn:node:ARCTIC would match to the arctic-data-center-suite). This way custom suites can still be written and run without us having to hard code in every node/suite combination in the taskList file. This would effectively remove the node score and quality tasks from the taskList.csv, leaving portal scoring tasks, listing CN nodes, and the data file acquisition tasks to remain in the taskList file. I'm not sure how some of the parameters would be set though, things like the formatId filter or harvest begin date would have to be generic and not configurable.
Overall this solution only feels okay because we can't really dynamically get all of the nodes we need, unless there is some flag that I could use from the DataONE API that I'm not aware of.
2. Static lookup for nodes and dynamic assignment of tasks
Take the list of nodes out of metadig.properties and put it some other config file. Instead of listing both the subjectId and subjectURL, just list the node ID and use the dataONE API to get the endpoint and subjectId.
Tasklist is as above.
3. Static lookup for nodes and tasks
Static list for the nodes and tasks would make the tasks more fully configurable, but we are still stuck with an unwieldy taskList file. I would like to consider refactoring that taskList.csv into a json file though, hopefully in a way that makes it a little more readable and parsable.
None of these solutions really seem ideal, though I think the second one might be the easiest (mostly since I'm unsure where bookkeeper is in terms of development).
These all sounds like good options, with bookkeeper being the preferred route but obviously more complicated. Peter did some initial work on incorporating bookkeeper into metadig that is turned off, so we should look into what he did and how that might work/contribute to a solution. In terms of the KNB/ARCTIC/etc, I think those should go in bookkeeper as well as valid quotas and we can use a single solution to know which sites can get which services.
As bookkeeper isn't yet ready for primetime, let's discuss a shorter term strategy.
Currently, all of the nodes that the quality engine is running on are listed in metadig.properties. When we had 2 or 3 nodes this wasn't a big deal but now we have 8, and the number is growing rapidly. Similarly, the tasks for each of these nodes are listed in taskList.csv. Adding a new node requires a configuration change which means that a
helm upgrade
has to be run to install the new config files. This is not ideal, in addition to the fact that the config files will quickly become unwieldy.There are a few options here I think, which I'll outline below:
1. Dynamic lookup for nodes and dynamic assignment of tasks
Using
bookkeeper
we can look up the hosted repositories that metadig-engine should be running on. I think we would need to combine that list with a list of DataONE nodes funded through other sources(ADC, KNB, ESS-DIVE) that are listed in the metadig.properties file. This does keep some hard coded values in metadig.properties but the number would (maybe?) be much smaller and not change as frequently as the list generated by bookkeeper.From this list of nodes, we would assume that the tasks are all fairly similar. Each node will get a quality task with the Dataone FAIR suite and a member node assessment task. We can also do a regex search and assign additional quality tasks to nodes based on the node/suite name (eg:
urn:node:ARCTIC
would match to thearctic-data-center-suite
). This way custom suites can still be written and run without us having to hard code in every node/suite combination in the taskList file. This would effectively remove the node score and quality tasks from the taskList.csv, leaving portal scoring tasks, listing CN nodes, and the data file acquisition tasks to remain in the taskList file. I'm not sure how some of the parameters would be set though, things like the formatId filter or harvest begin date would have to be generic and not configurable.Overall this solution only feels okay because we can't really dynamically get all of the nodes we need, unless there is some flag that I could use from the DataONE API that I'm not aware of.
2. Static lookup for nodes and dynamic assignment of tasks
Take the list of nodes out of metadig.properties and put it some other config file. Instead of listing both the subjectId and subjectURL, just list the node ID and use the dataONE API to get the endpoint and subjectId.
Tasklist is as above.
3. Static lookup for nodes and tasks
Static list for the nodes and tasks would make the tasks more fully configurable, but we are still stuck with an unwieldy taskList file. I would like to consider refactoring that taskList.csv into a json file though, hopefully in a way that makes it a little more readable and parsable.
None of these solutions really seem ideal, though I think the second one might be the easiest (mostly since I'm unsure where bookkeeper is in terms of development).
Thoughts @mbjones ?
The text was updated successfully, but these errors were encountered: