rework configuration files to more easily add or remove nodes #374

jeanetteclark · 2023-08-15T17:10:07Z

Currently, all of the nodes that the quality engine is running on are listed in metadig.properties. When we had 2 or 3 nodes this wasn't a big deal but now we have 8, and the number is growing rapidly. Similarly, the tasks for each of these nodes are listed in taskList.csv. Adding a new node requires a configuration change which means that a helm upgrade has to be run to install the new config files. This is not ideal, in addition to the fact that the config files will quickly become unwieldy.

There are a few options here I think, which I'll outline below:

1. Dynamic lookup for nodes and dynamic assignment of tasks

Using bookkeeper we can look up the hosted repositories that metadig-engine should be running on. I think we would need to combine that list with a list of DataONE nodes funded through other sources(ADC, KNB, ESS-DIVE) that are listed in the metadig.properties file. This does keep some hard coded values in metadig.properties but the number would (maybe?) be much smaller and not change as frequently as the list generated by bookkeeper.

From this list of nodes, we would assume that the tasks are all fairly similar. Each node will get a quality task with the Dataone FAIR suite and a member node assessment task. We can also do a regex search and assign additional quality tasks to nodes based on the node/suite name (eg: urn:node:ARCTIC would match to the arctic-data-center-suite). This way custom suites can still be written and run without us having to hard code in every node/suite combination in the taskList file. This would effectively remove the node score and quality tasks from the taskList.csv, leaving portal scoring tasks, listing CN nodes, and the data file acquisition tasks to remain in the taskList file. I'm not sure how some of the parameters would be set though, things like the formatId filter or harvest begin date would have to be generic and not configurable.

Overall this solution only feels okay because we can't really dynamically get all of the nodes we need, unless there is some flag that I could use from the DataONE API that I'm not aware of.

2. Static lookup for nodes and dynamic assignment of tasks

Take the list of nodes out of metadig.properties and put it some other config file. Instead of listing both the subjectId and subjectURL, just list the node ID and use the dataONE API to get the endpoint and subjectId.

Tasklist is as above.

3. Static lookup for nodes and tasks

Static list for the nodes and tasks would make the tasks more fully configurable, but we are still stuck with an unwieldy taskList file. I would like to consider refactoring that taskList.csv into a json file though, hopefully in a way that makes it a little more readable and parsable.

None of these solutions really seem ideal, though I think the second one might be the easiest (mostly since I'm unsure where bookkeeper is in terms of development).

Thoughts @mbjones ?

The text was updated successfully, but these errors were encountered:

mbjones · 2023-08-15T17:44:56Z

These all sounds like good options, with bookkeeper being the preferred route but obviously more complicated. Peter did some initial work on incorporating bookkeeper into metadig that is turned off, so we should look into what he did and how that might work/contribute to a solution. In terms of the KNB/ARCTIC/etc, I think those should go in bookkeeper as well as valid quotas and we can use a single solution to know which sites can get which services.

As bookkeeper isn't yet ready for primetime, let's discuss a shorter term strategy.

jeanetteclark added this to the 3.0 milestone Aug 15, 2023

jeanetteclark mentioned this issue Nov 2, 2023

figure out HttpMultipartRestClient issues #389

Open

jeanetteclark modified the milestones: 3.0, 3.1 Feb 8, 2024

jeanetteclark added this to Metadig Oct 2, 2024

jeanetteclark moved this to Backlog in Metadig Oct 2, 2024

jeanetteclark mentioned this issue Jan 21, 2025

In RequestReport, use the hashstore-java library to retrieve a system metadata document or metadata eml document for a pid #464

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework configuration files to more easily add or remove nodes #374

rework configuration files to more easily add or remove nodes #374

jeanetteclark commented Aug 15, 2023

mbjones commented Aug 15, 2023

rework configuration files to more easily add or remove nodes #374

rework configuration files to more easily add or remove nodes #374

Comments

jeanetteclark commented Aug 15, 2023

1. Dynamic lookup for nodes and dynamic assignment of tasks

2. Static lookup for nodes and dynamic assignment of tasks

3. Static lookup for nodes and tasks

mbjones commented Aug 15, 2023