[wip] Refactor of metrics operator #62

vsoch · 2023-09-20T04:56:56Z

This is WIP for a huge refactor because (like most things I make) I realized the design was not right / was not good enough. I'll include the notes here I shared in slack. Consider this a WIP because I have to update every single metric and test the case that I originally wanted with hpctoolkit, and I suspect that will take me a few more days (just started on this actually last night, damn). I want to be able to summarize this more simply, e.g.,:

A Metric Set is a collection of metrics to measure IO, performance, or networking that can be customized with addons.

And for the expected use case (for now) I expect the user to choose one metric, and then customize with addons for their needs. And that's it. We will have a registry of metrics (as we do now) and a registry of addons (not created yet for the UI) that range from volume types to drop-in applications/containers.

To summarize my (possibly terrible ideas) the design we started with is as follows:

A Metric Set holds one or more metrics, and a separate entity that represents an application (lammps) or storage (anything from pvc to config map) or standalone "roll your own"-
One metric generates one JobSet with some number of replicated jobs
The Metric Sets come in flavors, storage and application, where the first expects a storage defined, the second shares the namespace, etc.

The Metric is an interface, so as I started to make different flavors of Metric interfaces (e.g., a Launcher Worker design that would generate a launcher replicated job and then workers as another one) I started to dislike it more and more - the designation of different Metric Set types didn't have a lot of meaning. It also seemed unlikely anyone would actually put two metrics alongside the same storage or application, at least it would be challenging. And the arbitrary need to define some specific storage or application was not very flexible. Then I started implementing actual applications (e.g., lammps) as metrics themselves. The goop hit the fan when I realized I wanted to put together two metrics, lammps that had a Launcher worker design, and HPCtoolkit, which was an entirely different thing that generated and shared a volume. I scribbled down some ideas last night at ~10pm (below, "MetricContainer" turned into more generic "Addon") and worked until 2am on the refactor. For the new design:

A Metric Set is a generic shell - there is one type that holds some number of metrics scaled to some number of pods.
There is a new entity, an "Addon" that you can add to a Metric
An addon is flexible to add volumes, containers, update entrypoints, or even provide applications.

Then means we can define HPCToolkit as an addon to the app-lammps metric, and it makes sense - we know HPCToolkit is going to add a container, an empty volume, and then update an entrypoint for the metric it is attached to. It also means all the storage tests we were doing? Well now a volume type is just an addon you add to, for example, the io-fio metric.
So I conceptually like the direction of this design more, but it's much more complex because I've essentially broken up a block of LEGO into much smaller pieces, and of course getting those to work and updating all the existing is going to be... a lot. I'm giving myself to the end of the week to get something working and then likely will need to abandon HPCToolkit.

Requirements before this can be merged:

All metrics re-implemented and re-tested in this new design
My Kubecon experiments also verified to function the same
A web UI of addons
A larger version bump here (likely alpha 1 to alpha 2)

And probably something else I didn't think of. I'm giving myself to the end of the week to complete this and prototype hpctoolkit as an addon with the lammps app. This probably could be enough work to spread out over a few weeks to a month... no pressure! But also, I think I'm going to try my damn best anyway.

This is going to be a huge refactor to remove the application/storage "hard coded" legos replaced by a more flexible setup where we have one base metric set (no subtypes) and then metrics generate the replicated jobs (as many as they like, how they please) and then addons are provided to them, which can range from additional volumes to containers (that provide volumes) to any kind of customization. This is not ready for any kind of testing but I am mostly concerned about my computer blowing up and losing the work so I am saving for good measure :) Also, yay today! :D Signed-off-by: vsoch <[email protected]>

but might as well save the state of them! Signed-off-by: vsoch <[email protected]>

Signed-off-by: vsoch <[email protected]>

we did not get this completely working before (likely the spack mpi install as a basic hostname does not work ) so a basic conversion is sufficient Signed-off-by: vsoch <[email protected]>

Signed-off-by: vsoch <[email protected]>

vsoch · 2023-09-28T03:52:49Z

This was merged with #68

vsoch added 9 commits September 19, 2023 02:10

definitely making bad life decisions

f525b8a

but might as well save the state of them! Signed-off-by: vsoch <[email protected]>

very satisfying deletion of things.

8084830

Signed-off-by: vsoch <[email protected]>

lammps ran!

b0f94c2

Signed-off-by: vsoch <[email protected]>

amg is back

9cc1769

Signed-off-by: vsoch <[email protected]>

bdas is back

9d47cf1

Signed-off-by: vsoch <[email protected]>

add back hpl

d79ecbc

we did not get this completely working before (likely the spack mpi install as a basic hostname does not work ) so a basic conversion is sufficient Signed-off-by: vsoch <[email protected]>

add back kripke

412217a

Signed-off-by: vsoch <[email protected]>

laghos

b8c8043

Signed-off-by: vsoch <[email protected]>

vsoch mentioned this pull request Sep 20, 2023

[wip] second design for metrics operator #63

Merged

4 tasks

vsoch closed this Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip] Refactor of metrics operator #62

[wip] Refactor of metrics operator #62

Uh oh!

vsoch commented Sep 20, 2023

Uh oh!

vsoch commented Sep 28, 2023

Uh oh!

Uh oh!

[wip] Refactor of metrics operator #62

[wip] Refactor of metrics operator #62

Uh oh!

Conversation

vsoch commented Sep 20, 2023

Uh oh!

vsoch commented Sep 28, 2023

Uh oh!

Uh oh!