Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Refactor of metrics operator #62

Closed
wants to merge 9 commits into from
Closed

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Sep 20, 2023

This is WIP for a huge refactor because (like most things I make) I realized the design was not right / was not good enough. I'll include the notes here I shared in slack. Consider this a WIP because I have to update every single metric and test the case that I originally wanted with hpctoolkit, and I suspect that will take me a few more days (just started on this actually last night, damn). I want to be able to summarize this more simply, e.g.,:

A Metric Set is a collection of metrics to measure IO, performance, or networking that can be customized with addons.

And for the expected use case (for now) I expect the user to choose one metric, and then customize with addons for their needs. And that's it. We will have a registry of metrics (as we do now) and a registry of addons (not created yet for the UI) that range from volume types to drop-in applications/containers.


To summarize my (possibly terrible ideas) the design we started with is as follows:

  • A Metric Set holds one or more metrics, and a separate entity that represents an application (lammps) or storage (anything from pvc to config map) or standalone "roll your own"-
  • One metric generates one JobSet with some number of replicated jobs
  • The Metric Sets come in flavors, storage and application, where the first expects a storage defined, the second shares the namespace, etc.

The Metric is an interface, so as I started to make different flavors of Metric interfaces (e.g., a Launcher Worker design that would generate a launcher replicated job and then workers as another one) I started to dislike it more and more - the designation of different Metric Set types didn't have a lot of meaning. It also seemed unlikely anyone would actually put two metrics alongside the same storage or application, at least it would be challenging. And the arbitrary need to define some specific storage or application was not very flexible. Then I started implementing actual applications (e.g., lammps) as metrics themselves. The goop hit the fan when I realized I wanted to put together two metrics, lammps that had a Launcher worker design, and HPCtoolkit, which was an entirely different thing that generated and shared a volume. I scribbled down some ideas last night at ~10pm (below, "MetricContainer" turned into more generic "Addon") and worked until 2am on the refactor. For the new design:

  • A Metric Set is a generic shell - there is one type that holds some number of metrics scaled to some number of pods.
  • There is a new entity, an "Addon" that you can add to a Metric
  • An addon is flexible to add volumes, containers, update entrypoints, or even provide applications.

Then means we can define HPCToolkit as an addon to the app-lammps metric, and it makes sense - we know HPCToolkit is going to add a container, an empty volume, and then update an entrypoint for the metric it is attached to. It also means all the storage tests we were doing? Well now a volume type is just an addon you add to, for example, the io-fio metric.
So I conceptually like the direction of this design more, but it's much more complex because I've essentially broken up a block of LEGO into much smaller pieces, and of course getting those to work and updating all the existing is going to be... a lot. I'm giving myself to the end of the week to get something working and then likely will need to abandon HPCToolkit.

Requirements before this can be merged:

  • All metrics re-implemented and re-tested in this new design
  • My Kubecon experiments also verified to function the same
  • A web UI of addons
  • A larger version bump here (likely alpha 1 to alpha 2)

And probably something else I didn't think of. I'm giving myself to the end of the week to complete this and prototype hpctoolkit as an addon with the lammps app. This probably could be enough work to spread out over a few weeks to a month... no pressure! But also, I think I'm going to try my damn best anyway.

This is going to be a huge refactor to remove the application/storage "hard coded"
legos replaced by a more flexible setup where we have one base metric set (no
subtypes) and then metrics generate the replicated jobs (as many as they like, how
they please) and then addons are provided to them, which can range from additional
volumes to containers (that provide volumes) to any kind of customization. This
is not ready for any kind of testing but I am mostly concerned about my computer
blowing up and losing the work so I am saving for good measure :) Also, yay today! :D

Signed-off-by: vsoch <[email protected]>
but might as well save the state of them!

Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
we did not get this completely working before (likely
the spack mpi install as a basic hostname does not work
) so a basic conversion is sufficient

Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Member Author

vsoch commented Sep 28, 2023

This was merged with #68

@vsoch vsoch closed this Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant